Reducing de novo assembly by not pooling reads from different conditions

80 views
Skip to first unread message

Emiliano Canton

unread,
Aug 20, 2018, 9:57:50 AM8/20/18
to trinityrnaseq-users
Hello,

We are currently testing an assembly of RNAseq data for a non-model insect in our lab. We have 4 different tissues we're interested in analyzing, in 2 different conditions, and we did this in triplicate, with a total of 24 samples. For each sequenced sample we got an average of 30 million reads, 150 bp, paired. Some of them weren't that great, but the company is re-sequencing some of the samples because of the machine problems. While we wait, we are testing out the parameters of our assembly.

Initially, we are pooling only the data from two full replicates of the experiments to assemble with Trinity 2.6.6 due to our limits on RAM and CPU access at the university's server. It still amounts to around 560 million reads, which were trimmed with TrimGalore to remove adapters. After trimming, most of the reads were still above 130 bp long, so I hope that is not much of an issue. I allowed default read normalization and tested both min_kmer_cov of 1 and 2. I used the metrics script for a quick and dirty look at the assembly quality. I expected some fragmentation due to the some of the reads, but here are the outputs.

With Kmer coverage of 1

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':    240907
Total trinity transcripts:    366666
Percent GC: 32.71

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 4457
    Contig N20: 3188
    Contig N30: 2411
    Contig N40: 1834
    Contig N50: 1371

    Median contig length: 415
    Average contig: 790.29
    Total assembled bases: 289771839


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 3714
    Contig N20: 2483
    Contig N30: 1770
    Contig N40: 1259
    Contig N50: 889

    Median contig length: 363
    Average contig: 628.22
    Total assembled bases: 151342796


-----------
With kmer coverage of 2

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':    185850
Total trinity transcripts:    304734
Percent GC: 32.82

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 4512
    Contig N20: 3258
    Contig N30: 2514
    Contig N40: 1942
    Contig N50: 1485

    Median contig length: 412
    Average contig: 814.87
    Total assembled bases: 248319954


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 3870
    Contig N20: 2705
    Contig N30: 1986
    Contig N40: 1465
    Contig N50: 1039

    Median contig length: 346
    Average contig: 649.87
    Total assembled bases: 120778934

-------------

Assembly improved somewhat by not allowing singletons in our pooled data, but it still has a high number of small transcript fragments. Maybe going higher with the K-mer coverage or changing the kmer size might improve the assembly. Another colleague in the lab is working with a similar data set, but he decided not to pool the data from his samples and do individual assemblies per tissue instead and has gotten less partial assemblies. He attributes this to the presence of gene isoforms that may be giving the graph paths problems. Should I use that strategy to improve the assembly? At the end we want to perform differential expression analysis, but that might prove tricky to compare if each tissue has its own reference to do the mapping of the reads. Elsewhere I read that increasing the normalization coverage to 200 instead of the default 50 could help as more reads could be used to improve the assembly. Any pointers will be helpful, thank you.

Emiliano Cantón

Brian Haas

unread,
Aug 20, 2018, 12:40:32 PM8/20/18
to pemilian...@gmail.com, trinityrnaseq-users
Hi Emiliano,

I have a few recommendations for you.   First, if you can try using the latest version of Trinity (v2.8.2), that would be great.  Next, there are some other aspects of the assembly that you'll want to explore for the sake of evaluating assembly quality:
and I'd probably start with the ExN50 plots and using BUSCO.

For just about any assembly with a complex transcriptome, you're going to end up with a large number of lowly expressed transcripts that might bias your impression of the quality of the assembly.  The ExN50 attempts to take that into consideration.  You can also filter out lowly expressed fragments and other approaches for dealing with the many contigs as described here:


hope this helps!   sounds like an interesting project you have!!

~brian






--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Emiliano Canton

unread,
Aug 20, 2018, 3:29:08 PM8/20/18
to trinityrnaseq-users
Thank you for your suggestions!

I am now using v 2.8.2. Last week when I was doing the assembly the last version on the server was 2.6.6, but now it updated. Luckily, since I have no privileges to install anything. I've been reading through the github entries to do the expression abundance estimation for the 'genes' before I do the ExN50. I'll try to go through it with a couple of the experiments just to see how it comes out. As I write this, I'm running the assembly again to see if changing the normalization coverage to 200 actually improves or worsens the fragmentation. When I go through all these steps, I'll report back. Got to leave a record for people googling similar problems.

Emiliano
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Aug 20, 2018, 4:02:00 PM8/20/18
to pemilian...@gmail.com, trinityrnaseq-users
Sounds good.  The max cov up'd to 200 should help for dealing w/ polymophisms.  Also, the latest version should work slightly better if you have non-strand-specific data.

best,

~b

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Emiliano Canton

unread,
Aug 21, 2018, 11:08:40 AM8/21/18
to trinityrnaseq-users
Well....Trinity run failed in Phase 2, though I am not entirely sure why. I'm uploading to Google Drive the Trinity output from this run with V 2.8.2, min_kmer_cov of 2 and normalized coverage of 200 (job 24215489), apparently too large to attach. I'm also uploading the output from the previous run with v 2.6.6 with min_kmer_cov of 2, which successfully ran to completion on exactly the same set of reads (job 23637652).


The run seems to fail when running the following command, over and over again:

/apps/gcc/5.2.0/trinity/r20180815-2.8.2/util/support_scripts/../../Trinity --single "/scratch/local/24215489/trinity_output/read_partitions/Fb_0/CBin_82/c8277.trinity.reads.fa" --output "/scratch/local/24215489/
trinity_output/read_partitions/Fb_0/CBin_82/c8277.trinity.reads.fa.out" --CPU 1 --max_memory 1G --run_as_paired --seqType fa --trinity_complete --full_cleanup --min_kmer_cov 2

It also starts finding repeated sequences, which didn't happen with that set of reads when using v 2.6.6. I'll attempt v 2.6.6 with normalization of 200 to see if the error is reproducible with the earlier version of Trinity.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Aug 21, 2018, 11:31:20 AM8/21/18
to pemilian...@gmail.com, trinityrnaseq-users
Hi,

It looks like the files on google drive are bam files (below) from running STAR.    Were you planning on sharing these, or something else?

./2.5.4a-results
./2.5.4a-results/Aligned.sortedByCoord.out.bam
./2.5.4a-results/Log.final.out
./2.5.4a-results/Log.out
./2.5.4a-results/Log.progress.out
./2.5.4a-results/SJ.out.tab
./2.5.4a-results.tgz
./2.6.1a-results
./2.6.1a-results/Aligned.sortedByCoord.out.bam
./2.6.1a-results/Log.final.out
./2.6.1a-results/Log.out
./2.6.1a-results/Log.progress.out
./2.6.1a-results/SJ.out.tab
./2.6.1a-results.tgz

If you'd like to send some other data, the best way to do it for me is using this:

so I can easily access them from a linux server via cmd line.

best,

~b
>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

>>>>> To post to this group, send email to trinityrn...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/trinityrnaseq-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Brian J. Haas
>>>> The Broad Institute
>>>> http://broadinstitute.org/~bhaas
>>>>
>>>>  
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

>>> To post to this group, send email to trinityrn...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/trinityrnaseq-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> --
>> Brian J. Haas
>> The Broad Institute
>> http://broadinstitute.org/~bhaas
>>
>>  
>
> --
> You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

Emiliano Canton

unread,
Aug 21, 2018, 11:35:06 AM8/21/18
to trinityrnaseq-users
Ummmm...that's weird. Those files are not mine. I only uploaded two text files to Google Drive.


Let me check the other option.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

>>>>> To post to this group, send email to trinityrn...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/trinityrnaseq-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Brian J. Haas
>>>> The Broad Institute
>>>> http://broadinstitute.org/~bhaas
>>>>
>>>>  
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

>>> To post to this group, send email to trinityrn...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/trinityrnaseq-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> --
>> Brian J. Haas
>> The Broad Institute
>> http://broadinstitute.org/~bhaas
>>
>>  
>
> --
> You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Brian Haas

unread,
Aug 21, 2018, 11:36:38 AM8/21/18
to pemilian...@gmail.com, trinityrnaseq-users
oh, sorry!  my bad....  I pulled the wrong files.

just a sec...

Brian Haas

unread,
Aug 21, 2018, 11:40:05 AM8/21/18
to pemilian...@gmail.com, trinityrnaseq-users
interesting.  So, you found a bug in the latest version after all.

Could you package up 

tar -zcvf for_brian.tar.gz /scratch/local/24215489/trinity_output/read_partitions/Fb_0/CBin_82/c8277.trinity.reads.fa*

and send it to me privately?   I'll aim to resolve this asap.

best,


~brian

Emiliano Canton

unread,
Aug 21, 2018, 11:52:38 AM8/21/18
to trinityrnaseq-users
The way our server is setup, in the partition I'm using for big memory jobs once the job finishes the temporary folders are cleared. Meaning, that local scratch folder has ceased to exist. I always have to copy the final output to another folder in my user home directory. I'll contact our server administrators to see if there is anything we can do.

Emiliano Canton

unread,
Aug 21, 2018, 11:54:11 AM8/21/18
to trinityrnaseq-users
Wait,now it was my bad. I do have it backed up. I will send it over.

Brian Haas

unread,
Aug 21, 2018, 12:12:18 PM8/21/18
to pemilian...@gmail.com, trinityrnaseq-users
awesome!  I'm looking forward to looking into this.    I didn't expect this issue to surface...  I put this new check in there for v2.8 and didn't experience the repeat seq issue in any of our test data sets.  ugh...

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Emiliano Canton

unread,
Aug 23, 2018, 2:39:59 PM8/23/18
to trinityrnaseq-users
In case someone comes upon this thread while searching the internet, using Trinity 2.6.6 the assembly slightly improved when changing the normalize_read_coverage parameter to 200, thought not radically so. Using Kmer coverage of 2 did seem to work better than allowing singleton kmers into the mix. It might just be a question of our dataset. Some stats below, including after running BUSCO on the assembly with the arthropoda dataset. A combination of both paremeters appears to work best. My doubt remains if for differential expression analysis where some isoforms might be tissue specific which is the correct approach: individual per tissue assembly or pooled assembly from different tissues.


------
Default parameters:
Summarized benchmarking in BUSCO notation for file Trinity.fasta
# BUSCO was run in mode: transcriptome

    C:96.8%[S:31.0%,D:65.8%],F:2.6%,M:0.6%,n:1066

    1031    Complete BUSCOs (C)
    330    Complete and single-copy BUSCOs (S)
    701    Complete and duplicated BUSCOs (D)
    28    Fragmented BUSCOs (F)
    7    Missing BUSCOs (M)
    1066    Total BUSCO groups searched

-------------------------------------
Kmer coverage 2, normalized coverage 50
Summarized benchmarking in BUSCO notation for file Trinity.fasta
# BUSCO was run in mode: transcriptome

    C:97.6%[S:35.0%,D:62.6%],F:1.9%,M:0.5%,n:1066

    1040    Complete BUSCOs (C)
    373    Complete and single-copy BUSCOs (S)
    667    Complete and duplicated BUSCOs (D)
    20    Fragmented BUSCOs (F)
    6    Missing BUSCOs (M)
    1066    Total BUSCO groups searched

-----------------
Kmer coverage 2, normalized coverage 200

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':    187813
Total trinity transcripts:    288277
Percent GC: 32.77


########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 5041
    Contig N20: 3595
    Contig N30: 2718
    Contig N40: 2061
    Contig N50: 1530

    Median contig length: 392
    Average contig: 810.86
    Total assembled bases: 233753571



#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 4140
    Contig N20: 2839
    Contig N30: 2055
    Contig N40: 1487
    Contig N50: 1041

    Median contig length: 344
    Average contig: 650.59
    Total assembled bases: 122188887

Summarized benchmarking in BUSCO notation for file Trinity.fasta
# BUSCO was run in mode: transcriptome

    C:98.6%[S:47.7%,D:50.9%],F:0.6%,M:0.8%,n:1066

    1051    Complete BUSCOs (C)
    508    Complete and single-copy BUSCOs (S)
    543    Complete and duplicated BUSCOs (D)
    6    Fragmented BUSCOs (F)
    9    Missing BUSCOs (M)
    1066    Total BUSCO groups searched


El martes, 21 de agosto de 2018, 12:12:18 (UTC-4), Brian Haas escribió:
awesome!  I'm looking forward to looking into this.    I didn't expect this issue to surface...  I put this new check in there for v2.8 and didn't experience the repeat seq issue in any of our test data sets.  ugh...

On Tue, Aug 21, 2018 at 11:54 AM Emiliano Canton <pemilian...@gmail.com> wrote:
Wait,now it was my bad. I do have it backed up. I will send it over.

El martes, 21 de agosto de 2018, 11:52:38 (UTC-4), Emiliano Canton escribió:
The way our server is setup, in the partition I'm using for big memory jobs once the job finishes the temporary folders are cleared. Meaning, that local scratch folder has ceased to exist. I always have to copy the final output to another folder in my user home directory. I'll contact our server administrators to see if there is anything we can do.

El martes, 21 de agosto de 2018, 11:40:05 (UTC-4), Brian Haas escribió:
interesting.  So, you found a bug in the latest version after all.

Could you package up 

tar -zcvf for_brian.tar.gz /scratch/local/24215489/trinity_output/read_partitions/Fb_0/CBin_82/c8277.trinity.reads.fa*

and send it to me privately?   I'll aim to resolve this asap.

best,


~brian

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Aug 23, 2018, 3:30:06 PM8/23/18
to pemilian...@gmail.com, trinityrnaseq-users
Thx for sharing!  those busco stats look quite good.   It is interesting that min kmer cov = 2 improved things slightly.  I like the idea of doing this for deeply sequenced data sets.



To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.


--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages