Reduce false positives in Trinity assembly by twisting the parameter settings

107 views
Skip to first unread message

Qing Li

unread,
Feb 2, 2016, 12:52:41 PM2/2/16
to trinityrnaseq-users
Dear all,
 I am doing some bench mark work and simulating RNAseq reads from a reference transcriptome and then use Trinity 2.0.5 to assemble them and compare the assembly with the reference transcriptome. Now we have longer and stranded reads ~ 150bp, but it gives us way more isoforms than we expected. For example, the reference only has 62 transcripts, but the assembly gives ~1191 transcripts ...
I know I can twist the --min_per_id_same_path , --max_diffs_same_path , --max_internal_gap_same_path , but these 3 are to set a cut-off to randomly throw away similar transripts. how could I know whether they are discarding true isoforms or they are discarding the false positive ones caused by seq error ? Any suggestions for this issue? Would there be other reasonable parameters I can twist? Thank you in advance !
Best,
Qing

Brian Haas

unread,
Feb 2, 2016, 1:04:02 PM2/2/16
to Qing Li, trinityrnaseq-users
Hi Qing,

If the longer reads contain a lot of base-call errors (eg. towards the 3' end), then this could be disruptive to Trinity. Be sure to run Trimmomatic or other tool to ensure that only the high quality bases are included.  This will be more important the longer the read length.  

Also, be sure to set the strand-specific options for stranded data.

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

李青 Li, Qing

unread,
Feb 2, 2016, 1:07:46 PM2/2/16
to Brian Haas, trinityrnaseq-users
Thnx, Brian! I have trimmed the reads by prinseq to Q20, and set the --SS_lib_type RF, shall I trim it to Q30 then...? I am worried it might be too short...

Brian Haas

unread,
Feb 2, 2016, 1:11:40 PM2/2/16
to 李青 Li, Qing, trinityrnaseq-users
Nope... sounds like you've done the right things already.

I don't have a good explanation for why there are so many contigs, but I'm very curious to know as well.    I've done similar studies and haven't experienced such a thing.

Are you setting the --min_contig_length to a very low value?  Or is it at the default of 200?

~b

李青 Li, Qing

unread,
Feb 2, 2016, 1:22:21 PM2/2/16
to Brian Haas, trinityrnaseq-users
Yep, I set it to be 75bp because I am assembling the conotoxins, and our collaborators suggest that the minimum conotoxin seq should be ~ 75bp, So I set that in order not missing anything...

Mark Chapman

unread,
Feb 2, 2016, 1:27:30 PM2/2/16
to Qing Li, trinityrn...@googlegroups.com, Brian Haas

Hi Qing,
If your reads are 150b PE then what size where the inserts? I would expect them to be larger than 150 hence maybe the size selection could have excluded 75b reads? Not trying to panic you but I just thought I'd query it :)
Thanks, Mark

Brian Haas

unread,
Feb 2, 2016, 1:31:08 PM2/2/16
to Mark Chapman, Qing Li, trinityrn...@googlegroups.com
Also, if you set the contig length on par or shorter than the read length, you're going to get large numbers of contigs output as you're seeing.

In this case, you'd need to filter the output to retain only those contigs that have sufficient read support.   Trinity's default parameters are tuned for longer transcripts.

You could also crank the --min_kmer_cov value high to avoid all the cruft - something else I don't recommend for the usual trinity job.

~b

李青 Li, Qing

unread,
Feb 2, 2016, 1:42:02 PM2/2/16
to Mark Chapman, trinityrnaseq-users, Brian Haas
haha, thnx, Mark~ The insert size is 220bp, so the fragment size should on average 550bp before trimming. But after trimming the read length is ~97 on average , shall I set the min_contig_len to 97bp then? 

You received this message because you are subscribed to a topic in the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/trinityrnaseq-users/DYSlJpFZx9g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to trinityrnaseq-u...@googlegroups.com.

李青 Li, Qing

unread,
Feb 2, 2016, 1:49:12 PM2/2/16
to Brian Haas, Mark Chapman, trinityrn...@googlegroups.com
Thnx, Brian! I am also thinking of filtering the output to retain only those contigs that have sufficient read support. Thnx! 

Mark Chapman

unread,
Feb 2, 2016, 3:03:35 PM2/2/16
to 李青 Li, Qing, Brian Haas, trinityrn...@googlegroups.com
Hi Qing, I'd go with your 75b minimum contig length but follow Brian's advice about the min_kmer_cov and filtering.
Hopefully your RNA size selection didn't select against the small transcripts you're looking for! :)
--Mark
--
Dr. Mark A. Chapman
+44 (0)2380 594396
------------------------------------
Centre for Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

李青 Li, Qing

unread,
Feb 2, 2016, 4:08:50 PM2/2/16
to Mark Chapman, Brian Haas, trinityrn...@googlegroups.com
Thnx!

李青 Li, Qing

unread,
Feb 17, 2016, 6:47:34 PM2/17/16
to Brian Haas, Mark Chapman, trinityrn...@googlegroups.com
Hi guys,
Thank you! The --min_kmer_cov value adjusting is really helpful and now the specificity increased to be > 55% now... no more thousands of variants...
I wonder, is there a parameter we can twist to filter out the low complexity kmers? something like if the mer occurs xxx times then it wouldn't be used to construct the graph. Hope this can further improve the specificity.
Thank you!
Best,
Qing
Reply all
Reply to author
Forward
0 new messages