Assembly transcript missing in clusters.txt

209 views
Skip to first unread message

duart...@gmail.com

unread,
Mar 20, 2017, 5:40:32 AM3/20/17
to corset-project
Hello,

I am trying to use Corset to generate a transcript-gene map (the clusters.txt output) to be incorporated into Trinotate. However, while loading the clusters into the Trinotate database, Trinotate reported an error in which a transcript that was present in the assembly file could not be found in the map file.

Trying to track down the beginning of the problem, I noticed that the transcript is not present in any of the sample summaries (*.bam.corset-reads). Therefore, I believe that bowtie2 simply did not identify reads that mapped back to that transcript.

But anyway, even if the count was zero, shouldn't the Corset "-m 0" parameter attribute a "NoReadsCluster" id for this transcript (https://github.com/Oshlack/Corset/wiki/InstallingRunningUsage)? Or did something go wrong during the final clustering of these summary files?

Here are the commands I used:

$ bowtie2 --very-sensitive-local --all -x assembly.fasta \
-q -1 sample_1-1.paired.fq -2 sample_1-2.paired.fq -U sample_1.unpaired.fq \
-p 4 | samptools view -Sb > sample_1.bam

Summaries generation for each sample in parallel:

$ corset -m 0 -r true-stop sample_1.bam

Clustering and counting:

$ corset -m 0 -p corset -g 1,1,1,2,2,3,3,3,4,4,4 \
-i corset sample_1.bam.corset-reads sample_2.bam.corset-reads etc....

Best wishes,

Gustavo

Nadia Davidson

unread,
Mar 20, 2017, 7:56:59 PM3/20/17
to corset-project, duart...@gmail.com
Hi Gustavo,

You're right. If you used -m 0 then all the transcripts should be reported, so something has gone wrong.

My guess is that the issue only effects corset when it's run in parallel with the .corset-reads files being output and reread. I think it's likely that all transcripts will be reported if you run like so:
corset -m 0 -p corset -g 1,1,1,2,2,3,3,3,4,4,4 sample_1.bam sample_2.bam etc....
However, that can take a lot longer to run.

I'll look into this issue some more and let you know if and when we fix it. Thanks for reporting it to us.

Cheers,
Nadia.

duart...@gmail.com

unread,
Mar 21, 2017, 3:41:09 AM3/21/17
to corset-project, duart...@gmail.com
Hello Nadia,

Thanks for the fast response.

I parsed the clusters.txt file in comparison to the assembly IDs to find that near 80 transcripts were missing (although only one is already enough to make Trinotate crash).

I think I "solved" most of my issue by running Bowtie2 again with specific fragment lengths (-X option) according to each of my samples. Although I did not check if all of those 80 missing IDs are now present in the .corset-reads files, all the rare transcripts that I looked for are now there. And I believe that these new bam files will make the clustering more reliable.

The only thing I would like to suggest is to include a multithreading option for Corset. For creating the summaries it is possible to run the analyses in paralel, but the clustering step has no option but to run using a single thread, so it takes days to finish...

Best wishes,

Gustavo

swil...@gmail.com

unread,
Jun 5, 2017, 12:38:36 PM6/5/17
to corset-project
Hi Nadia et al.,

I just discovered the same pattern in my results: that ~20% of transcripts were missing after running with -m and -i *.corset-reads. Has there been any development on this? Is the "bug" likely to affect runs with -i *.corset-reads a higher m value? 

Also, is there a disadvantage to using a small m value (0 or 1) and then filtering the clustering results post-hoc (e.g. to 5 or 10) rather than using this value directly in corset?

Thanks!

Stu

Nadia Davidson

unread,
Jun 6, 2017, 4:58:21 AM6/6/17
to corset-project
Hi Stu,

I think the issue should only affect -m 0  (-m 1 should still be okay). I haven't looked into it anymore to be honest because there's a work-around (running without generating .corset-read files).

Did you use -m 0 everytime you ran corset? So when corset processed the .corset-read files as well? 20% transcripts missing wouldn't be too surprising if the threshold was the default of 10. It would seem high for -m 0 though. That would suggest that 20% of your transcript have no reads aligning to them which would be very unusual.

There's little difference whether you filter during or after corset. We tend to do it during because it can make corset a little faster, but certainly no reason not to leave the low expressed transcripts in if you like.

Cheers,
Nadia.

swil...@gmail.com

unread,
Jun 6, 2017, 11:03:00 AM6/6/17
to corset-project
Hi Nadia,

Thanks for your quick reply.

I ran Corset to create the corset-reads files as depicted on the github example (no other flags), and then ran -i with a large D for m 0, 5, and 10 to compare the results (as well as with defaults). It was convenient to not have to run from scratch with the bam files each time. It was only after noticing that Rapclust was dropping 36% of transcripts, though I now know it has a built-in limit of 10 reads, that I investigated if transcripts were missing from the Corset clustering. Having now run with m 0 and the bam files directly I can confirm you are right that this allowed the "NoReadsCluster" to be output, all 20% of them.

Yes, it is odd. This assembly is a Transfuse merged assembly based on 5 different assemblers and a large range of kmer values. Interestingly, the Detonate score is slightly better for a Transfuse-ion of 2 Binpacker assemblies that has HALF the contigs but only slightly worse Transrate score. I am wondering now if the difference may be that I mapped (for Corset and Detonate) with Bowtie2 with stranding (forward 0/fr nofw) while Transrate/Transfuse use Salmon in an unstranded manner, -IU (don't know about Snap). Maybe many of those contigs only get reads in the wrong orientation? I guess I'll have to do some tests (the list grows longer...).

So, if I understand correctly, the difference between filtering transcripts (by read count) and then clustering vs. clustering then filtering clusters would be that with m 0 transcripts with fewer than e.g. 10 reads would either 1) end up in clusters with other transcripts, slightly increasing read counts for that cluster or 2) form their own clusters with low read counts. So with a limit of 10 reads, #1 would be retained in post-cluster filtering, and #2 would still be filtered. Is that correct?

Thanks!

Stu

Nadia Davidson

unread,
Jun 7, 2017, 5:14:59 AM6/7/17
to corset-project
Hi Stu,

Yes, it is odd. This assembly is a Transfuse merged assembly based on 5 different assemblers and a large range of kmer values. Interestingly, the Detonate score is slightly better for a Transfuse-ion of 2 Binpacker assemblies that has HALF the contigs but only slightly worse Transrate score. I am wondering now if the difference may be that I mapped (for Corset and Detonate) with Bowtie2 with stranding (forward 0/fr nofw) while Transrate/Transfuse use Salmon in an unstranded manner, -IU (don't know about Snap). Maybe many of those contigs only get reads in the wrong orientation? I guess I'll have to do some tests (the list grows longer...).

This sounds very plausible! I'm not sure about all assemblers, but at least some will produce reverse strand contigs even if the data was stranded. 
 
So, if I understand correctly, the difference between filtering transcripts (by read count) and then clustering vs. clustering then filtering clusters would be that with m 0 transcripts with fewer than e.g. 10 reads would either 1) end up in clusters with other transcripts, slightly increasing read counts for that cluster or 2) form their own clusters with low read counts. So with a limit of 10 reads, #1 would be retained in post-cluster filtering, and #2 would still be filtered. Is that correct?

Yeah that is more or less correct, although there can be some subtle effects, for example if you filter before clustering, you can actually still get clusters with less than 10 reads (because some of the 10 that each transcript has might be shared by transcripts in another cluster and they randomly get assigned to that other cluster). It's also possible that the transcripts with 10 or more reads could be clustered differently because of the presence or absence of the transcripts with less than 10 reads. In terms of the biology, I don't think it would change much at all (unless you're interested in genes with very low expression) and indeed, I don't think I've seen any difference in my own differential expression results doing it during or after clustering.
 
Cheers,
Nadia.

swil...@gmail.com

unread,
Jun 7, 2017, 10:17:39 AM6/7/17
to corset-project
Hi Nadia,

Plausible, and frustrating. If that's not the point of Transrate/Transfuse, to get rid of mis-asseblies like that, then what is?

I'm glad you mentioned that about clusters with less than 10 reads (or whatever limit) still sneaking through, as I have noticed that and assumed it had to do with shared reads. Glad for the confirmation. But the numbers of 10+ read clusters vary like 0.05%, so I imagine you're right it makes little difference.

Thanks for your help!

Stu
Reply all
Reply to author
Forward
0 new messages