I am trying to use Corset to generate a transcript-gene map (the clusters.txt output) to be incorporated into Trinotate. However, while loading the clusters into the Trinotate database, Trinotate reported an error in which a transcript that was present in the assembly file could not be found in the map file.
Trying to track down the beginning of the problem, I noticed that the transcript is not present in any of the sample summaries (*.bam.corset-reads). Therefore, I believe that bowtie2 simply did not identify reads that mapped back to that transcript.
But anyway, even if the count was zero, shouldn't the Corset "-m 0" parameter attribute a "NoReadsCluster" id for this transcript (https://github.com/Oshlack/Corset/wiki/InstallingRunningUsage)? Or did something go wrong during the final clustering of these summary files?
Here are the commands I used:
$ bowtie2 --very-sensitive-local --all -x assembly.fasta \
-q -1 sample_1-1.paired.fq -2 sample_1-2.paired.fq -U sample_1.unpaired.fq \
-p 4 | samptools view -Sb > sample_1.bam
Summaries generation for each sample in parallel:
$ corset -m 0 -r true-stop sample_1.bam
Clustering and counting:
$ corset -m 0 -p corset -g 1,1,1,2,2,3,3,3,4,4,4 \
-i corset sample_1.bam.corset-reads sample_2.bam.corset-reads etc....
Best wishes,
Gustavo
Thanks for the fast response.
I parsed the clusters.txt file in comparison to the assembly IDs to find that near 80 transcripts were missing (although only one is already enough to make Trinotate crash).
I think I "solved" most of my issue by running Bowtie2 again with specific fragment lengths (-X option) according to each of my samples. Although I did not check if all of those 80 missing IDs are now present in the .corset-reads files, all the rare transcripts that I looked for are now there. And I believe that these new bam files will make the clustering more reliable.
The only thing I would like to suggest is to include a multithreading option for Corset. For creating the summaries it is possible to run the analyses in paralel, but the clustering step has no option but to run using a single thread, so it takes days to finish...
Best wishes,
Gustavo
Yes, it is odd. This assembly is a Transfuse merged assembly based on 5 different assemblers and a large range of kmer values. Interestingly, the Detonate score is slightly better for a Transfuse-ion of 2 Binpacker assemblies that has HALF the contigs but only slightly worse Transrate score. I am wondering now if the difference may be that I mapped (for Corset and Detonate) with Bowtie2 with stranding (forward 0/fr nofw) while Transrate/Transfuse use Salmon in an unstranded manner, -IU (don't know about Snap). Maybe many of those contigs only get reads in the wrong orientation? I guess I'll have to do some tests (the list grows longer...).
So, if I understand correctly, the difference between filtering transcripts (by read count) and then clustering vs. clustering then filtering clusters would be that with m 0 transcripts with fewer than e.g. 10 reads would either 1) end up in clusters with other transcripts, slightly increasing read counts for that cluster or 2) form their own clusters with low read counts. So with a limit of 10 reads, #1 would be retained in post-cluster filtering, and #2 would still be filtered. Is that correct?