There's an interesting problem that I discovered with using RSEM and similar methods.
For my bacterial RNA-seq, I have done a regular read-count summation with featureCounts, which in the default mode does not use any multimappers. I've also ran RSEM and kallisto.
Now, in case of normal eukaryotic RNA-seq, you would expect the sum of estimated raw read counts to be 10-30% higher for RSEM and kallisto. This is simply because you "rescue" the multimappers.
However, most bacterial annotations DO NOT have the UTRs included! The "gene" feature is usually the same as "CDS", and if you extract "transcripts" using genome fasta and annotation, your features quite often end up being too short to map/pseudomap the reads.
So, what happens is you actually lose reads with RSEM or kallisto.
I'm trying to figure out what to do here.
I would like to use an EM-based quantification method, but it seems like genomic alignment is a must for RNA-seq. So far I've used featureCounts with multimapping option, but that one simply divides the read counts equally between the equivalent positions - not exactly ideal.
Do you think there's a way one can use genomic alignment with RSEM's EM quantification? If you could define a read as belonging to a feature if it overlaps it even by 1 bp, that would include most of the reads I believe.
All the best,
-- Alex