questions regarding STAR aligning to transcriptome output

385 views
Skip to first unread message

lluc cabús fornas

unread,
Mar 31, 2021, 1:16:15 PM3/31/21
to rna-star
Hi Alex,

I'm trying to do a comparison to search for the optimal parameters for NGS for our samples. To do that, I began with 2 samples sequenced at paired-end, 150bp and 20M of read depth and processed the fastq files in order to obtain the single-end (only taking the first fastq file), 10M depth (taking only the first 10M reads) and 100bp and 50bp (hard-trimming the 3' end of the fastq files).
I followed the same processing for all of them, umi extraction, mapping with STAR to the transcriptome and RSEM for quantification.

For the results that I have obtained, it seems that single-end yields more transcripts than paired-end and that 50bp yields more transcripts than 100 or 150bp, all those are false alignments? This only happens in the alignment to the transcriptome, since the Aligned.sortedByCoord.out.bam file is much bigger in the paired-end analysis than in the single-end analysis.

How does the alignment to the transcriptome work? Is it normal that a sample that has a lower number of genes in the alignment to the genome has a higher number of genes in the alignment to the transcriptome?

Thank you very much in advance.

Best regards,
Lluc

Alexander Dobin

unread,
Mar 31, 2021, 5:32:40 PM3/31/21
to rna-star
Hi Lluc,

shorter reads will map to more loci in the genome (check the Log.final.out file), and even more different transcripts in the transcription -  you can check it by plotting a histogram of NH values for each read in the Aligned.toTranscriptome file (you can pick just the primary alignments since all alignments of a read have the same NH).
While RSEM is supposed to deconvolve multimappers, you are making its job harder by reducing the read length.
I think this could explain your observations.

Another possibility is that your reads have poor quality tails, or adapter sequences (i.e. short insert size). If that's the case, the trimming is beneficial, as it would allow more reads to map (both unique and multimappers), thus increasing the number of detected transcripts. If the number of uniquely mapped reads increases significantly as you trim the reads, it may hint at this scenario.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages