Reads aligning to genome but not transcriptome

Scott Youlten

unread,

Feb 4, 2016, 10:25:54 PM2/4/16

to rna-star

Hi all,

I am aligning mouse RNA-seq samples to the mm10 genome and the M5 gencode transcriptome. While this process works fine for most samples, in a handful of samples the Aligned.toTranscriptome.out.bam is very small, and following RSEM quantification it appears no reads are aligned the majority of genes. Puzzlingly the genome alignment seems to work fine for the same samples, and when I view the genome alignment with IGV there are lots of reads mapped to genes that are reported as 0 in the transcriptome alignment file.

Has anyone encountered this problem before? The parameters I am using for the alignment are below.

Any insight you could give me would be greatly appreciated!

Thanks

Scott

star_line="/home/scoyou/local/bin/STAR-2.5.1b/source/STAR

--genomeDir $genomeDir \

--runMode alignReads \

--readFilesIn $inFile1 $inFile2 \

--runThreadN $ncore \

--sjdbOverhang 124 \

--outFileNamePrefix $outDir$sample. \

--outFilterType BySJout \

--outFilterMultimapNmax 20 \

--outFilterMismatchNmax 999 \

--outFilterMismatchNoverReadLmax 0.04 \

--outFilterScoreMinOverLread 0 \

--outFilterMatchNminOverLread 0 \

--outFilterMatchNmin 80 \

--alignIntronMin 20 \

--alignIntronMax 1500000 \

--alignMatesGapMax 1500000 \

--alignSJoverhangMin 6 \

--alignSJDBoverhangMin 1 \

--outSAMattributes NH HI AS NM MD \

--quantMode TranscriptomeSAM"

starLine="/home/scoyou/local/bin/STAR-2.5.1b/source/STAR \

--runMode genomeGenerate \

--genomeDir $genomeDir \

--genomeFastaFiles $genomeDir/$sequence \

--outFileNamePrefix $outDir \

--runThreadN $ncore \

--sjdbGTFfile $genomeDir/$annotation \

--sjdbOverhang 124"

Alexander Dobin

unread,

Feb 5, 2016, 3:46:06 PM2/5/16

to rna-star

Hi Scott,

Is it possible that in these samples most of the reads are soft-clipped (adapter, poor quality tails etc)?

One option to verify the results is to use --quantMode TranscriptomeSAM GeneCounts, which will generate ReadsPerGene count file - then you can number of reads that overlap the genes.

If this does not resolve the problem, please map ~100,000 reads and send me the genomic and transcriptomic BAMs.

Cheers

Alex

Scott Youlten

unread,

Feb 7, 2016, 5:23:46 PM2/7/16

to rna-star

Hi Alex

Thanks for the incredibly fast response time!

You were right in that --quantMode TranscriptomeSAM GeneCounts produced counts for genes which had otherwise been 0 in the RSEM output, but I like RSEMs approach to counting and handling multi-mappers so I was hoping to get it working with that (Im also really wary of moving past an issue like this without knowing what was going wrong otherwise I will remain forever a rookie).

When I was subsetting the reads I realised one thing I forgot to mention (I guess because I haven't recognised any issues with it in the past) was that I had been using an in silico rRNA depletion step, aligning reads first to annotated rRNA sequences and aligning the unmapped reads to the whole mouse genome/transcriptome. So to rule it out, I tried aligning trimmed reads to the whole genome without this extra step....and all samples seem to have worked perfectly. The Aligned.to.transcriptome.sams are all ~equal size and the RSEM output looks great for all samples.

While this is good news as I can now proceed with analysis, I have no idea what could have been going on, particularly as it only seemed to affect a handful of samples and not at the genome level. Let me know if you'd still like a subset of the data if this isn't a well established phenomenon, otherwise thanks for your help and for the great program. Another happy customer.

Cheers

Scott

Alexander Dobin

unread,

Feb 8, 2016, 6:41:02 PM2/8/16

to rna-star

Hi Scott,

happy to hear that you figured out how to make it work.

I cannot think of a simple way how to explain your initial observations.

If you are interested, you can send me genomic and transcriptomic BAMs for ~100,000 reads, mapped with your original pipeline (i.e. with rRNA pre-filtering).