Read length bias on RNA counts using STAR

Prerak Desai

unread,

Feb 18, 2016, 2:57:19 PM2/18/16

to rna-star

Hi,

I am trying to perform an RNA seq analysis using STAR. I aligned reads generated from NextSeq to the latest dog ensemble assembly. Following are the star options used (basically the encode options) ,

STAR --runThreadN 16 --genomeDir dog --readFilesCommand gunzip -c --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts TranscriptomeSAM --twopassMode Basic --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --readFilesIn read1 read2 --outFileNamePrefix myoutput

I have ~75 samples. However the different samples have different read lengths. Some of the samples are 150 bp paired end, some are 75 bp paired end and some are 40 bp paired end. When I do a PCA plot of the samples, they clearly segregate by read length ( attached pdf)

I used the same reads and analysed them using CLC's RNA-seq pipeline and I dont see the read length bias in clustering.

Any suggestions or ideas as to why this could be happening. let me know if more information is needed.

Regards,

Prerak T Desai

star_dog_readlength.pdf

Alexander Dobin

unread,

Feb 19, 2016, 3:15:33 PM2/19/16

to rna-star

Hi Presak,

there are two possible culprits for the length bias: mapping and gene quantification.

Are you using ReadsPerGene counts for the PCA? How do you normalize them?

One way to figure out what's going on is to compare with the CLC pipeline which apparently dos not have the length bias. How do they map/quantify genes?

You could also try to assess the length effect by trimming all reads to 40b, and repeating mapping/quantification/PCA analysis.

The mapping parameters you are using look fine, except for --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33. These would allow single-end and discordant alignments, which I highly recommend against, since they are more likely to be mis-mappings or library prep artifacts.

Cheers

Alex

mechth...@gmail.com

unread,

Jul 10, 2017, 3:06:48 PM7/10/17

to rna-star

Hi,
I wonder wheather you found the reason for the length bias in the end? I have a similar problem when mapping rna-seq samples with different read length (101 bp paired end, 125 bp paired end and 152 bp paired end) with STAR, followed by read counting using Htseq. My sample cluster perfectly according to the readlength in a PCA plot on PC1. The base quality in the fastqc files seems to be fine for all reads over their entire length.

STAR --runThreadN 12 --genomeDir $starGenome --readFilesIn $fastqFile1 $fastqFile2 --readFilesCommand zcat --outFileNamePrefix $myoutput --twopassMode Basic --outSAMtype BAM SortedByCoordinate

Gene expression seems to differ in the expression of processed pseudogenes, which are (surprisingly) higher abundant for the samples with longer reads.
When using the --clip3pNbases option for STAR to trimm all reads to 101 bp I can remove the bias and the expression of pseudogenes for the longer reads.

The % of unmapped reads: too short in the Star logfile was also higher when mapping the longer reads.

Could this effect be connected to the higher probability of covering splice junctions for longer reads, that are then mapped to their corresponding pseudogenes? Is their any other explanation? Should it not be easier to map longer reads correctly?
Is their any other parameter I can change for mapping to account for the length bias? I already changed the --seedSearchStartLmax to 30 and tried 1-pass and 2-pass mapping without any effect on the length bias.

I could also stick to the clipping/trimming and use equal read length, but I still wonder wheather mapping to pseudogenes becomes more an issue with longer reads?

Regards,
Almut

PCA readlength-1.png

Alexander Dobin

unread,

Jul 11, 2017, 3:09:24 PM7/11/17

to rna-star

Hi Almut,

there was no follow-up to the original post, so the problem has not been resolved.

The longer reads should be easier to map with splicing, so generally there should be fewer reads mapping to pseudogenes.

Are you trimming the adapters before mapping? What is the insert size distribution in your library?

If a lot of reads have inserts << read length, they will not map well without adapter trimming, which may explain why you see increase in unmapped reads.

In this case, trimming the adapter is necessary - you can use simple built-in adapter trimming in STAR or use an external one.

If this does not help, could you send me more information for one of the 151b samples, mapped twice - once with full read length (151) and once with reads trimmed to 101b.

1. The Log.final.out files.

2. Scatter plot of read counts 151-mapping vs 101-mapping for all genes and for pseudogenes only.

Cheers

Alex

On Monday, July 10, 2017 at 3:06:48 PM UTC-4, wrote:

Hi,
I wonder wheather you found the reason for the length bias in the end? I have a similar problem when mapping rna-seq samples with different read length (101 bp paired end, 125 bp paired end and 152 bp paired end) with STAR, followed by read counting using Htseq. My sample cluster perfectly according to the readlength in a PCA plot on PC1. The base quality in the fastqc files seems to be fine for all reads over their entire length.

STAR --runThreadN 12 --genomeDir $starGenome --readFilesIn $fastqFile1 $fastqFile2 --readFilesCommand zcat --outFileNamePrefix $myoutput --twopassMode Basic --outSAMtype BAM SortedByCoordinate

Gene expression seems to differ in the expression of processed pseudogenes, which are (surprisingly) higher abundant for the samples with longer reads.
When using the --clip3pNbases option for STAR to trimm all reads to 101 bp I can remove the bias and the expression of pseudogenes for the longer reads.

The % of unmapped reads: too short in the Star logfile was also higher when mapping the longer reads

Reply all

Reply to author

Forward