RPM values computed in wig output differs from ReadsPerGene.tab

Jean-Philippe Villemin

unread,

Sep 20, 2017, 6:13:35 AM9/20/17

to rna-star

I'm using star-2.5.2b on stranded rnaseq from Illumina Trueseq.

Star gave me the wigs for each strand using unique mappers and then I visualise genes in ucsc browser.

I tried to recompute rpm value for the genes using the ReadsPerGene file using column 4 (mode reverse of htseqcount). So this column counts F2R1 reads for gene on strand + and F1R2 reads for gene on strand -.True ?

RPM = NumberOfReadsPerGene * 1000000/ Sum(ReadsPerGene)

But this value seems to be very different from the y-coordinate height I see in UCSC with wig.

What am I missing or doing wrong ?

Thanks

Alexander Dobin

unread,

Sep 20, 2017, 9:01:04 AM9/20/17

to rna-star

Hi Jean-Philippe,

the RPM values in the unique Wig files are normalized to the total number of unique reads, while you are normalizing by the number of unique reads overlapping genes.

This may explain the difference - if it does not please send me an example: a browser snapshot for one of the genes and the corresponding line from ReadPerGene, and the total number of uniquely mapped reads.

Cheers

Alex

Jean-Philippe Villemin

unread,

Sep 20, 2017, 1:22:46 PM9/20/17

to rna-...@googlegroups.com

Ok, I understand but still I have a huge difference with the browser !

ENSG00000149182(ARFGAP2) has 4133 reads. (column 4 readsPerGene) Gene is on neg strand.

60 014 431 is the number of unique reads overlapping genes.

(

4133 / 60 014 431) * 1 000 000 = 68,86

91 182 831 is the total number of unique reads.

(

4133 /91 182 831) * 1 000 000 = 45,32

In the ucsc browser, for the corresponding strand,I get 3.52601, so it don't get it. Do you see anything in my numbers that doesn't make sense ?

Update : I compute total number of unique reads as suggested here

Update2 : I tried to recompute reads count using htseq-count and I get more reads 7824 (initially 4133 reads found , even with that I still not have the number given in ucsc browser) Why I get more reads ? Do I miss something in the command line with htseq-count ?

htseq-count -f bam -s reverse T6rep2.bam gencode.v25.primary_assembly.annotation.gtf > genes.htseq-cont2.txt

The wig is transform in bigwig with wigToBigWig tool but that sould not change anything.

Update3 : Try to clarify my process if that can help

Paired reads are aligned. Then intermediate bams are merged into one and then I use the following line to create wigs. (no re-sort is done after merging and befor wig creation)

inputAlignmentsFromBAM --outWigType wiggle --outWigStrand Stranded --outWigNorm RPM

When intermediate bams are created, one readsPerGene.tab & LogFinal.out are created each time. I add the different values inside these files to compute a final statistics file for the final bam obtained after merge of the several intermediate bam.

Is that also a correct way to do ? Or do I need to first concat all fastq (R1 & R2 separately) to do alignment.

gene.png

Alexander Dobin

unread,

Sep 21, 2017, 10:40:30 AM9/21/17

to rna-star

Hi Jean-Philippe,

the math is a bit more complicated. The RPM value in the wiggle file is "per-base" (i.e. local), while the reads counts are aggregated over the whole gene.

You can think of wiggle signal as the number of reads overlapping a given base in the genome, normalized to the number of mapped reads.

Since a gene can fit GeneLength/ReadLength number of reads, RPMgene~RPMwiggle*GeneLength/ReadLength.

Note that for PE reads ReadLength=2*MateLength. I think this will reconcile your numbers.

Cheers

Alex

On Wednesday, September 20, 2017 at 1:22:46 PM UTC-4, Jean-Philippe Villemin wrote:

Ok, I understand but still I have a huge difference with the browser !

ENSG00000149182(ARFGAP2) has 4133 reads. (column 4 readsPerGene) Gene is on neg strand.

60 014 431 is the number of unique reads overlapping genes.
(
4133 / 60 014 431) * 1 000 000 = 68,86

91 182 831 is the total number of unique reads.
(
4133 /91 182 831) * 1 000 000 = 45,32

In the ucsc browser, for the corresponding strand,I get 3.52601, so it don't get it. Do you see anything in my numbers that doesn't make sense ?

Alexander Dobin

unread,

Sep 21, 2017, 10:48:36 AM9/21/17

to rna-star

Hi Jean-Philippe,

what parameters are you running STAR with? The answers to your updated questions may depend on it.

I am concerned about the difference between STAR counts and htseq counts - this is not supposed to happen.

For RPM normalization STAR uses the same numbers of unique (and multimapped if needed) reads as reported in Log.final.out - you do not need to extract these numbers with samtools.

Cheers

Alex

Jean-Philippe Villemin

unread,

Sep 21, 2017, 11:09:25 AM9/21/17

to rna-star

I have two groups of paired fastq (R1 & R2)

"GroupeA": {

"R1": "T1_R1_200.fastq.gz",

"R2": "T1_R2_200.fastq.gz"}

"GroupeB": {

"R1": "T1__R1_201.fastq.gz",

"R2": "T1__R2_201.fastq.gz" }

I'm applying star on each group. (one for groupA and then groupB). I write only the important options used.

STAR
                +" --readFilesIn GroupeAr1 GroupeAr2"
                +" --readFilesCommand zcat"     
                +" --outSAMtype BAM SortedByCoordinate "   
                +" --quantMode GeneCounts"       
                +" --outSAMattrIHstart 0 "

So here I generated the 2 logfinal.out and 2 readsperGene.tab. For example here I will extract for the same gene , the number of reads mapped in all readperGene.tab, to add them and get the global value . Wrong ? Also do that to get total uniquely mapped reads using intermediates logfinal.out.

Then I merge them using samtools :

samtools merge -r final.bam A.bam B.bam

I finaly have one bam that I will pass to :

STAR --runMode inputAlignmentsFromBAM"    \
    #+" --quantMode GeneCounts "        \
    +" --outWigType wiggle "        \
    +" --outWigStrand Stranded"  \
    +" --outWigNorm RPM"

Does the fact that I'm doing that on an unsorted bam can influe the wig ouput ?

Note for htseq counts : I have test on an unsorted bam. I was thinking that after the merge the bam is still sorted but that not the case. So I re sorted the final bam (samtools sort) and now I try the following if it match :

htseq-count -f bam -s reverse -r pos T6rep2.bam /home/jean-philippe.villemin/mount_archive2/commun.luco/ref/genes/GRCh38_PRIM_GENCODE_R25/gencode.v25.primary_assembly.annotation.gtf > genes.htseq-con3.txt

Jean-Philippe Villemin

unread,

Sep 21, 2017, 11:26:30 AM9/21/17

to rna-star

In this formula, RPMgene~RPMwiggle*GeneLength/ReadLength, RPMwiggle is the value I see on the browser ?

 3.25

RPMgene is the one I'm trying to compute naively ? Right ? 45,32

Ok for my example ENSG00000149182(ARFGAP2) GeneLength is 12 826 bp. My read length is 120 *2 = 240 .

3.25 * (12 826 / 240 ) = 173 ...humm not 45,32

Sorry but I'm lost.

Jean-Philippe Villemin

unread,

Sep 22, 2017, 2:34:34 AM9/22/17

to rna-...@googlegroups.com

Using htseq-count after samtools sort make me found the same star counts 4133.

So is sorting the bam necessary before creating the wig ?

UPDATE : No that doesn't change how the wig look like...I tested on sorted and unsorted bam.

But I still doesn't see well how to retrieve the computed value. I understand that is different from what I was doing with read counts overlapping the gene. Here, for each nucleotide your normalise by all uniq reads mapped x 1000000 so value in ucsc sould be the max value of these rations for one base coordinate in my gene.

Alexander Dobin

unread,

Sep 22, 2017, 1:04:21 PM9/22/17

to rna-star

Hi Jean-Philippe,

it's good that htseq and STAR counts agree, this is how STAR counting was designed.

the gene length in my formula meant just the exons, while the ~12kb value contains introns.

If you exclude introns, you would probably get something close to 2-3kb, which will bring the values closer together.

Gene RPM and wiggle (base) RPM are related quantities, but they cannot be derived one from another precisely.

The best you can do to get the gene RPM is to sum the wiggle signal over all exonic bases and divide that by the read length.

Even that is not a precise calculation, since (i) the mapped read length is not constant and may be < input read length; (ii) the reads that ovelrap exon/intron boundary are still counted in the gene RPM, which means that you would also need to sum over the intronic bases near the exon boundaries.