When using coverageBed computes the depth at each bed region, how can
I normalize the depth by reads number? So I can compare to others
samples.
coverageBed -abam sorted.bam -b genome.bed
Thank you.
I think the simples thing to do would be to compute the total number of reads up front (via samtools flagstat) and then awk to normalize the output of coverageBed by this total number of reads.
For example, using RPM, and assuming, for example, that you had 10,000,000 reads for your sample. Also assume that your regions.bed file is a 3-column BED file with chr, start, end. Thus the count of reads for each interval will be the 4th column.
> TOTREADS=10000000
> coverageBed -bam sorted.bam -b regions.bed | awk '{ print $0"\t"$4/($TOTREADS/1000000) }'
or for RPKM, which normalizes the RPM value by the size of the intreval:
> TOTREADS=10000000
> coverageBed -bam sorted.bam -b regions.bed | awk '{ print $0"\t"($4/($TOTREADS/1000000))/($3-$2) }'
I hopes this helps.
Best,
Aaron
It is clear.
Thank you.
According to your suggestions, I calculated the RPKM. regions.bed is
the gene length (including exon and intron). Totally I have about 35M
paired-end reads (read1+read2=70M). When I do the calculation as
follow:
TOTREADS=70M
coverageBed -bam sorted.bam -b regions.bed | awk '{ print
$0"\t"($4/($TOTREADS/1000000))/($3-$2) }'
The maximal RPKM is about 5. most of them with a very small value.
This is different to the value from cufflink.
Do you have some suggestions?
Thank you for your time.
On Tue, Nov 15, 2011 at 9:02 AM, Aaron Quinlan <aaronq...@gmail.com> wrote:
have you tried using 70000000 instead of 70M?
Anna
Thank you.
In fact, I used 70000000. Not 70M.
In fact, I want to compare affy array and RNA-seq for the same sample.
I first get the expression value from affy array. Then I want to get
the RPKM value for the same probeset region(bed file) in affy array.
But the RPKM value in the probeset region is very small.
Thank you.
I have no experience with RNAseq, but do you think it is wise to normalize by the full length of the unspliced transcript when your reads are obtained from cDNA? Perhaps this accounts for the lower values you have observed?
Best,
Aaron