Dear Sanjay,
thanks for using deepTools! Here's our reply to your questions, we hope they'll help:
1. Regarding your approach using bamCoverage and --scaleFactor:
- I assume that you're not using really just 3 as a scaling factor, but the precise number based on the real number of mapped reads. Just checking :)
- The way you did it was to scale the smaller sample up to match the more deeply sequenced sample. To be honest, I would recommend to do it the other way around and to scale the more deeply sequenced sample down. This opinion stems from the fact that it is difficult to distinguish whether a region with no reads at all in the shallowly sequenced sample is due to lack of coverage or due to lack of mapability (which would mean that the region will also not be covered in the more deeply sequenced sample). Regions with zero coverage in the less deeply sequenced sample will still have zero coverage after being multiplied while the remaining regions will get "artificially" increased read numbers. That's why I would recommend to multiply your control sample with 1/3 if you were to stick to your procedure.
2. Regarding your question about RPKM and 1x sequencing depth normalization:
Let's assume the following example sequencing sample:
- mouse sample (effective genome size of the mouse: ~ 2.15057 x 10^9 bp)
- 50 million mapped reads
- average size of sequenced DNA fragments: 200
- bin size for the bigWig: 25 bp
- 2 exemplary bins: no. 1 with 10 overlapping reads, no. 2 with 12 overlapping reads
RPKM takes the bin size and the number of mapped reads into consideration, it does not care about the genome size:
RPKM (per bin) = number of reads per bin / ( number of mapped reads (in millions) * bin length (kp)
For the example above, this would mean:
RPKM(bin1) =
10 / (
50 *
0.025) = 8.
For the second bin: RPKM(bin2) =
12 / (
50 *
0.025) = 9.6
sequencing depth = (total number of mapped reads * fragment length) / effective genome size = 50 x 10^6 * 200/ 2.15057 x 10^9 = 4.65