Slow/hanged CoverageBed

Pratap, Abhishek

unread,

Feb 26, 2010, 10:24:03 PM2/26/10

to bedtools...@googlegroups.com

Hi Aaron

I am trying to assess coverage for a bed file which I must say is really big ~83 million reads. The coverageBed command is taking a long time to run. Is this on the expected lines or something is not correct.

Thanks,
-Abhi

coverageBed -a WT_BASAL_ALL.bed -b /local/seq_archive/reference/BEDformat/mouse_mm9mm9_chr_size.bed > WT_BASAL_ALL_COVERAGE.txt

>ps
17744 pts/9    05:49:06 coverageBed
17895 pts/9    05:37:12 coverageBed

>top
PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
17744 apratap   25   0 1231m 1.2g 904 R 99.7 3.8 355:15.09 coverageBed
17895 apratap   25   0 1556m 1.5g 904 R 100.1 4.8 343:30.26 coverageBed

Aaron Quinlan

unread,

Feb 27, 2010, 9:23:33 AM2/27/10

to bedtools...@googlegroups.com

Is it correct that your "-a" file (WT_BASAL_ALL.bed) has 83 million entries? If so, how many features are in mouse_mm9mm9_chr_size.bed? Could you provide the first few lines of each, so that I can get a snese of what you're doing?

In theory, the only way I could see it taking that long is if you have a BED feature for every single base in your genome for the -b file.

Aaron

Pratap, Abhishek

unread,

Feb 27, 2010, 11:51:48 AM2/27/10

to bedtools...@googlegroups.com

Hi Aaron

Basically the aim is to get chromosomal level coverage for these reads.

Yes WT_BASAL_ALL.bed has 83 million entries
Only 22 entries in the mm9_chr_size.bed

Here it is

Few lines fom WT_BASAL_ALL.bed
chr10   3129268 3129308 HWI-EAS397:1:72:765:592#0       3       -
chr10   3129348 3129388 HWI-EAS397:6:42:317:1068#0      1       +
chr10   3129403 3129443 HWI-EAS397:6:42:317:1068#0      1       +
chr10   3129458 3129498 HWI-EAS397:6:42:317:1068#0      1       +
chr10   3129486 3129526 HWI-EAS397:1:72:765:592#0       3       -

Few lines from file b (mm9_chr_size.bed)
chr10   0       129993255
chr11   0       121843856
chr12   0       121257530
chr13   0       120284312
chr14   0       125194864
chr15   0       103494974
chr16   0       98319150
chr17   0       95272651
chr18   0       90772031

The processing is still going on:
17744 pts/9    19:24:35 coverageBed
17895 pts/9    19:12:37 coverageBed

-Abhi

Aaron Quinlan

unread,

Feb 27, 2010, 11:55:24 AM2/27/10

to bedtools...@googlegroups.com

Ah, I see. What you are trying to do is best achieved with genomeCoverageBed. WT_BASAL_ALL.bed would be your "-i" file and a modified version of mm9_chr_size.bed would be your "-g" file. Your "genome" file should look like this:

chr10 129993255
chr11 121843856
chr12 121257530
chr13 120284312
chr14 125194864

Aaron

Pratap, Abhishek

unread,

Feb 27, 2010, 11:58:17 AM2/27/10

to bedtools...@googlegroups.com

Grt. Just wondering if you could briefly detail why coverageBed wont be a good choice. I am sure genomeCoverageBed exists for a reason but just curious to know why coverageBed cant do the same thing.

Thanks!
-Abhi

On Fri, Feb 26, 2010 at 10:24 PM, Pratap, Abhishek <APr...@som.umaryland.edu <x-msg://35/APr...@som.umaryland.edu> > wrote:

Pratap, Abhishek

unread,

Feb 27, 2010, 8:16:49 PM2/27/10

to bedtools...@googlegroups.com

Hi Aaron

I tried genomeCoverage anyhow. What it gives me is a binned data which is expected. In my case I just need the total coverage xx/chromosome. I know this could be easily calculated by counting reads mapped per chromosome but just wondering if this method exists in BEDtools.

-A

Aaron Quinlan

unread,

Mar 1, 2010, 8:17:11 AM3/1/10

to bedtools...@googlegroups.com

Hi Abhi,

Sorry for not responding sooner, I've been at a conference until yesterday.

There are a couple of options here. First, you could use genomeCoverageBed and sum the coverage of all of the non-zero bases to get the fraction of the chromosome that is covered by >=1, >=2, ..., >=3 reads. Alternatively, if you simply want a count of the number of reads that aligned to each chromosome, you could use the following UNIX pipeline to do this:

$ awk '{print $1}' <READS.BED> | sort | uniq -c

Also, I've had some time to reflect on the slowness you reported for coverageBed. In fact, the original experiment you did (create a BED entry for the size of each chromosome) should have worked. I suspect the slowness you experienced is a consequence of the way the coverageBed algorithm works. I try to compute the depth _and_ breadth of coverage for each BED interval in the "-b" file. Consequently, I use a data structure that becomes rather inefficient when the features in the "-b" file are large (e.g., the size of mammalian chromosomes). My original intent for coverageBed was for users to compute coverage for smaller "windows" tiling a chromosome, as the data structure I use work quite nicely in those cases.

I will give some thought to ways to make this more generally useful even for very large BED features.

I hope this helps.

Cheers,

Aaron

Aaron Quinlan, Ph.D.
NRSA Postdoctoral Fellow
Hall Laboratory

University of Virginia

Biochem. & Mol. Genetics

aaronq...@gmail.com

Pratap, Abhishek

unread,

Mar 1, 2010, 1:18:00 PM3/1/10

to bedtools...@googlegroups.com

Thanks for the detailing the issue Aaron. It would be actually nice to have coveragebed work on long features but for now I can just grep per chr to calculate rough coverage stats which should suffice.