Slow/hanged CoverageBed

303 views
Skip to first unread message

Pratap, Abhishek

unread,
Feb 26, 2010, 10:24:03 PM2/26/10
to bedtools...@googlegroups.com
Hi Aaron

 I am trying to assess coverage for a bed file which I must say is really big ~83 million reads. The coverageBed command is taking a long time to run. Is this on the expected lines or something is not correct.

Thanks,
-Abhi

 coverageBed -a WT_BASAL_ALL.bed -b /local/seq_archive/reference/BEDformat/mouse_mm9mm9_chr_size.bed  > WT_BASAL_ALL_COVERAGE.txt


>ps
17744 pts/9    05:49:06 coverageBed
17895 pts/9    05:37:12 coverageBed

>top
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                            
17744 apratap   25   0 1231m 1.2g  904 R 99.7  3.8 355:15.09 coverageBed                                                                                        
17895 apratap   25   0 1556m 1.5g  904 R 100.1  4.8 343:30.26 coverageBed

Aaron Quinlan

unread,
Feb 27, 2010, 9:23:33 AM2/27/10
to bedtools...@googlegroups.com
Is it correct that your "-a" file (WT_BASAL_ALL.bed) has 83 million entries?  If so, how many features are in mouse_mm9mm9_chr_size.bed?  Could you provide the first few lines of each, so that I can get a snese of what you're doing?

In theory, the only way I could see it taking that long is if you have a BED feature for every single base in your genome for the -b file. 

Aaron

Pratap, Abhishek

unread,
Feb 27, 2010, 11:51:48 AM2/27/10
to bedtools...@googlegroups.com
Hi Aaron

Basically the aim is to get chromosomal level coverage for these reads.

Yes WT_BASAL_ALL.bed has 83 million entries
Only 22 entries in the mm9_chr_size.bed

Here it is

Few lines fom WT_BASAL_ALL.bed
chr10   3129268 3129308 HWI-EAS397:1:72:765:592#0       3       -
chr10   3129348 3129388 HWI-EAS397:6:42:317:1068#0      1       +
chr10   3129403 3129443 HWI-EAS397:6:42:317:1068#0      1       +
chr10   3129458 3129498 HWI-EAS397:6:42:317:1068#0      1       +
chr10   3129486 3129526 HWI-EAS397:1:72:765:592#0       3       -

Few lines from file b (mm9_chr_size.bed)
chr10   0       129993255
chr11   0       121843856
chr12   0       121257530
chr13   0       120284312
chr14   0       125194864
chr15   0       103494974
chr16   0       98319150
chr17   0       95272651
chr18   0       90772031


The processing is still going on:
17744 pts/9    19:24:35 coverageBed
17895 pts/9    19:12:37 coverageBed


-Abhi

Aaron Quinlan

unread,
Feb 27, 2010, 11:55:24 AM2/27/10
to bedtools...@googlegroups.com
Ah, I see.  What you are trying to do is best achieved with genomeCoverageBed.  WT_BASAL_ALL.bed would be your "-i" file and a modified version of mm9_chr_size.bed would be your "-g" file.  Your "genome" file should look like this:
chr10 129993255
chr11 121843856
chr12 121257530
chr13 120284312
chr14 125194864


Aaron

Pratap, Abhishek

unread,
Feb 27, 2010, 11:58:17 AM2/27/10
to bedtools...@googlegroups.com
Grt. Just wondering if you could briefly detail why coverageBed wont be a good choice. I am sure genomeCoverageBed exists for a reason but just curious to know why coverageBed cant do the same thing.


Thanks!
-Abhi
On Fri, Feb 26, 2010 at 10:24 PM, Pratap, Abhishek <APr...@som.umaryland.edu <x-msg://35/APr...@som.umaryland.edu> > wrote:

Pratap, Abhishek

unread,
Feb 27, 2010, 8:16:49 PM2/27/10
to bedtools...@googlegroups.com

Hi Aaron

 

I tried genomeCoverage anyhow. What it gives me is a binned data which is expected. In my case I just need the total coverage xx/chromosome. I know this could be easily calculated by counting reads mapped per chromosome but just wondering if this method exists in BEDtools.

 

-A

Aaron Quinlan

unread,
Mar 1, 2010, 8:17:11 AM3/1/10
to bedtools...@googlegroups.com
Hi Abhi,

Sorry for not responding sooner, I've been at a conference until yesterday.  

There are a couple of options here.  First, you could use genomeCoverageBed and sum the coverage of all of the non-zero bases to get the fraction of the chromosome that is covered by >=1, >=2, ..., >=3 reads.  Alternatively, if you simply want a count of the number of reads that aligned to each chromosome, you could use the following UNIX pipeline to do this:

$ awk '{print $1}' <READS.BED> | sort | uniq -c

Also, I've had some time to reflect on the slowness you reported for coverageBed.  In fact, the original experiment you did (create a BED entry for the size of each chromosome) should have worked.  I suspect the slowness you experienced is a consequence of the way the coverageBed algorithm works.  I try to compute the depth _and_ breadth of coverage for each BED interval in the "-b" file.  Consequently, I use a data structure that becomes rather inefficient when the features in the "-b" file are large (e.g., the size of mammalian chromosomes).  My original intent for coverageBed was for users to compute coverage for smaller "windows" tiling a chromosome, as the data structure I use work quite nicely in those cases.

I will give some thought to ways to make this more generally useful even for very large BED features.

I hope this helps.  

Cheers,
Aaron

Aaron Quinlan, Ph.D.
NRSA Postdoctoral Fellow
Hall Laboratory
University of Virginia
Biochem. & Mol. Genetics

Pratap, Abhishek

unread,
Mar 1, 2010, 1:18:00 PM3/1/10
to bedtools...@googlegroups.com

Thanks for the detailing the issue Aaron.  It would be actually nice to have coveragebed work on long features but for now I can just grep per chr to calculate rough coverage stats which should suffice.

 

Thanks!

-Abhi

Reply all
Reply to author
Forward
0 new messages