Combining/Comparing BedGraph files

Assaf Gordon

unread,

Jul 14, 2010, 5:12:37 PM7/14/10

to bedtools...@googlegroups.com

Hi Aaron, all,

I have a need to combine & compare bedgraph files.
Any one know of a readily available too for that ?

If not, I'll have to write it.
Aaron, will you be interested in adding such a tool to your collection ?

-gordon

Aaron Quinlan

unread,

Jul 14, 2010, 9:49:10 PM7/14/10

to bedtools...@googlegroups.com

Hi Assaf,
I am interested to know how you would like to combine and compare the bedgraph files. Could you give a brief example?

Best,
Aaron

Assaf Gordon

unread,

Jul 15, 2010, 12:13:05 AM7/15/10

to bedtools...@googlegroups.com

Hi Aaron,

On 07/14/2010 09:49 PM, Aaron Quinlan wrote:
> I am interested to know how you would like to combine and compare the
> bedgraph files. Could you give a brief example?

I'm deliberating between two options:

First, the 'naive' (but complicated) way:
Write a program that takes multiple BedGraph files as input,
and outputs a tabular file where each row is a common chrom/start/end, and the columns are the coverage values from each BedGraph.
Since bedGraph are assumed to be sorted and non-overlapping, the only required logic is to find the smallest common region that is shared by each file.

Example:
file 1:
chr1 1000 1500 10
chr1 2000 2100 10

file 2:
chr1 900 1600 10
chr1 1700 2050 50

combined output:
chrom start end file1 file2
chr1 900 1000 0 10
chr1 1000 1500 10 10
chr1 1500 1600 0 10
chr1 1700 2000 0 50
chr1 2000 2050 10 50
chr1 2050 2100 10 0

Second, the 'sliding window' way:
Calculate a sliding-window (probably, with maximum or average coverage over a window), allowing user to specify the window-size and and window-step.
then, having a program go over each BedGraph file and calculate a sliding-window function over it, something like:

genomeCoverageBed -i FILE1.BED -bg | slidingWindowBedGraph -op max -size 10000 -step 5000 > window1.bg
genomeCoverageBed -i FILE2.BED -bg | slidingWindowBedGraph -op max -size 10000 -step 5000 > window2.bg
paste window1.bg window2.bg > combined.bed

to have a window of 10K nt.
each BedGraph would have a separate window file, and since all coordinates in the window files will be the same, combining them is trivial.

The advantage of the first method is that it's completely loseless - no information is lost, coverage is retained accurately for all combined samples.
The advantage of the second method is that it is simpler to program and scales well.

Comments and further ideas are very welcomed.

-gordon

Aaron Quinlan

unread,

Jul 15, 2010, 8:08:26 AM7/15/10

to bedtools...@googlegroups.com

Thanks for the example Gordon, your problem at hand is much more clear. The latter approach is clearly _much_ simpler to write. However, as you say, you could potentially lose a substantial amount of information, depending on the fine scale changes in coverage and the choice of window size. The first option is much more general (it could be used for the -name field, or any field for that matter, not just the score in bedgraph files). In essence, it's very much like a combined "full (inner and outer) SQL join" of multiple BED files, where the "join" key is loose in the sense that you are just looking for >=1bp of overlap.

I have a couple of ideas for how to do this rather quickly and easily, but they are beyond my email abilities at the moment. Perhaps we could chat on the phone soon to discuss? If interested, email me privately.

In short, I can seen several ways in which such a tool (actually both options) would be useful and would like to help develop them. Option 1 would have, in fact, saved me a ton of time and errors in a experiment.

Best,
Aaron

Assaf Gordon

unread,

Jul 15, 2010, 4:46:31 PM7/15/10

to bedtools...@googlegroups.com

Hi Aaron,

Here's a working alpha version - supporting only BedGraph files with a single chromosome (i.e. chrom switch is not detected/handled).

http://cancan.cshl.edu/labmembers/gordon/files/BEDTools-Version2.8-mergeBedGraph.tar.bz2

Assuming three input files:

$ cat /home/gordon/temp/2.bg
chr1 1000 1500 10
chr1 2000 2100 20

$ cat /home/gordon/temp/3.bg
chr1 900 1600 60
chr1 1700 2050 50

$ cat /home/gordon/temp/4.bg
chr1 1980 2070 80
chr1 2090 2100 20

The combined output is:
$ ./mergeBedGraph --header ~/temp/2.bg ~/temp/3.bg ~/temp/4.bg
chrom start end 2 3 4
900 1000 0 60 0
1000 1500 10 60 0
1500 1600 0 60 0
1600 1700 0 0 0
1700 1980 0 50 0
1980 2000 0 50 80
2000 2050 20 50 80
2050 2070 20 0 80
2070 2090 20 0 0
2090 2100 20 0 20

Hopefully I'll have a full working version tonight.

-gordon

Aaron Quinlan

unread,

Jul 15, 2010, 5:17:00 PM7/15/10

to bedtools...@googlegroups.com

Hi Gordon,
Nice work! I am looking forward to browsing the code to get a sense of your approach. One question: do you see a need to also report the regions without coverage in any of the files? For example, from your files 0-900 has no coverage. It seems that this might also be useful to know so that it's a complete BEDGRAPH for the whole genome.

What do you think?
Aaron

Assaf Gordon

unread,

Jul 15, 2010, 5:18:13 PM7/15/10

to bedtools...@googlegroups.com

I'll add that as an option (similar to "-bga").

Assaf Gordon

unread,

Jul 15, 2010, 6:49:53 PM7/15/10

to bedtools...@googlegroups.com

Here's a working beta version (same link as before, updated file):

http://cancan.cshl.edu/labmembers/gordon/files/BEDTools-Version2.8-mergeBedGraph.tar.bz2

Comments are very welcomed,
-gordon

====================
the usage information
=======================
$ mergeBedGraph --help
Program: mergeBedGraph (v2.8.0)
Summary: merges multiple BedGraph files into one file, allowing coverage comparison.

Usage: mergeBedGraph [OPTIONS] FILE1 FILE2 .. FILEn
or
mergeBedGraph [OPTIONS] --names FILE1 NAME1 FILE2 NAME2 .. FILEn NAMEn

FILE1, FILE2, FILEn are BedGraph files (sorted, non-overlapping),
as produced by 'genomeCoverageBed -bg' .

Options:
--header Print a header line (chrom/start/end + names of each file).

--names Each filename is followed by a descriptive name of the file.
The name will be printed in the header line (instead of the filename).

--empty Report empty regions (start/end intervals with ZERO coverage in all files).
Requires the '--genome FILE' parameter.

--genome FILE
use genome file FILE to calculate empty regions.
genome file should contain two columns:
chrom<TAB>size
for each chromosome.

--examples Show detailed usage examples.

===================================
And detailed examples:
===================================
$ mergeBedGraph --examples
Program: mergeBedGraph (v2.8.0)
Summary: merges multiple BedGraph files into one file, allowing coverage comparison.

Usage: mergeBedGraph [OPTIONS] FILE1 FILE2 .. FILEn
or
mergeBedGraph [OPTIONS] --names FILE1 NAME1 FILE2 NAME2 .. FILEn NAMEn
Examples:

Input files:

$ cat 1.bg

chr1 1000 1500 10
chr1 2000 2100 20

$ cat 2.bg

chr1 900 1600 60
chr1 1700 2050 50

$ cat 3.bg

chr1 1980 2070 80
chr1 2090 2100 20

$ cat sizes.txt
chr1 5000

Merging the files:

$ mergeBedGraph 1.bg 2.bg 3.bg
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20

Merging the files, with a header line (titles are the file names):

$ mergeBedGraph --header 1.bg 2.bg 3.bg
chrom start end 1 2 3
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20

Merging the files, with a header line and custom names:

$ mergeBedGraph --header --name 1.bg WT-1 2.bg WT-2 3.bg KO-1
chrom start end WT-1 WT-2 KO-1
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20

Merging the files, showing empty regions:

$ mergeBedGraph --header --empty --genome sizes.txt 1.bg 2.bg 3.bg
chrom start end 1 2 3
chr1 0 900 0 0 0
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1600 1700 0 0 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20
chr1 2100 5000 0 0 0

Aaron Quinlan

unread,

Jul 16, 2010, 9:07:15 AM7/16/10

to bedtools...@googlegroups.com

Hi Gordon,
This is looking really good. One thing I could see being useful is the ability to specify a directory housing all of the bedgraph files that one wants to merge. Something like:

mergeBedGraph -path myDir/

Along those lines, have you tested the tool with many (say 20 or 50 or 100) files? Just curious to know how it behaves at the extreme.

Thanks again for writing this, I think it will be very useful to many folks.

Aaron

Assaf Gordon

unread,

Jul 16, 2010, 1:32:07 PM7/16/10

to bedtools...@googlegroups.com

No problem, here's an update:
http://cancan.cshl.edu/labmembers/gordon/files/BEDTools-Version2.8-mergeBedGraph.tar.bz2

with support of "--path" argument.

A small technical change:
the DEPTH variable is now a string (instead of a uint32_t),
so that any value in the input BedGraph can be merged.
It does no affect the results because no arithmatic operations are done on the value itself.

This allows merging BedGraph files that contain any kind of value in the fourth column (e.g. names or floating-point values).

The BEDGRAPH class is now a template which takes the DEPTH type as a template parameter.

-gordon

Aaron Quinlan

unread,

Jul 16, 2010, 2:21:09 PM7/16/10

to bedtools...@googlegroups.com

Excellent---a string "score" is much more flexible.

The other thing I can think of might be to limit the number of BEDGRAPH files that can be merged. POSIX systems differ in the maximum number of open file handles and some sysadmins can be very strict in this sense.

There are two approaches here:

1. Don't set a maximum and let the program crash if the max exceeds the system limit.
2. Set a maximum in the interest of preventing crashes and limiting questions like "What does this error mean?"

Given that it is unlikely that people are going to be merging more than 1K or 5K files anytime soon, perhaps option 1 is best.

Does anyone else have an opinion?

Aaron

Assaf Gordon

unread,

Jul 16, 2010, 3:18:23 PM7/16/10

to bedtools...@googlegroups.com

Aaron Quinlan wrote, On 07/16/2010 02:21 PM:
> The other thing I can think of might be to limit the number of
> BEDGRAPH files that can be merged. POSIX systems differ in the
> maximum number of open file handles and some sysadmins can be very
> strict in this sense.
>
> There are two approaches here:
>
> 1. Don't set a maximum and let the program crash if the max exceeds
> the system limit. 2. Set a maximum in the interest of preventing
> crashes and limiting questions like "What does this error mean?"
>
> Given that it is unlikely that people are going to be merging more
> than 1K or 5K files anytime soon, perhaps option 1 is best.
>
> Does anyone else have an opinion?

I think option 1 is the only acceptable option.
It's not any program's place to impose arbitrary limits.

The but the program should not crash, it should report a reasonable error to the user.

If the "ifstream" failed to open the file, the program should print the errno (and a description) to the user, explaining what happened.
If a file failed to open because too many files are opened, the kernel will return the correct error code (e.g. EMFILE = Too many files open) and with strerror() the user will get a descriptive error.

In other project I use:

===
std::string string_error(int errnum)
{
char buf[2048];
buf[2047]=0;
#if (_POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600) && ! _GNU_SOURCE
int i = strerror_r(errnum, buf, sizeof(buf)-1);
std::string s(buf);
#else
char* str = strerror_r(errnum, buf, sizeof(buf)-1);
std::string s(str);
#endif
return s;
}

string filename = XXX ;
ifstream ifs(filename, ios::in);
if ( ! ifs.good() ) {
cerr << "Error: failed to open input file '" << filename << "': " << string_error(errno) << endl;
exit(1);
}
===

And then the user will see the real reason (i.e. the O/S reason) for the file not being opened.

The pure C equivalent is the err() function from <err.h> .

I will add that to mergeBedGraph.

-gordon

Aaron Quinlan

unread,

Jul 16, 2010, 3:25:09 PM7/16/10

to bedtools...@googlegroups.com

Perfect.

Assaf Gordon

unread,

Jul 16, 2010, 6:49:42 PM7/16/10

to bedtools...@googlegroups.com

Hi Aaron, all,

The updated version of mergeBedGraph is available here:
http://cancan.cshl.edu/labmembers/gordon/files/BEDTools-Version2.8-mergeBedGraph.tar.bz2

Proper error messages are shown to the user when I/O errors occur.

A galaxy tool for mergeBedGraph (and genomeCoverageBed, too) is available here:
http://cancan.cshl.edu/labmembers/gordon/files/galaxy_BEDTools_2010_07_16.tar.bz2

You can test the galaxy version here:
http://cancan.cshl.edu/publicgalaxy/root?tool_id=BEDTools_mergeBEdGraph

-gordon

Reply all

Reply to author

Forward