I have a need to combine & compare bedgraph files.
Any one know of a readily available too for that ?
If not, I'll have to write it.
Aaron, will you be interested in adding such a tool to your collection ?
-gordon
Best,
Aaron
On 07/14/2010 09:49 PM, Aaron Quinlan wrote:
> I am interested to know how you would like to combine and compare the
> bedgraph files. Could you give a brief example?
I'm deliberating between two options:
First, the 'naive' (but complicated) way:
Write a program that takes multiple BedGraph files as input,
and outputs a tabular file where each row is a common chrom/start/end, and the columns are the coverage values from each BedGraph.
Since bedGraph are assumed to be sorted and non-overlapping, the only required logic is to find the smallest common region that is shared by each file.
Example:
file 1:
chr1 1000 1500 10
chr1 2000 2100 10
file 2:
chr1 900 1600 10
chr1 1700 2050 50
combined output:
chrom start end file1 file2
chr1 900 1000 0 10
chr1 1000 1500 10 10
chr1 1500 1600 0 10
chr1 1700 2000 0 50
chr1 2000 2050 10 50
chr1 2050 2100 10 0
Second, the 'sliding window' way:
Calculate a sliding-window (probably, with maximum or average coverage over a window), allowing user to specify the window-size and and window-step.
then, having a program go over each BedGraph file and calculate a sliding-window function over it, something like:
genomeCoverageBed -i FILE1.BED -bg | slidingWindowBedGraph -op max -size 10000 -step 5000 > window1.bg
genomeCoverageBed -i FILE2.BED -bg | slidingWindowBedGraph -op max -size 10000 -step 5000 > window2.bg
paste window1.bg window2.bg > combined.bed
to have a window of 10K nt.
each BedGraph would have a separate window file, and since all coordinates in the window files will be the same, combining them is trivial.
The advantage of the first method is that it's completely loseless - no information is lost, coverage is retained accurately for all combined samples.
The advantage of the second method is that it is simpler to program and scales well.
Comments and further ideas are very welcomed.
-gordon
I have a couple of ideas for how to do this rather quickly and easily, but they are beyond my email abilities at the moment. Perhaps we could chat on the phone soon to discuss? If interested, email me privately.
In short, I can seen several ways in which such a tool (actually both options) would be useful and would like to help develop them. Option 1 would have, in fact, saved me a ton of time and errors in a experiment.
Best,
Aaron
Here's a working alpha version - supporting only BedGraph files with a single chromosome (i.e. chrom switch is not detected/handled).
http://cancan.cshl.edu/labmembers/gordon/files/BEDTools-Version2.8-mergeBedGraph.tar.bz2
Assuming three input files:
$ cat /home/gordon/temp/2.bg
chr1 1000 1500 10
chr1 2000 2100 20
$ cat /home/gordon/temp/3.bg
chr1 900 1600 60
chr1 1700 2050 50
$ cat /home/gordon/temp/4.bg
chr1 1980 2070 80
chr1 2090 2100 20
The combined output is:
$ ./mergeBedGraph --header ~/temp/2.bg ~/temp/3.bg ~/temp/4.bg
chrom start end 2 3 4
900 1000 0 60 0
1000 1500 10 60 0
1500 1600 0 60 0
1600 1700 0 0 0
1700 1980 0 50 0
1980 2000 0 50 80
2000 2050 20 50 80
2050 2070 20 0 80
2070 2090 20 0 0
2090 2100 20 0 20
Hopefully I'll have a full working version tonight.
-gordon
What do you think?
Aaron
http://cancan.cshl.edu/labmembers/gordon/files/BEDTools-Version2.8-mergeBedGraph.tar.bz2
Comments are very welcomed,
-gordon
====================
the usage information
=======================
$ mergeBedGraph --help
Program: mergeBedGraph (v2.8.0)
Summary: merges multiple BedGraph files into one file, allowing coverage comparison.
Usage: mergeBedGraph [OPTIONS] FILE1 FILE2 .. FILEn
or
mergeBedGraph [OPTIONS] --names FILE1 NAME1 FILE2 NAME2 .. FILEn NAMEn
FILE1, FILE2, FILEn are BedGraph files (sorted, non-overlapping),
as produced by 'genomeCoverageBed -bg' .
Options:
--header Print a header line (chrom/start/end + names of each file).
--names Each filename is followed by a descriptive name of the file.
The name will be printed in the header line (instead of the filename).
--empty Report empty regions (start/end intervals with ZERO coverage in all files).
Requires the '--genome FILE' parameter.
--genome FILE
use genome file FILE to calculate empty regions.
genome file should contain two columns:
chrom<TAB>size
for each chromosome.
--examples Show detailed usage examples.
===================================
And detailed examples:
===================================
$ mergeBedGraph --examples
Program: mergeBedGraph (v2.8.0)
Summary: merges multiple BedGraph files into one file, allowing coverage comparison.
Usage: mergeBedGraph [OPTIONS] FILE1 FILE2 .. FILEn
or
mergeBedGraph [OPTIONS] --names FILE1 NAME1 FILE2 NAME2 .. FILEn NAMEn
Examples:
Input files:
$ cat 1.bg
chr1 1000 1500 10
chr1 2000 2100 20
$ cat 2.bg
chr1 900 1600 60
chr1 1700 2050 50
$ cat 3.bg
chr1 1980 2070 80
chr1 2090 2100 20
$ cat sizes.txt
chr1 5000
Merging the files:
$ mergeBedGraph 1.bg 2.bg 3.bg
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20
Merging the files, with a header line (titles are the file names):
$ mergeBedGraph --header 1.bg 2.bg 3.bg
chrom start end 1 2 3
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20
Merging the files, with a header line and custom names:
$ mergeBedGraph --header --name 1.bg WT-1 2.bg WT-2 3.bg KO-1
chrom start end WT-1 WT-2 KO-1
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20
Merging the files, showing empty regions:
$ mergeBedGraph --header --empty --genome sizes.txt 1.bg 2.bg 3.bg
chrom start end 1 2 3
chr1 0 900 0 0 0
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1600 1700 0 0 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20
chr1 2100 5000 0 0 0
mergeBedGraph -path myDir/
Along those lines, have you tested the tool with many (say 20 or 50 or 100) files? Just curious to know how it behaves at the extreme.
Thanks again for writing this, I think it will be very useful to many folks.
Aaron
with support of "--path" argument.
A small technical change:
the DEPTH variable is now a string (instead of a uint32_t),
so that any value in the input BedGraph can be merged.
It does no affect the results because no arithmatic operations are done on the value itself.
This allows merging BedGraph files that contain any kind of value in the fourth column (e.g. names or floating-point values).
The BEDGRAPH class is now a template which takes the DEPTH type as a template parameter.
-gordon
The other thing I can think of might be to limit the number of BEDGRAPH files that can be merged. POSIX systems differ in the maximum number of open file handles and some sysadmins can be very strict in this sense.
There are two approaches here:
1. Don't set a maximum and let the program crash if the max exceeds the system limit.
2. Set a maximum in the interest of preventing crashes and limiting questions like "What does this error mean?"
Given that it is unlikely that people are going to be merging more than 1K or 5K files anytime soon, perhaps option 1 is best.
Does anyone else have an opinion?
Aaron
I think option 1 is the only acceptable option.
It's not any program's place to impose arbitrary limits.
The but the program should not crash, it should report a reasonable error to the user.
If the "ifstream" failed to open the file, the program should print the errno (and a description) to the user, explaining what happened.
If a file failed to open because too many files are opened, the kernel will return the correct error code (e.g. EMFILE = Too many files open) and with strerror() the user will get a descriptive error.
In other project I use:
===
std::string string_error(int errnum)
{
char buf[2048];
buf[2047]=0;
#if (_POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600) && ! _GNU_SOURCE
int i = strerror_r(errnum, buf, sizeof(buf)-1);
std::string s(buf);
#else
char* str = strerror_r(errnum, buf, sizeof(buf)-1);
std::string s(str);
#endif
return s;
}
string filename = XXX ;
ifstream ifs(filename, ios::in);
if ( ! ifs.good() ) {
cerr << "Error: failed to open input file '" << filename << "': " << string_error(errno) << endl;
exit(1);
}
===
And then the user will see the real reason (i.e. the O/S reason) for the file not being opened.
The pure C equivalent is the err() function from <err.h> .
I will add that to mergeBedGraph.
-gordon
The updated version of mergeBedGraph is available here:
http://cancan.cshl.edu/labmembers/gordon/files/BEDTools-Version2.8-mergeBedGraph.tar.bz2
Proper error messages are shown to the user when I/O errors occur.
A galaxy tool for mergeBedGraph (and genomeCoverageBed, too) is available here:
http://cancan.cshl.edu/labmembers/gordon/files/galaxy_BEDTools_2010_07_16.tar.bz2
You can test the galaxy version here:
http://cancan.cshl.edu/publicgalaxy/root?tool_id=BEDTools_mergeBEdGraph
-gordon