I've noticed the flurry of activity around headers in BED files that's been happening in git over the last week or so. This is very welcome as propagating metadata to output files will make tracking various things easier.
However I think the new bedtools rules around "browser" and "track" lines are overly restrictive. In fact, they are incompatible with existing BED files and conventions in other software.
There are some small example BED files here:
http://genome.ucsc.edu/goldenPath/help/customTrack.html#EXAMPLE1
and track and browser lines are documented elsewhere on that page. From these examples, we immediately see that:
1. If browser/track lines are used as documented, there can be several browser lines and a track line in a block at the start of the file -- so they can't all be on the first line;
2. Track lines are intended so that BED files can contain multiple sets of data ("tracks"), each introduced by a track line -- so track lines (and possibly also browser lines, that's unclear) can appear throughout a BED file, not just in a block at the start.
As it happens, our pipeline has, amongst others, a data file containing repeat features organised into several subsets (as it processes different types of repeat differently). I've recently changed these data files to be in BED format (with multiple tracks corresponding to the multiple types of repeat) precisely so that people can use these same reference files with bedtools -- which our bedtools users have been finding very convenient. Unfortunately this no longer works with the current bedtools in git (which currently allows at most one browser or track line, and it must be the first line of the file).
Hopefully bedtools could return to more relaxed rules around these metadata lines, thus restoring interoperability with extant BED files. This could be at several levels:
1. IMHO it must return to accepting track and browser lines anywhere in the input BED file, even if it just ignores them as it did previously. This much restores basic functionality with multitrack BED files (such as our repeat data file), e.g., computing intersections against any of the features in any of the tracks.
2. Ideally -header would preserve track, browser, and "#" lines in place wherever they occurred, when writing an output BED file by filtering an input BED file (e.g., slopBed). However the existing implementation slightly augmented such that header lines in a block at the start of the file were preserved while later ones were ignored would probably suffice for most purposes.
3. Example 2 on the web page above is also intriguing in that the BED records in different tracks have different numbers of fields. I'm not proposing that bedtools needs to support that!
I'm also curious about the "chrom" header that Brent suggested. Is this a metadata line that's described somewhere, or do we just want to be able to have a descriptive header line naming all the columns without having to write "#chrom"? If the latter, quite apart from the potential for real BED files somewhere to have chromosomes named like "chrom22" rather than just "chr22" or "22", what about the next people who come along wanting to name their columns "Chrom" or "CHROM"?
But in general thanks for the very useful tools -- this email isn't intended to be criticism of any kind!
Cheers,
John
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
On Dec 9, 2011 10:12 AM, "John Marshall" <jm...@sanger.ac.uk> wrote:
>
> Hello Aaron et al,
>
> I've noticed the flurry of activity around headers in BED files that's been happening in git over the last week or so. This is very welcome as propagating metadata to output files will make tracking various things easier.
>
> However I think the new bedtools rules around "browser" and "track" lines are overly restrictive. In fact, they are incompatible with existing BED files and conventions in other software.
>
> There are some small example BED files here:
>
> http://genome.ucsc.edu/goldenPath/help/customTrack.html#EXAMPLE1
>
> and track and browser lines are documented elsewhere on that page. From these examples, we immediately see that:
>
> 1. If browser/track lines are used as documented, there can be several browser lines and a track line in a block at the start of the file -- so they can't all be on the first line;
>
> 2. Track lines are intended so that BED files can contain multiple sets of data ("tracks"), each introduced by a track line -- so track lines (and possibly also browser lines, that's unclear) can appear throughout a BED file, not just in a block at the start.
>
> As it happens, our pipeline has, amongst others, a data file containing repeat features organised into several subsets (as it processes different types of repeat differently). I've recently changed these data files to be in BED format (with multiple tracks corresponding to the multiple types of repeat) precisely so that people can use these same reference files with bedtools -- which our bedtools users have been finding very convenient. Unfortunately this no longer works with the current bedtools in git (which currently allows at most one browser or track line, and it must be the first line of the file).
>
> Hopefully bedtools could return to more relaxed rules around these metadata lines, thus restoring interoperability with extant BED files. This could be at several levels:
>
> 1. IMHO it must return to accepting track and browser lines anywhere in the input BED file, even if it just ignores them as it did previously. This much restores basic functionality with multitrack BED files (such as our repeat data file), e.g., computing intersections against any of the features in any of the tracks.
>
> 2. Ideally -header would preserve track, browser, and "#" lines in place wherever they occurred, when writing an output BED file by filtering an input BED file (e.g., slopBed). However the existing implementation slightly augmented such that header lines in a block at the start of the file were preserved while later ones were ignored would probably suffice for most purposes.
>
> 3. Example 2 on the web page above is also intriguing in that the BED records in different tracks have different numbers of fields. I'm not proposing that bedtools needs to support that!
>
> I'm also curious about the "chrom" header that Brent suggested. Is this a metadata line that's described somewhere, or do we just want to be able to have a descriptive header line naming all the columns without having to write "#chrom"? If the latter, quite apart from the potential for real BED files somewhere to have chromosomes named like "chrom22" rather than just "chr22" or "22", what about the next people who come along wanting to name their columns "Chrom" or "CHROM"?
>
I agree, and #chrom seems a better solution.
Thanks for your email, it is indeed timely. Ryan Dale and I have been having a similar discussion off-list based on the development version's intolerance of interspersed headers and commented (e.g. GFF lines preceded by a /^\#/). My goal is to provide better support for true headers, especially for VCF files, where headers are vital to the interpretation of a file's content. The current implementation supports this well at the cost of the loss of functionality you describe. The main complication is the affect of interspersed headers/comments on the new "chrom_sweep" algorithm Mostly out of laziness and the sense that view people used these sorts of records, I therefore decided to ditch support.
However, now that both you and Ryan have pointed out that this is counter to legal GFF and BED files, I will go back to the drawing board and try construe a better solution prior to the next formal release. Specifically, I plan to tackle this by implementing what you describe in option #2 (listed again below). Namely, true headers (i.e., at the start of the file prior to any data records) will be saved and optionally regurgitated if requested via a new -header option. Interspersed headers/comments will be tolerated yet ignored.
> 2. Ideally -header would preserve track, browser, and "#" lines in place wherever they occurred, when writing an output BED file by filtering an input BED file (e.g., slopBed). However the existing implementation slightly augmented such that header lines in a block at the start of the file were preserved while later ones were ignored would probably suffice for most purposes.
Thanks again for bringing this up, I appreciate knowing that people actually use this type of BED file.
Best,
Aaron