[Request] fastaFromBed - 2 requests - ignore unavailable BED intervals

kslowi...@gmail.com

unread,

Dec 28, 2011, 3:43:13 PM12/28/11

to bedtools...@googlegroups.com

First of all, thank you, Aaron! BEDTools is fantastic, and I hope you continue producing great tools. You make my life easier.

I have two requests:

Make fastaFromBed output ignore BED intervals that are not available in the input FASTA file.
Some means of extracting intervals from a compressed FASTA file (gz, bzip2, zip, etc.).

Request 1

Let's say I have:

a FASTA file with sequences Gm01 and Gm02.
a BED file with intervals on sequences Gm01, Gm02, and Gm03.

$ ls
chromosomes.fa intervals.bed
$
$ cat chromosomes.fa
>Gm01
ATGCATGCATGCATGC
>Gm02
GCATGCATGCATGCAT
$
$ cat intervals.bed
Gm01 1 5
Gm03 1 5
Gm02 1 5
$
$ fastaFromBed -bed intervals.bed -fi chromosomes.fa -fo stdout
index file chromosomes.fa.fai not found, generating...
>Gm01:1-5
TGCA
unable to find FASTA index entry for 'Gm03'
$

Notice the BED interval for Gm03 is listed before the interval for Gm02. Also notice that the Gm02 interval is not retrieved from the FASTA file.

I'd like fastaFromBed to retrieve all intervals from the input FASTA while ignoring the intervals that are on nonexistant sequences.

At the moment, I can't think of a quick and simple way to get around this problem by using extra programs. It seems to me that changing the code is the only way. I might eventually try to do this myself and request a pull on github... I don't know if I have enough time.

Request 2

I keep my files compressed because they're huge, especially whole eukaryotic genomes. You already let me use compressed BED interval files (since v2.13.0, I think), but I can't get fastaFromBed to work with compressed FASTA files when I use named pipes.

(Same files as in request 1.)

$ bzip2 *
$ ls
chromosomes.fa.bz2 intervals.bed.bz2
$
$ fastaFromBed -bed <(bzcat intervals.bed.bz2) -fi <(bzcat chromosomes.fa.bz2) -fo stdout
index file /dev/fd/62.fai not found, generating...
could not open index file /dev/fd/62.fai for writing!
$
$ mkfifo chromosomes.fa
$ ls
chromosomes.fa chromosomes.fa.bz2 intervals.bed.bz2
$
$ bzcat chromosomes.fa.bz2 > chromosomes.fa &
[1] 7074
$ cat < chromosomes.fa
>Gm01
ATGCATGCATGCATGC
>Gm02
GCATGCATGCATGCAT
[1]+ Done bzcat chromosomes.fa.bz2 > chromosomes.fa
$
$ bzcat chromosomes.fa.bz2 > chromosomes.fa &
[1] 7077
$ bzcat intervals.bed.bz2 | fastaFromBed -bed stdin -fi chromosomes.fa -fo stdout
index file chromosomes.fa.fai not found, generating...
^C
[1]+ Done bzcat chromosomes.fa.bz2 > chromosomes.fa
$ cat < chromosomes.fa
^C
$
$ ls
chromosomes.fa chromosomes.fa.bz2 intervals.bed.bz2

As you can see, I can successfully load the pipe with the sequence data and get it back out.

I haven't studied your code, but I expect that fastaFromBed needs to read the input FASTA file twice, once for indexing and a second time for fetching sequences. The named pipe can be read only once, so fastaFromBed never generates the index file and it does not fetch any sequences. I had to ^C out of the second call to fastaFromBed because nothing was happening.

I don't know if it's possible to make fastaFromBed work on input FASTA streams rather than files. Could you create a temporary

in-memory index from the stream instead of a fai file, and then fetch sequences from the stream by seeking back and forth? I don't know. I'd like to hear what you think.

Aaron Quinlan

unread,

Dec 29, 2011, 1:59:40 PM12/29/11

to bedtools...@googlegroups.com

Hello,

Thanks for the detailed description of your requests, and many thanks for using the tools.

Make fastaFromBed output ignore BED intervals that are not available in the input FASTA file.

This request has been met in the GitHub repository (commit 287a43df6e3). A warning to stderr is now produced when chroms are found in the BED file that are not in the FASTA file.

Some means of extracting intervals from a compressed FASTA file (gz, bzip2, zip, etc.).

Unfortunately, this is a non-trivial request based on how FASTA files are indexed. I will give some thought to ways to address this.

Best,

Aaron

kslowi...@gmail.com

unread,

Feb 27, 2012, 12:01:45 PM2/27/12

to bedtools...@googlegroups.com

Aaron, you might like to see James Taylor's explanation and code [1] about seeking in bzip2 files; he also mentions the difference between gzip and bzip2.

[1]: https://bitbucket.org/james_taylor/bx-python/wiki/IO/SeekingInBzip2Files

Aaron Quinlan

unread,

Feb 28, 2012, 9:31:28 AM2/28/12

to bedtools...@googlegroups.com

Thanks, I'll take a look. Can't promise anything soon, but I will definitely consider this.

Best,

Aaron

Stéphane Plaisance

unread,

Jul 31, 2012, 11:35:22 AM7/31/12

to bedtools...@googlegroups.com, kslowi...@gmail.com

Hi Aaron,

Any progress on the fasta.gz issue.

I have several references on my laptop and each is taking 3-4Gb per while the gz version is 800MB
Could you add support to razip compressed fasta files? this seems now a standard compressed format to me.

I tried to use a FIFO for the decompression but bedtools creates a fai index on the fly named by the fifo link, which is not OK.

Thanks for any tip,

Stephane

Aaron Quinlan

unread,

Aug 1, 2012, 1:37:31 PM8/1/12

to bedtools...@googlegroups.com

Sorry, I have made no progress in this regard and doubt anything will change for the next release. I will add this to the list for the release after next.

[Request] fastaFromBed - 2 requests - ignore unavailable BED intervals - extract from compressed FASTA

kslowi...@gmail.com

Aaron Quinlan

kslowi...@gmail.com

Aaron Quinlan

Stéphane Plaisance

Aaron Quinlan