First of all, thank you, Aaron! BEDTools is fantastic, and I hope you continue producing great tools. You make my life easier.
I have two requests:
- Make fastaFromBed output ignore BED intervals that are not available in the input FASTA file.
- Some means of extracting intervals from a compressed FASTA file (gz, bzip2, zip, etc.).
Request 1
Let's say I have:
- a FASTA file with sequences Gm01 and Gm02.
- a BED file with intervals on sequences Gm01, Gm02, and Gm03.
$ ls
chromosomes.fa intervals.bed
$
$ cat chromosomes.fa
>Gm01
ATGCATGCATGCATGC
>Gm02
GCATGCATGCATGCAT
$
$ cat intervals.bed
Gm01 1 5
Gm03 1 5
Gm02 1 5
$
$ fastaFromBed -bed intervals.bed -fi chromosomes.fa -fo stdout
index file chromosomes.fa.fai not found, generating...
>Gm01:1-5
TGCA
unable to find FASTA index entry for 'Gm03'
$
Notice the BED interval for Gm03 is listed before the interval for Gm02. Also notice that the Gm02 interval is not retrieved from the FASTA file.
I'd like fastaFromBed to retrieve all intervals from the input FASTA while ignoring the intervals that are on nonexistant sequences.
At the moment, I can't think of a quick and simple way to get around this problem by using extra programs. It seems to me that changing the code is the only way. I might eventually try to do this myself and request a pull on github... I don't know if I have enough time.
Request 2
I keep my files compressed because they're huge, especially whole eukaryotic genomes. You already let me use compressed BED interval files (since v2.13.0, I think), but I can't get fastaFromBed to work with compressed FASTA files when I use named pipes.
(Same files as in request 1.)
$ bzip2 *
$ ls
chromosomes.fa.bz2 intervals.bed.bz2
$
$ fastaFromBed -bed <(bzcat intervals.bed.bz2) -fi <(bzcat chromosomes.fa.bz2) -fo stdout
index file /dev/fd/62.fai not found, generating...
could not open index file /dev/fd/62.fai for writing!
$
$ mkfifo chromosomes.fa
$ ls
chromosomes.fa chromosomes.fa.bz2 intervals.bed.bz2
$
$ bzcat chromosomes.fa.bz2 > chromosomes.fa &
[1] 7074
$ cat < chromosomes.fa
>Gm01
ATGCATGCATGCATGC
>Gm02
GCATGCATGCATGCAT
[1]+ Done bzcat chromosomes.fa.bz2 > chromosomes.fa
$
$ bzcat chromosomes.fa.bz2 > chromosomes.fa &
[1] 7077
$ bzcat intervals.bed.bz2 | fastaFromBed -bed stdin -fi chromosomes.fa -fo stdout
index file chromosomes.fa.fai not found, generating...
^C
[1]+ Done bzcat chromosomes.fa.bz2 > chromosomes.fa
$ cat < chromosomes.fa
^C
$
$ ls
chromosomes.fa chromosomes.fa.bz2 intervals.bed.bz2
As you can see, I can successfully load the pipe with the sequence data and get it back out.
I haven't studied your code, but I expect that fastaFromBed needs to read the input FASTA file twice, once for indexing and a second time for fetching sequences. The named pipe can be read only once, so fastaFromBed never generates the index file and it does not fetch any sequences. I had to ^C out of the second call to fastaFromBed because nothing was happening.
I don't know if it's possible to make fastaFromBed work on input FASTA streams rather than files. Could you create a temporary
in-memory index from the stream instead of a fai file, and then fetch sequences from the stream by seeking back and forth? I don't know. I'd like to hear what you think.