[Request] fastaFromBed - 2 requests - ignore unavailable BED intervals - extract from compressed FASTA

336 views
Skip to first unread message

kslowi...@gmail.com

unread,
Dec 28, 2011, 3:43:13 PM12/28/11
to bedtools...@googlegroups.com
First of all, thank you, Aaron! BEDTools is fantastic, and I hope you continue producing great tools. You make my life easier.

I have two requests:
  1. Make fastaFromBed output ignore BED intervals that are not available in the input FASTA file.
  2. Some means of extracting intervals from a compressed FASTA file (gz, bzip2, zip, etc.).
Request 1

Let's say I have:
  • a FASTA file with sequences Gm01 and Gm02.
  • a BED file with intervals on sequences Gm01, Gm02, and Gm03.
$ ls
chromosomes.fa  intervals.bed

$ cat chromosomes.fa 
>Gm01
ATGCATGCATGCATGC
>Gm02
GCATGCATGCATGCAT

$ cat intervals.bed 
Gm01 1 5
Gm03 1 5
Gm02 1 5

$ fastaFromBed -bed intervals.bed -fi chromosomes.fa -fo stdout
index file chromosomes.fa.fai not found, generating...
>Gm01:1-5
TGCA
unable to find FASTA index entry for 'Gm03'

Notice the BED interval for Gm03 is listed before the interval for Gm02. Also notice that the Gm02 interval is not retrieved from the FASTA file.

I'd like fastaFromBed to retrieve all intervals from the input FASTA while ignoring the intervals that are on nonexistant sequences.

At the moment, I can't think of a quick and simple way to get around this problem by using extra programs. It seems to me that changing the code is the only way. I might eventually try to do this myself and request a pull on github... I don't know if I have enough time.

Request 2

I keep my files compressed because they're huge, especially whole eukaryotic genomes. You already let me use compressed BED interval files (since v2.13.0, I think), but I can't get fastaFromBed to work with compressed FASTA files when I use named pipes.

(Same files as in request 1.)

$ bzip2 *
$ ls
chromosomes.fa.bz2  intervals.bed.bz2

$ fastaFromBed -bed <(bzcat intervals.bed.bz2) -fi <(bzcat chromosomes.fa.bz2) -fo stdout
index file /dev/fd/62.fai not found, generating...
could not open index file /dev/fd/62.fai for writing!

$ mkfifo chromosomes.fa
$ ls
chromosomes.fa  chromosomes.fa.bz2  intervals.bed.bz2

$ bzcat chromosomes.fa.bz2 > chromosomes.fa &
[1] 7074
$ cat < chromosomes.fa
>Gm01
ATGCATGCATGCATGC
>Gm02
GCATGCATGCATGCAT
[1]+  Done                    bzcat chromosomes.fa.bz2 > chromosomes.fa

$ bzcat chromosomes.fa.bz2 > chromosomes.fa &
[1] 7077
$ bzcat intervals.bed.bz2 | fastaFromBed -bed stdin -fi chromosomes.fa -fo stdout
index file chromosomes.fa.fai not found, generating...
^C
[1]+  Done                    bzcat chromosomes.fa.bz2 > chromosomes.fa
$ cat < chromosomes.fa
^C

$ ls
chromosomes.fa  chromosomes.fa.bz2  intervals.bed.bz2

As you can see, I can successfully load the pipe with the sequence data and get it back out.

I haven't studied your code, but I expect that fastaFromBed needs to read the input FASTA file twice, once for indexing and a second time for fetching sequences. The named pipe can be read only once, so fastaFromBed never generates the index file and it does not fetch any sequences. I had to ^C out of the second call to fastaFromBed because nothing was happening.

I don't know if it's possible to make fastaFromBed work on input FASTA streams rather than files. Could you create a temporary 
in-memory index from the stream instead of a fai file, and then fetch sequences from the stream by seeking back and forth? I don't know. I'd like to hear what you think.

Aaron Quinlan

unread,
Dec 29, 2011, 1:59:40 PM12/29/11
to bedtools...@googlegroups.com
Hello,

Thanks for the detailed description of your requests, and many thanks for using the tools.

  1. Make fastaFromBed output ignore BED intervals that are not available in the input FASTA file.
This request has been met in the GitHub repository (commit 287a43df6e3).  A warning to stderr is now produced when chroms are found in the BED file that are not in the FASTA file.

  1. Some means of extracting intervals from a compressed FASTA file (gz, bzip2, zip, etc.).
Unfortunately, this is a non-trivial request based on how FASTA files are indexed.  I will give some thought to ways to address this.

Best,
Aaron

kslowi...@gmail.com

unread,
Feb 27, 2012, 12:01:45 PM2/27/12
to bedtools...@googlegroups.com
Aaron, you might like to see James Taylor's explanation and code [1] about seeking in bzip2 files; he also mentions the difference between gzip and bzip2.

Aaron Quinlan

unread,
Feb 28, 2012, 9:31:28 AM2/28/12
to bedtools...@googlegroups.com
Thanks, I'll take a look.  Can't promise anything soon, but I will definitely consider this.
Best,
Aaron


Stéphane Plaisance

unread,
Jul 31, 2012, 11:35:22 AM7/31/12
to bedtools...@googlegroups.com, kslowi...@gmail.com
Hi Aaron,

Any progress on the fasta.gz issue.
I have several references on my laptop and each is taking 3-4Gb per while the gz version is 800MB
Could you add support to razip compressed fasta files? this seems now a standard compressed format to me.

I tried to use a FIFO for the decompression but bedtools creates a fai index on the fly named by the fifo link, which is not OK.

Thanks for any tip,
Stephane

Aaron Quinlan

unread,
Aug 1, 2012, 1:37:31 PM8/1/12
to bedtools...@googlegroups.com
Sorry, I have made no progress in this regard and doubt anything will change for the next release.  I will add this to the list for the release after next.

Best,
Aaron

Reply all
Reply to author
Forward
0 new messages