support for reading selected seq from fasta.fai ?

24 views
Skip to first unread message

Brent Pedersen

unread,
Mar 11, 2015, 3:29:51 PM3/11/15
to biogo...@googlegroups.com
Hi,
is there support for using the fai index to pull sub-sequences from an indexed fasta?
I didn't see it in seqio.
thanks,
-Brent

Dan Kortschak

unread,
Mar 11, 2015, 3:48:10 PM3/11/15
to biogo...@googlegroups.com
All the code was written about a year ago, but not committed. I can do that today. Probably without tests, and then fix that up later.

Dan Kortschak

unread,
Mar 11, 2015, 8:58:50 PM3/11/15
to Brent Pedersen, biogo...@googlegroups.com
OK. That's done.

The two options you have for using it are:

1. Seek to the offset in the associated fasta file and then create
an io.LimitedReader of the appropriate length and then parse in
the sequence as normal text and add it to the Seq field of a
linear (or however you are using the data). This is probably the
easier/safer of the two. This approximately is how samtools
faidx does it.
2. Use the offset as an index into a gommap.MMap that has been
obtained from the fasta file you are using gommap[1]. This is
probably preferable in terms of performance and I think how
biopython works with this kind of sequence access.

The choice between these two approaches is probably why I didn't get any
further with it.

[1]https://godoc.org/launchpad.net/gommap

Brent Pedersen

unread,
May 13, 2015, 11:03:39 PM5/13/15
to Dan Kortschak, biogo...@googlegroups.com
Hi Dan,
I finally got around to this. For now, for simplicity, I went with
option 3: just Read() the requested chunk into a buffer.
https://github.com/brentp/faidx
I'm happy to contribute this to biogo if you think it's appropriate.

I may try the mmap approach if performance seems an issue.

Dan Kortschak

unread,
May 14, 2015, 12:08:03 AM5/14/15
to Brent Pedersen, biogo...@googlegroups.com
On Wed, 2015-05-13 at 21:03 -0600, Brent Pedersen wrote:
> I finally got around to this. For now, for simplicity, I went with
> option 3: just Read() the requested chunk into a buffer.
> https://github.com/brentp/faidx
> I'm happy to contribute this to biogo if you think it's appropriate.
>
I don't think this gets us anything that current biogo seq io doesn't
already provide. If you implement the mmap approach then it would be a
more useful addition.

The code that you have here makes a basis for that though.

Before it would be merged, indexing would need to be changed to 0-based
half open and I think providing an faidx association rather than using
an implied file extension would be better (I think
base.{bai,csi,fai,tbi} is a horrible pattern and should not ever work
it's way into library routines - app-level code may want to do that
though, maybe).

Dan

Reply all
Reply to author
Forward
0 new messages