bgzf Seek

56 views
Skip to first unread message

Vasiliy Tolstov

unread,
Feb 24, 2015, 4:58:58 AM2/24/15
to biogo...@googlegroups.com
Hello. I'm try to seek using uncompressed offset in bgzf file using go library.
Firstly - why go library does not provide ReadSeeker ? As i understand to do that we need to generate virtual offset from uncompressed offset.
As i see http://biopython.org/DIST/docs/api/Bio.bgzf-pysrc.html#BgzfReader have make_virtual_offset and split_virtual_offset also it recommend using tell method to get current virtual_offset, does it possible to expose this ?

Dan Kortschak

unread,
Feb 24, 2015, 5:42:20 AM2/24/15
to Vasiliy Tolstov, biogo...@googlegroups.com
That's an interesting idea. The driving use case for bgzf is genomics (and specifically genomics that I work on), so when I need a feature, that's most likely when it get's added (or if someone else requests it and it doesn't cost too much time). At the moment, the main feature I need is performance, so that is what I'm focussing on - the changes this involves are not trivial.

Having said that, I can see merits in this. I already plan to add CSI support (this is a generalisation of BAI), so while I do that I may add something like this. I also don't want to make an indexer that piggy-backs onto bgzf.Reader the way bam.Index does, since it's not necessary - in the case of indexing for BAI/CSI it is necessary to have an understanding of the decompressed content, in this case that is not true. Because of this, all that is necessary is that the indexer read the gzip member's header, seek to the end of the member less 4 bytes and read ISIZE,... repeat along the entire file keeping track of base offsets and decompressed sizes. This is far less costly than the decompression.

Note that it is not possible to index directly during a write (this is an unfortunate consequence of how package bgzf does write parallelisation - it may be fixable, but it is not a priority, and it can be worked around by tee-reading what is being written).

This index would then be used as a wrapping struct for a bgzf.Reader implementing io.Seeker.

Would you file an issue, linking to this discussion.

BTW The behaviour you are asking for that is provided by biopython as tell already exists. LastChunk gives the starting and ending virtual offsets of the last Read/Seek. I think this is even documented.
--
You received this message because you are subscribed to the Google Groups "biogo-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogo-user+...@googlegroups.com.
To post to this group, send email to biogo...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vasiliy Tolstov

unread,
Feb 24, 2015, 5:51:36 AM2/24/15
to Dan Kortschak, biogo...@googlegroups.com
Thanks for answer, i'm file an issue , but for now as i understand i
need to calculate offset (i can create my own ReadSeeker and seek
internally. )
python docs says that each block have 65536 bytes long, does it
suitable for bgzf go package?
Also python example about uncompressed seek says about offset of
specific block, how can i get it now? (not waiting for rewrite bgzf
package)

--
Vasiliy Tolstov,
e-mail: v.to...@selfip.ru
jabber: va...@selfip.ru

Dan Kortschak

unread,
Feb 24, 2015, 6:13:58 AM2/24/15
to Vasiliy Tolstov, biogo...@googlegroups.com
On 24/02/2015, at 9:21 PM, "Vasiliy Tolstov" <v.to...@selfip.ru> wrote:

> Thanks for answer, i'm file an issue , but for now as i understand i
> need to calculate offset (i can create my own ReadSeeker and seek
> internally. )
> python docs says that each block have 65536 bytes long, does it
> suitable for bgzf go package?

BGZF blocks are up 65536 bytes long. This is in the BGZF spec and is a consequence of how block length is stored (uint16_t BSIZE field).

> Also python example about uncompressed seek says about offset of
> specific block, how can i get it now? (not waiting for rewrite bgzf
> package)

If you read through a bgzf and watch the returns of LastChunk (particularly the File fields) you can collect all the block start offsets. If you also count bytes between offset changes you can know the decompressed block size - you need to take care not to miss block changes, so you must read one byte at a time. This will give you the complete indexing data that my outline gets you, just much less efficiently.

Dan Kortschak

unread,
Feb 24, 2015, 6:21:31 AM2/24/15
to Vasiliy Tolstov, biogo...@googlegroups.com
Note also that there is some ambiguity in the indexing of a BGZF. It is potentially valid to have many virtual offsets for a given offset into the decompressed data.

This is because compressing zero bytes takes a non-zero number of bytes. This is the basis for the EOF magic block in the BGZF spec.

To get around this, I would suggest that the final virtual offset for a given decompressed data offset be the appropriate one to use - it will be the virtual offset that results in the least amount of work before getting to actual data.

Dan Kortschak

unread,
Mar 9, 2015, 9:40:44 PM3/9/15
to Vasiliy Tolstov, biogo...@googlegroups.com
The necessary API for doing this is now present in bgzf.Reader.

You can set the Reader to stop at the end of blocks and watch the chunk
transactions directly in the bgzf.Reader.

Minimally to get the relationships between the bgzf virtual offsets and
real offsets you index like so (untested):

br := bgzf.NewReader(r, 0)
bg.Blocked(true)
for {
t := br.Begin()
n, err := io.Copy(ioutil.Discard, br)
if n == 0 && err == io.EOF {
break
}
indexFunc(n, t.End())
}

indexFunc will take the number of decompressed bytes and make a running
total, linking that to the bgzf.Chunk returned by t.End. You could use a
CSI for that or roll your own index type.


Reply all
Reply to author
Forward
0 new messages