That's an interesting idea. The driving use case for bgzf is genomics (and specifically genomics that I work on), so when I need a feature, that's most likely when it get's added (or if someone else requests it and it doesn't cost too much time). At the
moment, the main feature I need is performance, so that is what I'm focussing on - the changes this involves are not trivial.
Having said that, I can see merits in this. I already plan to add CSI support (this is a generalisation of BAI), so while I do that I may add something like this. I also don't want to make an indexer that piggy-backs onto bgzf.Reader the way bam.Index
does, since it's not necessary - in the case of indexing for BAI/CSI it is necessary to have an understanding of the decompressed content, in this case that is not true. Because of this, all that is necessary is that the indexer read the gzip member's header,
seek to the end of the member less 4 bytes and read ISIZE,... repeat along the entire file keeping track of base offsets and decompressed sizes. This is far less costly than the decompression.
Note that it is not possible to index directly during a write (this is an unfortunate consequence of how package bgzf does write parallelisation - it may be fixable, but it is not a priority, and it can be worked around by tee-reading what is being written).
This index would then be used as a wrapping struct for a bgzf.Reader implementing io.Seeker.
Would you file an issue, linking to this discussion.
BTW The behaviour you are asking for that is provided by biopython as tell already exists. LastChunk gives the starting and ending virtual offsets of the last Read/Seek. I think this is even documented.