How to write a .bai file alongside a .bam file ?

ay...@grailbio.com

unread,

Sep 5, 2017, 8:20:59 PM9/5/17

to biogo-user

Hi all!

I am writing a set of alignment records to a .bam file, and I would like to write a .bai index file at the same time as writing the .bam file.

I saw the example here: https://godoc.org/github.com/biogo/hts/bam#example-Index-Add

But I just have a set of records, not their associated Chunks (the example gets the Chunk from the bam.Reader).

What is the recommended way to get the Chunk that I can pass to Index.Add() ?

thanks!

- Alex

Dan Kortschak

unread,

Sep 6, 2017, 10:15:07 PM9/6/17

to ay...@grailbio.com, biogo-user

It depends on what you mean by at the same time. It is not possible to
get chunk information on written SAM records as they are written
because the bam.Writer buffers data into work blocks to improve
performance. This means that it is not possible to know where a record
ends up before it has actually been written to the BGZF stream.

You can however write the BGZF output stream to an io.Writer than has
been engineered to write to the final destination and reflect the
writes into an io.Reader that you can glean the index from (this will
be similar to io.TeeReader). I highly doubt this will be more efficient
than just writing out the data and then indexing though.

ay...@grailbio.com

unread,

Sep 19, 2017, 12:20:08 PM9/19/17

to biogo-user

Thanks for your response! I think I understand the issue you raised. The code

that calls bam.Writer.Write() can't receive the chunk as a return value since the

actual calculation of the chunk occurs asynchronously.

Waiting until we have the compressed bgzf byte stream seems later than necessary though

because that means the indexer needs to decompress the bgzf. Do you think it's

possible for the bgzf writer to save the information necessary for the index while

the bgzf writer is computing the bgzf byte stream?

- Alex

Dan Kortschak

unread,

Sep 19, 2017, 6:56:24 PM9/19/17

to ay...@grailbio.com, biogo-user

BGZF offsets are a two part value, the offset into the compressed
stream of the start of the block and the offset into the compressed
block corresponding to the start of the record.

I don't see how you can have both of those bits of information at the
same time as you have the records unless the bgzf.Writer keeps a
mapping of each written object (which can be any type of thing - not
just a sam.Record) and the block index.

If you have an idea about how to implement this efficiently, please
suggest it, but I don't see one.

Dan

> > <javascript:>

Alexander Yip

unread,

Sep 19, 2017, 7:01:19 PM9/19/17

to Dan Kortschak, biogo-user

Thanks Dan, I'm going to revisit this and see if I can come up with something.

- Alex

Dan Kortschak

unread,

Sep 19, 2017, 7:52:53 PM9/19/17

to Alexander Yip, biogo-user

One possible solution is to use a single threaded BGZF writer. This
exists at 28b030[1]. The code there will not currently compile, but
with a small amount of work it could be made to work. With some more, a
Write call could be made to update a bgzf.Chunk for each write.
Obviously, this would require that you fork the BAM writer to use that
BGZF writer. If you really want this, this would be my suggestion.

The alternative is for the existing concurrent writer to make available
the block offset component for each write (no issue here) and a block
number, and then make available a list of block starts when they are
ready to be written (I really don't like this - either it becomes a
[]int64 or a <-chan int64, I guess this could be checked for nil and
only written to if not, but the API becomes very fragile to ensure that
the pairs of values are properly collated, and I don't like that).

[1]https://github.com/biogo/hts/blob/28b0306e7cfd423046d3d09a34da68615a4af1ea/bgzf/bgzf.go

Reply all

Reply to author

Forward