tabix example

80 views
Skip to first unread message

Brent Pedersen

unread,
Jul 22, 2015, 1:13:50 PM7/22/15
to biogo...@googlegroups.com
Hi, does anyone have some example code for tabix reading or is there
some in the biogo code-base? I guess I get idx.Chunks(), then seek to
each chunk.Begin with bgzf reader?

Dan Kortschak

unread,
Jul 22, 2015, 6:26:23 PM7/22/15
to Brent Pedersen, biogo...@googlegroups.com
You will need to read the index through a compress/gzip Reader, but what you have there sounds about right (I've not written integration tests or examples for this package because tabix - and others dependent on htslib - are not properly specified in hts-specs).

Brent Pedersen

unread,
Jul 24, 2015, 9:46:09 AM7/24/15
to biogo-user, dan.ko...@adelaide.edu.au
ok. thanks. I'll give it a go.

Brent Pedersen

unread,
Sep 2, 2015, 10:01:04 PM9/2/15
to biogo-user, Dan Kortschak
Just got around to looking at this. Seem like it'd be convenient if
func (i *Index) Chunks(r *sam.Reference, beg, end int) []bgzf.Chunk

in tabix.go

instead had a signature something like:

func (i *Index) Chunks(r Record) []bgzf.Chunk


since a sam.Reference has a lot of unrelated stuff to just grab some intervals.

-Brent
> --
> You received this message because you are subscribed to the Google Groups
> "biogo-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to biogo-user+...@googlegroups.com.
> To post to this group, send email to biogo...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dan Kortschak

unread,
Sep 2, 2015, 10:24:20 PM9/2/15
to Brent Pedersen, biogo-user
That's a relatively straightforward change. Would you like to send a PR?

Brent Pedersen

unread,
Sep 3, 2015, 5:20:52 PM9/3/15
to Dan Kortschak, biogo-user
Hi Dan, I think I'm understanding how to make this work but I think
I'm hitting an error with this logic:

https://github.com/biogo/hts/blob/5ffa2ac9c04862f5b90413b29ff6e9ef91de558a/bgzf/index/index.go#L53

which gives an 'index out of range' error when it reaches the end of a
block. I think the '&&' in the line should be replace with '||'. Does
that seem right to you?

Once I get to the correct block, it's a linear scan through the
elements of that block to get the ones that I'm interested in?

Dan Kortschak

unread,
Sep 3, 2015, 6:17:50 PM9/3/15
to Brent Pedersen, biogo-user
On 04/09/2015, at 6:50 AM, "Brent Pedersen" <bped...@gmail.com> wrote:

> Hi Dan, I think I'm understanding how to make this work but I think
> I'm hitting an error with this logic:
>
> https://github.com/biogo/hts/blob/5ffa2ac9c04862f5b90413b29ff6e9ef91de558a/bgzf/index/index.go#L53
>
> which gives an 'index out of range' error when it reaches the end of a
> block. I think the '&&' in the line should be replace with '||'. Does
> that seem right to you?

Yes, that seems right. I'll fix that today after I LGTM your PR (it needs to return nil in the bad case - then squash into one commit). After that, please send a PR adding yourself to the biogo/biogo A+C.

> Once I get to the correct block, it's a linear scan through the
> elements of that block to get the ones that I'm interested in?

Yes, that's right.

Brent Pedersen

unread,
Sep 4, 2015, 1:27:38 PM9/4/15
to Dan Kortschak, biogo-user
I wrote a little wrapper so it's less code to use a tabix'ed file:
https://github.com/brentp/bix

A Query() by region returns an io.Reader that will have only the
overlapping records. To do this, I have to at least partially parse
every line from the ChunkedReader to make sure it overlaps the query
interval (and not just in the same block). For VCF, unless I
misunderstand, it has to even parse the INFO to calculate the end for
structural variants:

https://github.com/brentp/bix/blob/4532c7b65d25193fdc1afdb015963033ddff724a/bix.go#L118-L206

I'd welcome any thoughts on a better way to do this.

thanks,
-Brent

Dan Kortschak

unread,
Sep 4, 2015, 5:14:58 PM9/4/15
to Brent Pedersen, biogo-user
Do VCF files really just ignore the spec (I'm honestly not surprised if they do)?

The intervals handled by a the tabix index should be defined by the three fields specified in col_seq, col_beg, and col_end, so you should only need to find a new line, split on "\t" up to the maximum on those three values and compare against the interval they define.

Brent Pedersen

unread,
Sep 4, 2015, 5:19:23 PM9/4/15
to Dan Kortschak, biogo-user
On Fri, Sep 4, 2015 at 3:14 PM, Dan Kortschak
<dan.ko...@adelaide.edu.au> wrote:
> Do VCF files really just ignore the spec (I'm honestly not surprised if they do)?
>

https://github.com/samtools/htslib/blob/develop/tbx.c#L111-L126

> The intervals handled by a the tabix index should be defined by the three fields specified in col_seq, col_beg, and col_end, so you should only need to find a new line, split on "\t" up to the maximum on those three values and compare against the interval they define.

Yeah, I do that for non-vcf files.

Dan Kortschak

unread,
Sep 4, 2015, 5:30:09 PM9/4/15
to Brent Pedersen, biogo-user
Incredible. I give up.

Dan Kortschak

unread,
Sep 4, 2015, 5:51:02 PM9/4/15
to Brent Pedersen, biogo-user
On Fri, 2015-09-04 at 21:30 +0000, Dan Kortschak wrote:
> Incredible. I give up.

This is https://github.com/samtools/hts-specs/issues/70

Reply all
Reply to author
Forward
0 new messages