access to Linear index for bam

71 views
Skip to first unread message

Brent Pedersen

unread,
Dec 15, 2016, 12:16:22 PM12/15/16
to biogo...@googlegroups.com
I'd like to do some coverage checks using only a bam index. This is
much harder without access to the internal.Index--which has the
.Intervals for each tile.
For prototyping, I added:

func (i *Index) Index() internal.Index {
return i.idx
}

to hts/bam/index.go

is there another way to do this or a small API change so that it's accessible?
thanks,
-Brent

Dan Kortschak

unread,
Dec 15, 2016, 1:58:46 PM12/15/16
to Brent Pedersen, biogo...@googlegroups.com

How are you using the index to help you when you have it?

I'm terabyte-scale to include an internal type in the public API.

--
afk
Dan


From: biogo...@googlegroups.com <biogo...@googlegroups.com> on behalf of Brent Pedersen <bped...@gmail.com>
Sent: Friday, December 16, 2016 3:46:21 AM
To: biogo...@googlegroups.com
Subject: [biogo-user] access to Linear index for bam
 
--
You received this message because you are subscribed to the Google Groups "biogo-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogo-user+...@googlegroups.com.
To post to this group, send an email to biogo...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brent Pedersen

unread,
Dec 15, 2016, 2:30:22 PM12/15/16
to Dan Kortschak, biogo...@googlegroups.com
I'm doing this: https://play.golang.org/p/iyt6AamdqE

where line 13 is accessing the internal.Index

This means I can get a 16KB resolution of the depth of a BAM across
the entire genome in seconds. Well, likely, under
a second.

Brent Pedersen

unread,
Dec 15, 2016, 2:37:12 PM12/15/16
to Dan Kortschak, biogo...@googlegroups.com

Dan Kortschak

unread,
Dec 15, 2016, 3:21:44 PM12/15/16
to Brent Pedersen, biogo...@googlegroups.com
How well does that estimate concord with the real values from
exhaustive accounting?

To be honest, I think this is an abuse of the indexing strategy that
would have many caveats: it depends on compression ratios being near
identical across the BAM and on the major contribution from BAM record
mass coming from SEQ/QUAL over core or optional fields.

It doesn't fill me with confidence.

Brent Pedersen

unread,
Dec 15, 2016, 5:00:19 PM12/15/16
to Dan Kortschak, biogo...@googlegroups.com
But it's terabyte scale. ;)

here's what it looks like for a small set.

most of the region is a large deletion of CN1, that's the lower blob,
the higher blob is the flanking region of CN2.
cmp.png

Dan Kortschak

unread,
Dec 15, 2016, 6:21:27 PM12/15/16
to Brent Pedersen, biogo...@googlegroups.com
On Thu, 2016-12-15 at 15:00 -0700, Brent Pedersen wrote:
> But it's terabyte scale. ;)

That doesn't work for mongo and and doesn't work here :|

> here's what it looks like for a small set.
>
> most of the region is a large deletion of CN1, that's the lower blob,
> the higher blob is the flanking region of CN2.

This is not wildly convincing from the perspective of arguing that this
is a justification for exposing internal API.

You can however get the same information from the exposed index API by
iterating over the tiles and asking for the chunks for each tile. This
will be slower, but given the abuse of the information that is going
on, I think it's a reasonable price signal. Sorry.

Brent Pedersen

unread,
Dec 15, 2016, 6:34:45 PM12/15/16
to Dan Kortschak, biogo...@googlegroups.com
I think I can get what I need by reflection, but, this method seems to
actually work quite well for getting a global view of whats going on.
The attached shows an example where we got a BAM that was missing data
for a region. It's an extreme example, but it also seems to make it
easy to see large deletions and dups that look like noise to a CNV
caller.
nant.png

Dan Kortschak

unread,
Dec 15, 2016, 6:56:55 PM12/15/16
to Brent Pedersen, biogo...@googlegroups.com
On Thu, 2016-12-15 at 16:34 -0700, Brent Pedersen wrote:
> I think I can get what I need by reflection, but, this method seems
> to
> actually work quite well for getting a global view of whats going on.

I'm sure it's useful for this case, but I don't think the single use is
a good enough reason to expose an internal component.

How are you planning to use reflection? If you want to put in that
effort, you might be better off shadowing the types in your code and
using unsafe. But how much slower is it to use the existing API and
querying the tiles?

Brent Pedersen

unread,
Dec 15, 2016, 7:55:39 PM12/15/16
to Dan Kortschak, biogo...@googlegroups.com
you mean 10's of thousands of calls to index.Chunks()? I guess if I do
1-base intervals, it should be pretty fast.

Dan Kortschak

unread,
Dec 15, 2016, 8:00:12 PM12/15/16
to Brent Pedersen, biogo...@googlegroups.com
Not one base intervals, 16kb intervals on tile boundaries.

Dan Kortschak

unread,
Dec 15, 2016, 8:29:19 PM12/15/16
to Brent Pedersen, biogo...@googlegroups.com
Actually, with this approach, you can tune it to any value you want.
Given that the best this is doing is an approximation, you should make
use of that and get it very cheaply by only sampling at the scale
that's appropriate for the problem.

Brent Pedersen

unread,
Dec 15, 2016, 9:57:30 PM12/15/16
to Dan Kortschak, biogo...@googlegroups.com
I'm not sure how to choose the chunk that I get back from Chunks(); I
tried sorting to get the smallest, but the actual chunk that I want
may have been merged into another chunk, right?
Here's what I have: https://play.golang.org/p/-xNLfiUeIL

what am I missing?

Dan Kortschak

unread,
Dec 16, 2016, 12:50:24 AM12/16/16
to Brent Pedersen, biogo...@googlegroups.com
No, you are right. I guess it's the reflect/unsafe approach then.

Brent Pedersen

unread,
Dec 16, 2016, 11:50:14 AM12/16/16
to Dan Kortschak, biogo...@googlegroups.com
Got it. This seems to work:

func getRefs(idx *bam.Index) []RefIndex {
refs := reflect.ValueOf(*idx).FieldByName("idx").FieldByName("Refs")
ptr := unsafe.Pointer(refs.Pointer())
return (*(*[1 << 28]RefIndex)(ptr))[:refs.Len()]

Brent Pedersen

unread,
Dec 16, 2016, 2:34:52 PM12/16/16
to biogo...@googlegroups.com
FWIW, I have put this into a tool here:
https://github.com/brentp/goleft/tree/master/indexcov
thanks for the help.
-Brent

Brent Pedersen

unread,
Sep 19, 2017, 4:26:40 PM9/19/17
to biogo...@googlegroups.com
Just wanted to follow up and let you know this turned out to be quite
useful and is already widely used as a QC tool.
We just published it here:
https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/gix090/4160383/Indexcov-fast-coverage-quality-control-for

Here is an example output on 2K samples from Simons Autism:
http://home.chpc.utah.edu/~u6000771/indexcov/simons/
where you can see sex aneuploidies and other large events.

Thanks for developing biogo. I certainly wouldn't have done this if it
didn't exist.
-Brent
Reply all
Reply to author
Forward
0 new messages