Re: [Samtools-devel] BAI2 - Longer references, embedded reference etc

18 views

Skip to first unread message

Peter Cock

unread,

May 28, 2012, 6:39:31 AM5/28/12

to barn...@bc.edu, samtool...@lists.sourceforge.net, bamtoo...@googlegroups.com

On Mon, May 14, 2012 at 6:41 PM, Derek Barnett <barn...@bc.edu> wrote:>
>On 05/14/2012 01:04 PM, Heng Li wrote:
>> I would just wait until someone can assemble a chromosome longer
>> than 512Mbp. The opossum genome is still in scaffolds, so it is not a
>> problem for now. When we really get long chromosomes in a few
>> years, if not longer, SAM/BAM will probably be faced with other
>> challenges (e.g. very long reads and assembly in a graph). Actually
>> there is even the question of whether BAM will be replaced with
>> more space efficient formats such as cSRA and CRAM.
>>
>> As to RNA-seq data, we need an index that inserts a key frame per
>> N number of reads, instead of per X kb. For this to work, we have
>> to redesign the new index from scratch. Bamtools has such an index
>> so far as I am aware of, but I do not know the number of seek calls
>> it needs in particular given alignments spanning tens of kb regions.
>
> The alternate, fixed-N-read "BTI" scheme available in BamTools requires
> 2 seeks in all cases - once in the index file to jump to and read a
> reference's candidate blocks and then once in the BAM file when the
> proper data offset is found. Each index block in the BTI file stores its
> first alignment's start position & BAM file offset as well as the
> maximal end position (considering all of its alignments). Thus finding
> the closest block to your target region is a simple check of the
> left/right bounds of blocks, regardless of the alignment lengths.
>
> Caching of the index blocks could reduce to a single seek in the BAM
> itself only, at cost of keeping the data in memory.

Hi Derek,

From looking at the source code to BamTools on github, it appears
the BamTools Index (BTI) is limited by using int32_t (i.e. signed 32
bit integer), and thus chromosomes/references of length 2Gbp. Is
that correct?
https://github.com/pezmaster31/bamtools/blob/master/src/api/internal/index/BamToolsIndex_p.cpp

If so, the BamTools Index (BTI) would be fine for marsupials, and
probably most current scaffolds.

However, it would not enough for plant genomes in progress like
the wheat 3B chromosome. Can you switch to using an unsigned
32 bit integer which means support for 4Gb, hopefully enough for
the short/medium term? Or would simply moving to using a 64 bit
integer in BTI likely to cause problems?

Regards,

Peter

N.B. The samtools-devel thread started here:
http://sourceforge.net/mailarchive/message.php?msg_id=29261469

See also this thread on cram-devel:
http://listserver.ebi.ac.uk/pipermail/cram-dev/2012-May/000109.html

Reply all

Reply to author

Forward

0 new messages