Iterating over SAM/BAM file using fetch. Speed?

Kristoffer

unread,

Sep 10, 2011, 4:56:03 AM9/10/11

to Pysam User group

Hi,

When i use the iteration method
for alignedread in samfile.fetch():
to iterate over the reads in the SAM file, I get a lot better running
time than using the same command to iterate over the corresponding
file in BAM format (about a factor of 5x slower). Intuitively, it
should be the reverse. The commands inside the for Ioop is just some
fetching of info with the class.pysam.AlignedRead class. I have only
tried it on one samfile so I don't no how the running time "scales",
maybe there is a lot of setup indexing the BAM file? Is there
something on my part or is the fetch method actually much slower for a
BAM file?

Best regards,
Kristoffer

Jared Simpson

unread,

Sep 10, 2011, 5:29:23 AM9/10/11

to pysam-us...@googlegroups.com

Hi Kristoffer,

Would the time difference be accounted for by the decompression that must be performed on BAM files?

Best,

Jared

Kristoffer

unread,

Sep 11, 2011, 1:15:18 PM9/11/11

to Pysam User group

Hi Jared,

No, I have already created the three different physical files (the
SAM, BAM and the BAM index file). So afaik, the program starts to
iterate directly from the two different formats unless there are some
different indexing setups done by the commands
samfile = pysam.Samfile(SAM_file, 'r' ) and bamfile =
pysam.Samfile(BAM_file, 'rb' ).

Best,
Kristoffer

Andreas

unread,

Oct 17, 2011, 4:07:44 AM10/17/11

to Pysam User group

Hi Kristoffer,

good question. Some points to note:

1. The code within pysam to read through sam and bam-files is
identical - it calls
the c-samtools API. Hence any differences in speed are a consequence
of the c-samtools implementation.

2. Such things are very difficult to benchmark. Bam files are more
compact and require less I/O, but
require additional CPU cycles for decompression. Whether reading from
BAM or SAM files
is quicker or slower depends thus on your relative speed of I/O versus
CPU. Note that modern
servers with good memory are able to cache even large files into
memory which can bias
the results.

3. I have not benchmarked this, so the above is all conjecture.

Best wishes,
Andreas

Florian Finkernagel

unread,

Nov 9, 2011, 10:37:06 AM11/9/11

to Pysam User group

Hi,

I've not checked the code, but the documentation says

"Without reference or region all reads will be fetched. The reads will
be returned ordered by reference sequence, which will not necessarily
be the order within the file."

I always took this to mean that fetch() is slower than seek(0, 0)
followed by fetch(until_eof=True) which does not order by reference
sequence.

Could that be the difference?

So long,
Florian

Andreas

unread,

Nov 10, 2011, 6:11:26 AM11/10/11

to Pysam User group

Hi Florian,

thanks. You are correct.

There is very little difference in the overhead between fetch()
and seek(0,0); fetch( until_eof = True ). The former seeks to the
beginning of each
chromosome once and then reads sequentially, while the latter seeks
only once.
Unless you have a very large number of chromosomes, I guess the
difference should not be
very noticeable.

Best wishes,
Andreas

On Nov 9, 3:37 pm, Florian Finkernagel <finkerna...@mathematik.uni-

Reply all

Reply to author

Forward