getting base quality scores with pileupread.alignment.qual[pileupread.query_position] very slow

276 views
Skip to first unread message

Ellen Gunnarsdottir

unread,
Jun 1, 2015, 8:14:56 AM6/1/15
to pysam-us...@googlegroups.com
Hi,

I´m processing a bam file (size 16,000 KB) where I am counting alleles at each positions and obtaining base quality scores using Pysam. When I only get alleles with "pileupread.alignment.query_sequence[pileupread.query_position]" the processing takes 25 seconds and but if I also get the quality scores with "ord(pileupread.alignment.qual[pileupread.query_position]) - 33" it takes 380 seconds. I dont understand why the processing of the quality scores is taking so long? Has anyone else encountered this problem?

Thanks,
EDG

Marcel Martin

unread,
Jun 1, 2015, 8:36:20 AM6/1/15
to pysam-us...@googlegroups.com
Instead of

ord(pileupread.alignment.qual[pileupread.query_position]) - 33

use

pileupread.alignment.query_qualities[pileupread.query_position]

The attribute .query_qualities is easier to work with than the .qual
attribute and should be preferred (no need for ord() and no need for
subtracting 33). It was introduced a few pysam versions back.

The .qual attribute still exists, but is unfortunately very slow. It is
actually a property that converts the full sequence of qualities from
the internal representation every time it is accessed.

Marcel

Ellen Gunnarsdottir

unread,
Jun 1, 2015, 8:42:59 AM6/1/15
to pysam-us...@googlegroups.com
Thanks a million Marcel, now the whole procedure takes 40 seconds!

EDG

Kevin Jacobs

unread,
Jun 1, 2015, 8:43:59 AM6/1/15
to pysam-us...@googlegroups.com
I've run into this issue before. Every time you access a single base quality score, the code will recreate the entire quality array. One solution is to use a LRU cache (of bounded size) to cache qual arrays.  This will inexpensively mitigate almost all of the cost of array creation after the first. You can get a second speed up by pre-converting the cached qual arrays to numpy arrays, avoiding the need to call ord() and subtracting 33 each time. 

-Kevin

Sent from my tiny remote device. 
--
You received this message because you are subscribed to the Google Groups "Pysam User group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pysam-user-gro...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages