Pickling AlignedSegment

16 views

Skip to first unread message

Jordi Camps

unread,

Oct 4, 2021, 2:30:23 PM10/4/21

to Pysam User group

Hello,

I saw that this is an already discussed issue, but I'm also trying to do something with the AlignedSegment that involves pickling.

Instead of showing you the particular errors I'm finding, I will tell you what I need so you can guide me on the best practices to achieve my target.

I'm trying to modify a third party software that runs sequentially to make it parallel. The main aim of the software is a kind of duplicate marking procedure. It works by traversing a coordinate sorted but potentially unindexed bam file, extracting sets of AlignedSegments that share the same chromosome and position, perform its magic to add some tags and mark the proper AlignedSegments as duplicates and write this result in a new file.

My initial approach is to use one (multiprocessing.)Process to read the bam file sequentially, generate the different sets of reads and pass them thru a (multiprocessing.)Queue to a worker Process. This worker process does its magic and finally send the result to a writer Process (thru another Queue) that takes care of writing the data in the proper order (same as input).

It fails due to the known issue of AlignedSegment not being pickable. Making this object pickable does not seem difficult, but it involves modifying the pysam sources. I could not find a way to achieve this result from the outside (without modifying pysam sources).

It is also a non-recommended strategy, due to the high number of object copies involved.

Another approach seen in the forum is to tell the workers which region to fetch. This does not seem a good idea, as you will quickly be limited by the underlying IO and because the fetch approach needs an index, does not consider the unmapped reads (which I want to keep) and offers no way of fetching only the reads with a single starting point (maybe I'm wrong, but I think that specifying a single-base region returns all reads that covers that region).

Also, doing an index search for each single read quickly multiplies the random disk accesses needed, when a single thread reading the file sequentially should be much faster.

So, the question is: which should be the correct approach to this problem?

Thanks a lot for your suggestions

Reply all

Reply to author

Forward

0 new messages