pickling a bam read for parallel-processing

363 views
Skip to first unread message

Stathis Kanterakis

unread,
Aug 27, 2020, 8:10:02 AM8/27/20
to Pysam User group
Hello pysam users,
I want to use multiprocessing to speed up my bam-parsing code but Samfile reads are not pickleable and thus cannot be sent to different processes. I've documented the issue here:
The alternative would be to store whatever read info I need in a dict, which is picleable, but I was wondering if anyone has tried this before and what they recommend.
Thanks!
Stathis

f.fink...@googlemail.com

unread,
Aug 27, 2020, 8:48:46 AM8/27/20
to Pysam User group
Don't. Sending pickled objects will impose quite a bit of overhead that will eat your parallelization improvements. (It's turn-reads-to-dicst, turn-bytes-into-pickle, copy-pickle-in-memory-across-process-boundaries, turn pickle into bytes... that's at least 4 copies, and pickling ain't terribly fast...

The fast way is to hand of a sensible number of *regions* to each process, which will then open the bam itself and do a fetch.

You need to take care of reads being pulled in multiple regions though - I usually check if read.pos is within my start/stop.

Peter LoVerso

unread,
Aug 27, 2020, 11:53:43 AM8/27/20
to pysam-us...@googlegroups.com
Usually you will not want to do this, for the reasons mentioned above.

There are occasionally some times when it is beneficial, though.

You will need serializable versions of both your read, and the bam header. The fastest method is to call to_string() on your read, NOT to_dict() (which calls to_string and then splits on tabs to make it a dict). The string that comes out will be serializable. You can do this in your own __reduce__ method if you like.

To turn this back into a read, that string plus your bam header go into AlignedSegment.fromstring()

--
You received this message because you are subscribed to the Google Groups "Pysam User group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pysam-user-gro...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pysam-user-group/f2b51945-4b42-49e7-a696-5293c8ee5d58n%40googlegroups.com.

Stathis Kanterakis

unread,
Sep 14, 2020, 7:55:18 AM9/14/20
to Pysam User group
Thank you all for your tips! I was able to get a 3.5x speed improvement using multiprocessing rather than threading and the bottleneck now seems to be memory instead of cpu. Also simplified my code quite a bit. Wish I knew from the start! I'm kind of hating parallel-processing in python by now.. 

f.fink...@googlemail.com

unread,
Sep 14, 2020, 8:39:44 AM9/14/20
to Pysam User group
Glad you got some speed up :).

Personally, I have given up on getting python fast.

Recently I replaced a BLAS-heavy (so multicore, in parts, and copy heavy), Cython bam processing with 'proper' multi threaded
Rust implementation optimized for continuous memory access, and it went from 19 hours down to 3 minutes of processing time...

Stathis Kanterakis

unread,
Sep 14, 2020, 8:55:18 AM9/14/20
to pysam-us...@googlegroups.com
jesus... time for me to upskill!!

You received this message because you are subscribed to a topic in the Google Groups "Pysam User group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pysam-user-group/Jy-rDoIbdEQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pysam-user-gro...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pysam-user-group/e02a05e1-87e6-486d-a285-5fe2d83cfb0bn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages