pysam and multiprocessing

1,576 views
Skip to first unread message

Ernesto

unread,
Oct 9, 2012, 4:25:20 AM10/9/12
to pysam-us...@googlegroups.com
Hi All,

I wrote a simple script to read position by position a bam file using pysam. To speed up the process I thought to split the run in several threads using the multiprocessing module. Although the script seems to work (I required a chromosome a time per thread) I got different results running the same script on same simple dataset. In particular, I recover for each chromosome all reads showing mismatches with the reference. If I use a single thread, results are consistent and always the same. Increasing the number of threads, results are not stable and the number of recovered reads is sometimes different over multiple runs.
Since I'm not a computer scientist, I'm wondering whether the same bam file can be accessed by pysam in multiple processes. Is there a way to overcome this behavior? I would to read multiple chromosomes from the same bam file in an independent way in order to speed up the complete job.
Thanks in advance.

Regards,

Ernesto
Message has been deleted

Andreas Heger

unread,
Oct 9, 2012, 7:54:20 AM10/9/12
to pysam-us...@googlegroups.com
Dear Ernesto,

thanks for using pysam. You are on the right track.

Using the multiprocessing module should work, but to be on the safe
side, open a separate file for read access in each worker
process. That is, instead of passing an instance of Samfile to a
subprocess, pass the filename instead and create a new samfile =
Samfile( ) in the function that is run in parallel.

There are definitely side-effects when multi-threading and
possibly ones when multi-processing. Generally it is best to avoid
accessing a bam-file through the same instance of a Samfile object
from multiple threads/processes.

Best wishes,
Andreas
Reply all
Reply to author
Forward
0 new messages