pysam.fastafile.fetch in parallel using multiprocessing

594 views
Skip to first unread message

Iñigo Prada Luengo

unread,
Feb 21, 2017, 4:51:29 AM2/21/17
to Pysam User group
Hi, I am trying to run some simulations of fastq files and the way I get the sequences is via pysam.fastafile.fetch. One I run it with one core, the result is fine. However, when I run it with multiple cores, the process reading the same file get corrupted and are not able to read the sequence properly, and finally, pysam fasta fetch does not return anything.

Is this a know behaivour?

Florian Finkernagel

unread,
Feb 21, 2017, 4:56:31 AM2/21/17
to Pysam User group
Well, it is not unexpected. When forking processes (I assume you're on linux), they share the actual file handles,
and when one reads, the position within the file for all other processes also advances.

That means for you for now that you need to reopen the fasta files in each process.

I remember we had the same issue with Samfiles back in the day and added an appropriate patch,
this probably needs to be done here as well.

Iñigo Prada Luengo

unread,
Feb 21, 2017, 9:06:02 AM2/21/17
to Pysam User group
I do open the fasta file inside of each process but it does not work either

Florian Finkernagel

unread,
Feb 21, 2017, 9:24:28 AM2/21/17
to Pysam User group
Well, can you distill it down into a minimal code example showing the issue?

Iñigo Prada Luengo

unread,
Feb 21, 2017, 9:40:05 AM2/21/17
to Pysam User group
sure:

Here is the function I am implementing in parallel:

def sim_single_end(genome_fa,chr,chr_pos_start,chr_pos_end,read_length, unique_id):
#create fastafile object
fastafile = ps.FastaFile(genome_fa)
#sample left read
left_split_read = fastafile.fetch(chr, chr_pos_end - (read_length / 2), chr_pos_end)
# sample right read
right_split_read = fastafile.fetch(chr, chr_pos_start, chr_pos_start + (read_length / 2))
#reverse it
reversed_left_split_read = left_split_read[::-1]

#save the data
total_read = reversed_left_split_read + right_split_read
seq_id = "id:%s-%s|left_pos:%s-%s|right:%s-%s " % (unique_id,chr, int(chr_pos_end - (read_length / 2)), int(chr_pos_end), int(chr_pos_start),int(chr_pos_start + (read_length / 2)))
quality = "I" * read_length
fastq_string = "@%s\n%s\n+\n%s\n" % (seq_id, total_read, quality)
new_record = SeqIO.read(StringIO(fastq_string), "fastq")
return(new_record)

The problem is, that when I run this function with multiple cores,the variable "left_split_read" and "right_split_read" are empty. They do not read the genome

Blaise Li

unread,
Feb 24, 2017, 12:00:27 PM2/24/17
to pysam-us...@googlegroups.com
This is not the exact same situation as you describe, but I also encountered problems when working with pysam in parallel: https://github.com/pysam-developers/pysam/issues/397

2017-02-21 10:51 GMT+01:00 Iñigo Prada Luengo <iprada...@gmail.com>:
Hi, I am trying to run some simulations of fastq files and the way I get the sequences is via pysam.fastafile.fetch. One I run it with one core, the result is fine. However, when I run it with multiple cores, the process reading the same file get corrupted and are not able to read the sequence properly, and finally, pysam fasta fetch does not return anything.

Is this a know behaivour?

--
You received this message because you are subscribed to the Google Groups "Pysam User group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pysam-user-group+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andreas

unread,
Feb 26, 2017, 5:01:23 PM2/26/17
to Pysam User group
Thanks, I have opened a github issue. I will try to replicate and then investigate


Andreas

unread,
Feb 26, 2017, 5:01:51 PM2/26/17
to Pysam User group

Iñigo Prada Luengo

unread,
Mar 6, 2017, 7:39:00 AM3/6/17
to Pysam User group
Great! Than you!
Reply all
Reply to author
Forward
0 new messages