pysam.fastafile.fetch in parallel using multiprocessing

Iñigo Prada Luengo

unread,

Feb 21, 2017, 4:51:29 AM2/21/17

to Pysam User group

Hi, I am trying to run some simulations of fastq files and the way I get the sequences is via pysam.fastafile.fetch. One I run it with one core, the result is fine. However, when I run it with multiple cores, the process reading the same file get corrupted and are not able to read the sequence properly, and finally, pysam fasta fetch does not return anything.

Is this a know behaivour?

Florian Finkernagel

unread,

Feb 21, 2017, 4:56:31 AM2/21/17

to Pysam User group

Well, it is not unexpected. When forking processes (I assume you're on linux), they share the actual file handles,

and when one reads, the position within the file for all other processes also advances.

That means for you for now that you need to reopen the fasta files in each process.

I remember we had the same issue with Samfiles back in the day and added an appropriate patch,

this probably needs to be done here as well.

Iñigo Prada Luengo

unread,

Feb 21, 2017, 9:06:02 AM2/21/17

to Pysam User group

I do open the fasta file inside of each process but it does not work either

Florian Finkernagel

unread,

Feb 21, 2017, 9:24:28 AM2/21/17

to Pysam User group

Well, can you distill it down into a minimal code example showing the issue?

Iñigo Prada Luengo

unread,

Feb 21, 2017, 9:40:05 AM2/21/17

to Pysam User group

sure:

Here is the function I am implementing in parallel:

def sim_single_end(genome_fa,chr,chr_pos_start,chr_pos_end,read_length, unique_id):
    #create fastafile object
    fastafile = ps.FastaFile(genome_fa)
    #sample left read
    left_split_read = fastafile.fetch(chr, chr_pos_end - (read_length / 2), chr_pos_end)
    # sample right read
    right_split_read = fastafile.fetch(chr, chr_pos_start, chr_pos_start + (read_length / 2))
    #reverse it
    reversed_left_split_read = left_split_read[::-1]
    
    #save the data
    total_read = reversed_left_split_read + right_split_read
    seq_id = "id:%s-%s|left_pos:%s-%s|right:%s-%s " % (unique_id,chr, int(chr_pos_end - (read_length / 2)), int(chr_pos_end), int(chr_pos_start),int(chr_pos_start + (read_length / 2)))
    quality = "I" * read_length
    fastq_string = "@%s\n%s\n+\n%s\n" % (seq_id, total_read, quality)
    new_record = SeqIO.read(StringIO(fastq_string), "fastq")
    return(new_record)

The problem is, that when I run this function with multiple cores,the variable "left_split_read" and "right_split_read" are empty. They do not read the genome

Blaise Li

unread,

Feb 24, 2017, 12:00:27 PM2/24/17

to pysam-us...@googlegroups.com

This is not the exact same situation as you describe, but I also encountered problems when working with pysam in parallel: https://github.com/pysam-developers/pysam/issues/397

2017-02-21 10:51 GMT+01:00 Iñigo Prada Luengo <iprada...@gmail.com>:

Hi, I am trying to run some simulations of fastq files and the way I get the sequences is via pysam.fastafile.fetch. One I run it with one core, the result is fine. However, when I run it with multiple cores, the process reading the same file get corrupted and are not able to read the sequence properly, and finally, pysam fasta fetch does not return anything.

Is this a know behaivour?

--
You received this message because you are subscribed to the Google Groups "Pysam User group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pysam-user-group+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andreas

unread,

Feb 26, 2017, 5:01:23 PM2/26/17

to Pysam User group

Thanks, I have opened a github issue. I will try to replicate and then investigate

Andreas

unread,

Feb 26, 2017, 5:01:51 PM2/26/17

to Pysam User group

Please see https://github.com/pysam-developers/pysam/issues/409

Iñigo Prada Luengo

unread,

Mar 6, 2017, 7:39:00 AM3/6/17

to Pysam User group

Great! Than you!

Reply all

Reply to author

Forward