Parallel processing in Ruffus

26 views
Skip to first unread message

Dr Ashwin Kotwaliwale

unread,
Jan 8, 2016, 12:02:58 AM1/8/16
to ruffus_discuss
Hi there,
I am very new to Ruffus so asking this question.

I have a set of VCF files and I am performing analysis on them such that my analysis program uses all the VCFs together and not as single files. I have written a python function that accepts a list of the VCF files and i can call this function using the @originate decorator.

Since the VCF files contain data for all chromosomes I modified my code to work on one chromosome across all the VCFs files together. The issue I am facing in Ruffus is that how could I call my function so that all chromosomes are processed in parallel. So something like

doWork(['a.vf,b.vcf'],chromosome 1)
doWork(['a.vf,b.vcf'],chromsome 2)

Is there any such way in Ruffus that can handle this?

Thanks in advance.

Leo Goodstadt 顧維斌

unread,
Jan 8, 2016, 5:03:09 AM1/8/16
to ruffus_...@googlegroups.com
Dear Ashwin,

The obvious way to do this would be something like

chromosome_names = map(str, range(1, 20) + ["X", "Y"])
@transform(, add_inputs(["a.vcf", "b.vcf"]) ... )

etc

Unfortunately, in Ruffus, strings are assumed to be file names, and Ruffus will complain that the files, "1", "X" etc do not exist.

A future version of Ruffus in development will allow you to get around that but in the meantime, there are two things you can do:

1) Working with Ruffus
Bite the bullet and create a list of files called "1.chr", "2.chr" etc (some consistent extension as usual), with @originate, and then work with these files in the rest of the pipeline. The @product decorator is especially useful here (and was in fact introduced specifically for this use case) as it allows you to have each chromosome of your vcf files analysed in parallel. The syntax can be a bit tricky so please feel free to post another question if you get stuck. The following (untested) code should hopefully point you in the right direction.


chromosome_names = ["%s.chr" % cc for cc in (range(1, 20) + ["X", "Y"]))]
@originate(
chromosome_names)
def create_chromosome_names(output_file):
    with open(output_file, "w") as oo: pass

@product(create_chromosome_names, formatter(), 
         vcf_files, formatter(),
         "chr{basename[0][0]}.{basename[1][0]}.output")
def(input_file_names, output_file_name):
    chr_name, vcf_file_name = input_file_names


The only downside is that you get a list of extra files (called "1.chr", "2.chr" etc.). We have found that it helps to organise them in their own "chromosome" directory.

2) Working around Ruffus
If you wrap your strings in any other type, they will no longer be regarded as file names. However, you lose the handy way Ruffus constructs output file names from input strings.

class not_a_string (object):
  def __init__(self, name):
       self.name = str(name)

@transform(your_vcfs, add_inputs(map(not_a_string, range(1, 20) + ["X" + "Y"])...)


Hope that helps.

Leo




--
You received this message because you are subscribed to the Google Groups "ruffus_discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruffus_discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages