Is there a way to prevent copying input and output files if all relevant compute resources share a common file system?

21 views
Skip to first unread message

Henrik Seidel

unread,
Aug 5, 2016, 3:37:43 AM8/5/16
to gc3pie
Hello,

we use gc3pie on a Linux cluster with a shared GPFS file system and IBM LSF for job scheduling. The gc3pie pipelines are started on a cluster head node which also has the GPFS file system mounted. So there is no real need to copy input files to the compute nodes and results back to the head node - the data is already there because the file system is shared. We would like to avoid copying the data because they put a high load on I/O (on the network interfaces) of the head node. For next generation sequencing (which use large data volumes), copying the data back and forth takes often longer than the actual computations on the compute nodes. Also, the responsiveness of the head node degrades quite a lot when the pipeline is copying the data for something like 50 jobs in parallel (we are talking about a terabyte in total here, distributed across 50 I/O processes in parallel).

Is there a way to make gc3pie use the files directly by providing a path on a shared file system instead of copying them?

Thanks for your help, and sorry if this has been answered previously (just took over the maintenance of some pipelines, and couldn't find anything about this by searching the group postings).

Regards
Henrik

Riccardo Murri

unread,
Aug 8, 2016, 7:52:23 AM8/8/16
to gc3...@googlegroups.com

Hello Henrik,

Yes there is a way, provided that:

- there is a common file system across all' compute nodes and the host where GC3Pie script is running;

- the mount points for these file systems are the same are the same across all nodes.

In your case, you should be fine if you run the GC3Pie driver script on the cluster head node. Instead, it would not work if you run GC3Pie on è.g. a laptop and SSH into the cluster.

The trick is simple: just omit input files from the application definition, and directly refer to input files by absolute path name in the arguments parameter:

  Application(
    arguments=['myprog', '/gpfs/foo/bar'],
    inputs=[],
    outputs=['baz.out'],
    # ...
  )

Note that the 'outputs=' parameter is still needed to avail of the usual GC3Pie mechanism for collecting output files in the location where the script is running. Should you want to keep also output files in A Gpfs directory, keep 'outputs=' rmpty and add a 'terminated()' method:

  # in class MyApplication
  def 'terminated(self):
    shutil.move(self.execution.lrms_execdir + '/baz.out', '/gpfs/outfiles/')

I'm sorry I cannot provided a better worked out example now, but I'm travelling and I cannot only use the iPhone for processing emails.

Hope this helos!

Ciao,
R


--
You received this message because you are subscribed to the Google Groups "gc3pie" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gc3pie+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Henrik Seidel

unread,
Sep 12, 2016, 9:52:05 AM9/12/16
to gc3pie
Dear Riccardo,

thanks for your recommendations, and sorry for the late answer - I probably missed your response in the large list of emails waiting for me after my vacation. We will definitely give it a try in the way you suggested.

Regards
Henrik
To unsubscribe from this group and stop receiving emails from it, send an email to gc3pie+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages