Interface that removes a MapNode input file triggers node re-execution

Baldur

unread,

Dec 11, 2012, 1:20:53 PM12/11/12

to nipy...@googlegroups.com

With a MapNode wrapping an interface copied from Chris's gzip example (http://nipy.sourceforge.net/nipype/0.5.3/devel/cmd_interface_devel.html) execution is triggered a second time after the sub-nodes have run successfully. The result is thus first a gzip execution that gives a bunch of .gz files followed by a second gzip execution failing because the inputs are now missing. It required two or more inputs to the MapNode before the problem showed.

My assumption is that the deletion of the input by gzip is the issue here. As an experiment substituting the gzip command with a shell script that simply makes a copy of the input <file> to a <file>.gz and leaves input file intact didn't trigger a re-execution.

Could this be related to the hash_method checks? . Is there anyway to disable checking for a particular node or input?

Cheers

Baldur

Chris Filo Gorgolewski

unread,

Dec 11, 2012, 1:43:36 PM12/11/12

to nipy...@googlegroups.com

The easiest way to check this is to set in config logging level to debug and "stop_on_first_rerun" to true. This will report why a rerun would be necessary.

Best,

Chris

Baldur

--

Baldur

unread,

Dec 12, 2012, 8:45:54 AM12/12/12

to nipy...@googlegroups.com

I reduced the problem in size and the pypeline.log dump below shows the problem (sorry I can upload the files from my work - get a 340 error due to firewall restrictions)

Key to the failure is executing the MapNode with 2 or more files and using the Condor plugin (I hadn't seen this dependency yesterday).

The following configurations give the results noted (they correspond to the information in the log file)

MapNode gzip 1 file and Condor plugin - succeed

MapNode gzip 2 files and Condor plugin - fail

MapNode gzip 2 files - no plugin - succeeds

The crucial part of the log does indeed indicate that the hash is playing a role - after the subnodes (_zipTestFiles0 and _zipTestFiles1 complete) finish an attempt is made to rerun them and the change to the hash is involved:

121212-13:49:24,989 workflow INFO:

[Job finished] jobname: _zipTestFiles0 jobid: 2

121212-13:49:25,26 workflow INFO:

[Job finished] jobname: _zipTestFiles1 jobid: 3

121212-13:49:25,28 workflow INFO:

Submitting 1 jobs

121212-13:49:25,28 workflow INFO:

Executing: zipTestFiles ID: 1

121212-13:49:44,457 workflow DEBUG:

networkx 1.4 dev or higher detected

121212-13:49:44,460 workflow INFO:

Executing node zipTestFiles in dir: /tmp/tmpEJPykN/TestZip/zipTestFiles

121212-13:49:44,461 workflow DEBUG:

setting hashinput input_file-> ['/home/bvanlew/tmp/EtaOin0.txt', '/home/bvanlew/tmp/EtaOin1.txt']

121212-13:49:44,461 workflow DEBUG:

Node hash: c59d2151caedebee82b7a10d7c931023

121212-13:49:44,461 workflow DEBUG:

/tmp/tmpEJPykN/TestZip/zipTestFiles/_0xc59d2151caedebee82b7a10d7c931023_unfinished.json found and can_resume is True or Node is a MapNode - resuming execution

121212-13:49:44,462 workflow DEBUG:

writing pre-exec report to /tmp/tmpEJPykN/TestZip/zipTestFiles/_report/report.rst

121212-13:49:44,463 workflow DEBUG:

setting input 0 input_file /home/bvanlew/tmp/EtaOin0.txt

121212-13:49:44,464 workflow DEBUG:

Setting node inputs

121212-13:49:44,464 workflow INFO:

Executing node _zipTestFiles0 in dir: /tmp/tmpEJPykN/TestZip/zipTestFiles/mapflow/_zipTestFiles0

121212-13:49:44,464 workflow DEBUG:

Node hash: 084859a03d2d2df961346df37dc8e50c

121212-13:49:44,465 workflow DEBUG:

Rerunning node

121212-13:49:44,465 workflow DEBUG:

updatehash = False, self.overwrite = None, self._interface.always_run = False, os.path.exists(/tmp/tmpEJPykN/TestZip/zipTestFiles/mapflow/_zipTestFiles0/_0x084859a03d2d2df961346df37dc8e50c.json) = False, hash_method = timestamp

121212-13:49:44,465 workflow DEBUG:

Previous node hash = a754c7dec9a6c06f2f1f53df81206c50

121212-13:49:44,466 workflow DEBUG:

values differ in fields: input_file: '/home/bvanlew/tmp/EtaOin0.txt' != [u'/home/bvanlew/tmp/EtaOin0.txt', u'2452a4cf598b13d7c663fa6c7375c763']

121212-13:49:45,403 workflow ERROR:

['Node zipTestFiles failed to run on host lkeb-gisela01.']

Cheers

Baldur

Satrajit Ghosh

unread,

Dec 12, 2012, 8:49:28 AM12/12/12

to nipy-user

hi baldur,

can you set your workflow working directory to a shared location - instead of /tmp and try? also is it possible for you to use the current dev version - many changes have been made to improve stability in the latest code.

cheers,

satra

--

Baldur

unread,

Dec 12, 2012, 9:03:59 AM12/12/12

to nipy...@googlegroups.com

In this case the condor workers run on cores on the local machine so I don't think the /tmp is a problem - but I can try something else if you really think that makes a difference.

My nipype version is : '0.7.0.g4480f17-dev'

Cheers

Baldur

Satrajit Ghosh

unread,

Dec 12, 2012, 9:12:55 AM12/12/12

to nipy-user

hi,

in that case, it's not going to make a difference.

if the contents of /home/bvanlew/tmp/EtaOin0.txt are not changing, simply change the hashing method to 'content' and see if it clears the problem.

if this does resolve the problem then some part of some code is modifying file access times.

cheers,

satra

--

Baldur

unread,

Dec 12, 2012, 9:58:47 AM12/12/12

to nipy...@googlegroups.com

Well that is perhaps the problem - the zip subnode (a gzip) actually removes the input file. I'll try the hashing method change...