Interface that removes a MapNode input file triggers node re-execution

18 views
Skip to first unread message

Baldur

unread,
Dec 11, 2012, 1:20:53 PM12/11/12
to nipy...@googlegroups.com
With a MapNode wrapping an interface copied from Chris's gzip example (http://nipy.sourceforge.net/nipype/0.5.3/devel/cmd_interface_devel.html) execution is triggered a second time after the sub-nodes have run successfully. The result is thus first a gzip execution that gives a bunch of .gz files followed by a second gzip execution failing because the inputs are now missing.  It required two or more inputs to the MapNode before the problem showed. 

My assumption is that the deletion of the input by gzip is the issue here. As an experiment substituting the gzip command with a shell script that simply makes a copy of the input <file> to a <file>.gz and leaves input file intact didn't trigger a re-execution. 

Could this be related to the hash_method checks? . Is there anyway to disable checking for a particular node or input?  

Cheers
Baldur

Chris Filo Gorgolewski

unread,
Dec 11, 2012, 1:43:36 PM12/11/12
to nipy...@googlegroups.com
The easiest way to check this is to set in config logging level to debug and "stop_on_first_rerun" to true. This will report why a rerun would be necessary.

Best,
Chris


Baldur

--
 
 
 

Baldur

unread,
Dec 12, 2012, 8:45:54 AM12/12/12
to nipy...@googlegroups.com
I reduced the problem in size and the pypeline.log dump below shows the problem (sorry I can upload the files from my work - get a 340 error due to firewall restrictions)

Key to the failure is executing the MapNode with 2 or more files and using the Condor plugin (I hadn't seen this dependency yesterday).

The following configurations give the results noted (they correspond to the information in the log file)

MapNode gzip 1 file and Condor plugin - succeed
MapNode gzip 2 files and Condor plugin - fail
MapNode gzip 2 files - no plugin - succeeds

The crucial part of the log does indeed indicate that the hash is playing a role - after the subnodes (_zipTestFiles0 and _zipTestFiles1 complete) finish an attempt is made to rerun them and the change to the hash is involved:

121212-13:49:24,989 workflow INFO:
[Job finished] jobname: _zipTestFiles0 jobid: 2
121212-13:49:25,26 workflow INFO:
[Job finished] jobname: _zipTestFiles1 jobid: 3
121212-13:49:25,28 workflow INFO:
Submitting 1 jobs
121212-13:49:25,28 workflow INFO:
Executing: zipTestFiles ID: 1
121212-13:49:44,457 workflow DEBUG:
networkx 1.4 dev or higher detected
121212-13:49:44,460 workflow INFO:
Executing node zipTestFiles in dir: /tmp/tmpEJPykN/TestZip/zipTestFiles
121212-13:49:44,461 workflow DEBUG:
setting hashinput input_file-> ['/home/bvanlew/tmp/EtaOin0.txt', '/home/bvanlew/tmp/EtaOin1.txt']
121212-13:49:44,461 workflow DEBUG:
Node hash: c59d2151caedebee82b7a10d7c931023
121212-13:49:44,461 workflow DEBUG:
/tmp/tmpEJPykN/TestZip/zipTestFiles/_0xc59d2151caedebee82b7a10d7c931023_unfinished.json found and can_resume is True or Node is a MapNode - resuming execution
121212-13:49:44,462 workflow DEBUG:
writing pre-exec report to /tmp/tmpEJPykN/TestZip/zipTestFiles/_report/report.rst
121212-13:49:44,463 workflow DEBUG:
setting input 0 input_file /home/bvanlew/tmp/EtaOin0.txt
121212-13:49:44,464 workflow DEBUG:
Setting node inputs
121212-13:49:44,464 workflow INFO:
Executing node _zipTestFiles0 in dir: /tmp/tmpEJPykN/TestZip/zipTestFiles/mapflow/_zipTestFiles0
121212-13:49:44,464 workflow DEBUG:
Node hash: 084859a03d2d2df961346df37dc8e50c
121212-13:49:44,465 workflow DEBUG:
Rerunning node
121212-13:49:44,465 workflow DEBUG:
updatehash = False, self.overwrite = None, self._interface.always_run = False, os.path.exists(/tmp/tmpEJPykN/TestZip/zipTestFiles/mapflow/_zipTestFiles0/_0x084859a03d2d2df961346df37dc8e50c.json) = False, hash_method = timestamp
121212-13:49:44,465 workflow DEBUG:
Previous node hash = a754c7dec9a6c06f2f1f53df81206c50
121212-13:49:44,466 workflow DEBUG:
 values differ in fields: input_file: '/home/bvanlew/tmp/EtaOin0.txt' != [u'/home/bvanlew/tmp/EtaOin0.txt', u'2452a4cf598b13d7c663fa6c7375c763']
121212-13:49:45,403 workflow ERROR:
['Node zipTestFiles failed to run on host lkeb-gisela01.']

Cheers
Baldur

Satrajit Ghosh

unread,
Dec 12, 2012, 8:49:28 AM12/12/12
to nipy-user
hi baldur,

can you set your workflow working directory to a shared location - instead of /tmp and try? also is it possible for you to use the current dev version - many changes have been made to improve stability in the latest code.

cheers,

satra

--
 
 
 

Baldur

unread,
Dec 12, 2012, 9:03:59 AM12/12/12
to nipy...@googlegroups.com
In this case the condor workers run on cores on the local machine so I don't think the /tmp is a problem - but I can try something else if you really think that makes a difference.

My nipype version is : '0.7.0.g4480f17-dev'  

Cheers
Baldur

Satrajit Ghosh

unread,
Dec 12, 2012, 9:12:55 AM12/12/12
to nipy-user
hi,

in that case, it's not going to make a difference.

if the contents of /home/bvanlew/tmp/EtaOin0.txt are not changing, simply change the hashing method to 'content' and see if it clears the problem. 

if this does resolve the problem then some part of some code is modifying file access times.

cheers,

satra

--
 
 
 

Baldur

unread,
Dec 12, 2012, 9:58:47 AM12/12/12
to nipy...@googlegroups.com
Well that is perhaps the problem - the zip subnode (a gzip) actually removes the input file. I'll try the hashing method change...

Cheers
Baldur
Reply all
Reply to author
Forward
0 new messages