ArrayCombiner throws error in remote execution mode that _index file is missing

40 views
Skip to first unread message

Christian Frech

unread,
Apr 23, 2015, 11:54:29 AM4/23/15
to andur...@googlegroups.com
When I run the ArrayCombiner component in local mode, everything works fine:

[...]
Clearing directory /mnt/synology/data/christian/iamp/results/current/anduril/gsnap/execute/bamCounts-_ArrayCombiner1_array1_array1
Executing bamCounts-_ArrayCombiner1_array1_array1 (ArrayConstructor)
Component bamCounts-_ArrayCombiner1_array1_array1 finished with success; READY queue: bamCounts-_ArrayCombiner2_array1_combiner2
[...]

However, when I execute the same script in remote execution mode, I get the following error:

[...]
Clearing directory /mnt/synology/data/christian/iamp/results/current/anduril/gsnap/execute/bamCounts-_ArrayCombiner1_array1_array1

Executing bamCounts-_ArrayCombiner1_array1_array1 (ArrayConstructor) on biohazard
[bamCounts-_ArrayCombiner1_array1_array1] ssh -p 81 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o BatchMode=yes -l anduril biohazard /mnt/synology/data/christian/iamp/results/current/anduril/run-docker.sh anduril-remote launch /mnt/synology/data/christian/iamp/results/current/anduril/gsnap/execute/bamCounts-_ArrayCombiner1_array1_array1 /usr/local/share/anduril/builtin/components/ArrayConstructor "$( echo amF2YSAtWG14MjAwMG0gLWNwIDovb3B0L21va3Npc2thYW4vZGIvZXRjOi9vcHQvaGliZXJuYXRlL2xpYi9qcGEvaGliZXJuYXRlLWVudGl0eW1hbmFnZXItNC4zLjUuRmluYWwuamFyOi9vcHQvaGliZXJuYXRlL2xpYi9yZXF1aXJlZC9oaWJlcm5hdGUtanBhLTIuMS1hcGktMS4wLjAuRmluYWwuamFyOi9vcHQvaGliZXJuYXRlL2xpYi9yZXF1aXJlZC9kb200ai0xLjYuMS5qYXI6L29wdC9oaWJlcm5hdGUvbGliL3JlcXVpcmVkL2pib3NzLXRyYW5zYWN0aW9uLWFwaV8xLjJfc3BlYy0xLjAuMC5GaW5hbC5qYXI6L29wdC9oaWJlcm5hdGUvbGliL3JlcXVpcmVkL2pib3NzLWxvZ2dpbmctYW5ub3RhdGlvbnMtMS4yLjAuQmV0YTEuamFyOi9vcHQvaGliZXJuYXRlL2xpYi9yZXF1aXJlZC9oaWJlcm5hdGUtY29yZS00LjMuNS5GaW5hbC5qYXI6L29wdC9oaWJlcm5hdGUvbGliL3JlcXVpcmVkL2phdmFzc2lzdC0zLjE4LjEtR0EuamFyOi9vcHQvaGliZXJuYXRlL2xpYi9yZXF1aXJlZC9qYm9zcy1sb2dnaW5nLTMuMS4zLkdBLmphcjovb3B0L2hpYmVybmF0ZS9saWIvcmVxdWlyZWQvamFuZGV4LTEuMS4wLkZpbmFsLmphcjovb3B0L2hpYmVybmF0ZS9saWIvcmVxdWlyZWQvYW50bHItMi43LjcuamFyOi9vcHQvaGliZXJuYXRlL2xpYi9yZXF1aXJlZC9oaWJlcm5hdGUtY29tbW9ucy1hbm5vdGF0aW9ucy00LjAuNC5GaW5hbC5qYXI6L29wdC9oaWJlcm5hdGUvbGliL29wdGlvbmFsL2MzcDAvaGliZXJuYXRlLWMzcDAtNC4zLjUuRmluYWwuamFyOi9vcHQvaGliZXJuYXRlL2xpYi9vcHRpb25hbC9jM3AwL21jaGFuZ2UtY29tbW9ucy1qYXZhLTAuMi4zLjQuamFyOi9vcHQvaGliZXJuYXRlL2xpYi9vcHRpb25hbC9jM3AwL2MzcDAtMC45LjIuMS5qYXI6L3Vzci9sb2NhbC9zaGFyZS9hbmR1cmlsL2FuZHVyaWwuamFyOi46L3Vzci9sb2NhbC9zaGFyZS9hbmR1cmlsL21pY3JvYXJyYXkvbGliL2phdmE6L3Vzci9sb2NhbC9zaGFyZS9hbmR1cmlsLWJ1bmRsZXMvbW9rc2lza2Fhbi9saWIvamF2YTovdXNyL2xvY2FsL3NoYXJlL2FuZHVyaWwtYnVuZGxlcy9zZXF1ZW5jaW5nL2xpYi9qYXZhOi91c3IvbG9jYWwvc2hhcmUvYW5kdXJpbC9hbmR1cmlsLmphcjouIEFycmF5Q29uc3RydWN0b3IgL21udC9zeW5vbG9neS9kYXRhL2NocmlzdGlhbi9pYW1wL3Jlc3VsdHMvY3VycmVudC9hbmR1cmlsL2dzbmFwL2V4ZWN1dGUvYmFtQ291bnRzLV9BcnJheUNvbWJpbmVyMV9hcnJheTFfYXJyYXkxL19jb21tYW5k | base64 -i --decode )" ''
[ERROR] Component bamCounts-_ArrayCombiner1_array1_array1: /mnt/synology/data/christian/iamp/results/current/anduril/gsnap/execute/bamCounts-_ArrayCombiner1_array1_array1/array/_index (No such file or directory)
[...]

The funny thing is that after executing the script all the expected output files are there (including the _index file), so the component was definitely executed correctly! I suppose it has to do with network latency, because in remote execution mode I am writing to a CIFS file system. However, increasing --nfs-timeout in the anduril run command does not fix the issue. My guess is that the nfs-timeout is not honored in this particular case?

I am using Anduril version 1.2.23.

Christian Frech

unread,
Apr 23, 2015, 12:09:34 PM4/23/15
to andur...@googlegroups.com
I forgot to add the code snippet that produces this error (HTSeqBam2Counts):

bamCounts = HTSeqBam2Counts(alignments    = alignedBAMs,
                            annotationGTF = gtf,
                            dexseq_dir    = "/opt/DEXSeq/python_scripts/",
                            entity        = "Gene",
                            format        = "bam",
                            sorted        = true,
                            @host         = "auto")

Ville Rantanen

unread,
Apr 23, 2015, 12:30:41 PM4/23/15
to andur...@googlegroups.com
definitely sounds like the nfs issue that was fixed by adding that timeout. I would have to investigate in which cases the timeout is not honored. 
In my earlier experiments,  cifs didn't have that problem, only nfs. 

maybe you can double check the folder mappings in the hosts.conf for the remote executor. i believe that might also be a possible error source.


Ville Rantanen

unread,
Apr 23, 2015, 1:37:10 PM4/23/15
to andur...@googlegroups.com
okay.   the NFS latency thing is checked on all accounts but Local excution mode. so it should be honored here.

However,   If it was the latency checking that created the error, the error message goes: 
"Output file is missing for port %s and execDir %s", which is not the case.

So, one component is not seeing the output of the PREVIOUS component.
somehow  the ArrayCombiner and ArrayConstructor are not seeing each others files. maybe they run at different nodes, and there is lag.
This is something that is a little difficult to solve with anduril, since the command is sent over with ssh and no control for such events is handled by the component code.

I have created a workaround that uses the prefix scripts with a similar problem in nfs... but i'm not sure is it applicable here.  

In our cases these errors happen, when machine A is used to send processes.  After a component instance has run on node B,    A and B both agree that files exist and everything is okay. Then A sends the next process to node C, which is lagging for reason or another and doesnt see the files.  the node A where Anduril is running, had no way of knowing C doesnt see the files yet.
my workaround reads the _command file, and stays polling for each of the inputs mentioned, until they are accessible, or a timeout is exceeded.  this is not possible with remote execution mode, but with prefix scripts it is.

Christian Frech

unread,
Apr 24, 2015, 3:15:16 AM4/24/15
to andur...@googlegroups.com
the folder mappings are definitely ok, because all output files are generated at the right location, both in local and remote execution mode.

Christian Frech

unread,
Apr 24, 2015, 3:25:49 AM4/24/15
to andur...@googlegroups.com
Unfortunately the prefix workaround is not an option for me, because I need remote execution mode. Maybe the problem has to do with implementation specifics of HTSeqBam2Counts?

Here is the last line of function HTSeqBam2Counts/function.and, which I suppose to be the culprit:

return record(exon=ArrayCombiner(exonCounts),gene=ArrayCombiner(counts),force annotation=annotationF)


Lauri Lyly

unread,
Apr 24, 2015, 9:57:51 AM4/24/15
to andur...@googlegroups.com
Would it make any sense to add an adjustable post-execution delay to the remote execution script (bin/anduril-remote) so that it would be possible to set it via an environment variable? Simply for debugging cases like this.

And, if you run a second time, does it work then, since the output now should be there? Then you simply need to repeat the Anduril runs on this type of error while we find an explanation...

I cannot immediately see anything that could be changed about the line you posted.


Christian Frech

unread,
Apr 29, 2015, 1:32:14 PM4/29/15
to andur...@googlegroups.com
yes, a second run usually fixes the problem. what's strange is that I see now the same behaviour in slurm execution mode. not sure what's going on here, maybe there is an issue with our shared file system.

Christian Frech

unread,
May 4, 2015, 2:44:48 AM5/4/15
to andur...@googlegroups.com
Ok, after some further analysis I can confirm that this is definitely a shared file system-related caching issue that has nothing to do with Anduril. I can avoid this error by simply doing an 'ls' on the execution directory after coming back from a slurm or remote execution call, presumably because this forces some reload of the directory structure/content. I'm still not sure what is causing it though.

Ville Rantanen

unread,
May 4, 2015, 3:33:03 AM5/4/15
to andur...@googlegroups.com
Good that you found this one out! 

if you want to appear 'smart', you can run this command in the folder:  ;)
( set $FOLDER as your execution instance folder .. )

find $FOLDER -noleaf -print0 | xargs --null stat 2>&1 >/dev/null

what 'ls'  does is that it requests the stat of each file, to find mod.change dates etc for sorting. in some file systems this might trigger the actual syncing of the file.
With the 'find' command, you can be sure that any subfolder structure is synced too.

 



Reply all
Reply to author
Forward
0 new messages