"Terminated for an unknown reason" on a local filesystem using a the local executor

1,281 views
Skip to first unread message

ni...@fulcrumgenomics.com

unread,
Mar 27, 2017, 5:33:05 PM3/27/17
to Nextflow
Hey Nextflow developers,

I am trying to resolve an error when running Nextflow.  Initially I thought it was due to NFS latency, but running the workflow with local execution on a ext2/ext3 file system still gives the same error.  I am able to run the command successfully that failed with exit code 0.

When I trace the execution I get:
...
Mar-24 15:22:30.501 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Mar-24 15:22:31.108 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: local
Mar-24 15:22:31.109 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'local'
Mar-24 15:22:31.118 [main] DEBUG nextflow.executor.Executor - Initializing executor: local
Mar-24 15:22:31.120 [main] INFO  nextflow.executor.Executor - [warm up] executor > local
Mar-24 15:22:31.127 [main] DEBUG n.processor.LocalPollingMonitor - Creating local task monitor for executor 'local' > cpus=8; memory=24 GB; capacity=56; pollInterval=100ms; dumpInterval=5m
Mar-24 15:22:31.134 [main] DEBUG nextflow.processor.TaskDispatcher - Starting monitor: LocalPollingMonitor
Mar-24 15:22:31.135 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: local)
...
Mar-24 15:30:17.617 [Actor Thread 2] TRACE nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -c /bin/bash -ue .command.run &> /tmp/../work/d7/6090cb2bbc51d866b0b8c5b73a910b/.command.log
Mar-24 15:30:17.625 [Thread-1] TRACE nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: id: 13; name: some_task_name (some_task_tag); type: SCRIPTLET; status: 2147483647; error: null; workDirectory: /tmp/../work/13/01e580bac9dd6ad5f33698c462c8fc
  error [nextflow.exception.ProcessFailedException]: Process `some_task_name (some_task_tag)` terminated for an unknown reason -- Likely it has been terminated by the external system
Mar-24 15:30:17.628 [Actor Thread 2] INFO  nextflow.Session - [d7/6090cb] Submitted process > some_other_task (some_task_tag.2)
Mar-24 15:30:17.629 [Actor Thread 2] TRACE nextflow.processor.TaskProcessor - <some_other_task> After run
Mar-24 15:30:17.684 [Thread-1] ERROR nextflow.processor.TaskProcessor - Error executing process > 'some_task_name (some_task_tag)'

Caused by:
  Process `some_task_name (some_task_tag)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  SOME COMMAND THAT EXITS WITH ZERO STATUS

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /tmp/../work/13/01e580bac9dd6ad5f33698c462c8fc

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named: '.command.sh'
nextflow.exception.ProcessFailedException: Process `some_task_name (some_task_tag)` terminated for an unknown reason -- Likely it has been terminated by the external system
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83)
    at org.codehaus.groovy.reflection.CachedConstructor.doConstructorInvoke(CachedConstructor.java:77)
    at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrap.callConstructor(ConstructorSite.java:84)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:60)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:235)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:247)
    at nextflow.processor.TaskProcessor.finalizeTask(TaskProcessor.groovy:1939)
    at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:438)
    at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:394)
    at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:340)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1218)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
    at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:925)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:908)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:76)
    at nextflow.processor.TaskPollingMonitor$_start_closure6.doCall(TaskPollingMonitor.groovy:319)
    at nextflow.processor.TaskPollingMonitor$_start_closure6.call(TaskPollingMonitor.groovy)
    at groovy.lang.Closure.run(Closure.java:495)
    at java.lang.Thread.run(Thread.java:745)


The exit code file (".exitcode") contains "0" for the failed process.  Interestingly, the .command.err and .command.log files have the following:
.command.run.1: line 104: 28016 Terminated              nxf_trace "$pid" .command.trace

I am running with the trace enabled.  Again, this is all on a ext2/ext3 file system with local execution, while other errors posted here and on the github either used nfs or non-local (grid) scheduler.  

Thanks for any help you could offer.


Paolo Di Tommaso

unread,
Mar 27, 2017, 5:55:51 PM3/27/17
to nextflow
Are this tasks executed by using docker ? 

However you removed some relevant part from the log file. If you could share the original one anonymizing any sensitive data, would help. 


p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

ni...@fulcrumgenomics.com

unread,
Mar 27, 2017, 6:07:26 PM3/27/17
to Nextflow
Docker is not used here.

I will reply privately with the log as it is 8000+ lines long and would take too long to redact every last detail.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Mar 27, 2017, 6:41:01 PM3/27/17
to nextflow
The log file reports that you have 500 GB of ram (that's a lot) but only 20 GB free memory when the pipeline is launched that not so much. It looks a bit weird to me.


I will also try to understand why that job is crashing. Is the error deterministic? Also execute only the task without using NF, you can do that by changing to the directory reported in the error message and run `bash .command.run`. 

Is the command executed successfully ?


p

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

Nils Homer

unread,
Mar 27, 2017, 7:00:05 PM3/27/17
to Nextflow
1) It's a shared computer, with the nextflow config for the executor:
executor {
    name
= 'local'
    cpus = 8
    memory
= '24GB'
}
That seems reasonable to me.

2) Running 'bash .command.run' returns exit code '0' (echo $?) and but prints the "Terminated  nfx_trace "$pid" .command.trace" line as I described previously.

Paolo Di Tommaso

unread,
Mar 27, 2017, 7:07:26 PM3/27/17
to nextflow
The message `Terminated  nfx_trace.. etc` is expected. 

But it looks you you are running with -with-trace or -with-timeline command line option. Can you try to execute it without them.

Also, can you share the process definition of the `check_illumina_directory` process ? 


p


To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

ni...@fulcrumgenomics.com

unread,
Mar 27, 2017, 7:25:52 PM3/27/17
to Nextflow
process check_illumina_directory {
    tag { run_name }

    cpus 1
    memory '4GB'
    time '5m'

    input:
    val run_info_cid
    file params.run_folder
    file tmp_dir
    val lane from lanes_cid

    shell:
    """
    java -Xmx4g -Djava.io.tmpdir=${tmp_dir} -jar ${picard_jar} \
    CheckIlluminaDirectory \
    BASECALLS_DIR=${params.run_folder}/Data/Intensities/BaseCalls \
    LANES=${lane} \
    READ_STRUCTURE=${run_info_cid.read_structure}
    """
}

While the command itself takes less than 10 seconds to execute, does Nextflow count the process management as part of the wall-clock time?  Perhaps the "5m" is tripping it up?

Here's the `command.sh` too:
java -Xmx4g -Djava.io.tmpdir=/path/to/tmp -jar /path/to/picard.jar     CheckIlluminaDirectory     BASECALLS_DIR=/path/to/Data/Intensities/BaseCalls     LANES=1     READ_STRUCTURE=50T8B8B50T

I'll also run without neither trace nor timeline to see what happens, though that will be at least until tomorrow.

Thanks for responding so quickly!

N

Paolo Di Tommaso

unread,
Mar 28, 2017, 2:43:45 AM3/28/17
to nextflow
Yes, is extremely likely that the timeout is killing the process execution.


p

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

Nils Homer

unread,
Mar 28, 2017, 11:57:45 AM3/28/17
to Nextflow
The actual command completed successfully according to the logs in under 20 seconds, so I don't think the time limit is impacting this process in this particular case.  Can you tell me how the "time" is determined.  Could there be a delay, for example, because the executor doesn't join a thread or write to a log file in the time limit?

Paolo Di Tommaso

unread,
Mar 28, 2017, 12:00:13 PM3/28/17
to nextflow
But have you tried to increase, or even better to remove, the `time` directive ?


p

Nils Homer

unread,
Mar 28, 2017, 12:02:52 PM3/28/17
to Nextflow
Yes I have.  I changed it to 12h and it has completed the process.  Looking at the log for the process, the tool reports successful completion in under 20 seconds, so it leads me to believe there is a delay between the tool completing and Nextflow measuring the completion time, which to me is a bug, especially when the difference is on the order of minutes (not using NFS for tmp or work dir).  Should I open a github issue?

Paolo Di Tommaso

unread,
Mar 28, 2017, 12:05:12 PM3/28/17
to nextflow
Yes please, open a GH issue. I think there's a problem how the elapsed time is measured when using the local executor. 


--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

Nils Homer

unread,
Mar 28, 2017, 12:41:25 PM3/28/17
to Nextflow
Thanks Paolo for all your help tracking this down and if you have some direction on where the issue lies in the code, I may be able to submit a patch with enough intuition.

I am linking the github issue here in case anyone else in the community has this problem: https://github.com/nextflow-io/nextflow/issues/313
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Kemin Zhou

unread,
Apr 23, 2019, 5:47:06 PM4/23/19
to Nextflow
I am seeing the same error message.  I have long running pipeline through SLURM.
The process died randomly and I could not figure out any reason. 
I did not specify time, that could be the reason?

If I resume, then it died at a different place.
I am not sure SLURM killed one of my process or not. Otherwise, these process can finish without any problem
I have the same pipeline written in other language without any problem. The pipeline has been used for several
years.

Kemin Zhou

unread,
Apr 24, 2019, 2:17:55 PM4/24/19
to Nextflow
I have this long-running pipeline with latest nextflow,
[12/4669b1] process > lonjob [100%] 8 of 8, cached: 7, failed: 1
It does not show which of the 8 failed. Other processes faile with -- Likely it has been terminated by the external system
I was reading the length log file, and it did not give any clue.


--
Kemin Zhou

Kemin Zhou

unread,
Apr 24, 2019, 2:57:00 PM4/24/19
to Nextflow
I checked the program, the job was done successfully by its own log file. but nexflow reported failure.
and Kill other running jobs.

Kemin Zhou

unread,
May 2, 2019, 2:39:12 PM5/2/19
to Nextflow
I rewrote the same program in multi-thread, now it is 2-3x faster, then, the problem went away. Maybe it just took too long. 
Reply all
Reply to author
Forward
0 new messages