OSError: [Errno 24] Too many open files

645 views
Skip to first unread message

davidmich...@gmail.com

unread,
Sep 2, 2016, 4:47:26 AM9/2/16
to fireworkflows
Hi,

I ran in a problem of too many open files during execution of big workflows (several thousands of fireworks). the system is currently under development so it is not a problem, but it will, because the final system is intended to run many worflows of several thousand of fireworks on a dedicated cluster.

You can find here after the state of a simple scriptTask FIZZLED and the call stack which lead to the problem.

My system limit was 1024 opened files (ulimit -n). I have increased this limit to 4096 and everything runs fine now, but I wonder how I can reduce the number of files open by firework :

  • should I reduce the number of fireworks and increase the number of tasks inside them or something like this ?
  • Run the script-task with useShell = False ?
  • ... every advice will be appreciated :-)

And last but not least : thanks a lot for the good work and for this very nice tool that makes my life easier :-)

Cheers,
David



"stored_data": {
  • -
    "_exception": {
    • "_details": null,
    • "_failed_task_n": 0,
    • "_stacktrace": "Traceback (most recent call last):\n File \"/usr/lib/python3.4/site-packages/fireworks/core/rocket.py\", line 211, in run\n m_action = t.run_task(my_spec)\n File \"/usr/lib/python3.4/site-packages/fireworks/user_objects/firetasks/script_task.py\", line 37, in run_task\n return self._run_task_internal(fw_spec, stdin)\n File \"/usr/lib/python3.4/site-packages/fireworks/user_objects/firetasks/script_task.py\", line 48, in _run_task_internal\n shell=self.use_shell)\n File \"/usr/lib64/python3.4/subprocess.py\", line 859, in __init__\n restore_signals, start_new_session)\n File \"/usr/lib64/python3.4/subprocess.py\", line 1359, in _execute_child\n errpipe_read, errpipe_write = os.pipe()\nOSError: [Errno 24] Too many open files\n"
    },
  • "_message": "runtime error during task",
  • -
    "_task": {
    • "_fw_name": "ScriptTask",
    • -
      "script": [
      • "echo \"ending correl_S2 workflow\""
      ],
    • "use_shell": true
    }
},

Anubhav Jain

unread,
Sep 12, 2016, 8:28:24 PM9/12/16
to fireworkflows
Hi David

Sorry for the late reply, somehow I did not get the latest FWS tickets in my email

Can you tell me some details of your script? I want to be sure that the issue with too many open files pertains to FireWorks and not what is going on inside the script (e.g., if the script is opening a file and not closing it). e.g. see this:



Best,
Anubhav

davidmich...@gmail.com

unread,
Sep 13, 2016, 4:17:35 AM9/13/16
to fireworkflows
Hi Anubhav,

I don't think I'm in this case : I do not open files, just do a lot of file movement with some FileTransferTasks. I use some subprocesses, but not with Popen as I don't need the output, just the return code.

I use the old API call() func (https://docs.python.org/3/library/subprocess.html#older-high-level-api). I guess it closes the file descriptor after execution ...

The script which has caused the error (too many open files) is just a dummy script task echoing that the workflow execution has completed :

def get_dummy_end_fw():
# get caller module name
caller = inspect.currentframe().f_back
caller_name =  caller.f_globals['__name__']

# create dummy end fw
ft = ScriptTask.from_str("echo \"ending %s workflow\"" % caller_name)
task_name = "dummy end: %s" % caller_name
fw_end = Firework([ft], name=task_name)
return fw_end

The only part in the code where I do an explicit open is into a with construct :

with open(param_path, "w") as paramfile:
    paramfile.write("%s" % param_content)

I'm an experienced programmer (C, C++, Fortran & Perl mainly), but pretty new to Python...

Maybe have I missed something ... for example, I use the psycopg2 lib to connect to a PgSQL DB

So when I create a cursor like this in a funct :

cursor = db_connector.cursor()

I suppose it is destroyed at the end of the function, since no more reference points to this object ...

the number of fireworks in all the workflows in the launchpad when it crashed was greater than 1024 (more than 1300 actually).

I will launch further testes and check with lsof command which process open which file. Not this week, because I'm in travel, but next week I will tell you more.

Best,
David

davidmich...@gmail.com

unread,
Sep 22, 2016, 6:18:27 AM9/22/16
to fireworkflows
Hi,

I think I solve the problem: it was related to the logging system I use, as several instances of python interpreter are launched during execution of WFs, I had to reinitialize the logger in each firework. This was ok when running a singleshot, but not in rapidfire mode with large number of launches since it runs several fireworks in the same python interpreter, opening several sockets for logging on each fw launch ... I have now fixed the problem and it runs fine.

Thank you for your attention and sorry for the inconvenience caused by my irrelevant question.

Best,
David

Anubhav Jain

unread,
Sep 22, 2016, 10:19:34 AM9/22/16
to fireworkflows, davidmich...@gmail.com
Ok no problem - thanks for updating back the list with the answer to what happened
--
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/f5bbcf88-99f8-4f0b-8bd6-629f502a2d55%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages