JIT Compiler(s) seem to fail across multiple nodes

411 views
Skip to first unread message

Miguel Rodriguez

unread,
Jan 3, 2018, 4:36:10 PM1/3/18
to fenics-support
I recently built FEniCS 2017.2.0 from source on Berkeley's cluster, savio. Everything seems to run fine when a job is submitted to one node (24 processes). If I then submit the same job across multiple nodes without clearing the instant and dijitso cache, the job completes with non-fatal errors as shown below (30 processes):

**************************************************************************
**************************************************************************

Configuration: gcc-openmpi
Running on host n0035.savio2
Time: Wed Jan 3 12:42:19 PST 2018
Directory: /global/home/users/miguelr/python/test/test-fenics-2017.2.0
Using 30 processors across 2 nodes
Memory: 63500 MB per node
[n0035:16098] *** Process received signal ***
[n0035:16098] Signal: Segmentation fault (11)
[n0035:16098] Signal code: Address not mapped (1)
[n0035:16098] Failing at address: 0x3287190
[n0036:14278] *** Process received signal ***
[n0036:14278] Signal: Segmentation fault (11)
[n0036:14278] Signal code: Address not mapped (1)
[n0036:14278] Failing at address: 0x24ba180
[n0036:14278] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x2b34712b45e0]
[n0036:14278] [ 1] /global/software/sl-7.x86_64/modules/langs/python/3.5/lib/python3.5/lib-dynload/_posixsubprocess.so(+0x2045)[0x2b347a575045]
[n0036:14278] [ 2] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyCFunction_Call+0xf9)[0x2b3470e715e9]
[n0036:14278] [ 3] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x8fb5)[0x2b3470ef8bd5]
[n0036:14278] [ 4] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x144b49)[0x2b3470ef9b49]
[n0036:14278] [ 5] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x91d5)[0x2b3470ef8df5]
[n0036:14278] [ 6] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x144b49)[0x2b3470ef9b49]
[n0036:14278] [ 7] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalCodeEx+0x48)[0x2b3470ef9cd8]
[n0036:14278] [ 8] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x9a661)[0x2b3470e4f661]
[n0036:14278] [ 9] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyObject_Call+0x56)[0x2b3470e1c236]
[n0036:14278] [10] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x8377c)[0x2b3470e3877c]
[n0036:14278] [11] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyObject_Call+0x56)[0x2b3470e1c236]
[n0036:14278] [12] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0xd84c3)[0x2b3470e8d4c3]
[n0036:14278] [13] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0xcedaf)[0x2b3470e83daf]
[n0036:14278] [14] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyObject_Call+0x56)[0x2b3470e1c236]
[n0036:14278] [15] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x66f4)[0x2b3470ef6314]
[n0036:14278] [16] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x9546)[0x2b3470ef9166]
[n0036:14278] [17] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x9546)[0x2b3470ef9166]
[n0036:14278] [18] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x144b49)[0x2b3470ef9b49]
[n0036:14278] [19] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalCodeEx+0x48)[0x2b3470ef9cd8]
[n0036:14278] [20] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalCode+0x3b)[0x2b3470ef9d1b]
[n0036:14278] [21] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x137dfe)[0x2b3470eecdfe]
[n0036:14278] [22] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyCFunction_Call+0xf9)[0x2b3470e715e9]
[n0036:14278] [23] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x6ba0)[0x2b3470ef67c0]
[n0036:14278] [24] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x144b49)[0x2b3470ef9b49]
[n0036:14278] [25] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x91d5)[0x2b3470ef8df5]
[n0036:14278] [26] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x9546)[0x2b3470ef9166]
[n0036:14278] [27] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x9546)[0x2b3470ef9166]
[n0036:14278] [28] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x9546)[0x2b3470ef9166]
[n0036:14278] [29] /global/software/sl-7.x86_64/modules/langs/python/3.5/bin/../lib/libpython3.5m.so.1.0(+0x144b49)[0x2b3470ef9b49]
[n0036:14278] *** End of error message ***
[n0036:14301] *** Process received signal ***
[n0036:14301] Signal: Segmentation fault (11)
[n0036:14301] Signal code: Address not mapped (1)
[n0036:14301] Failing at address: 0x33e3500
[n0036:14302] *** Process received signal ***
[n0036:14302] Signal: Segmentation fault (11)
[n0036:14302] Signal code: Address not mapped (1)
[n0036:14302] Failing at address: 0x3fedc00
[n0036:14322] *** Process received signal ***
[n0036:14322] Signal: Segmentation fault (11)
[n0036:14322] Signal code: Address not mapped (1)
[n0036:14322] Failing at address: 0x35cd1d0
Process 0: Solving linear variational problem.
Process 9: Solving linear variational problem.
Process 13: Solving linear variational problem.
Process 7: Solving linear variational problem.
Process 12: Solving linear variational problem.
Process 14: Solving linear variational problem.
Process 1: Solving linear variational problem.
Process 11: Solving linear variational problem.
Process 3: Solving linear variational problem.
Process 5: Solving linear variational problem.
Process 2: Solving linear variational problem.
Process 10: Solving linear variational problem.
Process 6: Solving linear variational problem.
Process 4: Solving linear variational problem.
Process 8: Solving linear variational problem.
Process 16: Solving linear variational problem.
Process 20: Solving linear variational problem.
Process 26: Solving linear variational problem.
Process 19: Solving linear variational problem.
Process 21: Solving linear variational problem.
Process 24: Solving linear variational problem.
Process 25: Solving linear variational problem.
Process 28: Solving linear variational problem.
Process 27: Solving linear variational problem.
Process 22: Solving linear variational problem.
Process 17: Solving linear variational problem.
Process 15: Solving linear variational problem.
Process 18: Solving linear variational problem.
Process 23: Solving linear variational problem.
Process 29: Solving linear variational problem.

real 0m8.888s
user 1m19.635s
sys 0m12.147s

**************************************************************************

The results seem to be identical in both cases. On the other hand, if I submit this job across 2 nodes with a clean instant and dijitso cache, the job fails as shown below (25 processes):

**************************************************************************
**************************************************************************

Configuration: gcc-openmpi
Running on host n0035.savio2
Time: Wed Jan 3 12:53:20 PST 2018
Directory: /global/home/users/miguelr/python/test/test-fenics-2017.2.0
Using 25 processors across 2 nodes
Memory: 63500 MB per node
Calling FFC just-in-time (JIT) compiler, this may take some time.
------------------- Start compiler output ------------------------
[n0035:16463] *** Process received signal ***
[n0035:16463] Signal: Segmentation fault (11)
[n0035:16463] Signal code: Address not mapped (1)
[n0035:16463] Failing at address: 0x412e2d0

-------------------  End compiler output  ------------------------
Compilation failed! Sources, command, and errors have been written to: /global/home/users/miguelr/python/test/test-fenics-2017.2.0/jitfailure-ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f
Traceback (most recent call last):
  File "test-linear_solver.py", line 11, in <module>
    V = dlf.FunctionSpace(mesh, "CG", 2)
  File "/global/home/groups/fc_biome/sl7/programs/gcc-programs/fenics-2017.2.0/lib/python3.5/site-packages/dolfin/functions/functionspace.py", line 199, in __init__
    self._init_convenience(*args, **kwargs)
  File "/global/home/groups/fc_biome/sl7/programs/gcc-programs/fenics-2017.2.0/lib/python3.5/site-packages/dolfin/functions/functionspace.py", line 249, in _init_convenience
    constrained_domain=constrained_domain)
  File "/global/home/groups/fc_biome/sl7/programs/gcc-programs/fenics-2017.2.0/lib/python3.5/site-packages/dolfin/functions/functionspace.py", line 218, in _init_from_ufl
    dolfin_element, dolfin_dofmap = _compile_dolfin_element(element, mesh, constrained_domain=constrained_domain)
  File "/global/home/groups/fc_biome/sl7/programs/gcc-programs/fenics-2017.2.0/lib/python3.5/site-packages/dolfin/functions/functionspace.py", line 82, in _compile_dolfin_element
    ufc_element, ufc_dofmap = jit(element, mpi_comm=mesh.mpi_comm())
  File "/global/home/groups/fc_biome/sl7/programs/gcc-programs/fenics-2017.2.0/lib/python3.5/site-packages/dolfin/compilemodules/jit.py", line 107, in mpi_jit
    error_msg)
  File "/global/home/groups/fc_biome/sl7/programs/gcc-programs/fenics-2017.2.0/lib/python3.5/site-packages/dolfin/cpp/common.py", line 2739, in dolfin_error
    return _common.dolfin_error(location, task, reason)
RuntimeError: 

*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at
***
***
*** Remember to include the error message listed below and, if possible,
*** include a *minimal* running example to reproduce the error.
***
*** -------------------------------------------------------------------------
*** Error:   Unable to perform just-in-time compilation of form.
*** Reason:  Compilation failed on root node..
*** Where:   This error was encountered inside jit.py.
*** Process: 18
*** 
*** DOLFIN version: 2017.2.0
*** Git changeset:  0baf73825079a581e43ab1705370043040aa213d
*** -------------------------------------------------------------------------

[OMITTED REPEATED LINES]

--------------------------------------------------------------------------
mpirun noticed that process rank 21 with PID 14564 on node n0036.savio2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

real 0m6.524s
user 0m45.460s
sys 0m7.269s

**************************************************************************

The directory jitfailure-ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f is generated with the following error log:

**************************************************************************
**************************************************************************

[n0035:16463] *** Process received signal ***
[n0035:16463] Signal: Segmentation fault (11)
[n0035:16463] Signal code: Address not mapped (1)
[n0035:16463] Failing at address: 0x412e2d0

**************************************************************************

and when I run recompile.sh from the command line, no errors occur.

I've attached the python script used, the job file used to submit the job (SLURM), the output log files for all three cases, and the jitfailure* directory mentioned above.
fenics-2017.2.0-savio_testing-minimal.tar.gz

Jan Blechta

unread,
Jan 15, 2018, 6:57:18 AM1/15/18
to Miguel Rodriguez, fenics-support
It is known that Python 2 subprocess does something between fork and
exec which interacts badly with Infiniband (or threading). But as you
are running Python 3, which is considered to be safe in this regard, I
am not sure where the problem is although it looks similar - when
dijitso or Instant tries to run C++ compiler, a segfault happens.

I would try to run with environment variables

DIJITSO_SYSTEM_CALL_METHOD="OS_SYSTEM"
INSTANT_SYSTEM_CALL_METHOD="OS_SYSTEM"

and check if that helps.

Another step to help indentify the problem could be to disable
infiniband, see documentation of your MPI ("--mca btl ^openib" should
do the job with OpenMPI).

Also are you sure that recompile.sh works in the same environment the
FEniCS is run on nodes? Did you try running recompile.sh by the
same submit script?

Please, let us know how it went.

Jan


On Wed, 3 Jan 2018 13:36:09 -0800 (PST)
Miguel Rodriguez <marodri...@gmail.com> wrote:

> I recently built FEniCS 2017.2.0 from source on Berkeley's cluster,
> savio
> <http://research-it.berkeley.edu/services/high-performance-computing>.

Miguel Rodriguez

unread,
Jan 17, 2018, 12:25:08 AM1/17/18
to fenics-support
Running with the above environment variables did not seem to have any effect. I ran the recompile.sh through the job scheduler and got:

**************************************************************************
**************************************************************************

Configuration: gcc-openmpi
Running on host n0027.savio2
Time: Tue Jan 16 20:54:36 PST 2018
Directory: /global/home/users/miguelr/python/test/test-fenics-2017.2.0
Using 25 processors across 2 nodes
Memory: 63500 MB per node
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory
g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp: No such file or directory

real    0m0.539s
user    0m0.104s
sys     0m0.204s

**************************************************************************

Disabling infiniband with the flag you provided above worked. The job completed successfully.

Jan Blechta

unread,
Jan 17, 2018, 10:11:50 AM1/17/18
to Miguel Rodriguez, fenics-support
On Tue, 16 Jan 2018 21:25:08 -0800 (PST)
Miguel Rodriguez <marodri...@gmail.com> wrote:

> Running with the above environment variables did not seem to have any
> effect. I ran the recompile.sh through the job scheduler and got:
>
> **************************************************************************
> **************************************************************************
>
>
> Configuration: gcc-openmpi
> Running on host n0027.savio2
> Time: Tue Jan 16 20:54:36 PST 2018
> Directory: /global/home/users/miguelr/python/test/test-fenics-2017.2.0
> Using 25 processors across 2 nodes
> Memory: 63500 MB per node
> g++: error: ffc_element_5b6081f30aebcbadbc21c15812333d05758ea45f.cpp:
> No such file or directory

Apparently you are in the wrong working dir. Check the contents of
recompile.sh.

Jan

Miguel Rodriguez

unread,
Jan 17, 2018, 12:27:32 PM1/17/18
to fenics-support
You are correct. It ran without any errors when I submitted the job from the proper directory.

Jan Blechta

unread,
Jan 17, 2018, 1:44:37 PM1/17/18
to Miguel Rodriguez, fenics-support
Ok, so the problem appears to be in fork/exec when dijito calls the
compiler. Did you try with infiniband turned off?

Jan


On Wed, 17 Jan 2018 09:27:32 -0800 (PST)

Miguel Rodriguez

unread,
Jan 17, 2018, 1:48:47 PM1/17/18
to Jan Blechta, fenics-support
Yes, I did. Using the flag you provided to disable infiniband worked. The job completed successfully.
Reply all
Reply to author
Forward
0 new messages