TACC Stampede

1,321 views
Skip to first unread message

sheshu

unread,
Aug 5, 2013, 2:29:16 PM8/5/13
to dea...@googlegroups.com
Hi,

I installed deal.II on Stampede following the instructions in:

http://www.google.com/url?q=http://www.geodynamics.org/cig/Members/emheien/aspect_stampede/at_download/file&sa=U&ei=xOD_UdjyHLb54APd2YCQBw&ved=0CBgQFjAA&usg=AFQjCNHCjaGd5eXor6Gd7Aa1_zI8HuVAZw

The modules I used are:
Currently Loaded Modules:
  1) TACC         4) cluster         7) intel/13.0.079  10) petsc/3.4
  2) TACC-paths   5) cluster-paths   8) metis/5.0.2     11) phdf5/1.8.9
  3) Linux        6) cmake/2.8.9     9) mvapich2/1.9a2  12) trilinos/10.12.2



I did not get any errors while building the library and all the serial programs run fine. But when I try to run any parallel programs(step-17, step-32, step-40), I get runtime errors. Example output for step-40 is shown below:


$ ibrun ./step-40
TACC: Starting up job 1378918
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
Cycle 0:
   Number of active cells:       1024
   Number of degrees of freedom: 4225
   Solved in 10 iterations.



+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |     0.236s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assembly                        |         1 |    0.0164s |       6.9% |
| output                          |         1 |    0.0267s |        11% |
| setup                           |         1 |    0.0423s |        18% |
| solve                           |         1 |    0.0848s |        36% |
+---------------------------------+-----------+------------+------------+

Cycle 1:
   Number of active cells:       1954
   Number of degrees of freedom: 8399
   Solved in 10 iterations.



+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |     0.198s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assembly                        |         1 |      0.02s |        10% |
| output                          |         1 |    0.0141s |       7.1% |
| refine                          |         1 |     0.114s |        58% |
| setup                           |         1 |    0.0347s |        18% |
| solve                           |         1 |    0.0145s |       7.3% |
+---------------------------------+-----------+------------+------------+

Cycle 2:
   Number of active cells:       3664
   Number of degrees of freedom: 16183
   Solved in 11 iterations.



+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |     0.343s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assembly                        |         1 |    0.0377s |        11% |
| output                          |         1 |    0.0228s |       6.7% |
| refine                          |         1 |     0.197s |        58% |
| setup                           |         1 |     0.063s |        18% |
| solve                           |         1 |    0.0217s |       6.3% |
+---------------------------------+-----------+------------+------------+

Cycle 3:
   Number of active cells:       7036
   Number of degrees of freedom: 31483
/work/01366/sheshu/local/examples/step-40/step-40: symbol lookup error: /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so: undefined symbol: mkl_serv_mkl_malloc
/work/01366/sheshu/local/examples/step-40/step-40: symbol lookup error: /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so: undefined symbol: mkl_serv_mkl_malloc
/work/01366/sheshu/local/examples/step-40/step-40: symbol lookup error: /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so: undefined symbol: mkl_serv_mkl_malloc
[c558-103.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 14. MPI process died?
[c558-103.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[c558-103.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI process (rank: 5, pid: 32777) exited with status 127
[c558-103.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI process (rank: 0, pid: 32772) exited with status 127
[c558-103.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI process (rank: 1, pid: 32773) exited with status 127
TACC: MPI job exited with code: 1
 
TACC: Shutdown complete. Exiting.


What am I doing wrong?

Thanks.
Sheshu

Timo Heister

unread,
Aug 5, 2013, 3:13:06 PM8/5/13
to dea...@googlegroups.com
Are you loading the same modules in your job script? what happens if
you run "mpirun -n 2 ./step-40" on the login node (or the one where
you compiled)? can you run "ldd step-40" on the login node and inside
a job script to see if there are any differences?
> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see
> https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Timo Heister
http://www.math.clemson.edu/~heister/

sheshu

unread,
Aug 5, 2013, 4:53:06 PM8/5/13
to dea...@googlegroups.com
Hi Timo,

Thanks for the reply.
This is what I get when I run "mpirun -n 2 ./step-40" from the login mode:

librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4

Cycle 0:
   Number of active cells:       1024
   Number of degrees of freedom: 4225
./step-40: symbol lookup error: /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so: undefined symbol: mkl_serv_mkl_malloc
./step-40: symbol lookup error: /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so: undefined symbol: mkl_serv_mkl_malloc

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 127
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

There is no difference when I run "ldd step-40" from login mode or job script except for the hexadecimal numbers at the end of each line.

Timo Heister

unread,
Aug 5, 2013, 5:09:16 PM8/5/13
to dea...@googlegroups.com
Can you post the output of ldd here? I assume you are mixing mkl
versions between 13.0 and 13.1. Can you also please dig up any info
regarding mkl in detailed.log (in your deal.II build directory)?

sheshu

unread,
Aug 5, 2013, 5:53:07 PM8/5/13
to dea...@googlegroups.com
Hi Timo,

Output of ldd:

    linux-vdso.so.1 =>  (0x00007fff417a0000)
    libdeal_II.g.so.8.1.pre => /work/01366/sheshu/local/lib/libdeal_II.g.so.8.1.pre (0x00002b35db7b6000)
    libpetsc.so => /opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libpetsc.so (0x00002b35e7e45000)
    libimf.so => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libimf.so (0x00002b35e9241000)
    libteuchos.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libteuchos.so (0x00002b35e96fd000)
    libmkl_gf_lp64.so => /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_gf_lp64.so (0x00002b35e9e4c000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x000000306a000000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000306c800000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000306b400000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003069400000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000003069c00000)
    libmetis.so => /opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libmetis.so (0x00002b35ea5a9000)
    libmkl_intel_lp64.so => /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_lp64.so (0x00002b35ea83f000)
    libml.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libml.so (0x00002b35eaf8b000)
    libifpack.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libifpack.so (0x00002b35eb4b6000)
    libamesos.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libamesos.so (0x00002b35eb850000)
    libaztecoo.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libaztecoo.so (0x00002b35ebad7000)
    libepetra.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libepetra.so (0x00002b35ebd65000)
    libhdf5.so.7 => /opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib/libhdf5.so.7 (0x00002b35ec077000)
    libp4est.so.0 => /work/01366/sheshu/local/FAST/lib/libp4est.so.0 (0x00002b35ec644000)
    libsc.so.0 => /work/01366/sheshu/local/FAST/lib/libsc.so.0 (0x00002b35ec8e1000)
    libmkl_intel_thread.so => /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so (0x00002b35ecb2e000)
    libiomp5.so => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so (0x00002b35edaa2000)
    libmpich.so.8 => /opt/apps/intel13/mvapich2/1.9/lib/libmpich.so.8 (0x00002b35eddb0000)
    libintlc.so.5 => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5 (0x00002b35ee389000)
    libsvml.so => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libsvml.so (0x00002b35ee5d7000)
    libmkl_sequential.so => /opt/apps/intel/13/composer_xe_2013.1.117/mkl/lib/intel64/libmkl_sequential.so (0x00002b35eefa3000)
    libmkl_core.so => /opt/apps/intel/13/composer_xe_2013.1.117/mkl/lib/intel64/libmkl_core.so (0x00002b35ef641000)
    libparmetis.so => /opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libparmetis.so (0x00002b35f084f000)
    libhdf5_fortran.so.7 => /opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib/libhdf5_fortran.so.7 (0x00002b35f0a9f000)
    libhdf5_hl.so.7 => /opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib/libhdf5_hl.so.7 (0x00002b35f0ce2000)
    libz.so.1 => /lib64/libz.so.1 (0x000000306a400000)
    libmpichf90.so.8 => /opt/apps/intel13/mvapich2/1.9/lib/libmpichf90.so.8 (0x00002b35f0f16000)
    libifport.so.5 => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libifport.so.5 (0x00002b35f1118000)
    libifcore.so.5 => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libifcore.so.5 (0x00002b35f1347000)
    libm.so.6 => /lib64/libm.so.6 (0x0000003069800000)
    libmpichcxx.so.8 => /opt/apps/intel13/mvapich2/1.9/lib/libmpichcxx.so.8 (0x00002b35f167e000)
    libopa.so.1 => /opt/apps/intel13/mvapich2/1.9/lib/libopa.so.1 (0x00002b35f189f000)
    libmpl.so.1 => /opt/apps/intel13/mvapich2/1.9/lib/libmpl.so.1 (0x00002b35f1aa1000)
    libibmad.so.5 => /opt/ofed/lib64/libibmad.so.5 (0x00002b35f1ca5000)
    librdmacm.so.1 => /opt/ofed/lib64/librdmacm.so.1 (0x00002b35f1ebc000)
    libibumad.so.3 => /opt/ofed/lib64/libibumad.so.3 (0x00002b35f20c5000)
    libibverbs.so.1 => /opt/ofed/lib64/libibverbs.so.1 (0x00002b35f22cb000)
    librt.so.1 => /lib64/librt.so.1 (0x000000306a800000)
    liblimic2.so.0 => /opt/apps/limic2/0.5.5//lib/liblimic2.so.0 (0x00002b35f24da000)
    libirng.so => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libirng.so (0x00002b35f26db000)
    libcilkrts.so.5 => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libcilkrts.so.5 (0x00002b35f28e2000)
    libirc.so => /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libirc.so (0x00002b35f2b18000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003069000000)
    libgaleri.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libgaleri.so (0x00002b35f2d67000)
    libisorropia.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libisorropia.so (0x00002b35f2f9b000)
    libepetraext.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libepetraext.so (0x00002b35f3223000)
    libzoltan.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libzoltan.so (0x00002b35f351d000)
    libtriutils.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libtriutils.so (0x00002b35f3835000)
    libtpetraext.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libtpetraext.so (0x00002b35f3a73000)
    libtpetrainout.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libtpetrainout.so (0x00002b35f3c76000)
    libtpetra.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libtpetra.so (0x00002b35f3e84000)
    libkokkoslinalg.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libkokkoslinalg.so (0x00002b35f40f6000)
    libkokkosnodeapi.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libkokkosnodeapi.so (0x00002b35f42f8000)
    libkokkos.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libkokkos.so (0x00002b35f4504000)
    libtpi.so => /opt/apps/intel13/mvapich2_1_9/trilinos/10.12.2/10.12.2/lib/libtpi.so (0x00002b35f4705000)
    libsz.so.2 => /opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib/libsz.so.2 (0x00002b35f490f000)
    liblua-5.1.so => /usr/lib64/liblua-5.1.so (0x000000306f400000)

mkl in detailed.log:

#            LAPACK_LIBRARIES = /opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_gf_lp64.so;/opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_lp64.so;/opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so;/opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_core.so;/opt/apps/intel/13/composer_xe_2013.0.079/compiler/lib/intel64/libiomp5.so;-lm;/usr/lib/gcc/x86_64-redhat-linux/4.4.7/libgfortran.so;/usr/lib64/libm.so
#            P4EST_LIBRARIES = /work/01366/sheshu/local/FAST/lib/libp4est.so;/work/01366/sheshu/local/FAST/lib/libsc.so;/opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_gf_lp64.so;/opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_lp64.so;/opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so;/opt/apps/intel/13/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_core.so;
#            PETSC_LIBRARIES = /opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libpetsc.so;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libsuperlu_4.3.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libHYPRE.a;/opt/apps/intel13/mvapich2/1.9/lib/libmpichcxx.so;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libspai.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libsuperlu_dist_3.3.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libcmumps.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libdmumps.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libsmumps.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libzmumps.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libmumps_common.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libpord.a;/opt/apps/intel13/mvapich2_1_9/petsc/3.4/sandybridge/lib/libscalapack.a;/opt/apps/intel/13/composer_xe_2013.1.117/mkl/lib/intel64/libmkl_intel_lp64.so;/opt/apps/intel/13/composer_xe_2013.1.117/mkl/lib/intel64/libmkl_sequential.so;/opt/apps/intel/13/composer_xe_2013.1.117/mkl/lib/intel64/libmkl_core.so;


Kindly let me know if you need any other information.

Thanks,
Sheshu

Timo Heister

unread,
Aug 5, 2013, 5:58:59 PM8/5/13
to dea...@googlegroups.com
Okay, the problem is that you are picking up libraries in
/opt/apps/intel/13/composer_xe_2013.0.*
/opt/apps/intel/13/composer_xe_2013.1.*
/opt/apps/intel/13/composer_xe_2013.2.*
which you can tell by looking at the email you sent.

You basically have to figure out why anything but *2013.0.* gets
imported and stop that from happening. It could be pulled in by any of
the libraries you are using or it is something in your
LD_LIBRARY_PATH.
Reply all
Reply to author
Forward
0 new messages