[maker-devel] Issues with locking in MPI mode

244 views
Skip to first unread message

Michele Vidotto

unread,
Jan 21, 2021, 12:40:34 PM1/21/21
to maker...@yandell-lab.org
Dear all,

as reported in the subject I'm having issues with locking mechanism of MAKER when it is runs in parallel-mode through mpi.
I'm using maker version 3.01.03 but the same happens in my system when I build and install version 2.31.11.
All prerequisites were installed in a conda environment. Perl was installed from anaconda channel in version 5.26.2. Hard-coded paths to the compilers were fixed. Necessary perl modules were installed via cpanm:

"DBD::SQLite",
"DBI",
"Error",
"Error::Simple",
"File::NFSLock",
"File::Which",
"forks",
"forks::shared",
"Inline",
"Inline::C",
"IO::All",
"IO::Prompt",
"LWP::Simple"
"Perl::Unsafe::Signals",
"PerlIO::gzip",
"Proc::Simple",
"URI::Escape",
"DBD::Pg"

additional libraries and components were installed via conda

  - gcc_linux-64=7.3.0
  - gxx_linux-64=7.3.0
  - openmpi=4.1.0
  - zlib=1.2.11
  - libdb=6.1.26
  - expat=2.2.9
  - libxml2=2.9.10
  - exonerate=2.4.0
  - snoscan=1.0
  - rapsearch=2.24

other components were installed manually. MAKER compile and install with no errors, but when I execute the program via MPI with:

# to devoid OPEN MPI segmentation fault
export THREADS_DAEMON_MODEL=1

mpiexec -mca btl ^openib -n 1 \
maker \
-force \
-cpus 8 \
--fix_nucleotides \
maker_opts.ctl \
maker_bopts.ctl \
maker_exe.ctl

It always ends up with following error:


STATUS: Parsing control files...
ERROR: The directory is locked.  Perhaps by an instance of MAKER.

--> rank=NA, hostname=april.corp.igatechnology.com
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[19321,1],0]
  Exit code:    10
--------------------------------------------------------------------------

if I look inside *.maker.output directory a lock file remains: 

.NFSLock.gi_lock.NFSLock

If instead I run maker with the -nolock flag. MAKER runs with no problems at all.

My filesystem is oneFS from ISILON, exported to a virtual server through nfs4 protocol.
By looking at the code MAKER uses File::NFSLock Perl module for locking. This module fails some tests when installed on my system with cipanm:

#   Failed test at t/300_bl_sh.t line 115.
Shared locks not running simultaneously at t/300_bl_sh.t line 116, <$rd3> line 18.
# Looks like your test exited with 4 just after 27.
t/300_bl_sh.t ..... Dubious, test returned 4 (wstat 1024, 0x400)
Failed 47/73 subtests
t/400_kill.t ...... ok
t/410_die.t ....... ok
t/420_crash.t ..... ok
t/430_taint.t ..... ok

Test Summary Report
-------------------
t/300_bl_sh.t   (Wstat: 1024 Tests: 27 Failed: 1)
  Failed test:  27
  Non-zero exit status: 4
  Parse errors: Bad plan.  You planned 73 tests but ran 27.



But anyway I was able to install it with --notest flag.
Do you have any idea on how I can overcome my problem and have MAKER run in parallel with MPI?

Thanks in advance,




---
Michele Vidotto
mailto: michele...@gmail.com

Carson Holt

unread,
Mar 10, 2021, 4:28:23 PM3/10/21
to Michele Vidotto, maker...@yandell-lab.org
Look for hidden files with .NFSLock in the name, delete them, and see if they come back.

find <search_folder> | grep .NFSLock | xargs rm
find <search_folder> | grep .NFSLock

If the files come back after deleting them, it can mean another MAKER job is still running and updating the lock.  Can happen in weird situations like when process managers like slurm OOM kill a job but only wipe out some of the processes and not all.

—Carson



_______________________________________________
maker-devel mailing list
maker...@yandell-lab.org
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

Reply all
Reply to author
Forward
0 new messages