Suggestion for time-out problem when using coMDD for solving many "planes"(>50?) in qMDD

111 views
Skip to first unread message

Troels Linnet

unread,
Apr 13, 2013, 6:03:12 AM4/13/13
to mdd...@googlegroups.com
Dear Maxim and collaborators.

Thank you for a very nice GUI to reconstruct NLS data.
We use it a lot, and save days on time on the magnet.

We have tried to reconstruct data which is recorded "HSQC" like.
For CPMG measurements, and T1RHO, this gives a set of eksperiments with 
"planes" of HSQC's, recorded under different magnet settings.

We use coMDD to reconstruct the planes, by taking all the planes into consideration.
And it works very good.

But we find a little "time-bug", at least for qMDD 2.2

When you initialize coMDD, a coproc.sh script is generated and executed.
Part of that looks like this:
--------------------------------------------------
echo HD data disassembling
set i=0
foreach dir ( $procDirs )
    echo "setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log"
    setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log
    set ecode=$?
    if ( $ecode ) then
        echo Error while data disassembling
        echo "coMDD fail: in $dir"
        exit($ecode)
    endif
    cd ../$dir
    ./proc1.sh >>&log
    set ecode=$?
    if ( $ecode ) then
        echo Error while indirect FT
        exit($ecode)
    endif
    cd ../coMDD
    @ i++
end
----------------------------------------------------
The problem is that, if you have many planes (>50?) the line
" setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log "
it not "done" before you ask for the exit code.
And so the whole run is discarded.
The error is not easy to reproduce, since "sometimes" the computer is fast enough to be "done", other times it is not.
So when running in the GUI, it sometimes "not" work, and sometimes it does.

We found the following solution to be fruitful.
We set an array of waiting time to go through, until we get an error code of 0, and with a maximum waiting time of 55 seconds.
--------------------------------------------------------
echo HD data disassembling
set i=0
set cowait=`seq 0 10`
foreach dir ( $procDirs )
    foreach cw ( $cowait )
        echo "   "$dir
        echo "In directory: $PWD"
        echo "setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log"
        setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log
        set ecode=$?
        echo "setHD succesfull? 0=yes, 1=no. Result= $ecode"
        if ( $ecode ) then
            echo "ecode is $ecode. Sleeping $cw"
            sleep $cw
        else
            break            
        endif
    end
    echo "In dir: $dir"
    if ( $ecode ) then
        echo Error while data disassembling
        echo "coMDD fail: in $dir"
        exit($ecode)
    endif
    cd ../$dir
    echo "Running proc1.sh in ${PWD}"
    ./proc1.sh >>&log
    set ecode=$?
    if ( $ecode ) then
        echo Error while indirect FT
        exit($ecode)
    endif
    cd ../coMDD
    @ i++
end
------------------------------------------
The logfile for running qMDD in shell then looks like this.
Notice that sometimes there is a need for just a little more time (Here 1 second, but sometimes it is waiting 4 rounds=10 s).

---------------------------------------
[8]  - Done                          echo mddsolver hd27.mdd 25 0 50 1e-8 0.005 2345 0 1 hd27.res > hd27.log | csh
localhost:2 CPU load 23%. 0 jobs are running, 2 additional jobs

PIDs are: 14771 14790 14867 14889 14985 15164 15216 15235 15323 15342 15565 15584 15662 15686 15780 15864 16021 16073 16092 16180 16258 16326 16484 16536 16555 16643 16721 16877
/sbinlab2/software/mddnmr2.2/com/queMM.sh : OK
HD data disassembling

   prot_pH6_5C_0Murea_t1rho1.proc
In directory: /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/coMDD
setHD coMDD.hd res 28 ../prot_pH6_5C_0Murea_t1rho1.proc/MDD/region%02d.res hd%02d.res 0 >>&log
setHD succesfull? 0=yes, 1=no. Result= 1
ecode is 1. Sleeping 0
   prot_pH6_5C_0Murea_t1rho1.proc
In directory: /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/coMDD
setHD coMDD.hd res 28 ../prot_pH6_5C_0Murea_t1rho1.proc/MDD/region%02d.res hd%02d.res 0 >>&log
setHD succesfull? 0=yes, 1=no. Result= 0
In dir: prot_pH6_5C_0Murea_t1rho1.proc
Running proc1.sh in /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/prot_pH6_5C_0Murea_t1rho1.proc

   prot_pH6_5C_0Murea_t1rho2.proc
In directory: /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/coMDD
setHD coMDD.hd res 28 ../prot_pH6_5C_0Murea_t1rho2.proc/MDD/region%02d.res hd%02d.res 1 >>&log
setHD succesfull? 0=yes, 1=no. Result= 0
In dir: prot_pH6_5C_0Murea_t1rho2.proc
Running proc1.sh in /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/prot_pH6_5C_0Murea_t1rho2.proc

--------------------------
Thanks again for a very nice program.
Maybe the cowait array could be implemented in the coming version of qMDD?

Best
Troels

Maxim Mayzel

unread,
Apr 15, 2013, 4:38:47 AM4/15/13
to mdd...@googlegroups.com
Dear Troels,

Thank you so much for the feedback, it's not a problem at all to include your fix to coproc.sh, although I don't understand why and when the problem that you've mentioned arise.
To my  knowledge when you call an external program in the tcsh like this
program arguments >>& log
it is executed in the foreground and shell waits until it's done.
So the line that you pointed out
setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log
seems perfectly fine for me.

So do you have any explanation for this error? 
And did you try to run qMDD without GUI or just "coproc.sh c *proc" from the terminal window? 

Best
Maxim

Troels Emtekær Linnet

unread,
Apr 15, 2013, 5:23:15 AM4/15/13
to Maxim Mayzel, mdd...@googlegroups.com
Dear Maxim.

Without really finding the true explanation, I think it starts in queMM.sh
We tried to play a little with increasing the number of MDDHTREADS in queMM.sh, since we have a computer with 24 cores.

We tried to put this is:

if( ! $?MDDTHREADS && $HOST == "haddock" ) set MDDTHREADS=24   --- (Dont put this in)
if( ! $?MDDTHREADS ) set MDDTHREADS=2
set hosts=(localhost.$MDDTHREADS) # <host>.<nimber of CPUs>
#
# for distributed computing on hosts with names one,two,three with 4 CPUs each uncomment the following line
#set hosts=( one.4 two.4 three.4 ) 

But you should NOT, do this.
The rest of that scripts automatically putting out the jobs to different threads. Am I right?
So it auto-find that you have more cores.

So, when you have "many" planes, that you compare, i think that not all the distributed jobs are done,
before you ask for
setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log

We found this, when using the gui.
Sometimes it worked, sometimes it did not, and we could not figure out why.
But if appears, if you have many "planes"/"folders" that you run together.

We are now bypassing the call to the generated coproc.sh file, and use the slightly modified version.
I could not detect a difference for the generated coproc.sh file, which always is the same?
So, I actually suggest, that you make it a general, one place, script ? 

Best
Troels Emtekær Linnet


2013/4/15 Maxim Mayzel <maxim....@nmr.gu.se>

--
--
You received this message because you are subscribed to the Google Groups "mddnmr" group.
To post to this group, send email to mdd...@googlegroups.com rom your registered email address.
To unsubscribe from this group, send email to mddnmr-un...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/mddnmr
 
---
You received this message because you are subscribed to the Google Groups "mddnmr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mddnmr+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Maxim Mayzel

unread,
Apr 16, 2013, 9:10:16 AM4/16/13
to mdd...@googlegroups.com, Maxim Mayzel
Dear Troels,

You can find coproc.sh at $MDD_NMR/com
when you start coprocessing it is copied to the directory where you have all the spectra you want to process, and then it's modified by qMDD.py to put either "$RUNQUE regions.runs" or "rproc login@host $RUNQUE coMDD" and honestly I didn't test it thoroughly... 
You can define $RUNQUE in $MDD_NMR/GUI/qMDD.rc, the queMM.sh dispatcher is very simple and not very smart solution for our local "cluster" so it could well be that the bug that you observed is related to queMM.sh.  
If you have just one node with 24 CPUs then I would suggest you to write more simple script... 
I would probably make a script that uses srun https://computing.llnl.gov/linux/slurm/srun.html since we are now using cluster with SLURM system, let me know if you need it.

Best,
Maxim 

Reply all
Reply to author
Forward
0 new messages