Suggestion for time-out problem when using coMDD for solving many "planes"(>50?) in qMDD

Troels Linnet

unread,

Apr 13, 2013, 6:03:12 AM4/13/13

to mdd...@googlegroups.com

Dear Maxim and collaborators.

Thank you for a very nice GUI to reconstruct NLS data.

We use it a lot, and save days on time on the magnet.

We have tried to reconstruct data which is recorded "HSQC" like.

For CPMG measurements, and T1RHO, this gives a set of eksperiments with

"planes" of HSQC's, recorded under different magnet settings.

We use coMDD to reconstruct the planes, by taking all the planes into consideration.

And it works very good.

But we find a little "time-bug", at least for qMDD 2.2

When you initialize coMDD, a coproc.sh script is generated and executed.

Part of that looks like this:

--------------------------------------------------

echo HD data disassembling

set i=0

foreach dir ( $procDirs )

echo "setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log"

setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log

set ecode=$?

if ( $ecode ) then

echo Error while data disassembling

echo "coMDD fail: in $dir"

exit($ecode)

endif

cd ../$dir

./proc1.sh >>&log

set ecode=$?

if ( $ecode ) then

echo Error while indirect FT

exit($ecode)

endif

cd ../coMDD

@ i++

end

----------------------------------------------------

The problem is that, if you have many planes (>50?) the line

" setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log "

it not "done" before you ask for the exit code.

And so the whole run is discarded.

The error is not easy to reproduce, since "sometimes" the computer is fast enough to be "done", other times it is not.

So when running in the GUI, it sometimes "not" work, and sometimes it does.

We found the following solution to be fruitful.

We set an array of waiting time to go through, until we get an error code of 0, and with a maximum waiting time of 55 seconds.

--------------------------------------------------------

echo HD data disassembling

set i=0

set cowait=`seq 0 10`

foreach dir ( $procDirs )

foreach cw ( $cowait )

echo " "$dir

echo "In directory: $PWD"

echo "setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log"

setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log

set ecode=$?

echo "setHD succesfull? 0=yes, 1=no. Result= $ecode"

if ( $ecode ) then

echo "ecode is $ecode. Sleeping $cw"

sleep $cw

else

break

endif

end

echo "In dir: $dir"

if ( $ecode ) then

echo Error while data disassembling

echo "coMDD fail: in $dir"

exit($ecode)

endif

cd ../$dir

echo "Running proc1.sh in ${PWD}"

./proc1.sh >>&log

set ecode=$?

if ( $ecode ) then

echo Error while indirect FT

exit($ecode)

endif

cd ../coMDD

@ i++

end

------------------------------------------

The logfile for running qMDD in shell then looks like this.

Notice that sometimes there is a need for just a little more time (Here 1 second, but sometimes it is waiting 4 rounds=10 s).

---------------------------------------

[8] - Done echo mddsolver hd27.mdd 25 0 50 1e-8 0.005 2345 0 1 hd27.res > hd27.log | csh

localhost:2 CPU load 23%. 0 jobs are running, 2 additional jobs

PIDs are: 14771 14790 14867 14889 14985 15164 15216 15235 15323 15342 15565 15584 15662 15686 15780 15864 16021 16073 16092 16180 16258 16326 16484 16536 16555 16643 16721 16877

/sbinlab2/software/mddnmr2.2/com/queMM.sh : OK

HD data disassembling

prot_pH6_5C_0Murea_t1rho1.proc

In directory: /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/coMDD

setHD coMDD.hd res 28 ../prot_pH6_5C_0Murea_t1rho1.proc/MDD/region%02d.res hd%02d.res 0 >>&log

setHD succesfull? 0=yes, 1=no. Result= 1

ecode is 1. Sleeping 0

prot_pH6_5C_0Murea_t1rho1.proc

In directory: /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/coMDD

setHD coMDD.hd res 28 ../prot_pH6_5C_0Murea_t1rho1.proc/MDD/region%02d.res hd%02d.res 0 >>&log

setHD succesfull? 0=yes, 1=no. Result= 0

In dir: prot_pH6_5C_0Murea_t1rho1.proc

Running proc1.sh in /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/prot_pH6_5C_0Murea_t1rho1.proc

prot_pH6_5C_0Murea_t1rho2.proc

In directory: /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/coMDD

setHD coMDD.hd res 28 ../prot_pH6_5C_0Murea_t1rho2.proc/MDD/region%02d.res hd%02d.res 1 >>&log

setHD succesfull? 0=yes, 1=no. Result= 0

In dir: prot_pH6_5C_0Murea_t1rho2.proc

Running proc1.sh in /home/tlinnet/kte/t1rho/t1rho/prot_pH6_5C_0Murea/prot_pH6_5C_0Murea_t1rho2.proc

--------------------------

Thanks again for a very nice program.

Maybe the cowait array could be implemented in the coming version of qMDD?

Best

Troels

Maxim Mayzel

unread,

Apr 15, 2013, 4:38:47 AM4/15/13

to mdd...@googlegroups.com

Dear Troels,

Thank you so much for the feedback, it's not a problem at all to include your fix to coproc.sh, although I don't understand why and when the problem that you've mentioned arise.

To my knowledge when you call an external program in the tcsh like this

program arguments >>& log

it is executed in the foreground and shell waits until it's done.

So the line that you pointed out

setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log

seems perfectly fine for me.

So do you have any explanation for this error?

And did you try to run qMDD without GUI or just "coproc.sh c *proc" from the terminal window?

Best

Maxim

Troels Emtekær Linnet

unread,

Apr 15, 2013, 5:23:15 AM4/15/13

to Maxim Mayzel, mdd...@googlegroups.com

Dear Maxim.

Without really finding the true explanation, I think it starts in queMM.sh

We tried to play a little with increasing the number of MDDHTREADS in queMM.sh, since we have a computer with 24 cores.

We tried to put this is:

if( ! $?MDDTHREADS && $HOST == "haddock" ) set MDDTHREADS=24 --- (Dont put this in)

if( ! $?MDDTHREADS ) set MDDTHREADS=2

set hosts=(localhost.$MDDTHREADS) # <host>.<nimber of CPUs>

#

# for distributed computing on hosts with names one,two,three with 4 CPUs each uncomment the following line

#set hosts=( one.4 two.4 three.4 )

But you should NOT, do this.

The rest of that scripts automatically putting out the jobs to different threads. Am I right?

So it auto-find that you have more cores.

So, when you have "many" planes, that you compare, i think that not all the distributed jobs are done,

before you ask for

setHD coMDD.hd res $nreg ../$dir/MDD/region%02d.res hd%02d.res $i >>&log

We found this, when using the gui.

Sometimes it worked, sometimes it did not, and we could not figure out why.

But if appears, if you have many "planes"/"folders" that you run together.

We are now bypassing the call to the generated coproc.sh file, and use the slightly modified version.

I could not detect a difference for the generated coproc.sh file, which always is the same?

So, I actually suggest, that you make it a general, one place, script ?

Best

Troels Emtekær Linnet

2013/4/15 Maxim Mayzel <maxim....@nmr.gu.se>

--
--
You received this message because you are subscribed to the Google Groups "mddnmr" group.
To post to this group, send email to mdd...@googlegroups.com rom your registered email address.
To unsubscribe from this group, send email to mddnmr-un...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/mddnmr

---
You received this message because you are subscribed to the Google Groups "mddnmr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mddnmr+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Maxim Mayzel

unread,

Apr 16, 2013, 9:10:16 AM4/16/13

to mdd...@googlegroups.com, Maxim Mayzel

Dear Troels,

You can find coproc.sh at $MDD_NMR/com

when you start coprocessing it is copied to the directory where you have all the spectra you want to process, and then it's modified by qMDD.py to put either "$RUNQUE regions.runs" or "rproc login@host $RUNQUE coMDD" and honestly I didn't test it thoroughly...

You can define $RUNQUE in $MDD_NMR/GUI/qMDD.rc, the queMM.sh dispatcher is very simple and not very smart solution for our local "cluster" so it could well be that the bug that you observed is related to queMM.sh.

If you have just one node with 24 CPUs then I would suggest you to write more simple script...

I would probably make a script that uses srun https://computing.llnl.gov/linux/slurm/srun.html since we are now using cluster with SLURM system, let me know if you need it.

Best,
Maxim

Reply all

Reply to author

Forward