Julia getting stuck in long for loop / help at debugging not terminating code

226 views
Skip to first unread message

axsk

unread,
Oct 8, 2015, 8:37:03 AM10/8/15
to julia-users
Hello dear Julia fellows!

I use a for loop to batch run a clustering-script with different data and parameters, you can find my code in the gist.

Most of the work (~120s per call) is spend in the Hokusai.cluster call, which is a pure function, using pmap (julia -p 4) to compute clusterings with further different parameters (sigma, tau).
Furthermore I use ProgressMeter.jl to track the progress p, and JLD/DataFrames to save my results after each successful computation.

The loop is run 578 times, each time appending ~60 new rows to the DataFrame hdf.
So all in all this task takes about 20 hours.

Unfortunately after some hours (if I recall correctly it was 6,9,10 hours when I looked at it), the loop does not continue but hangs,
as can be seen by missing progress messages as well as htop.
Usually all four workers are at full load, but then only one worker is at full load (for multiple hours) while the others are idle.

If I restart Julia and continue the loop by loading the saved hdf::DataFrame it skips all so far computed entries (line 7) and resumes after the last saved entry (I checked that) but runs without problems, until it gets stuck again at some random later point.

I really do not know how to debug/solve this.
As this happens on seemingly random iterations it takes long times to reproduce.

Do you have any ideas what might cause this or how to find the bug?

Best,
Alex

Isaiah Norton

unread,
Oct 8, 2015, 12:00:08 PM10/8/15
to julia...@googlegroups.com
Your best bet may be to  log in to a node (if possible) and attach gdb to the hanging process to try to see where it hung up. This sounds like some form of corruption, and is going to be tough to debug without an isolation case.
Reply all
Reply to author
Forward
0 new messages