Running Dedalus on Several Cores

387 views
Skip to first unread message

nand...@swarthmore.edu

unread,
Jun 19, 2018, 2:48:21 PM6/19/18
to Dedalus Users
Hi, I'm trying to run the 3d rayleigh-benard Dedalus example on PSC Bridges. I've made a batch script that seems to work for one node:
==========================
#!/bin/bash
#SBATCH -p RM
#SBATCH -t 00:05:00
#SBATCH -N 1
#SBATCH --ntasks-per-node 28

#echo commands to stdout
set -x

export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0

#activate dedalus
source /home/nanders4/dedalus/bin/activate

#run mpi program
mpirun -np $SLURM_NTASKS python3 rayleigh_benard.py

#copy output to $HOME

srun -N $SLURM_NNODES --ntasks-per-node=28 \
sh -c 'cp $LOCAL/* /home/nanders4/DedalusExamples/'
==============================================

But when I increase the number of nodes to 2, I get the following error for processes 28-56 (these are the processes run on the second node):
2018-06-19 14:36:46,995 __main__ 28/56 ERROR :: Exception raised, triggering end of main loop.

For the 1 node case, several snapshots folders were output, each coming from different processes, but for the 2 node case, I only got snapshots_s1.

What do I need to do to allows a Dedalus script, say the 3d rayleigh benard, to use several nodes?

-Nick

Daniel Lecoanet

unread,
Jun 19, 2018, 2:51:32 PM6/19/18
to dedalu...@googlegroups.com
Hi Nick,

Normally there's another error above the "ERROR :: Exception raised, triggering end of main loop." -- could you try to find that? It's normally more descriptive.

Daniel


--
You received this message because you are subscribed to the Google Groups "Dedalus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dedalus-users+unsubscribe@googlegroups.com.
To post to this group, send email to dedalu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dedalus-users/70b7dfcd-7d0d-4eb0-88b5-bc8321c4f309%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nand...@swarthmore.edu

unread,
Jun 19, 2018, 3:01:39 PM6/19/18
to Dedalus Users
Not really, here's a larger picture:
+ mpirun -np 56 python3 rayleigh_benard.py
2018-06-19 14:36:40,161 pencil 0/56 INFO :: Building pencil matrix 1/128 (~1%) Elapsed: 0s, Remaining: 6s, Rate: 2.1e+01/s
2018-06-19 14:36:40,703 pencil 0/56 INFO :: Building pencil matrix 13/128 (~10%) Elapsed: 1s, Remaining: 5s, Rate: 2.2e+01/s
2018-06-19 14:36:41,291 pencil 0/56 INFO :: Building pencil matrix 26/128 (~20%) Elapsed: 1s, Remaining: 5s, Rate: 2.2e+01/s
2018-06-19 14:36:41,879 pencil 0/56 INFO :: Building pencil matrix 39/128 (~30%) Elapsed: 2s, Remaining: 4s, Rate: 2.2e+01/s
2018-06-19 14:36:42,469 pencil 0/56 INFO :: Building pencil matrix 52/128 (~41%) Elapsed: 2s, Remaining: 3s, Rate: 2.2e+01/s
2018-06-19 14:36:43,071 pencil 0/56 INFO :: Building pencil matrix 65/128 (~51%) Elapsed: 3s, Remaining: 3s, Rate: 2.2e+01/s
2018-06-19 14:36:43,670 pencil 0/56 INFO :: Building pencil matrix 78/128 (~61%) Elapsed: 4s, Remaining: 2s, Rate: 2.2e+01/s
2018-06-19 14:36:44,270 pencil 0/56 INFO :: Building pencil matrix 91/128 (~71%) Elapsed: 4s, Remaining: 2s, Rate: 2.2e+01/s
2018-06-19 14:36:44,873 pencil 0/56 INFO :: Building pencil matrix 104/128 (~81%) Elapsed: 5s, Remaining: 1s, Rate: 2.2e+01/s
2018-06-19 14:36:45,472 pencil 0/56 INFO :: Building pencil matrix 117/128 (~91%) Elapsed: 5s, Remaining: 1s, Rate: 2.2e+01/s
2018-06-19 14:36:45,975 pencil 0/56 INFO :: Building pencil matrix 128/128 (~100%) Elapsed: 6s, Remaining: 0s, Rate: 2.2e+01/s
2018-06-19 14:36:45,980 __main__ 0/56 INFO :: Solver built
2018-06-19 14:36:46,572 __main__ 0/56 INFO :: Initialization time: 7.008661
2018-06-19 14:36:46,572 __main__ 0/56 INFO :: Starting loop

2018-06-19 14:36:46,995 __main__ 28/56 ERROR :: Exception raised, triggering end of main loop.
.
.
.

What might be helpful is that farther down, after the processes have failed to start, there's this:
Traceback (most recent call last):
File "rayleigh_benard.py", line 136, in <module>
solver.step(dt)
File "/home/nanders4/dedalus/src/dedalus/dedalus/core/solvers.py", line 483, in step
self.timestepper.step(self, dt)
File "/home/nanders4/dedalus/src/dedalus/dedalus/core/timesteppers.py", line 111, in step
evaluator.evaluate_scheduled(**evaluator_kw)
File "/home/nanders4/dedalus/src/dedalus/dedalus/core/evaluator.py", line 107, in evaluate_scheduled
self.evaluate_handlers(scheduled_handlers, wall_time=wall_time, sim_time=sim_time, iteration=iteration, **kw)
File "/home/nanders4/dedalus/src/dedalus/dedalus/core/evaluator.py", line 153, in evaluate_handlers
handler.process(**kw)
File "/home/nanders4/dedalus/src/dedalus/dedalus/core/evaluator.py", line 544, in process
file = self.get_file()
File "/home/nanders4/dedalus/src/dedalus/dedalus/core/evaluator.py", line 418, in get_file
self.create_current_file()
File "/home/nanders4/dedalus/src/dedalus/dedalus/core/evaluator.py", line 458, in create_current_file
file = h5py.File(str(self.current_path), 'w-')
File "/home/nanders4/dedalus/lib/python3.6/site-packages/h5py/_hl/files.py", line 312, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/home/nanders4/dedalus/lib/python3.6/site-packages/h5py/_hl/files.py", line 146, in make_fid
fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = '/home/nanders4/DedalusExamples/snapshots/snapshots_s1/snapshots_s1_p45.h5', errno = 2, error message = 'No such file or directory', flags = 15, o_flags = c2)

It's repeated many times with a different values for X in ...snapshots_s1_pX.h5, ranging between 28 and 55, but I think this is more the result of the failed processes than the reason.

Ben Brown

unread,
Jun 19, 2018, 3:02:41 PM6/19/18
to dedalu...@googlegroups.com
Nick,
    How many cores are you trying to run on with 1 and 2 nodes respectively?

—Ben

To unsubscribe from this group and stop receiving emails from it, send an email to dedalus-user...@googlegroups.com.

To post to this group, send email to dedalu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dedalus-users/70b7dfcd-7d0d-4eb0-88b5-bc8321c4f309%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Dedalus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dedalus-user...@googlegroups.com.

To post to this group, send email to dedalu...@googlegroups.com.

Ben Brown

unread,
Jun 19, 2018, 3:03:59 PM6/19/18
to dedalu...@googlegroups.com
Never mind. Saw you follow up message.  Could you try to run on 32 cores, with two nodes, and let us know if that works?

nand...@swarthmore.edu

unread,
Jun 19, 2018, 3:25:51 PM6/19/18
to Dedalus Users
Yes, I just ran it on two nodes with 16 tasks per node, but it's more or less the same:

...
18-06-19 15:18:41,959 pencil 0/32 INFO :: Building pencil matrix 128/128 (~100%) Elapsed: 6s, Remaining: 0s, Rate: 2.3e+01/s
2018-06-19 15:18:41,963 __main__ 0/32 INFO :: Solver built
2018-06-19 15:18:56,154 __main__ 0/32 INFO :: Initialization time: 19.888969
2018-06-19 15:18:56,154 __main__ 0/32 INFO :: Starting loop
2018-06-19 15:19:03,462 __main__ 30/32 ERROR :: Exception raised, triggering end of main loop.
2018-06-19 15:19:03,462 __main__ 31/32 ERROR :: Exception raised, triggering end of main loop.
2018-06-19 15:19:03,462 __main__ 29/32 ERROR :: Exception raised, triggering end of main loop.
2018-06-19 15:19:03,462 __main__ 28/32 ERROR :: Exception raised, triggering end of main loop.


Traceback (most recent call last):

...as before

Jeffrey S. Oishi

unread,
Jun 19, 2018, 3:39:33 PM6/19/18
to dedalu...@googlegroups.com
Hi Nick,

We've seem something like this before. If it's the same problem, it comes from an NFS configuration that means some nodes don't see the creation of the directory (which is done by the root processor) before trying to write. You can add

FILEHANDLER_TOUCH_TMPFILE = True

to your dedalus.cfg file as a workaround but I think that's the wrong approach here

Instead, on Bridges (and any other XSEDE resource) you should be writing your data to $SCRATCH, not your home directory. I would try writing to $SCRATCH first, and see if that fixes your problem. $SCRATCH uses the Lustre parallel filesystem, which shouldn't have this problem. You can find details of how to do that in the Bridges documentation.

Let us know if that helps!

Jeff

--
You received this message because you are subscribed to the Google Groups "Dedalus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dedalus-user...@googlegroups.com.
To post to this group, send email to dedalu...@googlegroups.com.

nand...@swarthmore.edu

unread,
Jun 20, 2018, 9:05:46 AM6/20/18
to Dedalus Users
That fixed everything, thank you!

Jeffrey S. Oishi

unread,
Jun 20, 2018, 9:43:17 AM6/20/18
to dedalu...@googlegroups.com
fantastic!

On Wed, Jun 20, 2018 at 9:05 AM <nand...@swarthmore.edu> wrote:
That fixed everything, thank you!


--
You received this message because you are subscribed to the Google Groups "Dedalus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dedalus-user...@googlegroups.com.
To post to this group, send email to dedalu...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages