Scaling studies for isentropic vortex

64 views
Skip to first unread message

Robert Sawko

unread,
Mar 2, 2017, 12:19:41 PM3/2/17
to pyfrmai...@googlegroups.com
Hi,

I posted my first go at benchmarking PyFR against CUDA backend.

https://github.com/robertsawko/PyFR-bench

I'd like to work more on it and extend it for the remaining cases in your JCP
paper.

The actual study is in isentropic_vortex.sh and I apologise for a lot of
cluster specific stuff. There's some LSF data manager stuff there which may be
confusing if you haven't seen it before.


I am testing the code on Power8+4GPU arranged 2GPU per socket. At the moment
the results are pretty random and I don't see weak or strong scaling at all.
This may be to do with our cluster or with my installation, but I thought I'll
share the files and ask if you have any feedback.

The only thing I was able to verify was that the installation spawns processes
across GPUs across different nodes.

Let me know if you have any ideas. Also, please let me know if this is the
right way of having this discussion.

Best wishes,
Robert
--
Dr Robert Sawko
Research Staff Member, IBM
Daresbury Laboratory
Keckwick Lane, Warrington
WA4 4AD
United Kingdom
--
Email (IBM): RSa...@uk.ibm.com
Email (STFC): robert...@stfc.ac.uk
Phone (office): +44 (0) 1925 60 3301
Phone (mobile): +44 778 830 8522
Profile page:
http://researcher.watson.ibm.com/researcher/view.php?person=uk-RSawko
--
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Freddie Witherden

unread,
Mar 2, 2017, 5:45:35 PM3/2/17
to pyfrmai...@googlegroups.com
Hi Robert,

The vortex test case is 2D and on a grid with relatively few elements.
As such just running the case as-is you are close to the strong scaling
limit. The working set here is almost small enough to fit into cache!

As such the case should not be used for benchmarking. For this you will
want to consider a 3D Navier-Stokes case. (Which is also somewhat more
realistic when compared with the 2D Euler equations.)

Regards, Freddie.

Robert Sawko

unread,
Mar 2, 2017, 9:12:46 PM3/2/17
to Freddie Witherden, pyfrmai...@googlegroups.com
Great... Yes, I didn't estimate the degrees of freedom for this. Trying to be
too quick. I've uploaded the sd7003 case. I added the residual printing and I
can already see it produced 2x speedup going from 1 to 2 nodes. I am doing a
full test now.


I have several related questions and comments:
1) what is [backend] rank-allocator = linear? Does this not conflict with MPI
options e.g. -rank-by from OMPI or binding policy of MVAPICH. This is
significant for me as I have two GPUs per socket and 64 hardware threads
per socket. I don't want 4 process to run on the first socket alone.

I print my bindings in MVAPICH and it looks ok, but I want to double check
that python is not doing something else under the hood.

2) What is the rough DoF estimate for the strong scaling limit you observed
with PyFR?

3) At the moment I am setting 4 MPI proc per node as I've got 4 GPUs, but I
assume there's nothing to stop me from using more. Has anyone looked at
optimal ratio of MPI processes to GPUs?

Thanks,

Freddie Witherden

unread,
Mar 2, 2017, 9:26:20 PM3/2/17
to pyfrmai...@googlegroups.com
Hi Robert,

On 02/03/2017 21:12, Robert Sawko wrote:
> Great... Yes, I didn't estimate the degrees of freedom for this. Trying to be
> too quick. I've uploaded the sd7003 case. I added the residual printing and I
> can already see it produced 2x speedup going from 1 to 2 nodes. I am doing a
> full test now.
>
>
> I have several related questions and comments:
> 1) what is [backend] rank-allocator = linear? Does this not conflict with MPI
> options e.g. -rank-by from OMPI or binding policy of MVAPICH. This is
> significant for me as I have two GPUs per socket and 64 hardware threads
> per socket. I don't want 4 process to run on the first socket alone.

So the rank allocator decides how partitions in the mesh should be
mapped onto MPI ranks. The linear allocator is exactly what you would
expect: the first MPI rank gets the first partition, and so on and so
forth. There is also a random allocator. Having four processes on one
socket is probably okay; I doubt you would notice much of a difference
compared to an even split. When running with the CUDA backend PyFR is
entirely single threaded and offloads all relevant computation to the
GPU. We also work hard to mask latencies for host-to-device transfers
so even sub-optimal assignments usually work out.

> I print my bindings in MVAPICH and it looks ok, but I want to double check
> that python is not doing something else under the hood.
>
> 2) What is the rough DoF estimate for the strong scaling limit you observed
> with PyFR?

That is a good question and one which I do not have much of a feel for.
I would say that you want on the order of a thousand *elements* per GPU
and below that you may being to experience a flattening out of your
strong scaling curve.

> 3) At the moment I am setting 4 MPI proc per node as I've got 4 GPUs, but I
> assume there's nothing to stop me from using more. Has anyone looked at
> optimal ratio of MPI processes to GPUs?

One MPI rank per GPU will be optimal. Anything else will just introduce
extra overheads.

Regards, Freddie.

Vincent, Peter E

unread,
Mar 10, 2017, 10:18:53 AM3/10/17
to PyFR Mailing List, Freddie Witherden
Hi Robert

>> 2) What is the rough DoF estimate for the strong scaling limit you observed
>> with PyFR?
>
> That is a good question and one which I do not have much of a feel for. I would say that you want on the order of a thousand *elements* per GPU and below that you may being to experience a flattening out of your strong scaling curve.

Based on our Gordon Bell results on K20X, things start to tail off for 3D compressible Navier Stokes when you get down to ~ 1 element per CUDA core, where here an element is a P4 hexahedra with 5 x 5 x 5 x 5 DOFs = 625 DOFs.

Cheers

Peter

Dr Peter Vincent MSci ARCS DIC PhD
Reader in Aeronautics and EPSRC Fellow
Department of Aeronautics
Imperial College London
South Kensington
London
SW7 2AZ
UK

web: www.imperial.ac.uk/aeronautics/research/vincentlab
twitter: @Vincent_Lab
> --
> You received this message because you are subscribed to the Google Groups "PyFR Mailing List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pyfrmailingli...@googlegroups.com.
> To post to this group, send an email to pyfrmai...@googlegroups.com.
> Visit this group at https://groups.google.com/group/pyfrmailinglist.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages