General questions in distributed parallelization

Jie Cheng

unread,

Dec 6, 2017, 1:45:38 PM12/6/17

to deal.II User Group

Hello everyone

First I want to say, it is so much easier and more pleasant to work with PETScWrappers in dealii than writing PETSc code in C. Developing finite element code based on dealii is 1000 times more efficient than writing from scratch!

Recently I parallelized a time-dependent linear elasticity code with PETScWrappers, based on step-40. I tested its performance on my desktop. But I am not sure if the result is "normal".

The test case is the bending of a 3D cantilever beam. There are 262144 active cells and 839619 dofs. I ran 2 time steps, which involves two assembling and solving calls, on 1, 2, 4, 8 processors, respectively. The wall time looks like this:

| assemble (s) | solve (s) | total (s)

-------------------------------------------------------------------

n = 1 | 274.0 | 44.44 | 362.0

n = 2 | 140.50 | 27.54 | 192.0

n = 4 | 75.72 | 17.12 | 106.3

n = 8 | 64.74 | 16.62 | 92.82

The speedup from n = 1 up to n = 4 is quite obvious, but from n = 4 to n = 8, it is insignificant. Actually sometimes running with 8 ranks is even slower than

running with 4 ranks. My desktop has one intel i7-6700k cpu, which has 4 cores but 8 threads, and 16GB memory. I do not quite understand the difference between "thread" and "rank". Should I expect the performance to scale up to 4 or 8 mpi ranks?

Another general question is that: in distributed parallel applications, I use temporary objects to copy a non-ghosted vector to a ghosted vector, or vice versa all the time. For example, I use non-ghosted vector to store my solution, but have to copy it to a ghosted vector when I output results or refine mesh. On the other hand, if I use ghosted vector to store my solution, I have to copy it to a non-ghosted vector when I manipulate it with PETScWrappers::VectorBase::add (for example subtracting time discretization terms from it).

I wanna ask, is this copy operation expensive? Is there a way to avoid that?

Thank you

Jie

Wolfgang Bangerth

unread,

Dec 6, 2017, 3:14:46 PM12/6/17

to dea...@googlegroups.com

Jie,

> First I want to say, it is so much easier and more pleasant to work with
> PETScWrappers in dealii than writing PETSc code in C. Developing finite
> element code based on dealii is 1000 times more efficient than writing
> from scratch!

Thank you for the kind words!

> Recently I parallelized a time-dependent linear elasticity code with
> PETScWrappers, based on step-40. I tested its performance on my desktop.
> But I am not sure if the result is "normal".
>
> The test case is the bending of a 3D cantilever beam. There are 262144
> active cells and 839619 dofs. I ran 2 time steps, which involves two
> assembling and solving calls, on 1, 2, 4, 8 processors, respectively.
> The wall time looks like this:
>
> | assemble (s) | solve (s) | total (s)
> -------------------------------------------------------------------
> n = 1 | 274.0 | 44.44 | 362.0
> n = 2 | 140.50 | 27.54 | 192.0
> n = 4 | 75.72 | 17.12 | 106.3
> n = 8 | 64.74 | 16.62 | 92.82
>
> The speedup from n = 1 up to n = 4 is quite obvious, but from n = 4 to n
> = 8, it is insignificant. Actually sometimes running with 8 ranks is
> even slower than
> running with 4 ranks. My desktop has one intel i7-6700k cpu, which has 4
> cores but 8 threads, and 16GB memory. I do not quite understand the
> difference between "thread" and "rank". Should I expect the performance
> to scale up to 4 or 8 mpi ranks?

You can't expect to gain a factor of 2 when going from 4 to 8 MPI ranks
on this processor. That's because the i7-6700K has only 4 real cores, see
https://en.wikipedia.org/wiki/Intel_Core#Core_i7
which means that there are four processing units on this chip. But, each
of them presents itself as two "virtual cores", i.e., it can execute two
threads at the same time, but it really only has the resources for one
instruction at a time (sort of, not speaking precisely here). This helps
because in reality instructions often sit idle waiting for data to
arrive from memory, and in this case the physical infrastructure can
work on an instruction from the other thread. In your case, this
improves performance by 10-15%, but ultimately, you are still limited by
the fact that your processor only has four units to do floating point
addition, four units to do floating point multiplication, etc -- because
it really only has 4 cores.

> Another general question is that: in distributed parallel applications,
> I use temporary objects to copy a non-ghosted vector to a ghosted
> vector, or vice versa all the time. For example, I use non-ghosted
> vector to store my solution, but have to copy it to a ghosted vector
> when I output results or refine mesh. On the other hand, if I use
> ghosted vector to store my solution, I have to copy it to a non-ghosted
> vector when I manipulate it with PETScWrappers::VectorBase::add (for
> example subtracting time discretization terms from it).
> I wanna ask, is this copy operation expensive? Is there a way to avoid that?

It's almost certainly not expensive enough for you to worry about. It's
significantly cheaper to copy a vector this way than to do one
matrix-vector multiplication, for example.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Jie Cheng

unread,

Dec 6, 2017, 3:20:42 PM12/6/17

to deal.II User Group

Hi Wolfgang

Thank you so much for the clear answer!

Jie

Reply all

Reply to author

Forward