MPI code is slow compared to serial code on one processor

Praveen C

unread,

Nov 5, 2014, 12:37:11 AM11/5/14

to Deal.II Googlegroup

Dear all

I have a parallel DG code using Trilinos MPI vectors and I am trying to make it fast. Here is an example of my problem. I have

newton_update and right_hand_side initialized as

right_hand_side.reinit (locally_owned_dofs, mpi_communicator);

newton_update.reinit (locally_owned_dofs, mpi_communicator);

Here is a piece of code that I am timing with my serial code and with mpi code but run with one processor.

for (; cell!=endc; ++cell)

if(cell->is_locally_owned())

{

const unsigned int cell_no = cell_number (cell);

cell->get_dof_indices (dof_indices);

for(unsigned int i=0; i<fe.dofs_per_cell; ++i)

newton_update(dof_indices[i]) = dt(cell_no) *

right_hand_side(dof_indices[i]) *

inv_mass_matrix[cell_no][i];

}

dt and inv_mass_matrix are serial vectors.

Timings are

serial code = 24.4 sec
mpi code = 121 sec

This seems like a large difference. What am I doing wrong ?

I another post, Martin suggested I try deal.II parallel vectors. Would that help in this situation ?

Thanks

praveen

Martin Kronbichler

unread,

Nov 5, 2014, 11:18:37 AM11/5/14

to dea...@googlegroups.com

Dear Praveen,

> serial code = 24.4 sec
> mpi code = 121 sec
>
>
> This seems like a large difference. What am I doing wrong ?

It is difficult to say with the information you give. Is the difference
between 'serial code' and 'mpi code' that you use
TrilinosWrappers::MPI::Vector instead of TrilinosWrappers::Vector? Or is
the first number the program run on one processor and the second one the
program run on two or more processors?

To find out more yourself, I suggest you use some profiler and check
where the bottlenecks of the program are. valgrind's callgrind is one
example (with MPI, run "mpirun -n 2 valgrind
--tool=callgrind ./program_name"), but there are also commercial tools
like Intel's VTune.

I would definitely not be surprised if running only the code you show on
two processors is slower than on one processor. For DG, your code is
essentially adding a scaled version of right_hand_side with the inverse
diagonal mass matrix and the time step. If run in serial, all vector
accesses are at least close to direct array access for Trilinos vectors
(there are some checks but the CPU should be able to perfectly predict
those branches). If run in parallel, you need to do index computations.
In particular if your right_hand_side vector includes ghosts, that is
going to be quite expensive.

> I another post, Martin suggested I try deal.II parallel vectors. Would
> that help in this situation ?

On the code you show, parallel::distributed::Vector should be faster
than Trilinos vectors. If your Trilinos right_hand_side vector includes
ghosts, the difference could easily be a factor 3-4. For a Trilinos
vector with one contiguous range, the two vectors should not be far
apart because then both vectors need to translate the locally owned
range into the index range [0, local_size). But even I am just guessing,
so the profiler will reveal what happens.

Best,
Martin

Praveen C

unread,

Nov 5, 2014, 11:24:48 AM11/5/14

to Deal.II Googlegroup

Hello Martin

In serial code I use Vector<double> and in mpi code I use TrilinosWrappers::MPI::Vector

The timings I report are for just one processor, I run the mpi code on one processor to compare with the serial code. That is why I am surprised the mpi code is so much slower than the serial one.

I will try the profiling you suggest.

Best regards

praveen

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Timo Heister

unread,

Nov 5, 2014, 11:28:29 AM11/5/14

to dea...@googlegroups.com

Is your serial code using multithreading? Some of the vector
operations use multithreading under the hood. Check with 'top' if you
get more than 100% usage (or look at the output of "time ./program").
If yes, it is not a fair comparison this way and you need to run with
several mpi tasks.

--
Timo Heister
http://www.math.clemson.edu/~heister/

Martin Kronbichler

unread,

Nov 5, 2014, 11:38:52 AM11/5/14

to dea...@googlegroups.com

Hi Praveen,

> In serial code I use Vector<double> and in mpi code I use
> TrilinosWrappers::MPI::Vector
>
>
> The timings I report are for just one processor, I run the mpi code on
> one processor to compare with the serial code. That is why I am
> surprised the mpi code is so much slower than the serial one.

Then what you see is the difference between direct array access
(Vector<double>) and translation between global and local indices in the
case where the translation actually does nothing. I am a bit surprised
that it costs that much but again it does not really surprise that the
cost of index recomputation shows up. Using global indices in a code
where vector access is an important part - as in your case - is almost
always a bad idea. This is why we introduced
parallel::distributed::Vector in MatrixFree with a way to get direct
array access in local index space inside the algorithms. But even that
turns out to be too expensive in some of our algorithms (we would need
to vectorize as much as possible reads/writes but that is lower on my
agenda than certain other functionality for matrix-free).

As a side note, my guess is that the cost of
parallel::distributed::Vector sits somewhere between Vector<double> and
TrilinosWrappers::MPI::Vector.

If the operation you give is the only you worry about regarding parallel
speed, my suggestion would be to put the inverse of the diagonal mass
matrix into a vector with the same layout as right_hand_side and use
Vector::scale/add operations. (And if you do, make sure to be fair with
multithreading as Timo says because the deal.II vectors do those
operations multithreaded.)

Best,
Martin

Praveen C

unread,

Nov 5, 2014, 11:23:22 PM11/5/14

to Deal.II Googlegroup

Hello Martin

In the DG scheme, there are many vector access required. I will use parallel::distributed::Vector and see if it improves the speed. I will also replace operations with scale/add/sadd calls wherever possible.

By the way, in the documentation, the link

IAMCS preprint 2011-187

says no results found.

Best regards

praveen

Best,
Martin

Wolfgang Bangerth

unread,

Nov 5, 2014, 11:29:19 PM11/5/14

to dea...@googlegroups.com

> By the way, in the documentation, the link
>
> IAMCS preprint 2011-187

> <http://iamcs.tamu.edu/file_dl.php?type=preprint&preprint_id=237>
>
> says no results found.

Yes, they must have moved the files around. I've just removed the link altogether.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@math.tamu.edu
www: http://www.math.tamu.edu/~bangerth/

Reply all

Reply to author

Forward