Dear Praveen,
> serial code = 24.4 sec
> mpi code = 121 sec
>
>
> This seems like a large difference. What am I doing wrong ?
It is difficult to say with the information you give. Is the difference
between 'serial code' and 'mpi code' that you use
TrilinosWrappers::MPI::Vector instead of TrilinosWrappers::Vector? Or is
the first number the program run on one processor and the second one the
program run on two or more processors?
To find out more yourself, I suggest you use some profiler and check
where the bottlenecks of the program are. valgrind's callgrind is one
example (with MPI, run "mpirun -n 2 valgrind
--tool=callgrind ./program_name"), but there are also commercial tools
like Intel's VTune.
I would definitely not be surprised if running only the code you show on
two processors is slower than on one processor. For DG, your code is
essentially adding a scaled version of right_hand_side with the inverse
diagonal mass matrix and the time step. If run in serial, all vector
accesses are at least close to direct array access for Trilinos vectors
(there are some checks but the CPU should be able to perfectly predict
those branches). If run in parallel, you need to do index computations.
In particular if your right_hand_side vector includes ghosts, that is
going to be quite expensive.
> I another post, Martin suggested I try deal.II parallel vectors. Would
> that help in this situation ?
On the code you show, parallel::distributed::Vector should be faster
than Trilinos vectors. If your Trilinos right_hand_side vector includes
ghosts, the difference could easily be a factor 3-4. For a Trilinos
vector with one contiguous range, the two vectors should not be far
apart because then both vectors need to translate the locally owned
range into the index range [0, local_size). But even I am just guessing,
so the profiler will reveal what happens.
Best,
Martin