Speed up for Step-12

neeraj sarna

unread,

Apr 17, 2016, 7:17:56 AM4/17/16

to deal.II User Group

Hello everyone,

I am presently working on a linear hyperbolic system, which I have been able to solve using my own assemblation algorithm

in a DG framework. But I now plan to use MeshWorker::loop because I think it can provide me with better performance. So before directly implementing MeshWorker::loop for my problem, I have been exploring step-12 from the tutorials.

Since the MeshWorker::loop uses tbb by default, I was studying the time taken during assembly for different number of threads on my personal machine(max 4 threads). But, surprisingly I do not find much of an improvement by changing the number of threads. The observation remains the same when I run the program on my university's cluster (max 64 threads).

Details of the system are:

1.) no of DOFs = 16384

2.) no of active cells = 4096

The following are the results of the experiment done on the cluster to calculate the timing using the inbuilt timer of dealii:

1.) No of Threads 2.)total time 3.)%time on assembly

1 0.425s 13

4 0.508 3.8

8 0.348 7.5

16 0.362 10.0

32 0.404 15

It would be really helpful if someone could explain why m I getting a poor performance.

Thanks,

Neeraj

Martin Kronbichler

unread,

Apr 18, 2016, 5:13:46 AM4/18/16

to dea...@googlegroups.com

Dear Neeraj,

the speedup you observe is indeed not very good. I have a few questions:

> The observation remains the same when I run the program on my
> university's cluster (max 64 threads).
>
> Details of the system are:
> 1.) no of DOFs = 16384
> 2.) no of active cells = 4096

This looks like a very small problem. How exactly have you been using
the parallelism? Did you put MultithreadInfo::set_thread_limit()? How
did you run the loops? The standard Workstream loop that is used by
MeshWorker IIRC does the writing to the matrix in serial. If writing to
the matrix takes a significant time, that might be an explanation.
Better performance should be possible by using coloring where you give
WorkStream a colored list of cells which can be worked upon
independently, see the WorkStream paper mentioned in the glossary list.

> The following are the results of the experiment done on the cluster to
> calculate the timing using the inbuilt timer of dealii:
>
> 1.) No of Threads 2.)total time 3.)%time on assembly
>
> 1 0.425s 13
> 4 0.508 3.8
> 8 0.348 7.5
> 16 0.362 10.0
> 32 0.404 15

This is the output of Timer::wall_time() I assume? Regarding the
percentages: Assembly is going a bit up and down. Why is that? What is
the rest of the code doing? Does it use threading? Without context this
column does not give a lot of information.

For the rest: How do the timings look like for a four times larger
problem? How much time do you spend on computing versus writing into the
matrix? (You can check the latter by running the loop manually and just
doing the computation part.)

Best,
Martin

neeraj sarna

unread,

Apr 19, 2016, 12:29:32 PM4/19/16

to deal.II User Group

Dear Martin,

Thank you for you reply. Yes, I used set_thread_limit() to set the number of threads I wish to use. And I found that:

Time for computation was almost 20 times the time taken to write to the system matrix.

I think you are right with pointing out that the size of the problem is small. So I increased the size by four times and the results which I now have look pretty good. Though not much speed up can be expected because everything else apart from the assembly is serial and almost 60-70% of the total time is taken up by the GMRES solver. Anyways, this fixes the doubt which I had regarding Step-12.

I also have a few questions regarding the system I am presently solving. Since the time spent on writing the values to the system matrix is small compared to the assembly, I decided to use MeshWorker::loop for my problem. This problem is of the form : A.u_x + B.u_y = P. Where A and B are constant matrices and u has 6 components.

So the size of the problem is(with a third order DG scheme):

No of cells : 2560

No of DOF : 230400

The code does the following jobs in the specified order:

1.) develops sparsity pattern and initializes matrices

2.) assembles the system (using MeshWorker::loop)

3.) solves using GMRES

4.) evaluates the error (exact solution is known)

5.) does h_adaptivity using kelly error estimator

6.) writes the solution to the file for the last refinement cycle

The complete code is serial apart from step-2. For this problem I spend almost 40-50% of the total wall clock time, when run in serial, on assembly so I was expecting good speed up after using the MeshWorker::loop. But instead of getting a speedup, the code has become slower. To understand this I used VTune to profile my code which ran on 4 threads. By comparing it with the serial code, I found that the call to tbb::internal::allocate_root_proxy consumes most of the CPU time. You can find attached a snapshot of the results obtained from VTune. As a reference I have also attached a file which contains the function for the assembly. The setup is very similar to Step-12 of the tutorials.

It would be helpful if you can provide some explanation as to why this can possibly happen.

Best,

Neeraj

VTune_result.png

assemble_system_meshworker.h

Wolfgang Bangerth

unread,

Apr 19, 2016, 6:52:42 PM4/19/16

to dea...@googlegroups.com

> No of cells : 2560
> No of DOF : 230400
>
> The code does the following jobs in the specified order:
> 1.) develops sparsity pattern and initializes matrices
> 2.) assembles the system (using MeshWorker::loop)
> 3.) solves using GMRES
> 4.) evaluates the error (exact solution is known)
> 5.) does h_adaptivity using kelly error estimator
> 6.) writes the solution to the file for the last refinement cycle
>
> The complete code is serial apart from step-2. For this problem I spend
> almost 40-50% of the total wall clock time, when run in serial, on
> assembly so I was expecting good speed up after using the
> MeshWorker::loop. But instead of getting a speedup, the code has become
> slower. To understand this I used VTune to profile my code which ran on
> 4 threads. By comparing it with the serial code, I found that the call
> to tbb::internal::allocate_root_proxy consumes most of the CPU time. You
> can find attached a snapshot of the results obtained from VTune. As a
> reference I have also attached a file which contains the function for
> the assembly. The setup is very similar to Step-12 of the tutorials.

I'm not overly familiar with Intel VTune, but I'd like to point out two
issues:

- 2560 cells is still a pretty small problem to try to parallelize. If
you're interested in measuring whether a code really does what you
expect it to do as far as parallelization is concerned, choose the
biggest problem you can work with (i.e., typically what still fits into
memory for your machine), measure speedup there, and then work yourself
back to smaller problems.

- My suspicion is that VTune tells you how much CPU time is spent in
each of these functions. But that is not necessarily the time they spend
on doing useful things. For example, a common strategy is to create 8
threads at the beginning of the program (on, say, an 8-core machine). 7
of those will continue to do nothing for the largest part of your
program, with the exception of those parts of your program that are
parallelized and during which they are assigned some work. If your
program is completely sequential, then some programs will tell you that
you spend 7/8=87.5% of the overall CPU time in a function that does nothing.

I don't know whether that's the case here, and it's hard to tell from a
distance, but these are the things you need to be aware of if you want
to interpret whatever numbers you get from a program of VTune. As a
general rule, you will often get *lots of data* but *little information*
from performance profilers, and it takes significant experience to learn
what it actually *means*.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@math.tamu.edu
www: http://www.math.tamu.edu/~bangerth/

neeraj sarna

unread,

Apr 22, 2016, 6:30:39 AM4/22/16

to deal.II User Group

Dear Wolfgang,

Thank you for your suggestions, the size of the problem was actually small and I now get good scalability for meshes with around 10000 or more cells.

Best,

Neeraj

Wolfgang Bangerth

unread,

Apr 22, 2016, 7:34:28 AM4/22/16

to dea...@googlegroups.com

On 04/22/2016 05:30 AM, neeraj sarna wrote:
>
> Thank you for your suggestions, the size of the problem was actually small and
> I now get good scalability for meshes with around 10000 or more cells.

Ah, great, good to know!
Best
Wolfgang

Reply all

Reply to author

Forward