> No of cells : 2560
> No of DOF : 230400
>
> The code does the following jobs in the specified order:
> 1.) develops sparsity pattern and initializes matrices
> 2.) assembles the system (using MeshWorker::loop)
> 3.) solves using GMRES
> 4.) evaluates the error (exact solution is known)
> 5.) does h_adaptivity using kelly error estimator
> 6.) writes the solution to the file for the last refinement cycle
>
> The complete code is serial apart from step-2. For this problem I spend
> almost 40-50% of the total wall clock time, when run in serial, on
> assembly so I was expecting good speed up after using the
> MeshWorker::loop. But instead of getting a speedup, the code has become
> slower. To understand this I used VTune to profile my code which ran on
> 4 threads. By comparing it with the serial code, I found that the call
> to tbb::internal::allocate_root_proxy consumes most of the CPU time. You
> can find attached a snapshot of the results obtained from VTune. As a
> reference I have also attached a file which contains the function for
> the assembly. The setup is very similar to Step-12 of the tutorials.
I'm not overly familiar with Intel VTune, but I'd like to point out two
issues:
- 2560 cells is still a pretty small problem to try to parallelize. If
you're interested in measuring whether a code really does what you
expect it to do as far as parallelization is concerned, choose the
biggest problem you can work with (i.e., typically what still fits into
memory for your machine), measure speedup there, and then work yourself
back to smaller problems.
- My suspicion is that VTune tells you how much CPU time is spent in
each of these functions. But that is not necessarily the time they spend
on doing useful things. For example, a common strategy is to create 8
threads at the beginning of the program (on, say, an 8-core machine). 7
of those will continue to do nothing for the largest part of your
program, with the exception of those parts of your program that are
parallelized and during which they are assigned some work. If your
program is completely sequential, then some programs will tell you that
you spend 7/8=87.5% of the overall CPU time in a function that does nothing.
I don't know whether that's the case here, and it's hard to tell from a
distance, but these are the things you need to be aware of if you want
to interpret whatever numbers you get from a program of VTune. As a
general rule, you will often get *lots of data* but *little information*
from performance profilers, and it takes significant experience to learn
what it actually *means*.
Best
W.
--
------------------------------------------------------------------------
Wolfgang Bangerth email:
bang...@math.tamu.edu
www:
http://www.math.tamu.edu/~bangerth/