Victor,
most of the parallelism would have to happen in the application program
(step-35 in your case). The key loop of that program looks like this:
loop:
interpolate_velocity();
diffusion_step(...);
projection_step(...);
update_pressure(...);
The second and third of these do expensive things like assembling linear
systems and solving them. For assembling the linear system, the program uses
WorkStream, which we know scales reasonably well to perhaps a dozen cores (or
maybe two dozen, depending on what the workload actually is). For solving
linear systems, there isn't much parallelism to be had: The matrices are very
small (31k and 4k rows) so matrix-vector products do not scale well to more
than at most a handful of cores, and the preconditioners used (ILU) has no
parallelism at all -- not because it isn't implemented, but because one can't
parallelize forward/backward substitution at all.
So I'm not surprised you don't get much speed-up. You'd need (i) a much larger
program to make those operations that are parallelized work well, and (ii) use
different algorithms than the ones used in this program to solve linear
problems. Both of these are of course possible, it's just not what this
program does. Nor what its intent is: the tutorials are meant to *teach* how
to write finite element codes; they're not *intended* to be HPC-ready
applications.
Best
W.