New deal.II 8.5.1 is 20% slower than deal.II 8.0.0

drgul...@gmail.com

unread,

Dec 26, 2017, 6:52:03 AM12/26/17

to deal.II User Group

Deal.II 8.5.1 seems to be 20% slower than 8.0.0. This is the timing I get when running the Step-23 tutorial (output to screen and vtk is suppressed):

deal.II version 8.0.0:

$ time ./step-23
Number of active cells: 16384
Number of degrees of freedom: 16641

real    0m3.432s
user    0m6.320s
sys    0m0.612s

deal.II version 8.5.1:

$ time ./step-23
Number of active cells: 16384
Number of degrees of freedom: 16641

real    0m4.430s
user    0m7.080s
sys    0m0.492s

In general, I get about 20% slow down for my own code when upgrading from 8.0.0 to 8.5.1. What is the reason of such a slow down? Does the deal.II follow the right direction given new versions become gradually slower?!

Matthias Maier

unread,

Dec 26, 2017, 10:08:17 AM12/26/17

to dea...@googlegroups.com

Hi,

I get relatively comparable results for both versions:

dev: ./step-23 55.55s user 1.64s system 131% cpu 43.637 total
8.0: ./step-23 55.85s user 1.48s system 129% cpu 44.130 total

Is this the unmodified step-23 tutorial program?

For measuring performance regressions a total runtime of less than 5
seconds doesn't say that much. Never versions allocate and precompute
quite a bunch of stuff upfront which might result in a small (problem
independent) fixed runtime overhead (of a second or less).

Best,
Matthias

drgul...@gmail.com

unread,

Dec 26, 2017, 1:35:10 PM12/26/17

to deal.II User Group

Thanks. This is strange as I still get 15-20% consistently better results in favor of older versions
on three different machines already. Two more studies on other systems attached below.

TEST: Step-23 (integration time modified from 5 to 150, output suppressed)
CMAKE_BUILD_TYPE: "Release".

MACHINE 1: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

deal v8.1.0 (built and installed from source):

$ time ./step-23
real    1m23.768s
user    5m46.080s
sys    0m4.079s

deal v8.5.1 (built and installed from source):

$ time ./step-23
real    1m42.416s
user    5m37.018s
sys    0m4.340s

MACHINE 2: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz
$ lsb_release -a
Description:    Ubuntu 14.04.5 LTS
$ gcc --version
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4

deal v8.1.0 (built and installed from source):

$ time ./step-23
real    2m49.114s
user    11m41.429s
sys    0m48.882s

deal v8.5.1 (built and installed from source):
$ time ./step-23
real    3m20.583s
user    10m54.850s
sys    2m18.989s

Matthias Maier

unread,

Dec 26, 2017, 2:10:44 PM12/26/17

to dea...@googlegroups.com

Would you mind sending us the "detailed.log" files?

Best,
Matthias

>> On Tue, Dec 26, 2017, at 05:52 CST, drgul...@gmail.com <javascript:>

drgul...@gmail.com

unread,

Dec 26, 2017, 3:22:34 PM12/26/17

to deal.II User Group

Yes, the two are attached. The key lines from their diff result:

$ diff detailed.log-v8.1.0 detailed.log-v8.5.1
...
< # Compiler flags used for this build:
< #        CMAKE_CXX_FLAGS:              -pedantic -fpic -Wall -Wpointer-arith -Wwrite-strings -Wsynth -Wsign-compare -Wswitch -Wno-unused-local-typedefs -Wno-long-long -Wno-deprecated -Wno-deprecated-declarations -std=c++11 -Wno-parentheses -Wno-long-long
< #        DEAL_II_CXX_FLAGS_RELEASE:    -O2 -funroll-loops -funroll-all-loops -fstrict-aliasing -Wno-unused
---
> # Base configuration (prior to feature configuration):
> #        DEAL_II_CXX_FLAGS:            -pedantic -fPIC -Wall -Wextra -Wpointer-arith -Wwrite-strings -Wsynth -Wsign-compare -Wswitch -Woverloaded-virtual -Wno-long-long -Wno-deprecated-declarations -Wno-literal-suffix -std=c++11
> #        DEAL_II_CXX_FLAGS_RELEASE:    -O2 -funroll-loops -funroll-all-loops -fstrict-aliasing -Wno-unused-local-typedefs
18c19
< #        DEAL_II_LINKER_FLAGS:         -Wl,--as-needed -rdynamic -pthread
---
> #        DEAL_II_LINKER_FLAGS:         -Wl,--as-needed -rdynamic -fuse-ld=gold
...
> #            BOOST_CXX_FLAGS = -Wno-unused-local-typedefs
...
> #      ( DEAL_II_WITH_BZIP2 = OFF )
> #        DEAL_II_WITH_CXX11 = ON
> #      ( DEAL_II_WITH_CXX14 = OFF )
> #      ( DEAL_II_WITH_GSL = OFF )
...
> #            THREADS_CXX_FLAGS = -Wno-parentheses
> #            THREADS_LINKER_FLAGS = -pthread

detailed.log-v8.1.0

detailed.log-v8.5.1

Martin Kronbichler

unread,

Dec 27, 2017, 3:49:42 AM12/27/17

to dea...@googlegroups.com

In general, we strive to make deal.II faster with new releases, and for many cases that is also true as I can confirm from my applications. I have ran step-23 on release 8.0 as well as the current development sources and I can confirm that the new version is slower on my machine. If I disable output of step-23, I get a run time of 4.7 seconds for version 8.0 and 5.3 seconds for the current version. After some investigations I found out that while some solver-related operations got faster indeed (the problem with 16k dofs is small enough to run from L3 cache in my case), we are slower in the FEValues::reinit() calls. This call appears in VectorTools::create_right_hand_side() and the VectorTools::interpolate_boundary_values in the time loop. The reason for this is that we nowadays call "MappingQGeneric::compute_mapping_support_points" also for the bilinear mapping MappingQ1, which allocates and de-allocates a vector. While this is uncritical on higher order mappings, in 2D with linear shape functions the time spent there is indeed not negligible. This is indeed unfortunate for your use case, but I want to stress that the changes were made in the hope to make that part of the code more reliable. Furthermore, those parts of the code are not performance critical and not accurately tracked. It is a rather isolated issue that got worse here, so from this single example one definitely not say that we are going the wrong direction as a project.

While there are plenty of things I could imagine to make this particular case more efficient in the application code, way beyond the performance of what the version 8.0 provided - note that I would not write the code like that if it were performance critical - the only obvious thing is that we could try to work around the memory allocations by not returning a vector in MappingQGeneric::compute_mapping_support_points but rather fill an existing array in MappingQGeneric::InternalData::mapping_support_points. Nobody of us developers has this high on the priority list right now, but we would definitely appreciate if some of our users, like you, wants to look into that. I could guide you to the right spots.

Best regards,
Martin

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

drgul...@gmail.com

unread,

Dec 27, 2017, 8:16:44 AM12/27/17

to deal.II User Group

Thank you!

Some guidance how I could optimize the code would be appreciated. I am using deal.II for solving a time-dependent nonlinear 2D problem (sort of sine-Gordon, but a more advanced model which includes a history dependence, https://github.com/drgulevich/mitmojco). Most of the time the deal.II code spends in:

1. fe_values.get_function_values -- most of the wall time (70%)
2. fe_values.reinit -- less often
3. CG solver -- even less often

Kind regards,
Dmitry

Martin Kronbichler

unread,

Dec 27, 2017, 8:25:31 AM12/27/17

to dea...@googlegroups.com

Dear Dmitry,

Thanks for the info. This sounds interesting. May I ask some more details: Are you using Q1 elements in your 2D model (this is the element used in step-23)? I have not looked into your history model, but I guess it can be adapted. Since your code is limited by FEValues::get_function_values, I suggest you take a look at the step-48 tutorial program. That code does indeed have performance-tuned components. It requires a somewhat different approach to write the loops, but it should pay off for time-dependent cases, I expect a speedup of a factor 4-6 in 2D for linear elements for the part corresponding to FEValues::get_function_values and FEValues::reinit. I would be happy to guide you more closely if this sounds applicable.

Best,
Martin

drgul...@gmail.com

unread,

Dec 27, 2017, 9:13:42 AM12/27/17

to deal.II User Group

Thanks, that is helpful and looks very promising. I will look into the step-48 then. And yes, I am using Q1 elements.
Cheers,
Dmitry

luca.heltai

unread,

Dec 30, 2017, 1:10:36 PM12/30/17

to Deal.II Users

If all you are solving is a two dimensional problem, you could encode your “get_function_values” into a matrix vector multiplication to drastically improve the situation.

I’m thinking of a matrix of size (n_quadrature_points x n_active_cells) x n_dofs, and then you slice the results cellwise instead of repeatedly calling get_function_values.

once:

M[q+active_cell_index*n_dofs_per_cell, i] = fe_values.shape_value(i,q);

at every solution step, before you actually need the values:
M.vmult(values, solution);

in every cell:
local_values = ArrayView(values[active_cell_index*n_dofs_per_cell], n_dofs_per_cell)

L.

Praveen C

unread,

Dec 31, 2017, 12:09:12 AM12/31/17

to Deal. II Googlegroup

On 30-Dec-2017, at 11:40 PM, luca.heltai <luca....@gmail.com> wrote:

I’m thinking of a matrix of size (n_quadrature_points x n_active_cells) x n_dofs, and then you slice the results cellwise instead of repeatedly calling get_function_values.

once:

M[q+active_cell_index*n_dofs_per_cell, i] = fe_values.shape_value(i,q);

This looks like a Vandermonde matrix which would be identical on every cell. In that case, it is not necessary to store it for every cell.

Thanks

praveen

luca.heltai

unread,

Dec 31, 2017, 11:26:07 AM12/31/17

to Deal.II Users

This depends on the finite element. For FEQ, yes, for other finite elements, definitely not (RaviarThomas, Nedelec, etc.).

Also, the code was wrong with the numbering, of course. On the left you should put the global dof number associated to i, not i.

Even if the local matrix is identical, storing it with the correct numbering into a big sparse matrix is only marginally expensive, while saving a lot in extracting local dofs.

One vmult, followed by memory contiguous access at every cell, is much cheaper than searching in the global vector for local dofs, then performing one local multiplication (maybe with one identical Vandermonde matrix).

L.

drgul...@gmail.com

unread,

Jan 5, 2018, 3:16:38 PM1/5/18

to deal.II User Group

Thanks a lot. It certainly makes sense.

I guess the same considerations should apply to the function SineGordonProblem::compute_nl_matrix in Step-25 where reinit() and get_function_values() are invoked inside the cell loop. This is where I was originally starting from in my code.

D.

Wolfgang Bangerth

unread,

Jan 8, 2018, 2:23:54 AM1/8/18

to dea...@googlegroups.com

On 01/05/2018 01:16 PM, drgul...@gmail.com wrote:
>
>
> I guess the same considerations should apply to the function
> SineGordonProblem::compute_nl_matrix in Step-25 where reinit() and
> get_function_values() are invoked inside the cell loop. This is where I was
> originally starting from in my code.

Yes, that is correct. There is definitely room to improve this program.

If you feel like it and figure out how to accelerate something like step-25,
would you be interested in writing a few paragraphs for the "Possibilities for
extensions" section of step-25 that discusses how this program could be made
faster?

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

drgul...@gmail.com

unread,

Jan 8, 2018, 10:30:37 AM1/8/18

to deal.II User Group

Sure, I will be happy to try.

Kind regards,

D.

Reply all

Reply to author

Forward