[Gfs-devel] optimizing cell traverse operations

Daniel Fuster

unread,

Jun 2, 2012, 11:27:39 AM6/2/12

to GFS developper discussion list

Hi Stephane

As you know, I am trying to see if there is any way to optimize the speed of the code. After some time looking at it, I would like to report some observations and to know your comments about this.

I attach a patch file (you can find 4 patches inside) that contains the modification of the code I did for some initial trials. It is not a final code at all, so please, pull these patches in a separate folder and do not pay too much attention to the program style (I think it is clear enough to be understood easily but if it is not, please tell me)

It is a little difficult to explain what I did in a short email, but I am going to try to do it. Basically, the final goal is to know if there is a way to traverse the cells quicker, and to find out what is making the cell traverse slow in some situations.

As you will see in the patch, I have created new structures to make the cell traverse quicker. I know perfectly that this does not mean that the method I propose is better, but it is a good test for comparison between cell_traverse functions and the best one could do

The basic idea is to create a structure (MatrixCells) that contains the following important elements:

FttCell ** cellpointer : it is an array of pointers. Each element contains a pointer to a cell in domain.

gint ** lists: is a matrix that contains the number of elements in the array cellpointer of a given traverse list. Note that depending of the celltraverse operation, a different list is defined. For example list[0][i] contains the elements of a FTT_CELL_TRAVERSE_LEAFS in cellpointer. I add -1 to mark the list end (this structure may be optimized with a dynamic array)

gint * neighbors: It is an array with FTT_CELL_NEIGHBOR*TOTALCELLS elements where I save the positions of neighboring cells in the cellpointer structure

gdouble * facevalues: It is an array with FTT_CELL_NEIGHBOR*TOTALCELLS elements where I directly save a given value at the faces of a given cell

This structure is complemented with another one for the variables (variableaux) that basically contains an array of real values. The ith element corresponds to the value of the variable in the cell located at the same position than cellpointer[i]. One could imagine to just make a list of pointers to the value of a GfsVariable in the ith cell.

The initialization of the structures is code at the beginning of mac_projection. Once the structures are initialized I have tried to compare the velocity of a given operation.

The speed up in very simple operations is impressive. For the example I attach below, the scale_divergence takes (just apply the patches and do gerris2D skew.gfs):

scale standard 0.26 s
scale opt 0.01 s

However, I do not always get such a good improvement. In addition, I did not really understand why I was getting such a large difference.

The gfs_normal_divergence function is much more illustrative of the speed optimization problems. I just realized that it penalizes a lot to use GFS_STATE (cell)->f[face.d].un or gfs_domain_face_fraction. but only when I use my optimized structure. Indeed, even if I save the values in the FttCell structure, it takes some appreciable time to access to them. This was the reason why I created direct structure to access to the values I need directly (obviously at expenses of memory). The results I got are the following:

gfs_normal_divergence (standard) 0.35 s

gfs_normal_divergence_opt 0.01 s(completely optimized)

gfs_normal_divergence_opt 2.04 s(!!! very funny result obtained with the optimized traverse and using a stored value in the FttCell structure)

gfs_normal_divergence_opt 0.28 s(optimized traverse and I use gfs_domain_face_fraction)

Something similar happens with GFS_STATE (cell)->f[face.d].un

Now the questions:

1) Do you know exactly why this happens?

2) Based on these observations, do you think there is a way to optimize the ftt_cell_traverse without using my structure, but making the compiler recognize the memory that is going to be used next?

3) Do you think we could store the MatrixCells structure for each box and to use it to traverse the cells? if so, do you think it is worthy to pre-compute and save all the information we expect to be used?

4) Indeed, related to 3 and as you know already, I have verified that what it would make a huge different in the computational time would be to store the matrix coefficients required to compute the gradients. This is somehow related to the suggestion of pre-computing required quantities but it does not necessary require to change the cell_traverse. I would appreciate if you could give me some information about the different information I get from gfslinearproblem. I guess I only need the GfsStencil but to be honest, I would be glad if you could give me some information about how this information is structured

best

Daniel

--
www.danielfuster.com

skew.gfs

optimization

Stephane Popinet

unread,

Jun 12, 2012, 9:25:44 PM6/12/12

to GFS developper discussion list

Hi Daniel,

> As you will see in the patch, I have created new structures to make the cell
> traverse quicker. I know perfectly that this does not mean that the method I
> propose is better, but it is a good test for comparison between
> cell_traverse functions and the best one could do

This is also what ftt_cell_traverse_new() etc.. does, isn'it? Did you
compare your implementation with this?

Also, if your goal is to optimise things such as relax(), then using
the GfsStencil stuff is definitely the way to go. I have done some
tests which indicate that one could speed up relax() by a factor > 5
when using the GfsStencil and GfsLinearProblem data structures.
Getting such good speedups is very dependent on using the cache
efficiently. This involves redesigning the current GfsStencil data
structure (with only small changes to the interface).

The problem is that the expensive part (for the multigrid) is actually
interpolations/restrictions between levels rather than relaxation.
Optimising this part involves redesigning most of the stuff in ftt.c

As I am sure you realise this stuff is difficult because "the devil is
in the details". A classical example is inverting the indices of the
inner and outer loops when going through a multi-dimensional array.
This does not make any difference "semantically" but has a large
impact on performance. Similarly the overhead of memory indirections
(i.e. -> operators) is difficult to predict as it depends on the
details of memory layout etc...

I think the correct approach is to select the basic operations which
you would like to optimise e.g.:

- straight traversal
- access to neighboring cells (e.g. relax)
- interpolation between levels

and then benchmark (maybe even independently of Gerris) different
quad/octree implementations (and also look at things like p4est
mentioned by Maka).

The next step is to define an interface (hopefully as close as
possible to the current one) which would not result in degradation of
performance.

The choice of basic operations above needs to be based on detailed
profiling of current simulations, so this is the first step. I am not
so sure we currently have that good an understanding of what parts of
the code are slow.

hope this helps somewhat

Stephane

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Gfs-devel mailing list
Gfs-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gfs-devel

Daniel Fuster

unread,

Jun 13, 2012, 2:58:23 AM6/13/12

to GFS developper discussion list

Thanks for your comments Stephane

> This is also what ftt_cell_traverse_new() etc.. does, isn'it?

No, it does not. I test different methods with subtle differences but
the impact on the computational time is large. It is not the same the
ftt_cell_traverse_new() stuff than going through an array of pointers
to cells. To go through an array of doubles is still different and
much faster (although I guess it is not always convenient due to the
general structure of the code).

> Did you compare your implementation with this?

Yes, I did. To my surprise, ftt_cell_traverse_new() do not improve
things much in the tests I did

> The problem is that the expensive part (for the multigrid) is actually
> interpolations/restrictions between levels rather than relaxation.
> Optimising this part involves redesigning most of the stuff in ftt.c

I see. I did some quick tests and I verified that relax was taking a
significant amount of time. In any case, I agree that optimizing
get_from_below/above will be also required.

> and then benchmark (maybe even independently of Gerris) different
> quad/octree implementations (and also look at things like p4est
> mentioned by Maka).
>

ok, I have a module where I already test different straight traverse
methods. Methods to speed up access to neighboring cells (e.g. relax)
and interpolation between levels require some more work.
However, I am not sure if this is what I want. What I was thinking is
to add some additional structures to be able to minimize "expensive"
operations in the relax and get_from_below/above functions. This would
not require any important change in the structures, but I am not sure
that what I am saying makes sense....

> hope this helps somewhat

it does, thanks so much

Daniel

> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Gfs-devel mailing list
> Gfs-...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gfs-devel

--
www.danielfuster.com

Stephane Zaleski

unread,

Jun 14, 2012, 3:03:09 PM6/14/12

to GFS developper discussion list

Hi

> Similarly the overhead of memory indirections
> (i.e. -> operators) is difficult to predict as it depends on the
> details of memory layout etc...

I am at a conference on multiphase CFD and discussed these issues with some people.
Indeed many people agree that the problem is predictability of the use of memory layout
by the compiler. This is also a problem for other codes, and new architectures such as
manycore designs.

This makes it difficult to find a simple fix for these problems. Certainly benchmarks will help
we are preparing a set of such benchmarks here.

cheers

SZ

Daniel Fuster

unread,

Jun 15, 2012, 6:19:53 AM6/15/12

to GFS developper discussion list

I send you a patch that contains a module to benchmark some different
cell_traverse methods for different operations. The patch just
contains a new module modules/optimized.c. I attach the file I use to
test the module

gerris3D optimized.gfs

It is really funny, for example, initializing a variable to one, the
standard cell traverse is 20% quicker doing
GFS_VALUE (cell, v) = GFS_STATE (cell)->f[0].un;
GFS_VALUE (cell, v) = 1.;

than simply doing

GFS_VALUE (cell, v) = ftt_cell_size (cell);

any idea why?

In general, if one need to call functions recursively, my feeling is
that it is always better to store the information that remains
constant if by doing that, one avoids any function call out of the
cell. The gain in those cases is around factor 20-30 (depends on the
system). Unfortunately, I am afraid this is not usually possible,
but...

Here the results in my machine

number of cells 2396745
time init matrix cells 1.68
time init vble 0
time init vbleP 0.14

SIMPLE INITIALIZATION vble=1
gfs_domain_cell_traverse time 7.37
gfs_domain_cell_traverse time (trick) 5.88
time ftt_cell_traverse_next 8.23
cell traverse VariableAux 0.38
cell traverse VariableAuxP 2.35

INITIALIZATION vble=ftt_cell_size (cell)
gfs_domain_cell_traverse time 7.74
time ftt_cell_traverse_next 8.19
cell traverse VariableAux 4.67
cell traverse VariableAuxP 4.9

INITIALIZATION vble=GFS_STATE (cell)->f[0].un
gfs_domain_cell_traverse time 5.72
cell traverse VariableAux 3.76
cell traverse VariableAuxP 4.68

INITIALIZATION vble=sum(gfs_domain_face_fraction)
gfs_domain_cell_traverse time 9.16
cell traverse VariableAux 9.67
cell traverse VariableAuxP 9.39

deallocate opt 0.45

best
Daniel

2012/6/14 Stephane Zaleski <stephane...@gmail.com>:

--
www.danielfuster.com

optpatch

optimized.gfs

Daniel Fuster

unread,

Jun 15, 2012, 6:21:24 AM6/15/12

to GFS developper discussion list

Sorry, I meant that it is quicker doing

quicker doing
GFS_VALUE (cell, v) = GFS_STATE (cell)->f[0].un;
GFS_VALUE (cell, v) = 1.;

than simply doing

GFS_VALUE (cell, v) = 1.;

--
www.danielfuster.com

optpatch

optimized.gfs

Vladimir Kolobov

unread,

Jul 10, 2014, 11:59:26 AM7/10/14

to gfs-...@lists.sourceforge.net

Hi Daniel,

I was wondering if there was any progress with speeding up cell
traversal operations in gerris ?

Are you aware of any other codes that do it better than gerris.

Thanks,

Vladimir

Stephane Popinet

unread,

Jul 15, 2014, 9:53:22 AM7/15/14

to GFS developper discussion list

Hi Vladimir,

> I was wondering if there was any progress with speeding up cell
> traversal operations in gerris ?

I have been working for about a year now on a new framework which is
meant to replace Gerris at some point. There is already quite a lot of
functionality and documentation available here:

http://basilisk.fr

The overall goal is to obtain a framework which is as efficient as pure
Cartesian mesh implementations and much more efficient than Gerris when
using quad/octrees. It is also a much simpler and better documented code
than Gerris. Initial results show that it is roughly 5 times faster than
Gerris on quadtrees.

I will talk about this at the upcoming meeting:

http://gfs.sourceforge.net/wiki/index.php/Meeting_on_Numerical_challenges_in_two-phase_flows

cheers

Stéphane

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

Vladimir Kolobov

unread,

Sep 8, 2014, 2:53:59 PM9/8/14

to GFS developper discussion list

Hi Stephane,

I wonder if basilisk is designed to do something gerris can not rather than
having all the gerris capabilities incrementally improved.

In particular:

1) will basilisk have capabilities for parallel grid generation similar to
http://www.sciencedirect.com/science/article/pii/S0045782514001340

2) could it be adapted for GPU computing using patches as GAMER
http://iccs.lbl.gov/research/isaac/GAMER_Framework.html ?

Thanks,

Vladimir

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

Reply all

Reply to author

Forward