Fully distributed triangulation (level 0)

Yi-Chung Chen

unread,

Jan 31, 2017, 9:43:46 AM1/31/17

to deal.II User Group

Hi all,
I have a project requiring to read in a large coarse mesh from gmsh to deal.ii > 1M dofs. Most of cells have their own characteristics, which means I cannot combine them and create a coarse mesh.
Currently, I implemented it by using shared memory triangulation for parallelization. Since I want to scale it to a cluster system and target a 100M mesh (no need for mesh refinement), I have to use distributed tria via MPI (is there any better solution?). I found out that the initial cost is large because of the duplication of triangulation and p4est forest. I was wondering if there is any method to remove part of triangulation or p4est data. Or can I build initial distributed triangulation by myself? Based on my understanding, the vertices and cells need a cross reference for ID (cell_id is like deal-p4est permutation vectors). However, these changes require comprehensive understanding of the whole triangulation system of deal.ii. I was wondering if anyone can give me some idea of it. Thanks

Here I provide one simple trick if you have similar projects like mine. In order to read cells of different characteristics, cell_id is a good way to id each cell and give its own parameters. However, since deal.ii limited number of cell_id in char, then the maximum cell_id is 256. A simple way to fix it is by changing type of cell_id in type.h to unsigned int and also its corresponding instantiation.

On the other hand, if you are using solution transfer in shared_tria, you can add “solution_transfer.cc : (all_out[0]).compress(VectorOperation::insert);” so that the solution tria should be
template<int dim, typename VectorType, typename DoFHandlerType>

void SolutionTransfer<dim, VectorType, DoFHandlerType>::interpolate

(const VectorType &in,

VectorType &out) const

{

Assert (in.size()==n_dofs_old,

ExcDimensionMismatch(in.size(), n_dofs_old));

Assert (out.size()==dof_handler->n_dofs(),

ExcDimensionMismatch(out.size(), dof_handler->n_dofs()));

std::vector<VectorType> all_in(1);

all_in[0] = in;

std::vector<VectorType> all_out(1);

all_out[0] = out;

interpolate(all_in,

all_out);

(all_out[0]).compress(VectorOperation::insert); // this helps avoid interpolate error

out=all_out[0];

}

Another simple trick to speed up the system, you can turn off the reorder mesh function and mesh by yourself and give a dummy permutation. The idea of p4est is to limit the communication so that just make sure the adjacent cells are grouped properly. This trick gives a huge save in computation time (>1M dofs).

best
YC Chen

Wolfgang Bangerth

unread,

Jan 31, 2017, 10:08:10 AM1/31/17

to dea...@googlegroups.com

YC,

> I have a project requiring to read in a large coarse mesh from gmsh to deal.ii
>> 1M dofs. Most of cells have their own characteristics, which means I cannot
> combine them and create a coarse mesh.
> Currently, I implemented it by using shared memory triangulation for
> parallelization. Since I want to scale it to a cluster system and target a
> 100M mesh (no need for mesh refinement), I have to use distributed tria via
> MPI (is there any better solution?). I found out that the initial cost is
> large because of the duplication of triangulation and p4est forest. I was
> wondering if there is any method to remove part of triangulation or p4est
> data.

No, unfortunately there is not. p4est is built on the assumption that the
coarse mesh is replicated on all processors, and deal.II inherits this
assumption. If your coarse mesh has 1M cells, that may just barely so be
tolerable, although it will likely lead to inefficient code in some places
where you loop over all cells and almost all of them turn out to be
artificial. But I suspect that you will be in serious trouble if your coarse
mesh has 100M cells.

You should really try to come up with a coarser coarse mesh that you can then
refine.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Yi-Chung Chen

unread,

Jan 31, 2017, 4:57:34 PM1/31/17

to deal.II User Group, bang...@colostate.edu

Hi Wolfgang,

Thank you for your prompt reply. I was wondeirng if I can integrate other partition tools, such as metis or parmetis to handle the fully distributed triangulation. I can develop that part by myself (or with some help from the community). Do you have any suggestions? My following project also relies on this since I will try to manage number of cells in each processor. With p4est, it is hard to manage number of cells based on a in-house algorithm. My application is about IC designs that may have million to billion cells. A fully distributed triangulation helps to reduce memory usage. The current shared_memory system can handle 20M (single core) in system of 32GB main memory.

Any design of 1M cells on the distributed triangulation have problem of computation time because of the reorder step. This is why I bypassed it and provided a sorted mesh to grid_in (read_msh). For a problem of 5M cells, I can save 200sec at the step of create_triangulation.

Regards

YC Chen

Wolfgang Bangerth

unread,

Jan 31, 2017, 5:57:35 PM1/31/17

to Yi-Chung Chen, deal.II User Group

Yi-Chung,

> Thank you for your prompt reply. I was wondeirng if I can integrate
> other partition tools, such as metis or parmetis to handle the fully
> distributed triangulation. I can develop that part by myself (or with
> some help from the community). Do you have any suggestions?

I believe that it would not be impossible to develop this, but it will
probably not be a small exercise. You would have to develop a fair share
of code, and gain an understanding of parts of the library you don't
usually get to see if you use deal.II.

If you're willing to do this, we can guide you along the process, but
it's going to be a bit of work for sure.

> My following
> project also relies on this since I will try to manage number of cells
> in each processor. With p4est, it is hard to manage number of cells
> based on a in-house algorithm.

That is actually not quite true. p4est (and the deal.II classes that use
it) allow you to describe a "weight" for each cell, and p4est will
partition in a way that the sum of weights is roughly equal among all
processors.

> My application is about IC designs that
> may have million to billion cells. A fully distributed triangulation
> helps to reduce memory usage. The current shared_memory system can
> handle 20M (single core) in system of 32GB main memory.

That's already quite impressive :-) What kind of meshes do you have that
require so many cells? Are they geometrically incredibly complicated to
require that many cells already at the coarse level?

> Any design of 1M cells on the distributed triangulation have problem of
> computation time because of the reorder step. This is why I bypassed it
> and provided a sorted mesh to grid_in (read_msh). For a problem of 5M
> cells, I can save 200sec at the step of create_triangulation.

Yes, I'm willing to believe this. The algorithm wasn't intended for
meshes of this size, though we did test it with ~300k cells in 2d and we
know that it scales like O(N). So 200 seconds seems like a long time. Is
this in debug mode?

Best
Wolfgang

Timo Heister

unread,

Feb 1, 2017, 1:38:23 PM2/1/17

to dea...@googlegroups.com

> I believe that it would not be impossible to develop this, but it will probably not be a small exercise. You would have to develop a fair share of code, and gain an understanding of parts of the library you don't usually get to see if you use deal.II.
>
> If you're willing to do this, we can guide you along the process, but it's going to be a bit of work for sure.

One of my students is working on the initial step to get towards being
able to handle very large distributed triangulations. The main
limitation will be (for now), that the partitioning is static and
needs to happen in an offline process (so you can't really do adaptive
refinement or change the number of processors effectively). This might
be enough for your problem, though.

I can go into more detail if you want.

--
Timo Heister
http://www.math.clemson.edu/~heister/

Yi-Chung Chen

unread,

Feb 2, 2017, 1:12:27 PM2/2/17

to deal.II User Group, yichung...@gmail.com, bang...@colostate.edu

Hi Wolfgang,

On Tuesday, January 31, 2017 at 10:57:35 PM UTC, Wolfgang Bangerth wrote:

Yi-Chung,

> Thank you for your prompt reply. I was wondeirng if I can integrate
> other partition tools, such as metis or parmetis to handle the fully
> distributed triangulation. I can develop that part by myself (or with
> some help from the community). Do you have any suggestions?

I believe that it would not be impossible to develop this, but it will
probably not be a small exercise. You would have to develop a fair share
of code, and gain an understanding of parts of the library you don't
usually get to see if you use deal.II.

If you're willing to do this, we can guide you along the process, but
it's going to be a bit of work for sure.

I'm willing to work on this part. Please let me know how should I start it. I believe this code will help the community.

> My following
> project also relies on this since I will try to manage number of cells
> in each processor. With p4est, it is hard to manage number of cells
> based on a in-house algorithm.

That is actually not quite true. p4est (and the deal.II classes that use
it) allow you to describe a "weight" for each cell, and p4est will
partition in a way that the sum of weights is roughly equal among all
processors.

I knew that p4est can do partition by weight. However, my current needs to control and distributed the cells in need. It would be better to control the partition by myself rather than p4est.

> My application is about IC designs that
> may have million to billion cells. A fully distributed triangulation
> helps to reduce memory usage. The current shared_memory system can
> handle 20M (single core) in system of 32GB main memory.

That's already quite impressive :-) What kind of meshes do you have that
require so many cells? Are they geometrically incredibly complicated to
require that many cells already at the coarse level?

Actually, this is the interesting part. We try to simulate thermal profile of integrated circuit. For a CPU, it has billion transistors inside and each of then has its own power trace as RHS. That is why we have to give it a large coarse mesh at beginning. I did some model reductions for transistors, but I still want my tool can simulate 100M cells to ensure accuracy.

> Any design of 1M cells on the distributed triangulation have problem of
> computation time because of the reorder step. This is why I bypassed it
> and provided a sorted mesh to grid_in (read_msh). For a problem of 5M
> cells, I can save 200sec at the step of create_triangulation.

Yes, I'm willing to believe this. The algorithm wasn't intended for
meshes of this size, though we did test it with ~300k cells in 2d and we
know that it scales like O(N). So 200 seconds seems like a long time. Is
this in debug mode?

Unfortunately not in debug mode.I guess the reorder is more like O(N^2 or N^4) if I may recall.

It searches the cells will minimum numbers of neighbors and then search again recursively for its neighbors. With increasing number of dofs, time increases exponentially.

For a 2-D problem it might be fast. For a 3-D, problem with >1M dofs takes a while to reorder.

In my setup program, a steady state 3-D thermal simulation (distributed trial) for a problem of 5M dofs in two cores requires 200 sec reorder, 80 sec setup_system, 80 sec solver time (petsc-MPI). 100sec output, and 80 sec create_tria, and 45 sec assembly. Each core requires 9GB memory. This is why I want to reduce memory usage.

my solution is to create mesh by myself since I generate it from a real IC. here is an example, if anyone want to generate the mesh. For a 3X3 2-D mesh, normally I will create mesh start from x then y. However, p4est requires a mesh with minimum (or balanced communication), it would be better to group neighbors together. So cell_id becomes the following:

7 8 9 7 8 9

4 5 6 =>> 3 4 6

1 2 3 1 2 5

In this way, it is easy to map it to distributed them by linearly distributing.

Best

YC Chen

Yi-Chung Chen

unread,

Feb 2, 2017, 1:24:53 PM2/2/17

to deal.II User Group

Hi Timo,

Thank you for your information.

Yes, please let me know the detail or the code. I just need a fully distributed triangulation which stores it local owned data in use. I dont even need p4est. A MPI protocol handles data communication should work. There is no need of adaptivity since cells in my case are atomic cells. I can also work on the code since I believe it may help the development of fully distributed triangulation. Actually, my friend and I are working on async distribution. All these should be included at the current stage.

Best

YC Chen

Wolfgang Bangerth

unread,

Feb 7, 2017, 12:54:28 PM2/7/17

to dea...@googlegroups.com

Timo (and Yi-Chung):

> One of my students is working on the initial step to get towards being
> able to handle very large distributed triangulations. The main
> limitation will be (for now), that the partitioning is static and
> needs to happen in an offline process (so you can't really do adaptive
> refinement or change the number of processors effectively). This might
> be enough for your problem, though.

Are you willing to share that code, Timo?

I suspect that if implemented right, it should not be terribly difficult to do
refinement of the mesh, but because you can't repartition the coarse mesh, it
will quickly become unbalanced if processors refine differently (i.e., in
practice, if processors do not all refine globally).

Do you implement this by building another class on top of
dealii::Triangulation so that the base class only stored the coarse mesh plus
one layer of ghosts, and the derived class is responsible for the
communication? And then derive another class from DoFHandlerPolicy to deal
with this triangulation?

Best
W.

Wolfgang Bangerth

unread,

Feb 7, 2017, 12:54:29 PM2/7/17

to Yi-Chung Chen, deal.II User Group

Yi-Chung,

> I'm willing to work on this part. Please let me know how should I start it. I
> believe this code will help the community.

No doubt! Thank you for your offer of help!

I'm going to comment in more detail below, but will point out that for all
major development, it is always helpful to do it in a way so that you can get
feedback early and often. There is nothing worse than going to work for two or
three months, uploading all of your code, and then getting feedback that
something could have been done in a much simpler way, or should have been done
in a different way to make it more general. In other words, whenever you have
something that is working, put it into a github pull request and let others
take a look and comment on it!

> > My application is about IC designs that
> > may have million to billion cells. A fully distributed triangulation
> > helps to reduce memory usage. The current shared_memory system can
> > handle 20M (single core) in system of 32GB main memory.
>
> That's already quite impressive :-) What kind of meshes do you have that
> require so many cells? Are they geometrically incredibly complicated to
> require that many cells already at the coarse level?
>
> Actually, this is the interesting part. We try to simulate thermal profile of
> integrated circuit. For a CPU, it has billion transistors inside and each of
> then has its own power trace as RHS. That is why we have to give it a large
> coarse mesh at beginning. I did some model reductions for transistors, but I
> still want my tool can simulate 100M cells to ensure accuracy.

That makes sense. My question was more motivated by why you need a *coarse*
mesh that is so fine? If your geometry is relatively simple, but you need high
resolution, then you can just take a coarse mesh with simple cells and refine
it a sufficient number of times. That already works -- we have done
computations on 100M cells many times.

The only reason I can see for a very fine coarse mesh is if you need to
resolve geometries with lots and lots and lots of curved edges and corners.

> Yes, I'm willing to believe this. The algorithm wasn't intended for
> meshes of this size, though we did test it with ~300k cells in 2d and we
> know that it scales like O(N). So 200 seconds seems like a long time. Is
> this in debug mode?
>
> Unfortunately not in debug mode.I guess the reorder is more like O(N^2 or N^4)
> if I may recall.

Hm, that is strange. Would you be willing to share one or two of your meshes
with one or a few million cells? If you can't share them publicly, can you
share them with me?

> It searches the cells will minimum numbers of neighbors and then search again
> recursively for its neighbors. With increasing number of dofs, time increases
> exponentially.

Ah, I see -- that's the reordering in the Cuthill-McKee step in
parallel::distributed::Triangulation. Interesting. I wonder whether we could
come up with a more scalable implementation of this algorithm by building up
different data structures.

> In my setup program, a steady state 3-D thermal simulation (distributed trial)
> for a problem of 5M dofs in two cores requires 200 sec reorder, 80 sec
> setup_system, 80 sec solver time (petsc-MPI). 100sec output, and 80 sec
> create_tria, and 45 sec assembly. Each core requires 9GB memory. This is why I
> want to reduce memory usage.

Yes, I can see why the reorder step is annoying then.

Timo Heister

unread,

Feb 8, 2017, 1:54:05 PM2/8/17

to dea...@googlegroups.com

> Are you willing to share that code, Timo?

Yes, we will be creating a PR for that soon.

> I suspect that if implemented right, it should not be terribly difficult to
> do refinement of the mesh, but because you can't repartition the coarse
> mesh, it will quickly become unbalanced if processors refine differently
> (i.e., in practice, if processors do not all refine globally).

The issue with that is that you would need to update the ghost layer
correctly. Doable, but definitely requires some extra thought.

> Do you implement this by building another class on top of
> dealii::Triangulation so that the base class only stored the coarse mesh
> plus one layer of ghosts, and the derived class is responsible for the
> communication? And then derive another class from DoFHandlerPolicy to deal
> with this triangulation?

Yes. It is derived from parallel::Triangulation (and an alternative to
shared::Tria and distributed::Tria).

Timo Heister

unread,

Feb 9, 2017, 4:33:05 PM2/9/17

to dea...@googlegroups.com

see https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dealii_dealii_pull_3956&d=CwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=Kumuf8l3Od3XM93KDoNo2AFeuWqKbGvpUufysXzdnAE&s=847Vw5LGq1DTVk1pUvZNIpjoFirAOxRIkxnlS0VHYC8&e= for the work in
progress pull request.

Yi-Chung Chen

unread,

Feb 10, 2017, 7:58:14 AM2/10/17

to deal.II User Group, yichung...@gmail.com, bang...@colostate.edu

Hi Wolfgang,

Thank you for reply. Unfortunately, google does not allow me to attach large meshes here (file is too big? > 300MB for a 3D mesh with dofs 2.5M in .msh). Is there any other way? I would like to share it with you.

Here I attach the website of our project here,
MTA

I did not share the deal.II code in the package simply because this program needs a customized deal.ii to compile it. Therefore, I embedded all dependency in it and share it in binary. The mesh generator inside the package can create coase mesh based on defintion of structure (xml file) and do refinement. For some exaples, I do not refine it because it is already very fine. I plan to share the code in future. My friends and I implemented some time-stepper such as bdf, stable trapezoidal, etc. I believe this would help the community.

I find the best way to avoid the reorder step is to create your own mesh and bypass it. I'm also working on this. If there is any progress, I'd like to share it.

Thanks
YC Chen

Yi-Chung Chen

unread,

Feb 10, 2017, 8:00:15 AM2/10/17

to deal.II User Group

Thank you, Timo.
I'm going to refingure my deal.ii to this. Hopefully I can migrate all my design to this asap. I'm sure I can report you some performance analysis etc later on.

Regards
YC Chen

Timo Heister

unread,

Feb 10, 2017, 8:11:17 AM2/10/17

to dea...@googlegroups.com

> I'm going to refingure my deal.ii to this. Hopefully I can migrate all my
> design to this asap. I'm sure I can report you some performance analysis etc
> later on.

That would be interesting to see. Keep in mind that this is still
quite far from being usable, though. Right now I can create meshes,
but a lot of the data structures are not set up correctly yet and I
haven't started working on distributing DoFs.

Reply all

Reply to author

Forward