Parallel computation fails to converge

143 views
Skip to first unread message

Junxiang Wang

unread,
Feb 7, 2025, 4:51:50 AM2/7/25
to deal.II User Group
Dear all,

I am modelling multyphysics problem in a paralle way. 

The code runs perfectly in serial or small number of cores while fails to converge once I run it on more number of cores say >8.

What could cause this problem?

Thanks a lot.

Junxiang.

Bruno Turcksin

unread,
Feb 7, 2025, 9:42:04 AM2/7/25
to deal.II User Group
Hello Junxiang,

It's very hard to know what the issue could be. What you could do is print some information about your matrix and right hand side (for instance, different norms) and check that they are the same in serial and in parallel. By comparing the serial and parallel simulations at different points in your code, you should be able to find out why there is a difference.

Best,

Bruno

Junxiang Wang

unread,
Feb 8, 2025, 6:24:07 AM2/8/25
to deal.II User Group
Dear Bruno,

Thanks a lot for your advice. 

I tested it. The code runing in parellel has exactly the same norm of stiffness mtx but different norm of RHS residual compared to running in serial. 

It is really frastrating that increasing number of cores can make a difference...

Best,

Junxiang.

Junxiang Wang

unread,
Feb 8, 2025, 6:26:06 AM2/8/25
to deal.II User Group
In the first 80 steps, RHS and Stiffness are all the same. After several step of refine mesh, the RHS then turns to different from the serial one. 

Subramanya G

unread,
Feb 8, 2025, 7:25:34 AM2/8/25
to dea...@googlegroups.com
Are you making sure you’re initializing the vector to 0 in release mode. 

Subramanya.  
सुब्रह्मण्य .


--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dealii/f3a77cc0-c066-45b4-ad6b-1069899594a4n%40googlegroups.com.

Wolfgang Bangerth

unread,
Feb 8, 2025, 11:37:20 AM2/8/25
to dea...@googlegroups.com
On 2/8/25 04:24, Junxiang Wang wrote:
>
> It is really frastrating that increasing number of cores can make a difference...

Yes, but at least you now have something that you know should not happen, but
does happen. It is a very concrete thing (unlike "solver not converging") that
you can explore and debug because you know that the vectors should be the same
regardless of the number of cores you are using.

Best
W.

Bruno Turcksin

unread,
Feb 8, 2025, 4:57:10 PM2/8/25
to dea...@googlegroups.com
What happens if you don't refine the mesh? Is everything still the same?

Bruno

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/tA35I30la0M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.

Junxiang Wang

unread,
Feb 10, 2025, 1:18:21 AM2/10/25
to deal.II User Group
Deal All,

Thanks a lot for your advice.

The code works fine in both parallel and serial if the mesh remains the same without any refinement. 

To further investigate the issue I encountered. I export the global residual of running in serial and parallel models with a high number of cores.

I find that one particular nodal residual value becomes abnormal as shown below in the pic attached.

Therefore, Can I just simply add the residual value of local_rhs at the node on the edge shared by the two elements, and compare it to the final residual vector, system_pde_residual?


  29           if (residual_only)
  28             {
  27               constraints_update.distribute_local_to_global(
  26                 local_rhs, local_dof_indices, system_pde_residual);
  25
  24               {
  23                 constraints_update.distribute_local_to_global(
  22                   local_rhs, local_dof_indices, system_total_residual);
  21               }
  20             }
  19           else
  18             {
  17               constraints_update.distribute_local_to_global(
  16                 local_matrix,
  15                 local_rhs,
  14                 local_dof_indices,
  13                 system_pde_matrix,
  12                 system_pde_residual);
  11
  10             } 


Best,

Junxiang.
parallel.png
serial.png

Bruno Turcksin

unread,
Feb 10, 2025, 8:02:49 AM2/10/25
to dea...@googlegroups.com
Hello Junxiang,

It looks like the issue happens at a hanging node. In step-40, we show how to see which processor each cell belongs to. I think it would be interesting to output that image and see if the issue happens because the hanging node is shared between different processors.

Best,

Bruno

Junxiang Wang

unread,
Feb 11, 2025, 2:32:47 AM2/11/25
to deal.II User Group
Hi Bruno,

Thanks a lot.

I export the subdomain_id. 

334.png is the mesh before refinement. 335.png is the refined mesh. 334_1 is the processor 1, 334_2 is the processor 2, etc...

Based on the image. Processor 0 and Processor 1 all have different mesh after refinement. The abnormal residual occurs right after the ownership changes.

Is this the problem that occurs? 

Best 

Junxiang.

335.png
334_0.png
334_1.png
334.png
335_0.png
335_1.png

Junxiang Wang

unread,
Feb 11, 2025, 5:17:39 AM2/11/25
to deal.II User Group
Hi Bruno,

I further print information about the RHS of the two elements that have the abnormal residual value. It turns out the nodal values are all the same between the serial and parallel simulation.

Therefore, Can I interpret the issue I encounter in that it is troublesome to distribute the local_rhs when the hanging node is located on the shared edge of two elements and also the two elements belong to different processors respectively?

Best 

Junxiang.

Junxiang Wang

unread,
Feb 11, 2025, 5:43:57 AM2/11/25
to deal.II User Group
Sorry, the node distribution is as follows.
335_0_node.png
335_1_node.png

Wolfgang Bangerth

unread,
Feb 11, 2025, 11:23:03 AM2/11/25
to dea...@googlegroups.com
On 2/11/25 03:17, Junxiang Wang wrote:
>
> Therefore, Can I interpret the issue I encounter in that it is troublesome to
> distribute the local_rhs when the hanging node is located on the shared edge
> of two elements and also the two elements belong to different processors
> respectively?

Junxiang:
I don't know what the concrete problem in your code is, but in general deal.II
can deal with hanging nodes just fine. You just need to make sure that you
distribute local contributions into the global matrix and vector objects using
the AffineConstraints::distribute_local_to_global() functions. step-40 is the
canonical place to see how that is done. I would try and follow the same
structure as shown in step-40 in your code as close as possible.

My suggestion for you is to see if you can simplify the situation (say, use a
right hand side f=1 for simplicity), output the local contributions to make
sure you get the same with one or multiple processes, and output the completed
right hand side vector (again to make sure you get the same with one and
multiple processes).

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/


Junxiang Wang

unread,
Feb 14, 2025, 2:08:00 AM2/14/25
to deal.II User Group
Deal ALL,

Thanks a lot for your suggestion. I checked the problem further and conducted several rounds of debugging based on STEP-40.

The problem remains there and becomes more clear.

Basically, in the serial model or small number of cores. The hanging node element along with the adjacent element always belongs to the same subdomain even after levels of refinement.  In this case, everything works perfectly.

Once the ownership of the hanging node element and its adjacent element changes to a different processor after refinement in simulation in a larger number of cores, the issue occurs. For example, the hanging node elements and their adjacents all belong to a subdomain 0 before refinement (334_m.png). After refinement, the hanging node element belongs to subdomain 0, while the adjacent element  belongs to subdomain 1. (335_m.png).  we tested that the abnormal residual always occurs along with this kind of redistribution of the hanging node element and its adjacent element to different processors.

We also export the rhs of the abnormal element before  refinement 

dofs |  nodal value        
26        -1.73171
27       34.7134
30       -2.69509
31       40.9745
166     1.19627
167    -39.1402
168    3.23053
169    -36.5477

and after refinement:

dofs  | nodal value 
0           -1.73171e+00
1            3.47134e+01
2            -2.69509e+00
3           4.09745e+01
4           1.19627e+00
5           -3.91402e+01
6           3.23053e+00
7           -3.65477e+01

It turns out that the element rhs are the same. 

Will the distribution process change the data? or there is anything I need to take care of ? 

The code also includes solution data interpolation and history data transfer to the new refined mesh. 

If you could further give me any advice to tackle this issue, I deeply appreciate it.

Best,

Junxiang.
residual.png
334_m.png
335_m.png

Wolfgang Bangerth

unread,
Feb 19, 2025, 5:07:11 PM2/19/25
to dea...@googlegroups.com


On 2/14/25 00:07, Junxiang Wang wrote:
>
> Will the distribution process change the data? or there is anything I
> need to take care of ?
>
> The code also includes solution data interpolation and history data
> transfer to the new refined mesh.

My general strategy for cases such as your is to make the program as
simple as possible. Save the current state of the program somewhere
(ideally in a version control system). Then throw out everything that is
not necessary to demonstrate the problem. I imagine that you don't need
the solution data interpolation and history data transfer, for example.
Pretty much every problem of this sort I've ever run into could be
demonstrated with a 200-line code, and often substantially less.


> If you could further give me any advice to tackle this issue, I deeply
> appreciate it.

I don't really have any further suggestions on how to debug this, other
than stating that we have been using adaptive mesh refinement in
parallel for ~15 years now and do not observe these sorts of issues. I
am pretty sure that you can find similar cases to yours when you run
step-40, for example.

The only thing I could suggest is to show us how you initialize the
AffineConstraints object. It needs to know both the locally owned and
the locally relevant DoFs -- you may want to take a look at step-40 as
an example.

Best
W.
Reply all
Reply to author
Forward
0 new messages