Independent tasks appear to be running sequentially on multiple threads

119 views
Skip to first unread message

Михаил Мичуров

unread,
Apr 1, 2025, 9:42:07 PM4/1/25
to Legion Users
Greetings again,

I am trying to implement a program that solves a differential equation on a 2D grid (source code is in the attachment). Unfortunately, I have not been able to achieve any speedup by utilizing multiple worker threads. I've also used Legion Prof to check if the tasks that should run in parallel (jacobi_step) do, and it turned out they don't.
ZeNMk4zEdlQ.jpg
Then I checked if these tasks are independent using Legion Spy, and if I understand it correctly, they are actually independent.
q2UHQWzQKUI.jpg
I am using image function to partition the grid into overlapping (input) and non-overlapping (output) rectangular subregions (tasks partition_ghost_2d and partition_interior_2d) since I could not find any documentation for restrict. But I used  -lg:partcheck to verify that the disjoint output partition is really disjoint.

Is there something I can do to make the tasks run in parallel?

Thank you,

Mikhail


jacobi.rg

Elliott Slaughter

unread,
Apr 2, 2025, 7:19:14 PM4/2/25
to Михаил Мичуров, Legion Users
Hi Mikhail,

You have an if statement inside your for loop. That will defeat the index launch optimization, and depending on the exact set of optimizations you have enabled, possibly cause the entire execution to block on every iteration.

I would try to phrase your application so that you can launch all of the parallel tasks at once.

I'm not that familiar with the Jacobi method, but can you do something like:

    while delta >= eps do
        __demand(__index_launch)
        for i = 0, num_subregions * num_subregions, 2 do
            jacobi_step(X0, Y0, h_x, h_y, ghost_partition_even[i], interior_partition_odd[i], deltas_partition[i])
        end
        __demand(__index_launch)
        for i = 1, num_subregions * num_subregions, 2 do
            jacobi_step(X0, Y0, h_x, h_y, ghost_partition_odd[i], interior_partition_even[i], deltas_partition[i])
        end

I added the __demand(__index_launch) to make the compiler throw an error if the index launch optimization fails. If the code compiles like this, there is no possible way to miss the optimization.

Anyway, this shows you the general pattern you can use, and if your parallelism structure is more complicated (e.g, you need diagonal sets of tiles, or something), hopefully you can see to structure things.

--
You received this message because you are subscribed to the Google Groups "Legion Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to legionusers...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/legionusers/b70b4e23-fad3-4190-86b4-56edd4f818fbn%40googlegroups.com.


--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

Михаил Мичуров

unread,
Apr 3, 2025, 7:20:21 AM4/3/25
to Legion Users
Hello Elliott,

Thank you for your suggestion. I will try to implement this pattern and will report the results.

четверг, 3 апреля 2025 г. в 06:19:14 UTC+7, Elliott Slaughter:

Михаил Мичуров

unread,
Apr 6, 2025, 6:52:30 AM4/6/25
to Legion Users
My main loop now looks like this:

    while delta >= eps do
        __demand(__index_launch)
        for i = 0, num_subregions * num_subregions do

            jacobi_step(X0, Y0, h_x, h_y, ghost_partition_even[i], interior_partition_odd[i], deltas_partition[i])
        end

        __demand(__index_launch)
        for i = 0, num_subregions * num_subregions do

            jacobi_step(X0, Y0, h_x, h_y, ghost_partition_odd[i], interior_partition_even[i], deltas_partition[i])
        end

        delta = max1d(deltas)

        iterations += 2
    end

Sadly, the tasks still run sequentially, and increasing the number of worker threads does not seem to affect the execution time at all.
Screenshot 2025-04-06 172954.png

Anything else I could try?

Thank you,

Mikhail


четверг, 3 апреля 2025 г. в 18:20:21 UTC+7, Михаил Мичуров:

Michael Bauer

unread,
Apr 8, 2025, 1:34:24 PM4/8/25
to Михаил Мичуров, Legion Users
How many processors are you configuring Realm with when you run this application?

Михаил Мичуров

unread,
Apr 10, 2025, 11:15:00 AM4/10/25
to Legion Users
Do you mean the -ll:cpu option? I'm running my program using the following command (omitting program arguments):
regent.py jacobi.rg  -ll:cpu 4
This was the only way to specify the number of threads that I managed to find.


среда, 9 апреля 2025 г. в 00:34:24 UTC+7, Michael Bauer:

Michael Bauer

unread,
Apr 13, 2025, 7:24:16 PM4/13/25
to Михаил Мичуров, Legion Users
Do you have Legion Prof log files for us to look at? The visualization you shared above looks like it was generated by some other profiling tool.

Михаил Мичуров

unread,
Apr 14, 2025, 8:52:57 AM4/14/25
to Legion Users
Sure, the log file is in the attachment. I'm not sure if it's the exact same file I used earlier since I might have changed the code since my last messages, but the problem is the same.
I was using legion_prof trace to convert the traces to the Google Trace Viewer format and viewing them using Perfetto UI.

понедельник, 14 апреля 2025 г. в 06:24:16 UTC+7, Michael Bauer:
prof_0.log

Michael Bauer

unread,
Apr 16, 2025, 5:15:10 AM4/16/25
to Михаил Мичуров, Legion Users
Which commit of Legion are you using for generating this profile? It doesn't seem to be either the latest release or the latest master branch.

Михаил Мичуров

unread,
May 7, 2025, 10:44:19 AM5/7/25
to Legion Users
Greetings again!

I apologize for taking so long to reply.

I have now reinstalled Legion, it is up to date with origin/master according to git. Unfortunately, the issue is still there.
The tasks are still executed by a single thread:
image_2025-05-07_21-30-19.png

The log file is in the attachment.

I have also tried to run matrix multiplication on multiple threads and it seems to work as expected, submatrices are computed in parallel.
среда, 16 апреля 2025 г. в 16:15:10 UTC+7, Michael Bauer:
prof_0.log

Michael Bauer

unread,
May 14, 2025, 4:19:17 AM5/14/25
to Михаил Мичуров, Legion Users
Looking at this profile in more detail, what it looks like to me is that you are blocking in your top-level task every single iteration of your jacobi solver to test for convergence. The top-level task runs a single jacobi iteration task and then blocks to wait for the result of that task (you can see the backtrace that it waiting on a future):
Screenshot 2025-05-14 at 1.12.30 AM.png
Meanwhile there is only one jacobi_step task to run so that runs on the one CPU processor. It eventually produces the future result which wakes up the top-level task and then the cycle repeats itself. I suspect that the reason that you're not getting any task parallelism within the iterations is that "num_subregions" == 1 and that is the reason you're only seeing one sub-task at a time for each jacobi iteration. Can you confirm if that is the case or not? You should be able to put in an assertion before the loop. Regent's __demand(__index_launch) syntax should fail at compile time if it can't actually convert the loop to an index launch. I would also recommend not checking for convergence every iteration to allow Legion's dependence analysis to get ahead and not expose any latency when launching sub-tasks. Maybe only check for convergence every 100 iterations (or smaller you want it to be tighter, but the more the better).

Михаил Мичуров

unread,
May 19, 2025, 12:35:39 AM5/19/25
to Legion Users
> I suspect that the reason that you're not getting any task parallelism within the iterations is that "num_subregions" == 1 and that is the reason you're only seeing one sub-task at a time for each jacobi iteration. Can you confirm if that is the case or not?
Pretty sure that there are 4 subregions. When I run my program with a small fixed number of iterations, there seem to be 4 * iterations "jacobi_step" tasks, meaning (if I understand correctly) 4 subregions are processed each iteration. Below is an example for 10 iterations:
Screenshot from 2025-05-19 11-10-38.png
I've also removed convergence checks completely and fixed the number of iterations:
    var max_iter = 620

    for i = 0, max_iter / 2 do

        __demand(__index_launch)
        for i = 0, num_subregions * num_subregions do
            jacobi_step(X0, Y0, h_x, h_y, ghost_partition_even[i], interior_partition_odd[i], deltas_partition[i])
        end

        __demand(__index_launch)
        for i = 0, num_subregions * num_subregions do
            jacobi_step(X0, Y0, h_x, h_y, ghost_partition_odd[i], interior_partition_even[i], deltas_partition[i])
        end
    end

It seems like the program speeds up now: it runs for 10,7s, 9.6s and 8.7s on 1, 2 and 4 threads (-ll:cpu option) respectively. However, it still does not look like all threads are utilized at all times. A zoomed in fragment of the trace:
Screenshot from 2025-05-19 11-27-30.png

The log file is in the attachment.

среда, 14 мая 2025 г. в 15:19:17 UTC+7, Michael Bauer:
prof_620_fixed_0.log

Michael Bauer

unread,
May 26, 2025, 5:55:43 AM5/26/25
to Михаил Мичуров, Legion Users
This profile is different than the one you sent before. You're not getting the warning from the profiler that you've used the default mapper for performance modeling, so presumably you've written a custom mapper that overrides the default mapper (or just doesn't inherit from it at all). Given that these do look like index space task launches, I believe that there is likely a performance bug in your mapper that is assigning lots of sub-tasks to the first CPU processor and not distributing point tasks evenly. How did you design your mapper?

Михаил Мичуров

unread,
May 26, 2025, 8:47:22 PM5/26/25
to Legion Users
I am getting the warning, I didn't think it was worth mentioning since I was getting it all the time (I was using the default mapper).

Just to be sure, I've collected the profile one more time, and here's the output:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! YOU ARE PROFILING USING THE DEFAULT MAPPER!!!
!!! THE DEFAULT MAPPER IS NOT FOR PERFORMANCE !!!
!!! PLEASE CUSTOMIZE YOUR MAPPER TO YOUR      !!!
!!! APPLICATION AND TO YOUR TARGET MACHINE    !!!
First use of the default mapper in address space 0
occurred when task main (UID 2) invoked the "select_task_options" mapper call
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!WARNING WARNING WARNING WARNING WARNING WARNING!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

N = 2000
num_subregions = 2
eps = 0.0000000100
[0 - 70a0e6290780]    0.107202 {4}{runtime}: LEGION WARNING: Region requirement 1 of operation Copy (UID 10, provenance: jacobi.rg:364) in parent task main (UID 2) is using uninitialized data for field(s) 101 of logical region (1,2,2) (from file /home/mmichurov/Apps/legion/runtime/legion/legion_ops.cc:1894)
algorithm start
iterations = 620
difference = 0.0000203955


And the profile doesn't look too different from the last one:
Screenshot from 2025-05-27 07-35-21.png
The log file is attached.

I assume  it is possible to write a custom mapper that explicitly maps tasks to different cores. Could you please provide an example? The Legion Mapper API looks a bit overwhelming, and I can't find any documentation for writing mappers in Regent.
понедельник, 26 мая 2025 г. в 16:55:43 UTC+7, Michael Bauer:
prof_620_fixed_0.log
Reply all
Reply to author
Forward
0 new messages