Questions about debugging Legion/Regent programs

11 views
Skip to first unread message

Михаил Мичуров

unread,
Nov 7, 2024, 7:03:49 AM11/7/24
to Legion Users

Greetings,

I’m a postgraduate student at the Institute of Computational Mathematics and Mathematical Geophysics of the Siberian Branch of the Russian Academy of Sciences. For my thesis I’m researching automated debugging methods implemented in various task-based parallel programming systems. From what I understand both Legion runtime and Regent compiler implement several automated checks for common error types and I would like to ask several questions regarding the implementation of these checks and errors that are possible in Legion/Regent programs:


1. Where can I learn about how the runtime checks (privilege checks, bounds checks and partition checks) are implemented? 

2. What classes of mistakes are 1) impossible in Legion but are possible in lower level distributed programming, 2) possible in Legion but are detected by Regent compiler or optional runtime checks and 3) still possible in both Legion and Regent and cannot be detected automatically?

3. In particular, is it possible for a deadlock caused by a cycle in the task dependency graph to occur in a program that uses Legion C++ API? Are data races possible in Legion/Regent and if so, what could cause them?


Thank you,

Mikhail

Elliott Slaughter

unread,
Nov 7, 2024, 1:54:08 PM11/7/24
to Михаил Мичуров, Legion Users
Hi Mikhail,

Legion is a C++ API and therefore it is always possible to do silly and nonsensical things like:

int *x = nullptr;
printf("%d\n", *x);

So in all of the discussion below there is some assumption of a basic level of sanity in the user code (otherwise all bets are off). But generally speaking:
  • Deadlocks are impossible in Legion and Regent. They simply cannot happen unless you call out to another networking library like MPI (which therefore, would not be a pure Legion program).
  • Data races are also impossible in both Legion and Regent with the same caveat.
  • As a general rule, a well-behaved Legion program will always have sequential semantics. That rules out by definition every class of parallel bug. So then the remaining question is how the user might violate the Legion programming model itself, because these are the ways that things can go wrong.
  • Legion provides dynamic checks for privileges, bounds and partitions. Some checks are more expensive than others and require additional opt-in settings. Specifically:
    • If a task calls a subtask without the correct privileges, Legion will issue an error at runtime. This check is always on.
    • If a task attempts to create an accessor without the appropriate privilege, Legion will issue an error at runtime. This might be on by default now?
    • If a task accesses a region out of bounds, Legion will issue an error at runtime. This check requires explicit opt-in at compile time because it is very expensive.
    • If the user creates a partition with the incorrect annotations (disjoint/complete when it is not or vice versa), Legion will issue an error at runtime. This check is behind an opt-in runtime flag because it is expensive.
    • If the user makes a control replicated task which violates control determinism, Legion will issue an error at runtime. This check is behind an opt-in runtime flag because it is expensive.
    • Legion does not track the types of futures, fields, etc. That means if you create a field of type int and access it as float, Legion will not detect the error. This is currently just up to the user to catch.
  • In general Regent checks all of the above, except in cases where it's especially expensive. So:
    • A task calling a subtask without correct privileges is a compile time error in Regent.
    • A task accessing a region without correct privileges is a compile time error in Regent.
    • A task accessing a region out of bounds is a runtime error and requires an opt-in flag, because it is expensive to check, even for a compiler.
    • In most cases Regent correctly infers the types of partitions. In some cases it cannot do so and then the checks are left to runtime (behind a flag).
    • Regent has a conservative check for control determinism. I believe that if Regent says a task is safe to control replicate, that will be true. However, users who use a lot of C/C++ code from Regent will often need to circumvent this check (note: specifically for their C/C++ code only) because Regent can't see into C/C++ code to check it. So for pure Regent programs this works well and users are on their own if they compile Regent with C/C++. Note that even in the combined case they can still fall back to the Legion checks.
    • Regent does track types for fields, futures, etc. so those types of bugs are checked at compile time.
  • Overall, I think everything can be checked automatically as long as:
    • You are willing to pay for (potentially expensive) runtime checks at least in some cases.
    • You do not write inherently unsafe C/C++ as noted at the top of this email.
    • You do not call out to external parallel libraries like MPI.
Hope that helps. You may also be interested in the Legion manual, which has among other things, sections on accessors (where privilege/bounds checks are performed) and partitioning: https://legion.stanford.edu/pdfs/legion-manual.pdf

Anything that requires a dynamic check will have a flag. So you can also be sure you're getting a complete list of checks by consulting the Legion documentation on available flags:


Hope this helps.

--
You received this message because you are subscribed to the Google Groups "Legion Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to legionusers...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/legionusers/e102d8b2-bc01-4312-8f8a-19dbe01be4a1n%40googlegroups.com.


--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

Michael Bauer

unread,
Nov 9, 2024, 6:17:31 PM11/9/24
to Михаил Мичуров, Legion Users, slau...@cs.stanford.edu
I'll add a few thoughts as well:

> 1. Where can I learn about how the runtime checks (privilege checks, bounds checks and partition checks) are implemented? 

See some of the links below. If you have questions about what specific bits of code are doing please ask as a follow-up.

> What classes of mistakes are 1) impossible in Legion but are possible in lower level distributed programming, 2) possible in Legion but are detected by Regent compiler or optional runtime checks and 3) still possible in both Legion and Regent and cannot be detected automatically?

I would in general agree that in Legion/Regent, code should always have sequential semantics so unless you opt-in to one of the more advanced features (e.g. relaxed coherence modes), pretty much every kind of parallel programming bug that you might normally encounter (races, deadlocks, livelocks) should all be impossible. You can still make logic bugs and things like that which will make your code incorrect, but that can happen in any program.

> Data races are also impossible in both Legion and Regent with the same caveat.

Unless you use the relaxed coherence modes like "simultaneous". If you use "simultaneous" coherence then you can get two tasks using the same physical instance concurrently, but only because you asked for it and then it's up to you to manage any races. In general though, using "exclusive" or "atomic" coherence should guarantee that your program is data race free.  There is one other place you can get races which is inside of your task bodies. If your task is using a parallel execution in the task body (e.g., executing an OpenMP parallel loop or launching a CUDA kernel) then Legion doesn't do anything to protect you from races that might happen in those contexts. Races between different tasks though are in general impossible without relaxed coherence modes.

> In particular, is it possible for a deadlock caused by a cycle in the task dependency graph to occur in a program that uses Legion C++ API?

If Legion is implemented correctly then you should never be able to get a cycle in the Realm event graph that Legion constructs. That said, there have occasionally been bugs in the implementation of Legion that have introduced such cycles. Legion has a formal verification tool called Legion Spy that checks that the Realm event graph that Legion creates maintains the sequential semantics of the original Legion program. Included in Legion Spy is a cycle detector (the '-c' option) to help identify deadlocks (which are always a bug in the implementation of Legion and not something the user should ever be able to induce). Legion Spy formally validates Legion's execution for large parts of our test suite on both single node and multi-node execution in the Legion CI. You can also use Legion Spy for validating that Legion did the right thing for your program too (up to the point where the verification becomes too expensive).

> Are data races possible in Legion/Regent and if so, what could cause them?

Data races are possible only if the user opts into them by using relaxed coherence modes or by doing something illegal like accessing data outside of the logical region they requested privileges on or accessing such data with the wrong privileges (e.g., writing when they requested read-only). If you use normal Legion accessors and turn on the bounds checks (expensive) then we can eliminate all illegal uses of data and prevent all races except the ones the user has explicitly sanctioned.

> If a task attempts to create an accessor without the appropriate privilege, Legion will issue an error at runtime. This might be on by default now?

If you use the Legion FieldAccessor class then it is always checked. If you use the UnsafeFieldAccessor class then it will not be checked.

> If a task accesses a region out of bounds, Legion will issue an error at runtime. This check requires explicit opt-in at compile time because it is very expensive.

This is controlled by the CHECK_BOUNDS template parameter on the FieldAccessor class. This also means you can opt-in to bounds checking just on particular accessors on an individual basis without needing to turn it on broadly across your whole application if you're tracking down a bug in a specific task. Although the most common way is to turn on bounds checks for all accessors at once.

> If the user creates a partition with the incorrect annotations (disjoint/complete when it is not or vice versa), Legion will issue an error at runtime. This check is behind an opt-in runtime flag because it is expensive.

The flag for this is `-lg:partcheck` if you want to follow the code in the runtime.

> If the user makes a control replicated task which violates control determinism, Legion will issue an error at runtime. This check is behind an opt-in runtime flag because it is expensive.

The flag for this is `-lg:safe_ctrlrepl`.

> Legion does not track the types of futures, fields, etc. That means if you create a field of type int and access it as float, Legion will not detect the error. This is currently just up to the user to catch.

Although we do some rudimentary checking to ensure that the type which you are asserting the future is has the same size as the actual buffer for the future. In general we don't really have a type system for data in Legion. We just deal with the sizes of fields/futures/etc.

Elliott Slaughter

unread,
Nov 11, 2024, 12:51:49 PM11/11/24
to Михаил Мичуров, Legion Users
In addition to everything that has been said so far, let me also say that we're generally happy to review drafts of manuscripts if you plan to publish something. Sometimes people write things about Legion without consulting us first, which has led to some inaccurate and/or nonsensical claims about Legion (fortunately not in any major papers, but in some published minor papers). So we are generally happy to review things written about Legion to help ensure that they are factual and accurate.
Reply all
Reply to author
Forward
0 new messages