Hi Dmitry, Marco,
We have developed a set of sound static analyses that significantly reduce ThreadSanitizer's overhead by eliminating unnecessary instrumentation ahead-of-time.
Our benchmarks on the Chromium codebase are very promising:
A 30% improvement on the overall Speedometer 3 score (up to 2x speedups on individual sub-tests like TodoMVC-React-Redux).
Significant gains (up to 2.2x) in other Blink performance tests.
We have also seen substantial speedups on other major applications:
SQLite: Up to 4.7x.
FFmpeg: Up to 2.27x.
Redis: Up to 1.72x.
MySQL: Up to 1.11x.
These results are achieved by a framework of five analyses: Escape Analysis (finds thread-local objects), Lock Ownership (finds lock-protected globals), Single-Threaded Context (detects pre-threading code), SWMR Pattern Detection (finds read-mostly globals), and Dominance-Based Elimination (prunes redundant checks).
While the full framework is extensive, we propose two analyses as excellent candidates for initial upstreaming:
Intra-procedural Escape Analysis: We suggest starting with an intra-procedural version. It is simpler to integrate yet highly effective. It is sound by construction, operating on provably thread-local data.
Dominance-Based Elimination: This is extremely powerful in hot loops. We acknowledge this affects report granularity (the race is reported on the dominating access), but it guarantees detection, which we believe is a valuable trade-off for the performance gain.
Our approach preserves TSan's zero-false-positive guarantee and adds no runtime overhead. We have a working implementation integrated with the TSan LLVM pass and are prepared to do the work to adapt it for upstreaming.
Would you be open to discussing this further and guiding us on the contribution process?
Hi Dmitry,
Thank you very much for your prompt and encouraging reply! We're very glad you find our results promising, and we are excited to upstream this work under your guidance.
Let me first answer your questions below:
1. What do you mean by "affects reporting granularity"?
Indeed, domination-based analysis is sound and complete in the following sense: TSan (without dominance analysis) reports a race if and only if TSan (with the dominance analysis) reports a race. However, as with some optimizations that TSan already implements, the number of race reports (after dominance based optimization) may be fewer because we report only the “dominating” instruction to be in race (and avoid instrumenting and reporting other instructions that are dominated).
To illustrate this, let's consider an example:
// BB1: Dominator block
__tsan_write(&x); // TSan instrumentation call
x = 1; // Dominating access I₁ (instrumented)
if (condition) {
// BB3: Dominated block
x = 2; // Dominated access I₂ (uninstrumented)
}
In this code, the access x = 1 dominates x = 2. Our analysis removes the instrumentation for x = 2. If another thread writes to x concurrently, creating a race with the access at x = 2, then TSan will still detect this race, but the report will point to the line with x = 1.
What this means for a developer in practice is that a race report on the dominating access `I₁` is a strong signal that the entire region of code it dominates is vulnerable.
2. How exactly can we help?
Thanks for clarifying the process – a GitHub PR that passes existing tests and adds new ones sounds like a clear plan.
We have a working and debugged implementation as a set of LLVM passes integrated with ThreadSanitizerPass. To simplify the review and integration process, we could start with one of the two analyses proposed in the initial email (Intra-procedural Escape Analysis or Dominance-Based Elimination), as this would allow us to focus on a single set of changes.
Your guidance on the following points would be invaluable:
We are ready to get started and look forward to your guidance.
Hello Dmitry and all,
I’m implementing getUnderlyingObjectsThroughLoads — a stronger variant of getUnderlyingObjects that augments ValueTracking (VT) with MemorySSA clobber-chasing and, when safe, follows stores to recover the pointer value stored in memory. The function is used by our Escape Analysis (EA) to enumerate all potential underlying objects for a pointer read (so EA can prove objects are thread-local / non-escaping).
The core problem I ran into is practical and important for soundness:
getUnderlyingObjects(Value *V, ..., MaxLookup) accepts a MaxLookup parameter. In our tree calling MaxLookup = 0 behaves as “unbounded”, so we call VT with 0 to try to get a full expansion.
But the VT API does not report whether it completed exploration or silently stopped because of an internal cutoff (cycle protection, internal step limits, etc.). That means callers cannot distinguish “VT explored everything” vs “VT stopped early”. If a client (EA) treats a truncated VT result as complete, it may incorrectly assume that no other underlying objects exist — producing unsound decisions.
Because EA must be sound (it should conservatively bail out when the analysis is incomplete), this ambiguity is a real problem.
Options I’m considering (with tradeoffs)Keep calling VT with MaxLookup = 0 (unbounded) and trust VT.
Pro: simplest, no API changes.
Con: implicit semantics, possible hangs/explosion in pathological cases; no way to detect truncated results.
Clone / reimplement relevant parts of ValueTracking inside my analysis.
Pro: full control; I can detect internal cutoffs and return an Incomplete flag.
Con: code duplication; maintenance burden; risk of divergence as VT evolves.
Propose an upstream API change to VT — for example:
void getUnderlyingObjects(const Value *V, SmallVectorImpl<const Value*> &Bases, LoopInfo *LI, unsigned MaxLookup, bool *Incomplete = nullptr);or return a struct:
struct UnderlyingResult { SmallVector<const Value*> Bases; bool Complete; }; UnderlyingResult getUnderlyingObjectsEx(const Value *V, LoopInfo *LI, unsigned MaxLookup);Pro: clean; callers know whether VT was complete.
Con: API change; requires PR, tests, and review.
Which of these solutions would you consider most reasonable or maintainable?
Is the current non-signalling behavior of getUnderlyingObjects intentional? Has this been discussed previously? Any hidden reasons VT intentionally hides completeness (internal complexity, historical reasons, etc.)?
If an API change is acceptable, which form is preferred:
an out-parameter (bool *Incomplete),
a new return struct,
or a new overload (e.g. getUnderlyingObjectsEx)?
Are there codebase precedents for adding “completion” flags to analyses, and is there a preferred pattern to follow? Any performance/style concerns I should follow if I prepare a PR (e.g., prefer new symbol vs changing the existing signature)?