Performance regression using multi-threaded compilation

308 views
Skip to first unread message

Yanming Wang

unread,
Jun 20, 2022, 1:34:11 PM6/20/22
to XLA development
Hi XLA folks,

I tried to reduce the Pytorch/XLA compile time using multi-threading by setting the thread-pool option in XRT or PjRT. Multi-threading did help a lot in reducing the overall compile time. However, I noticed a slight performance regression (~5%) when running the bert-base-uncased model compiled using multi-threading.

It appears that the XLA compiler uses llvm::SplitModules to achieve compilation parallelism https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc#L1263-L1270. After forcing the compilation of splitted modules to be in serial, I still saw the performance issue. So the issue may be from llvm::SplitModules itself.

Is this difference an expected behavior? If not, how can we improve it?

Thanks,
Yanming

Stephan Herhut

unread,
Jul 5, 2022, 10:36:57 AM7/5/22
to Yanming Wang, XLA development
Hi Yanming.

Have you profiled where the extra 5% are spent? If it is a specific kernel, you could try comparing the llvm modules or generated PTX for a sequential and parallel compile. Maybe that can give a hint what is going wrong. A performance difference is certainly not intended.

Cheers
  Stephan

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/4be06998-92f8-4068-ba3b-8c0622d6021dn%40googlegroups.com.


--

Stephan Herhut

Software Engineer

her...@google.com


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Liana Sebastian

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde. 

     

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.


Yanming Wang

unread,
Jul 12, 2022, 6:53:30 PM7/12/22
to XLA development
Hi Stephan,

Thanks for the info. I did some debugging as you suggested and I found the performance regression is primary due to less constant folding after using multi-threading for compilation. I found the global constant variable will be moved to one submodule while other submodules will treat it as an external variable. Only the module who owns the global constant variable will do constant folding while other modules will lower it to a global read instruction.

For example, I got the following instructions for a global constant after splitting the module.
In module_0000.SyncTensorsGraph.760.ir-with-opt.0.ll:
@buffer_for_constant_121 = local_unnamed_addr addrspace(1) constant [8 x i8] c" \00\00\00\00\00\00\00", align 128

In module_0000.SyncTensorsGraph.760.ir-with-opt.1.ll:
@buffer_for_constant_121 = external local_unnamed_addr addrspace(1) constant [8 x i8], align 128

In module_0000.SyncTensorsGraph.760.1.ptx:
ld.global.nc.u64        %rd379, [buffer_for_constant_121]

After optimization, @buffer_for_constant_121 was constant-folded in submodule 0 while it still requires a global read instruction in submodule 1.

I tested a workaround that makes the global variable local to each fused computation in HLO by modifying the GpuSanitizeConstantNames pass and it seems to regain most of the performance. Maybe the best solution to fix the global variable constant folding issue is to utilize LLVM link time optimization, though I'm not sure how to do it given my limited knowledge in LLVM.

Do you have any suggestions on fixing this issue?

Regards,
Yanming

Stephan Herhut

unread,
Aug 29, 2022, 4:22:24 AM8/29/22
to Yanming Wang, XLA development
Hi Yanming.

Thanks for investigating this further. In the current flow, each submodule is compiled independently to ptx so the CUDA linker would have to do the optimizations. Under the hood, XLA calls the cuLinkXXX methods from the CUDA API to perform linking.

Looking at the use case here, it seems that duplicating constants is only worthwhile if accesses to them later get constant-folded. I would suspect this only happens for small constants anyway, so maybe it is good enough to copy only small constants. The size overhead in the generated code, if constant folding does not happen, should be marginal. 

What is interesting is that this constant access is not folded earlier. Can you share the HLO fragment that uses constant_121?

Cheers
  Stephan


George Karpenkov

unread,
Aug 31, 2022, 4:21:35 AM8/31/22
to Stephan Herhut, Yanming Wang, XLA development

Stephan Herhut

unread,
Sep 5, 2022, 8:00:16 AM9/5/22
to George Karpenkov, Yanming Wang, XLA development
Looks like it. Thanks Yanming for fixing the issue!

Cheers
  Stephan
Reply all
Reply to author
Forward
0 new messages