Efficient parallel loops in XLA

199 views
Skip to first unread message

José Valim

unread,
Dec 31, 2020, 4:25:51 AM12/31/20
to XLA development
Hi everyone,

I am working on a higher level language that compiles down to XLA. However, I am wondering what is the most efficient way (both in computation and memory) to compile down a loop. For example, take this Python code:

    def _smooth(x):
      out = np.empty_like(x)
      for i in range(1, x.shape[0] - 1):
          for j in range(1, x.shape[1] - 1):
              out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
                          x[i +  0, j + -1] + x[i +  0, j + 0] + x[i +  0, j + 1] +
                          x[i +  1, j + -1] + x[i +  1, j + 0] + x[i +  1, j + 1]) // 9
      return out

Let's assume I don't want to unroll this loop for now.

One option is to compile this code down to two XlaBuilder::While ops. The issue though is that while loop execution is sequential and the loop above is parallelizable.

Another option I considered was to implement it using a XlaBuilder::Reduce or a XlaBuilder::Map. In this scenario, I would call XlaBuilder::Map with three tensors: x, iota alonside axis 0, and iota alongside axis 1 to give me the indexes. Would XlaBuilder be able to parallelize this setup? Would the iotas be preallocated in memory as large tensors? Or is XLA smart enough to treat the iotas as increasing counters?

Finally, is there any other approach I could use here and I missed it? My goal is to understand what XLA can do and match that with the design of the high-level language so we can efficiently compile down to XLA.

Thank you,

Adrian Kuegel

unread,
Jan 5, 2021, 4:47:55 AM1/5/21
to José Valim, XLA development
I can answer at least some part of this question. XLA is smart enough to treat iotas as increasing counters. They are not preallocated in memory as large tensors, unless they cannot be fused with their users.

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/a6e847d1-7890-4647-bbcc-9d207d3490c8n%40googlegroups.com.


--


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat sind, leiten Sie diese bitte nicht weiter, informieren Sie den Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.

      

This e-mail is confidential. If you are not the right addressee please do not forward it, please inform the sender, and please erase this e-mail including any attachments. Thanks.

Peter Hawkins

unread,
Jan 5, 2021, 9:58:50 AM1/5/21
to José Valim, XLA development
One thing I will note is that this particular code is pretty much XLA's ReduceWindow operator. So you should write this using ReduceWindow.

Unlike something like, say, Halide, XLA isn't really good at compiling scalar loop-nest code as you've shown. It's just not intended for that purpose: it tends to do best when you express your computation using higher-level patterns like ReduceWindow.

Peter

Sanjoy Das

unread,
Jan 5, 2021, 11:07:53 AM1/5/21
to Peter Hawkins, José Valim, XLA development
Also, while conceptually the loop could be a map, XLA's map will not let you pass in `x` as an additional argument.  The arguments to map need to all be of the same shape, which is also the iteration space of the map operation.

This could be fixed though, but, as Peter says, it is better to use reduce-window or something similar that already has the semantics you're looking for.

-- Sanjoy

George Karpenkov

unread,
Jan 5, 2021, 7:15:18 PM1/5/21
to Sanjoy Das, Peter Hawkins, José Valim, XLA development
+1 on using reduce-window, however sadly there's no fast emitter for reduce-window on GPU, so the produced code on GPUs could be quite slow for this particular example.

Jacques Pienaar

unread,
Jan 5, 2021, 9:12:29 PM1/5/21
to George Karpenkov, Sanjoy Das, Peter Hawkins, José Valim, XLA development
Hey,

Probably also good to mention there are multiple reduce window lowerings/codegen being worked on. Of particular interest may be in https://github.com/google/iree/blob/73993da71576fce2ae4b8a480059aa5737f6022d/iree/test/e2e/xla_ops/reduce_window.mlir (just an example, you'd probably want to look at other examples or use the equivalent opt tool with dumping enabled to see how these are lowered finally during codegen) where the HLO reduce window is lowered to LinAlg (see https://mlir.llvm.org/docs/Dialects/Linalg/) and Affine (see https://mlir.llvm.org/docs/Dialects/Affine/). Where you might see some familiar elements (and parallel loop constructs ;-)). And generating the LinAlg/Affine/SCF form may be easier and of course combining HLO with those too.

These are lower level (well ... mostly) components and abstractions used during codegen, rather than the higher level ops in XLA, for XLA it is almost like with Matlab: if you are writing the loop iterations yourself, it probably will be slow. These on the other hand are not yet tuned by any means, and performance burn down work only recently started (of course contributions welcome!) and we don't yet allow intermixing these at XLA interface boundary (so today mapping to the most appropriate/semantically appropriate/highest level HLO op is best bet for performance).

Best,

Jacques


José Valim

unread,
Jan 6, 2021, 5:21:56 AM1/6/21
to XLA development
The language I am working on prefers functional constructs (map, reduce, etc), so I appreciate the suggestion of pointing people towards reduce_window. We will probably support loops but I will mention the higher-level constructs are generally more capable of being optimized.

Thanks everyone for the insights!
Reply all
Reply to author
Forward
0 new messages