HLO Code Generation

802 views
Skip to first unread message

amacd...@knuedge.com

unread,
Apr 20, 2017, 2:53:26 PM4/20/17
to XLA development, Linda Wogulis
I am working with lwogulis. We have been trying to understand how to access the appropriate data from HLO in order to generate ASM code for our hardware.

My, albeit naive, understanding of XLA is that I created a compute graph like,

y[k, i j] = Sum( b[i, m] * c[m, k, j] for m in [0, M] )
x[i, j]   = Sum( f(y[k, i, j] * a[l, i]) for k, l in [0, K] x [0, L] )

Then XLA would process the graph and smash these operations into (assume this is a good idea/optimal for sake of argument),

x[i, j]   = Sum( f(Sum( a[l, i] * b[i, m] * c[m, k, j] for m in [0, M] ) for k, l in [0, K] x [0, L] )

AND on top of that, give hints about what loops I should run through first, how to access memory, whether I should cache data to registers or push out to memory, etc. So far, we have been unable to see this. We found a graph dump and can get HLO strings from various calls, but they are no more informative than the TF Protobuf Graph as they only say which order to do the operations, but not how to access memory and such.

Can someone help me understand how to get the information I want? Or I might be completely wrong, in which case, can you clarify what XLA is a bit for me?

Thanks

ja...@nervanasys.com

unread,
Apr 21, 2017, 2:05:14 PM4/21/17
to XLA development
You can use the hardware independent passes (in xla/service/*) for things like common subexpression elimination, but depending on what class of hardware you're developing for (as described here), you may be able to reuse some of the already existing passes from the CPU or GPU backend. For a really novel device, then you are on the hook for memory hardware specific passes like allocation strategies, lowering to asm, etc. 

amacd...@knuedge.com

unread,
Apr 26, 2017, 2:31:09 PM4/26/17
to XLA development
We've been trying that. If you look at the other posts we (lwogulis and myself) have made you can see the problems we have. Mostly we can't get a new device registered.

amacd...@knuedge.com

unread,
Jun 6, 2017, 5:15:05 PM6/6/17
to XLA development
I would just like to bump this post. I do not feel like this has been answered adequately. 


On Thursday, April 20, 2017 at 11:53:26 AM UTC-7, amacd...@knuedge.com wrote:

Eli Bendersky

unread,
Jun 6, 2017, 5:28:25 PM6/6/17
to amacd...@knuedge.com, XLA development
On Tue, Jun 6, 2017 at 2:15 PM, <amacd...@knuedge.com> wrote:
I would just like to bump this post. I do not feel like this has been answered adequately. 

Can you please try to ask more concrete, smaller-scope questions? These may be easier for folks to answer

Eli
 

The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and then deleting it from your computer. Please note that KnuEdge, Inc. reserves the right to monitor and review the content of any electronic message or information sent to or from KnuEdge employee e-mail addresses without informing the sender or recipient of the message.'. If the disclaimer can't be applied, take no action.

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+unsubscribe@googlegroups.com.
To post to this group, send email to xla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/45abcf8b-dc7b-4ebe-abf2-d579a64f4034%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Aidan MacDonald

unread,
Jun 6, 2017, 5:40:49 PM6/6/17
to Eli Bendersky, XLA development
From what I have read, I believe XLA is a means of compacting linear algebra and compiling it into a binary for the hardware to handle. I have not found a way to get the compressed linear algebra in a useful way so that we can process it and compile it ourselves. I don't care at this moment about compiling the linear algebra to binary, just viewing the compressed linear algebra. Hence the LA written in my first post. 


On Tue, Jun 6, 2017 at 2:28 PM Eli Bendersky <eli...@google.com> wrote:
On Tue, Jun 6, 2017 at 2:15 PM, <amacd...@knuedge.com> wrote:
I would just like to bump this post. I do not feel like this has been answered adequately. 

Can you please try to ask more concrete, smaller-scope questions? These may be easier for folks to answer

Eli
 


On Thursday, April 20, 2017 at 11:53:26 AM UTC-7, amacd...@knuedge.com wrote:
I am working with lwogulis. We have been trying to understand how to access the appropriate data from HLO in order to generate ASM code for our hardware.

My, albeit naive, understanding of XLA is that I created a compute graph like,

y[k, i j] = Sum( b[i, m] * c[m, k, j] for m in [0, M] )
x[i, j]   = Sum( f(y[k, i, j] * a[l, i]) for k, l in [0, K] x [0, L] )

Then XLA would process the graph and smash these operations into (assume this is a good idea/optimal for sake of argument),

x[i, j]   = Sum( f(Sum( a[l, i] * b[i, m] * c[m, k, j] for m in [0, M] ) for k, l in [0, K] x [0, L] )

AND on top of that, give hints about what loops I should run through first, how to access memory, whether I should cache data to registers or push out to memory, etc. So far, we have been unable to see this. We found a graph dump and can get HLO strings from various calls, but they are no more informative than the TF Protobuf Graph as they only say which order to do the operations, but not how to access memory and such.

Can someone help me understand how to get the information I want? Or I might be completely wrong, in which case, can you clarify what XLA is a bit for me?

Thanks


The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and then deleting it from your computer. Please note that KnuEdge, Inc. reserves the right to monitor and review the content of any electronic message or information sent to or from KnuEdge employee e-mail addresses without informing the sender or recipient of the message.'. If the disclaimer can't be applied, take no action.

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.

Chris Leary

unread,
Jun 7, 2017, 2:46:12 PM6/7/17
to Aidan MacDonald, Eli Bendersky, XLA development
Hi Aidan,

If I'm understanding the original question properly: XLA does not suggest a loop interchange order for arbitrary new hardware, nor are the fusion optimizations we have in the code base going to work well out of the box on arbitrary hardware.

The last post makes it sound like you just want to get at the computation graph. You can see how CpuCompiler and GpuCompiler in the code base are implemented to see how it is passed the HloModule. HLOs are high level operations that are documented in the operation semantics doc you referenced.

What have you tried so far to get at the graph?

Please do note: XLA, certainly today, is not "plug and play" for arbitrary new exotic hardware, but we consider it the best available starting point since a) all of the operations have defined semantics b) if your target supports LLVM, you can likely reuse big chunks of the CPU backend to get started and c) the fact we decompose the TF ops into a smaller set makes it easier to get coverage than implementing all the TF kernels directly. It's still definitely work, potentially a lot of it! Hopefully we didn't give the opposite impression somehow. If we did, that would be interesting to know so we can update the documentation accordingly.

The notion is that TF operations decompose into a small set of higher level array operators with semantics, and new platforms can focus on slotting into those to compile/run TF-emitted code sequences with the highest possible performance.

HTH!

- Leary

On Tue, Jun 6, 2017 at 2:40 PM, Aidan MacDonald <amacd...@knuedge.com> wrote:
From what I have read, I believe XLA is a means of compacting linear algebra and compiling it into a binary for the hardware to handle. I have not found a way to get the compressed linear algebra in a useful way so that we can process it and compile it ourselves. I don't care at this moment about compiling the linear algebra to binary, just viewing the compressed linear algebra. Hence the LA written in my first post. 

On Tue, Jun 6, 2017 at 2:28 PM Eli Bendersky <eli...@google.com> wrote:
On Tue, Jun 6, 2017 at 2:15 PM, <amacd...@knuedge.com> wrote:
I would just like to bump this post. I do not feel like this has been answered adequately. 

Can you please try to ask more concrete, smaller-scope questions? These may be easier for folks to answer

Eli
 


On Thursday, April 20, 2017 at 11:53:26 AM UTC-7, amacd...@knuedge.com wrote:
I am working with lwogulis. We have been trying to understand how to access the appropriate data from HLO in order to generate ASM code for our hardware.

My, albeit naive, understanding of XLA is that I created a compute graph like,

y[k, i j] = Sum( b[i, m] * c[m, k, j] for m in [0, M] )
x[i, j]   = Sum( f(y[k, i, j] * a[l, i]) for k, l in [0, K] x [0, L] )

Then XLA would process the graph and smash these operations into (assume this is a good idea/optimal for sake of argument),

x[i, j]   = Sum( f(Sum( a[l, i] * b[i, m] * c[m, k, j] for m in [0, M] ) for k, l in [0, K] x [0, L] )

AND on top of that, give hints about what loops I should run through first, how to access memory, whether I should cache data to registers or push out to memory, etc. So far, we have been unable to see this. We found a graph dump and can get HLO strings from various calls, but they are no more informative than the TF Protobuf Graph as they only say which order to do the operations, but not how to access memory and such.

Can someone help me understand how to get the information I want? Or I might be completely wrong, in which case, can you clarify what XLA is a bit for me?

Thanks


The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and then deleting it from your computer. Please note that KnuEdge, Inc. reserves the right to monitor and review the content of any electronic message or information sent to or from KnuEdge employee e-mail addresses without informing the sender or recipient of the message.'. If the disclaimer can't be applied, take no action.

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+unsubscribe@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.

The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and then deleting it from your computer. Please note that KnuEdge, Inc. reserves the right to monitor and review the content of any electronic message or information sent to or from KnuEdge employee e-mail addresses without informing the sender or recipient of the message.'. If the disclaimer can't be applied, take no action.

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+unsubscribe@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.

lwog...@knuedge.com

unread,
Jun 7, 2017, 5:50:21 PM6/7/17
to XLA development
Hi Leary,

I work with Aidan and made most of the attempts at solving this. We aren't exactly looking for the graph - we are able to use the .dot-formatted output from the recommended XLA flags to see the structure. What we really want is the operation descriptions in HLO, which we imagine would look something like what's in Aidan's original post. We've tried to turn on and manipulate maximum debugging/printing output through VLOG specifications. We've looked through all of the hlo_graph_dumper, hlo_instruction, etc. files. We've basically tried looking through every file we thought might give us useful output, and either attempted to directly print out information or used the VLOG calls. But we've never managed to find what the operations actually look like in HLO.

We would expect something that would describe how the operation is constructed and called by the machine. What we can access currently is the structure with labels like 'multiplication' for each operation, but this does not tell us how multiplication happens.

Thank you,
Linda
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.

The information contained in this message is intended only for the recipient, and may be a confidential attorney-client communication or may otherwise be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, any dissemination or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify us by replying to the message and then deleting it from your computer. Please note that KnuEdge, Inc. reserves the right to monitor and review the content of any electronic message or information sent to or from KnuEdge employee e-mail addresses without informing the sender or recipient of the message.'. If the disclaimer can't be applied, take no action.

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.

Chris Leary

unread,
Jun 7, 2017, 5:55:20 PM6/7/17
to lwog...@knuedge.com, XLA development
Hi Linda,

Just to understand the context better to make sure we answer the right question, is "the machine" in your case a standard CPU? Would looking at the LLVM IR that targets a standard CPU be what you're looking for, or is it something else? Or are you looking for the assembly? If you're looking for loop-level constructs that run on a standard CPU, dumping out the LLVM IR would probably suffice?

The HLO is exactly what you say -- just a "multiply these things" node in a graph, that has an associated type for its operands and output, like `f32[8, 8]`. How that gets lowered down to the next level (which we call LLO, for low-level optimizer / low-level operations) is backend-specific, though we have examples in the tree for CPU and GPU using LLVM.

Thanks!

- Leary

To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+unsubscribe@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.

amacd...@knuedge.com

unread,
Jun 7, 2017, 6:13:53 PM6/7/17
to XLA development
She tagged me in. Thanks for the clarification! That was really helpful. I was very confused.

Our processor is not a standard CPU. Technically, its a bunch of custom DSPs linked together with hierarchical communication between DSPs. There is shared memory between the DSPs and register to register communication for each DSP. You can kind of think of it as a large cluster computer shrunk to a single chip.

I am not an expert on how our compiler works, but I don't think CPU IR is sufficient for us because our LLVM compiler is heavily modded. Generating a macro assembly would be most useful for us because we can write good macros that do small tasks for us. that said, our ASM is custom and has a lot of different instructions. 

Thanks again.

- Leary

Chris Leary

unread,
Jun 7, 2017, 6:22:45 PM6/7/17
to Aidan MacDonald, XLA development
No problem, for very custom things you may want to consider how you might do your own customized HLO conversion to your compiler IR to achieve good performance and fit inside shared memories and such, since the CPU backend hopes that emitting fairly vanilla LLVM IR will do decently good things.

One other thing to note: if the LLVM is forked and heavily modified the CPU backend *could* be a non-trivial starting point, we consistently track LLVM at a tag very close to HEAD. It'll be best if the forked LLVM does merge from upstream consistently as well.

- Leary

To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+unsubscribe@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.

amacd...@knuedge.com

unread,
Jun 7, 2017, 9:24:43 PM6/7/17
to XLA development
I don't think there's much hope for merging our LLVM. We have it emitting custom IR for our C++ compiler. That and our compiler team was recently laid off, so few people here have the skills to develop the LLVM compiler. How would I best approach developing a direct to assembly LLO? Generating our native assembly is probably the easiest.

Chris Leary

unread,
Jun 7, 2017, 9:37:38 PM6/7/17
to Aidan MacDonald, XLA development
For each HLO you'd want to emit the corresponding assembly for each of the HLO operations. If you have a high level assembler (e.g. that doesn't make you allocate registers by hand, etc.) it may be a lot easier to emit into a string buffer and then feed it into the assembler - an example subclass of IrEmitter ( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/ir_emitter.h#L70 ) is here:


Probably the best way to get started is to take a neural net you're interesting in running on your hardware and implementing each of the Handle* methods it takes to get it to execute correctly and produce similar-valued results to the CPU values.

To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+unsubscribe@googlegroups.com.

To post to this group, send email to xla...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages