"multi-device function on TPUs"

Moshe Maor

unread,

Jan 20, 2022, 3:29:10 PM1/20/22

to XLA development

hi folks,

can you please shed some light on the comment in the jit_compile documentation where it says: "Set this value to False when directly running a multi-device function on TPUs (e.g. two TPU cores, one TPU core and its host CPU)". is that to say that multi-device must be executed only via the auto-clustering for TPU and other XLA-only devices?

Thanks,

Moshe

George Karpenkov

unread,

Jan 23, 2022, 11:30:52 PM1/23/22

to Moshe Maor, XLA development

Hi Moshe,

I would assume the documentation mentions it because TPU has its own entry points (tpu.replicate and friends).

Do avoid XY problem discussion: what would you like to achieve?

George

--
You received this message because you are subscribed to the Google Groups "XLA development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xla-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/49297b4a-8357-4d97-b0d6-f0dffc7d603dn%40googlegroups.com.

Moshe Maor

unread,

Jan 25, 2022, 1:10:05 AM1/25/22

to XLA development

Hi George,

We plan to use XLA device with must-compile, per our previous discussions. I want to make sure there are no pending issues with that direction.

This is the reason I've asked about the feature that allows must-compile functions with uncompilable ops to be executed.

And another feature we will need is a change in eager executor to treat all functions as must-compile, as you have today for TPU (and XLA_GPU/CPU).

I would guess that any XLA device would be interested in such a feature, either automatically or even as an attribute to the device e.g. "treat all functions as must-compile" - what do you think?

Thanks,

Moshe

George Karpenkov

unread,

Jan 25, 2022, 12:07:54 PM1/25/22

to Moshe Maor, XLA development

> I would guess that any XLA device would be interested in such a feature, either automatically or even as an attribute to the device e.g. "treat all functions as must-compile" - what do you think?

Yes, it totally makes sense to generalize the branch in execute.cc to not just hardcode TPUs, but to do it for all "compilation" devices.

> I want to make sure there are no pending issues with that direction.

Since TPUs do that, I would assume you are fine.

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/cbea5795-195e-4909-b02b-3b20df58e2e1n%40googlegroups.com.

Moshe Maor

unread,

Feb 2, 2022, 11:40:05 AM2/2/22

to XLA development

hi George,

When picking into the code that should automatically treat functions as must-compile for XLA devices see here, it seems like the check for device name is occurring before placer has the chance to place the op on the XLA device. so, at lease in the following example this code fails to treat the function as must-compile:

[assume that i am forcing TF to register the XLA_CPU device in a high priority]

import os
os.environ['TF_XLA_FLAGS']='--tf_xla_cpu_global_jit --tf_xla_enable_xla_devices '

import tensorflow as tf
a = tf.Variable([[1, 2, 3, 4],[5, 6, 7, 8]], dtype=tf.float32)
b = tf.Variable([[5, 6, 7, 8],[1, 2, 3, 4]], dtype=tf.float32)

@tf.function
def myfunc(a, b):
c = a + b
d = c * b
e = d - c
f = e / a
return f

print ( myfunc(a, b) )

Moshe Maor

unread,

Feb 2, 2022, 12:02:45 PM2/2/22

to XLA development

George,

before you tell me to avoid XY problem discussion :)

To be clear: I want every function to be treated as must-compile when it is destined to execute on an XLA device.

Thanks,

Moshe

George Karpenkov

unread,

Feb 2, 2022, 3:12:32 PM2/2/22

to Moshe Maor, XLA development

One bit I don't follow is why are you using XLA_CPU specifically?

Why not register your own device and have it behave effectively like a TPU device?

To view this discussion on the web visit https://groups.google.com/d/msgid/xla-dev/1cc13b2d-db81-44df-960b-94ecc134a1ccn%40googlegroups.com.

Moshe Maor

unread,

Feb 3, 2022, 1:49:13 AM2/3/22

to XLA development

XLA_CPU is just an experiment to demonstrate the issue.

I bumped into this behavior when I was trying to do the same for our device, which is an XLA device.

e.g. let's call it XLA_APU. I was trying to add the change to the code under MustCompileWithXLA() :

...

  if (op->GetDeviceParsedName().type == "XLA_APU" ||

      op->GetDeviceParsedName().type == "TPU" ||

...

and I realized that when this function is called from GetOrCreateKernelAndDevice(), the op is still not placed and therefore this code at the end of MustCompileWithXLA() has no effect.

unfortunately, it seems like I cannot let the placer execute before this call as the placer (via SelectDevice()) requires the locked NodeDef which captures the value of the _XlaMustCompile attribute, so we have a chicken and egg problem, where I would like to know the placement of the function to decide on must-compile but the placer requires the must-compile attribute to be locked...