It does seem to be illegal memory access like you said. But, I don't think it is inside my GPU kernel.
Just as for context, inside my op kernel I have something like (a lot of unnecessary stuff removed):
void Compute(OpKernelContext* context) override {
// ...
Tensor* output = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, TensorShape({b, n, h, w}), &output));
OP_REQUIRES(context, output->NumElements() <= tensorflow::kint32max, errors::InvalidArgument("Too many elements in output"));
printf("Allocating[y]: %i \n", b * n * h * w);
functor::MyFunctor<Device, T>()(
...
output->flat<T>().data()
);
It works completely fine past this point (the GPU kernel runs fine with no errors) but I get the error with the output tensor.
Namely, if I call my custom op and just try to print the output tensor, then I get the illegal memory access (due to copying from GPU to CPU).
y = my_module.my_op(
x=x,
w=w,
//...
)
print(y) // Error is here
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed.
I am really not sure how to progress with this as I'm pretty sure there is no mistake in my GPU kernel. Do I need to do something with output inside the compute function (the tensorflow custom op guide doesn't do anything with it).