keras model with TF 2.2.0 crashing during training with TPU and tf.data.Dataset: "The Encode() method is not implemented for DatasetVariantWrapper objects"

364 views
Skip to first unread message

Fabien Tarrade

unread,
May 24, 2020, 3:18:14 PM5/24/20
to Discuss

Hi there,

I am doing BERT classification task fine tuning using the transformers and keras (Tensorflow 2.2.0). I am using SST2 TFrecord files for training and evaluation. While there is no issues with CPU and GPU (K80/V100), the same code and data is crashing during training (model.fit) wwhen using TPU (same issue on GCP or Colab):

[INFO] training the model ...
2020-05-23 16:02:07.319339: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.
2020-05-23 16:02:07.326142: E tensorflow/core/framework/dataset.cc:88] The Encode() method is not implemented for DatasetVariantWrapper objects.
Fatal Python error: Segmentation fault

Current thread 0x00007fae55e21780 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_experimental_dataset_ops.py", line 741 in dataset_cardinality
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/cardinality.py", line 66 in cardinality
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 733 in _validate_args
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 699 in __init__
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 1112 in __init__
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 815 in fit
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 66 in _method_wrapper
  File "/content/proj_multilingual_text_classification/src/model/tf_bert_classification/model.py", line 252 in train_and_evaluate

I did follow some of the tutorial that exist for Tensorflow 2.1.0 and TPU:
https://colab.research.google.com/notebooks/tpu.ipynb

I am creating the model in the TPUStrategy and it seems to be able to find TPU

INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8

I guess the issue is coming from my pipeline that read the data from my TFRecord file and create tf.data.Dataset
- I have 1 TFRecord file for training and one for serving
- I am doing very standard things like tf.data.TFRecordDataset shuffle, cache, map, prefetch and cache
- I guess something I am doing is not correct when using TPU:

features_spec = {
    'input_ids': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'attention_mask': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'token_type_ids': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'label': tf.io.FixedLenFeature([], tf.int64, default_value=0)
}

example = tf.io.parse_single_example(record, features_spec)

f0 = tf.ensure_shape(tf.io.parse_tensor(example['input_ids'], out_type=tf.int32), (None,))
f1 = tf.ensure_shape(tf.io.parse_tensor(example['attention_mask'], out_type=tf.int32), (None,))
f2 = tf.ensure_shape(tf.io.parse_tensor(example['token_type_ids'], out_type=tf.int32), (None,))
return {'input_ids': f0, 'attention_mask': f1, 'token_type_ids': f2}, example['label']


I am debugging my code but I don't managed to understand that the error messages means:
"The Encode() method is not implemented for DatasetVariantWrapper objects"
I look at the code but it didn't help https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/dataset.cc
It seems that tf.data.Dataset takes an argument DT_VARIANT tensor that represents the dataset.

Why the issue only appear in keras model.fit and for TPU ? I can loop over the dataset and check that the shape and the data are correct.  Where in my code should I look at ? What is specific to TPU  in this case ?

Thanks for any suggestion and guidance.

Thanks
Cheers
Fabien

--
Dr. Fabien Tarrade

Senior Data Scientist at AXA

I am a senior Data Scientist at AXA with the mission of helping AXA becoming a data driven organisation by using advanced analytics and Big Data.
I have over 10 years of experience in management of large projects, processing, modelling and statistical treatment of large volume of experimental data
up to 10 petabytes as well as the development and maintenance of advanced and complex computer programs.

Zurich, Switzerland

LinkedIn Twitter Google Facebook Google Xing

Jiri Simsa

unread,
May 26, 2020, 3:13:01 PM5/26/20
to Fabien Tarrade, Ruoxin Sang, Tom O'Malley, Discuss
+Ruoxin Sang +Tom O'Malley 

Hi Fabien, thank you for your inquiry. Could you please create an issue on github.com/tensorflow with instructions on how to reproduce this?

The error hints at a placement issue. In particular, the `Encode` method would be invoked when the dataset variant is accessed on a different device than it was created on -- it is unexpected for this to happen when Keras checks the cardinality of the dataset. It would be good to understand which devices are the dataset and cardinality ops placed on. When you create the issue on github, please include the output of your program after setting tf.debugging.set_log_device_placement(True) (see https://www.tensorflow.org/api_docs/python/tf/debugging/set_log_device_placement).

Best,

Jiri

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/b2df09fc-f9e4-d130-84b0-84220951e9cf%40gmail.com.

Fabien Tarrade

unread,
May 27, 2020, 1:05:28 PM5/27/20
to Ruoxin Sang, Jiri Simsa, Tom O'Malley, Discuss

Hi Ruoxin and Jiri,

Thanks for the follow up and the suggestion. It took me a lot of time but I managed to find a small example to reproduced the issue. All is documented in this issue:
https://github.com/tensorflow/tensorflow/issues/39913

I added tf.debugging.set_log_device_placement(True) and part of the log with our without the seg fault  (tell me if you need more).

My code is based on Colab examples as the one you mentioned from Martin and code example from GCP repo.

I am really puzzle and maybe the issue is on my side. I also find a lot of case for which my test in Colab was not working and working again without new changes !

To reproduce the issue:
- open a Colab and select TPU
- copy the code in a python file
- in a Colab cell execute: !python test.py

def main(argv):
....
print(" take(5)ok", valid_dataset.take(5))
if use_tpu:
print('setting up TPU: cluster resolver')
tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
--> every call on the dataset starting from here will result to a seg fault
...
if __name__ == '__main__':
app.run(main)

The same code (no main, no app.run) is working when copying it in a single Colab cell.
I didn't managed in Colab cells to reproduce the issue (without calling the python file). I saw it few time but didn't manage to be able to reproduce more than once !
Since I am using GCP AI Platform training, I need to have my code in a python file with a main. With CPU and GPU it is always working

A monolithic python file (no main, no app.run) with the same content will failed as well.

Thanks
Cheers
Fabien

Hi Fabian,

Could you  share your code in a colab? I tried the standard colab here and it still works for TF2.2. It is possible that some changes here triggered a new problem. 
--

I am a senior Data Scientist at AXA with the mission of helping AXA becoming a data driven organisation by using advanced analytics and Big Data.
I have over 10 years of experience in management of large projects, processing, modelling and statistical treatment of large volume of experimental data
up to 10 petabytes as well as the development and maintenance of advanced and complex computer programs.

Zurich, Switzerland

Jiri Simsa

unread,
May 27, 2020, 6:51:31 PM5/27/20
to Fabien Tarrade, Ruoxin Sang, Tom O'Malley, Discuss
The TPU initialization is required to happen before any TensorFlow ops. I commented on the issue you created with a suggestion on how to fix your problem.

Best,

Jiri

Fabien Tarrade

unread,
May 29, 2020, 10:28:46 AM5/29/20
to Jiri Simsa, Ruoxin Sang, Tom O'Malley, Discuss

Hi Jiri.

yes, my mistake. Thanks for the hint.

Thanks
Cheers
Fabien

Jiri Simsa

unread,
May 29, 2020, 5:40:24 PM5/29/20
to Fabien Tarrade, Ruoxin Sang, Tom O'Malley, Discuss
No problem. I'm glad you were able to resolve the issue.

Best,

Jiri
Reply all
Reply to author
Forward
0 new messages