Hi there,
I am doing BERT classification task fine tuning using the
transformers and keras (Tensorflow 2.2.0). I am using SST2
TFrecord files for training and evaluation. While there is no
issues with CPU and GPU (K80/V100), the same code and data is
crashing during training (model.fit) wwhen using TPU (same issue
on GCP or Colab):
[INFO] training the model ... 2020-05-23 16:02:07.319339: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started. 2020-05-23 16:02:07.326142: E tensorflow/core/framework/dataset.cc:88] The Encode() method is not implemented for DatasetVariantWrapper objects. Fatal Python error: Segmentation fault Current thread 0x00007fae55e21780 (most recent call first): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_experimental_dataset_ops.py", line 741 in dataset_cardinality File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/cardinality.py", line 66 in cardinality File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 733 in _validate_args File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 699 in __init__ File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py", line 1112 in __init__ File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 815 in fit File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 66 in _method_wrapper File "/content/proj_multilingual_text_classification/src/model/tf_bert_classification/model.py", line 252 in train_and_evaluate
I did follow some of the tutorial that exist for Tensorflow 2.1.0
and TPU:
https://colab.research.google.com/notebooks/tpu.ipynb
I am creating the model in the TPUStrategy and it seems to be able
to find TPU
INFO:tensorflow:Finished initializing TPU system. INFO:tensorflow:Found TPU system: INFO:tensorflow:*** Num TPU Cores: 8 INFO:tensorflow:*** Num TPU Workers: 1 INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I guess the issue is coming from my pipeline that read the data
from my TFRecord file and create tf.data.Dataset
- I have 1 TFRecord file for training and one for serving
- I am doing very standard things like tf.data.TFRecordDataset
shuffle, cache, map, prefetch and cache
- I guess something I am doing is not correct when using TPU:
features_spec = { 'input_ids': tf.io.FixedLenFeature([], tf.string, default_value=''), 'attention_mask': tf.io.FixedLenFeature([], tf.string, default_value=''), 'token_type_ids': tf.io.FixedLenFeature([], tf.string, default_value=''), 'label': tf.io.FixedLenFeature([], tf.int64, default_value=0) } example = tf.io.parse_single_example(record, features_spec) f0 = tf.ensure_shape(tf.io.parse_tensor(example['input_ids'], out_type=tf.int32), (None,)) f1 = tf.ensure_shape(tf.io.parse_tensor(example['attention_mask'], out_type=tf.int32), (None,)) f2 = tf.ensure_shape(tf.io.parse_tensor(example['token_type_ids'], out_type=tf.int32), (None,)) return {'input_ids': f0, 'attention_mask': f1, 'token_type_ids': f2}, example['label']
I am debugging my code but I don't managed to understand that the
error messages means:
"The Encode() method is not implemented for DatasetVariantWrapper
objects"
I look at the code but it didn't help
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/dataset.cc
It seems that tf.data.Dataset takes an argument DT_VARIANT tensor
that represents the dataset.
Why the issue only appear in keras model.fit and for TPU ? I can
loop over the dataset and check that the shape and the data are
correct. Where in my code should I look at ? What is specific to
TPU in this case ?
Thanks for any suggestion and guidance.
Thanks
Cheers
Fabien
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/b2df09fc-f9e4-d130-84b0-84220951e9cf%40gmail.com.
Hi Ruoxin and Jiri,
Thanks for the follow up and the suggestion. It took me a lot of
time but I managed to find a small example to reproduced the
issue. All is documented in this issue:
https://github.com/tensorflow/tensorflow/issues/39913
I added tf.debugging.set_log_device_placement(True) and part of
the log with our without the seg fault (tell me if you need
more).
My code is based on Colab examples as the one you mentioned from
Martin and code example from GCP repo.
I am really puzzle and maybe the issue is on my side. I also find
a lot of case for which my test in Colab was not working and
working again without new changes !
To reproduce the issue:
- open a Colab and select TPU
- copy the code in a python file
- in a Colab cell execute: !python test.py
The same code (no main, no app.run) is working when copying it in
a single Colab cell.
I didn't managed in Colab cells to reproduce the issue (without
calling the python file). I saw it few time but didn't manage to
be able to reproduce more than once !
Since I am using GCP AI Platform training, I need to have my code
in a python file with a main. With CPU and GPU it is always
working
A monolithic python file (no main, no app.run) with the same
content will failed as well.
Thanks
Cheers
Fabien
Hi Fabian,
Could you share your code in a colab? I tried the standard colab here and it still works for TF2.2. It is possible that some changes here triggered a new problem.
Hi Jiri.
yes, my mistake. Thanks for the hint.
Thanks
Cheers
Fabien