Huggingface Transformers and TFX?

862 views
Skip to first unread message

Joshua Pham

unread,
Mar 30, 2020, 5:08:55 PM3/30/20
to TensorFlow Extended (TFX)
Hello all,

I'm wondering if anyone has considered integrating the Huggingface Transformer library with TFX.

So far I have written a custom component that operates in eager mode in order to make use of the tokenizer (batch_encode_plus()), then generates a DistilBERT embedding.

But to tokenize a phrase, the library operates on raw text input (and uses numpy ops, not TF ops). This is a barrier to using the tokenizer in TF Transform, and consequently to compose a transform graph that can be used in TF Serving... has anyone explored shimming Tensor ops here? Or is there another way to have the tokenizer work within the graph? I'm curious if there are workarounds or successful integrations to speak of... or if there other issues with using Huggingface Transformers with TFX in production that we should be aware of!

Many thanks,
Joshua Pham

Robert Crowe

unread,
Mar 30, 2020, 5:15:12 PM3/30/20
to TensorFlow Extended (TFX)
Unless I'm missing something, this sounds a lot like our recent blog post about using a BERT model with transfer learning:


Part 2 should be coming out soon with more details on the implementation.

Robert Crowe | TensorFlow Developer Advocate | rober...@google.com  | @robert_crowe



--
You received this message because you are subscribed to the Google Groups "TensorFlow Extended (TFX)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfx+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfx/286c9522-4b48-43a8-86f3-6b034612cbcd%40tensorflow.org.

Joshua Pham

unread,
Mar 31, 2020, 3:27:18 PM3/31/20
to TensorFlow Extended (TFX)
Thank you Robert! The blog post certainly has a lot that we can look at using. As a progress update I've integrated that BertTokenizer implementation into our Transform component to output tensors for input ids and masks, and am attempting to get it working with a forward pass into Huggingface DistilBERT, using their library, in a Trainer component. Fingers crossed that it plays nicely.

Chris Fregly

unread,
Mar 31, 2020, 3:35:04 PM3/31/20
to Joshua Pham, TensorFlow Extended (TFX)
@Joshua: I'm very interested in seeing what you come up with. Can you share when you're done? Would love to hack on it when it's ready!

Super cool.

> On Mar 31, 2020, at 12:27 PM, 'Joshua Pham' via TensorFlow Extended (TFX) <t...@tensorflow.org> wrote:
>
> Thank you Robert! The blog post certainly has a lot that we can look at using. As a progress update I've integrated that BertTokenizer implementation into our Transform component to output tensors for input ids and masks, and am attempting to get it working with a forward pass into Huggingface DistilBERT, using their library, in a Trainer component. Fingers crossed that it plays nicely.
>
> --
> You received this message because you are subscribed to the Google Groups "TensorFlow Extended (TFX)" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tfx+uns...@tensorflow.org.
> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfx/94f4eec1-c9b6-4a80-b83f-0000f4b833b9%40tensorflow.org.

Chris Fregly

unread,
Mar 31, 2020, 4:38:59 PM3/31/20
to TensorFlow Extended (TFX), Joshua Pham
@Robert:  The Colab notebook referenced in the Part 1 blog post is all kindsa broken.


Can you take a look?  Maybe I'm doing something wrong?  Although I'm just clicking "Run All", so I'm not sure how I could mess that up.   :)

Also, when is Part 2 expected?  I'd love to see the deep-dive!

Joshua Pham

unread,
Mar 31, 2020, 5:13:36 PM3/31/20
to TensorFlow Extended (TFX)
Robert -- there's a hitch when I'm trying to run TF Text v1.15.1 in a TF Transform component in Kubeflow... I get the error "tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'CaseFoldUTF8' in binary"


In case it's relevant, I read that TF Text depends on TF 2.0, but the best option I have is to use the TF 2.0 API within TF 1.15.2 since we haven't migrated to TF 2.0 yet. This is why I'm using TF Text v1.15.1 (because 2.0.0 release notes say the major version numbers track with TF)


It looks like there was some fix around the `CaseFoldUTF8` op in release v1.15.0: https://github.com/tensorflow/text/releases/tag/v1.15.0. I'm curious if this error I'm seeing is related? Or am I good to use TF Text 2.0+ versions with older TF 1.x?

Hannes Hapke

unread,
Apr 1, 2020, 11:23:45 AM4/1/20
to TensorFlow Extended (TFX)
Hi Chris, 

Thank you for your feedback. Without knowing what exact error you encountered, here are two thoughts:
1) The notebook take some time to execute and if you hit "Run all", your Colab session might expire. I would recommend running it cell-by-cell so you don't hit the time limit (also increase the number of training steps if needed).
2) We discovered an issue with exporting the model with the Keras model contains a tf.hub model therefore the TFMA step might fail. This process works fine when you go the model_to_estimator route. A bunch of good people are looking into this issue and I hope we can fix the issue in the coming days. Just to be clear: This problem is specific to this pipeline, the native Keras implementation is working.
 
I'll keep you posted.

Cheers, 
Hannes


On Tuesday, March 31, 2020 at 1:38:59 PM UTC-7, Chris Fregly wrote:
@Robert:  The Colab notebook referenced in the Part 1 blog post is all kindsa broken.


Can you take a look?  Maybe I'm doing something wrong?  Although I'm just clicking "Run All", so I'm not sure how I could mess that up.   :)

Also, when is Part 2 expected?  I'd love to see the deep-dive!

On Mar 31, 2020, at 12:35 PM, Chris Fregly <ch...@fregly.com> wrote:

@Joshua:  I'm very interested in seeing what you come up with.  Can you share when you're done?  Would love to hack on it when it's ready!

Super cool.

On Mar 31, 2020, at 12:27 PM, 'Joshua Pham' via TensorFlow Extended (TFX) <t...@tensorflow.org> wrote:

Thank you Robert! The blog post certainly has a lot that we can look at using. As a progress update I've integrated that BertTokenizer implementation into our Transform component to output tensors for input ids and masks, and am attempting to get it working with a forward pass into Huggingface DistilBERT, using their library, in a Trainer component. Fingers crossed that it plays nicely.

--
You received this message because you are subscribed to the Google Groups "TensorFlow Extended (TFX)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to t...@tensorflow.org.

Hannes Hapke

unread,
Apr 1, 2020, 11:30:44 AM4/1/20
to TensorFlow Extended (TFX)
Hi Joshua, 

The tf text version needs to match the TF version. tf.text 2.1 for TF 2.1, etc.

- Hannes

Joshua Pham

unread,
Apr 1, 2020, 11:53:50 AM4/1/20
to TensorFlow Extended (TFX)
Thanks Hannes, I've been sticking with TF Text v1.15.1. I verified that TF Text is installed on the Dataflow workers, thinking that it might have just been an issue where it was only installed on the Kubeflow pods. But I'm still getting the same error... using a different model now, so it doesn't use CaseFoldUTF8. But I'm getting "Op type not registered 'RegexSplitWithOffsets'". I think I will create an issue in TF Transform...

Hannes Hapke

unread,
Apr 1, 2020, 11:56:08 AM4/1/20
to Joshua Pham, TensorFlow Extended (TFX)
Hi Joshua, 

I think the op "RegexSplitWithOffsets" is only available in tf.text 2.1 and higher. 

- Hannes


---
Hannes Hapke
t: @hanneshapke

For secure messages, please use this pgp key: http://bit.ly/1EJhUxJ 



On Wed, Apr 1, 2020 at 8:53 AM 'Joshua Pham' via TensorFlow Extended (TFX) <t...@tensorflow.org> wrote:
Thanks Hannes, I've been sticking with TF Text v1.15.1. I verified that TF Text is installed on the Dataflow workers, thinking that it might have just been an issue where it was only installed on the Kubeflow pods. But I'm still getting the same error... using a different model now, so it doesn't use CaseFoldUTF8. But I'm getting "Op type not registered 'RegexSplitWithOffsets'". I think I will create an issue in TF Transform...

--
You received this message because you are subscribed to a topic in the Google Groups "TensorFlow Extended (TFX)" group.
To unsubscribe from this topic, visit https://groups.google.com/a/tensorflow.org/d/topic/tfx/I1yBzNsSqXM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tfx+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfx/7f58224b-5ef4-4f65-bd4e-c183cffd6ce0%40tensorflow.org.

Chris Fregly

unread,
Apr 1, 2020, 1:32:10 PM4/1/20
to Hannes Hapke, TensorFlow Extended (TFX)
I saw the TFMA step fail after I pulled the notebook into my own Jupyter environment outside of Colab.

Colab is still broken no matter which way I run it.  Possibly my Colab is defaulting to TF 1.x and i need 2.x?  The errors were well beyond expiration and the other possible causes you mentioned.  Doesn’t matter at this point since I pulled the notebook in-house, but others might have a similar issue.

Look forward to the TFMA fix.  This is blocking me at the moment.

On Apr 1, 2020, at 8:23 AM, Hannes Hapke <hannes...@gmail.com> wrote:


To unsubscribe from this group and stop receiving emails from it, send an email to tfx+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfx/05a2674d-9b03-425a-afd3-2aedafeb482d%40tensorflow.org.

Joshua Pham

unread,
Apr 1, 2020, 3:04:44 PM4/1/20
to TensorFlow Extended (TFX)
Hi Hannes, I ran the pipeline locally, successfully (produced output). It is currently not working in Kubeflow/Dataflow. I verified my TF Text version to be 1.15.1. https://github.com/tensorflow/text/releases/tag/v1.15.0 indicates that there was a regex_split_op added; does this refer to RegexSplitWithOffsets?


On Wednesday, April 1, 2020 at 11:56:08 AM UTC-4, Hannes Hapke wrote:
Hi Joshua, 

I think the op "RegexSplitWithOffsets" is only available in tf.text 2.1 and higher. 

- Hannes


---
Hannes Hapke
t: @hanneshapke

For secure messages, please use this pgp key: http://bit.ly/1EJhUxJ 



On Wed, Apr 1, 2020 at 8:53 AM 'Joshua Pham' via TensorFlow Extended (TFX) <t...@tensorflow.org> wrote:
Thanks Hannes, I've been sticking with TF Text v1.15.1. I verified that TF Text is installed on the Dataflow workers, thinking that it might have just been an issue where it was only installed on the Kubeflow pods. But I'm still getting the same error... using a different model now, so it doesn't use CaseFoldUTF8. But I'm getting "Op type not registered 'RegexSplitWithOffsets'". I think I will create an issue in TF Transform...

--
You received this message because you are subscribed to a topic in the Google Groups "TensorFlow Extended (TFX)" group.
To unsubscribe from this topic, visit https://groups.google.com/a/tensorflow.org/d/topic/tfx/I1yBzNsSqXM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to t...@tensorflow.org.

Robert Crowe

unread,
Apr 1, 2020, 3:54:48 PM4/1/20
to TensorFlow Extended (TFX)
Quick update to let you know that the team is aware of the issues with Colab and is currently working to fix.  I'll update again when we have a fix.


Robert Crowe | TensorFlow Developer Advocate | rober...@google.com  | @robert_crowe


You received this message because you are subscribed to the Google Groups "TensorFlow Extended (TFX)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfx+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfx/b5613c5d-966a-4935-8562-d6d47b18c693%40tensorflow.org.

Chris Fregly

unread,
Apr 9, 2020, 5:38:25 PM4/9/20
to Robert Crowe, TensorFlow Extended (TFX)
Any update, Robert?  Anxiously awaiting this fix!

Robert Crowe

unread,
Apr 10, 2020, 5:32:35 PM4/10/20
to Chris Fregly, TensorFlow Extended (TFX)

Chris Fregly

unread,
Apr 10, 2020, 6:20:06 PM4/10/20
to Robert Crowe, TensorFlow Extended (TFX)
Hey Robert!

These links are for the taxi cab example.  I was hoping for a fix on the original BERT / IMDB notebook.  It's still busted.  


I can't seem to get past this in the notebook...  any idea?  Excited to use this notebook!!

Hannes Hapke

unread,
Apr 10, 2020, 6:31:41 PM4/10/20
to Chris Fregly, Robert Crowe, TensorFlow Extended (TFX)
Hi Chris, 

A bunch of people are still looking into the issue We suspect an issue with the underlying TFX code.
I will post an update once we have a fix. The notebook with the same pipeline but with the estimator implementation works fine and the models are running in prod.

- Hannes

To unsubscribe from this group and all its topics, send an email to tfx+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfx/1A23043F-11C0-4A03-9BF1-8FF8D4F2A6F1%40fregly.com.

Robert Crowe

unread,
Apr 10, 2020, 7:06:45 PM4/10/20
to Hannes Hapke, Chris Fregly, TensorFlow Extended (TFX)
Sorry again, I was focused on the pip install issues which were blocking previously, not just this one but the taxi ones too.  I've updated the pip install in the BERT notebook.


Robert Crowe | TensorFlow Developer Advocate | rober...@google.com  | @robert_crowe


Robert Crowe

unread,
Apr 10, 2020, 7:12:15 PM4/10/20
to Hannes Hapke, Chris Fregly, TensorFlow Extended (TFX)
I'm able to get past StatisticsGen now, but it gets stuck on Evaluator.


Robert Crowe | TensorFlow Developer Advocate | rober...@google.com  | @robert_crowe


Chris Fregly

unread,
Apr 22, 2020, 4:07:52 PM4/22/20
to Robert Crowe, Hannes Hapke, TensorFlow Extended (TFX)
Anu update?  I tried the latest TFX 0.21.4 and this TFX+BERT example still isn't working.

Sorry to keep bugging you all, but I'm hosting a 400 person online event this Saturday - and I'd love to highlight this TFX example.

Is it possible to get this working by then?  Not a big deal, but super-nice-to-have.  This example has been broken since it launched, unfortunately.

On Apr 10, 2020, at 4:12 PM, Robert Crowe <rober...@google.com> wrote:

I'm able to get past StatisticsGen now, but it gets stuck on Evaluator.

Robert Crowe | TensorFlow Developer Advocate | rober...@google.com  | @robert_crowe



On Fri, Apr 10, 2020 at 4:06 PM Robert Crowe <rober...@google.com> wrote:
Sorry again, I was focused on the pip install issues which were blocking previously, not just this one but the taxi ones too.  I've updated the pip install in the BERT notebook.

Robert Crowe | TensorFlow Developer Advocate | rober...@google.com  | @robert_crowe



On Fri, Apr 10, 2020 at 3:31 PM Hannes Hapke <hannes...@gmail.com> wrote:
Hi Chris, 

A bunch of people are still looking into the issue We suspect an issue with the underlying TFX code.
I will post an update once we have a fix. The notebook with the same pipeline but with the estimator implementation works fine and the models are running in prod.

- Hannes

On Fri, Apr 10, 2020 at 3:20 PM Chris Fregly <ch...@fregly.com> wrote:
Hey Robert!

These links are for the taxi cab example.  I was hoping for a fix on the original BERT / IMDB notebook.  It's still busted.  


I can't seem to get past this in the notebook...  any idea?  Excited to use this notebook!!

<PastedGraphic-1.png>

Hannes Hapke

unread,
Jun 4, 2020, 6:38:51 PM6/4/20
to Chris Fregly, Robert Crowe, TensorFlow Extended (TFX)
Hi Chris, 

Quick follow up: With the release of TFX 0.22, the pipeline example is working with the native Keras implementation:
https://github.com/tensorflow/workshops/blob/master/blog/TFX_Pipeline_for_Bert_Preprocessing.ipynb

Example code for requesting predictions from a TFServing instance can be found here: 
https://colab.research.google.com/gist/hanneshapke/e6b031484ed2474165c4eccb37f69ce3/request-prediction-from-bert-model-with-tft-preprocessing.ipynb

Since the prediction requests are a bit cumbersome with the tf.Example data structure (as shown in the example above), I created a second example of the same pipeline but without the tf.Example dependency of the exported model:
https://gist.github.com/hanneshapke/f0980b7422d367808dae409536fe9b46

I hope the examples help. Feedback and comments are appreciated. 

Best regards,
Hannes

On Thu, Apr 9, 2020 at 2:38 PM Chris Fregly <ch...@fregly.com> wrote:
To unsubscribe from this group and all its topics, send an email to tfx+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfx/F22FF41C-25C6-476E-98CF-C4C7C12821DC%40fregly.com.

Chris Fregly

unread,
Jun 5, 2020, 2:42:20 AM6/5/20
to Hannes Hapke, Robert Crowe, TensorFlow Extended (TFX)
Nice!!  Thanks, Hannes.  Very exciting.  And yes, I was stumbling through that tf.Example sample.  Extra high-five for clearing my path.

And please release your book!  We’re all waiting for it!  I check amazon every few days!! :)

On Jun 4, 2020, at 3:38 PM, Hannes Hapke <hannes...@gmail.com> wrote:



Hannes Hapke

unread,
Jun 5, 2020, 11:42:03 AM6/5/20
to Chris Fregly, Robert Crowe, TensorFlow Extended (TFX)

Thank you, Chris.

Thanks to the wonderful feedback from Robert, Irene, and others, Catherine and I were able to finalize the draft and to ship the final version to O'Reilly. The book "Building Machine Learning Pipelines" is now in their hands. 
If everything goes according to the production schedule, there will be a digital version available at the end of July, and the print version will ship by the end of August.

- Hannes

Robert Crowe

unread,
Jun 5, 2020, 12:19:18 PM6/5/20
to Hannes Hapke, Catherine Nelson, Irene Giannoumis, Chris Fregly, TensorFlow Extended (TFX)
+Catherine Nelson +Irene Giannoumis 

That's great news, and congratulations!  It's a great book, and a huge contribution to the ML community.


Robert Crowe | TensorFlow Developer Advocate | rober...@google.com  | @robert_crowe


刘超

unread,
Nov 26, 2020, 3:55:08 AM11/26/20
to TensorFlow Extended (TFX), hannes...@gmail.com, rober...@google.com, Chris Fregly
hi, hannes

Thanks for your great code, i've run through all codes above,  but I got stuck on changing model,  when I use this model("https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/3"),  I got errors below.  do you have any idea about how to fix it ?

  ValueError: Could not find matching function to call loaded from the SavedModel. Got:
      Positional arguments (3 total):
        * [<tf.Tensor 'inputs:0' shape=(None, 64) dtype=int32>, <tf.Tensor 'inputs_1:0' shape=(None, 64) dtype=int32>, <tf.Tensor 'inputs_2:0' shape=(None, 64) dtype=int32>]
        * False
        * None
      Keyword arguments: {}

    Expected these arguments to match one of the following 4 option(s):

    Option 1:
      Positional arguments (3 total):
        * {'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids'), 'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids')}
        * True
        * None
      Keyword arguments: {} 

Hannes Hapke

unread,
Nov 26, 2020, 1:40:36 PM11/26/20
to 刘超, TensorFlow Extended (TFX), rober...@google.com, Chris Fregly
The new versions of BERT and ALBERT on TF Hub changed the data structure of how to receive the input data. It changed from a list to a dictionary. 

The input variables should be passed to the model as follows:
    bert_layer = load_bert_layer()
    encoder_inputs = dict(
        input_word_ids=tf.reshape(input_word_ids, (-1, max_seq_length)),
        input_mask=tf.reshape(input_mask, (-1, max_seq_length)),
        input_type_ids=tf.reshape(input_type_ids, (-1, max_seq_length)),
    )
    outputs = bert_layer(encoder_inputs)

The example pipelines for BERT and ALBERT have been updated to reflect the latest TFX version and the updates to the TF Hub models.
Hannes

Robert Crowe

unread,
Nov 29, 2020, 2:11:09 PM11/29/20
to Hannes Hapke, 刘超, TensorFlow Extended (TFX), Chris Fregly
Thanks again Hannes!

Robert Crowe | TensorFlow Developer Engineer | rober...@google.com  | @robert_crowe


Satwik Ram

unread,
Mar 25, 2022, 8:06:00 AM3/25/22
to TensorFlow Extended (TFX), Chris Fregly, hannes...@gmail.com, TensorFlow Extended (TFX), rober...@google.com
Chris,

Did you overcome the problem?

If yes, could you briefly expand on the solution?

If possible please share the updated colab notebook link.

Thanks in advance.

Ryan Clough

unread,
Mar 25, 2022, 11:47:34 AM3/25/22
to Satwik Ram, TensorFlow Extended (TFX), Chris Fregly, hannes...@gmail.com, rober...@google.com
I don't have a direct answer to your question but perhaps these will help - There are now two reasonably comprehensive NLP examples[1] in the TFX repo now that show how to fine-tune a pre-trained BERT model from TF Hub, and could be extrapolated to other models once you understand how it works.

If you want to use huggingface itself, you will likely need custom components unless you can figure out a way to get their preprocessing to translate into TF-Text constructs (seems it may be possible in some cases, but I haven't tried it). In the ideal case HF might have a TF model you can use in Trainer, but you will need to ensure tokenization/preprocessing can be done in the way the model expects, which might otherwise need a custom component in some cases.

TFX's existing support only extends to TF Hub, which doesn't have nearly as many models as HF, but has most of the popular ones.




--
Ryan Clough (He/Him)
ML Infrastructure, Spotify NYC

Sayak Paul

unread,
Mar 26, 2022, 9:55:22 AM3/26/22
to Ryan Clough, jo...@huggingface.co, Satwik Ram, TensorFlow Extended (TFX), Chris Fregly, hannes...@gmail.com, rober...@google.com
Adding Joao (from the Hugging Face team) to this thread.
Sayak Paul | sayak.dev



Reply all
Reply to author
Forward
0 new messages