What exactly is considered as OOV for the Universal Sentence Encoder Transformer version?

1453 views
Skip to first unread message

krupal Modi

unread,
Feb 26, 2019, 2:28:05 AM2/26/19
to TensorFlow Hub
Hi All,

Currently I am using the Universal Sentence Encoder Transformer version provided at https://tfhub.dev/google/universal-sentence-encoder-large/3 and it works great except when the model encounters OOV words. I explored the model graph and found a DT_STRING Tensor with 200,004 words (one of them is a <UNK> tag) which I was able to extract using the model's proto. But I also noticed the model has a hashmap with 400,000 buckets 

module/text_preprocessor/string_to_index_Lookup/hash_bucket
Operation: StringToHashBucketFast
Attributes (1)
num_buckets {"i":400000}

and an embedding tensor (module/Embeddings_en/sharded_*) broken into 17 shards with each shard holding a 35297 x 320 Tensor, which makes 35297 x 17 = 600,049 across first dimension.

This is what confuses me. It would be a great favour and will really help me if someone can answer few questions for me:
  • Why are there more buckets than the static vocab and even more words in embedding tensor? Does the model break words in the sentence into units smaller than words ?
  • Can you please explain how the model embeds units of sentence before they hit the Transformer?
  • What exactly is considered as OOV for the model ?
Let me know if you need more clarification from my end. I am currently using this model for short-text similarity.

Daniel Cer

unread,
Feb 26, 2019, 2:09:12 PM2/26/19
to krupal Modi, TensorFlow Hub
Hi Krupal!

OOVs are any words not found in the explicit vocab backed by the ~200k DT_STRING Tensor. 

We do use 400k OOV buckets. Each OOV word is hashed and mapped to one of the buckets to obtain its embedding vector. Using a large number of OOV buckets helps the model handle diverse training data from the web, while still benefiting from the explicit 200k vocab for more common words.  

Dan

--
You received this message because you are subscribed to the Google Groups "TensorFlow Hub" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hub+uns...@tensorflow.org.
Visit this group at https://groups.google.com/a/tensorflow.org/group/hub/.

Arshad Javeed

unread,
Aug 7, 2020, 2:31:00 PM8/7/20
to TensorFlow Hub, krupal...@gmail.com
Hi Dan,

Does it imply that whenever the model encounters OOV, it tries to build embedding representations using the 400k OOV buckets?

Also, I was curious to know how different such a representation would be compared the what the BERT models do, using the skip-gram word representation.


Thanks,
Arshad

To unsubscribe from this group and stop receiving emails from it, send an email to h...@tensorflow.org.

Daniel Cer

unread,
Aug 17, 2020, 6:46:00 PM8/17/20
to Arshad Javeed, TensorFlow Hub, krupal Modi
Yes, for the universal-sentence-encoder-large model, OOVs are hashed to map them to one of the 400k OOV buckets. 

For our multilingual models (e.g., universal-sentence-encoder-multilingual-large), we use SentencePiece for tokenization. This is similar to the WordPiece tokenization used by BERT.

However, our training objective still differs from BERT in that our models are trained largely to sentence-level retrieval/ranking tasks.

Dan  
To unsubscribe from this group and stop receiving emails from it, send an email to hub+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/hub/e3c75489-a4b6-4684-b902-2b350f440b63o%40tensorflow.org.
Reply all
Reply to author
Forward
0 new messages