Hi All,
Currently I am using the Universal Sentence Encoder Transformer version provided at
https://tfhub.dev/google/universal-sentence-encoder-large/3 and it works great except when the model encounters OOV words. I explored the model graph and found a DT_STRING Tensor with 200,004 words (one of them is a <UNK> tag) which I was able to extract using the model's proto. But I also noticed the model has a hashmap with 400,000 buckets
module/text_preprocessor/string_to_index_Lookup/hash_bucket
Operation: StringToHashBucketFast
Attributes (1)
num_buckets {"i":400000}
and an embedding tensor (module/Embeddings_en/sharded_*) broken into 17 shards with each shard holding a 35297 x 320 Tensor, which makes 35297 x 17 = 600,049 across first dimension.
This is what confuses me. It would be a great favour and will really help me if someone can answer few questions for me:
- Why are there more buckets than the static vocab and even more words in embedding tensor? Does the model break words in the sentence into units smaller than words ?
- Can you please explain how the model embeds units of sentence before they hit the Transformer?
- What exactly is considered as OOV for the model ?
Let me know if you need more clarification from my end. I am currently using this model for short-text similarity.