How to reproduce tokenization/normalization for CJK languages' NNLM models?

102 views
Skip to first unread message

Aki Ariga

unread,
Oct 30, 2018, 9:57:40 PM10/30/18
to TensorFlow Hub
Hi,

As far as I read the documentation on tfhub.dev, I understand we need to tokenize word before using pre-trained text embedding model like nnlm-ja-dim128[1], nnlm-ko-dim128[2], nnlm-zh-dim128[3].

To prevent tokenization mismatch between pre-trained models and new sentences to be executed, and to reduce the number of OOV, we should use same tokenization and/or normalization method (such as NFKC conversion[4]. See also example normalization in mecab-ipadic-neologd[5]).

Is there any code or document for reproducing tokenization/normalization? Or, don't we need to care to preprocess for word tokenization?



arnoegw

unread,
Nov 2, 2018, 9:04:51 AM11/2/18
to TensorFlow Hub
Hi Aki, thanks for reaching out!

The TF Hub modules for text embeddings take entire sentences of inputs and internally take care of preprocessing (such as tokenization before a table lookup). This way, token-based and RNN-based modules should be usable interchangeably, with no need to do tokenization on your side.

arnoegw

Aki Ariga

unread,
Nov 2, 2018, 10:23:58 AM11/2/18
to arnoegw, TensorFlow Hub
Thanks for your response, arnoegw!

As you may know, even though Japanese, Chinese and Korean don't have space between words, as far as I read TF Hub NNLM document, it assumes words are separated with a space.

For example, this document shows example as folows:
https://tfhub.dev/google/nnlm-ja-dim128/1

embed(["ネコ", "猫 と 犬"])

While "猫 と 犬" includes spaces between words, original sentence should be "猫と犬", which doesn't have any space.

I also found the blog post mentioning we need to normalize half-width alphabet/number into full-width one for Japanese text.
https://tjo.hatenablog.com/entry/2018/06/26/220234

Do you mean space between words aren't required for TF Hub module?

Regards,
Aki

2018年11月2日(金) 22:04 'arnoegw' via TensorFlow Hub <h...@tensorflow.org>:
--
You received this message because you are subscribed to the Google Groups "TensorFlow Hub" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hub+uns...@tensorflow.org.
Visit this group at https://groups.google.com/a/tensorflow.org/group/hub/.
--
-------
Michiaki Ariga

arnoegw

unread,
Nov 2, 2018, 10:57:55 AM11/2/18
to TensorFlow Hub, arn...@google.com
Hi

Thanks for elaborating! I checked again, and you are right: For CJK, automatic segmentation as I described does not work. Indeed, the modules only tokenize naively on spaces , like you wrote. (So if you can put those in, at least you get the individual word embeddings combined within the module and can save coding the combiner step.) Moreover, the tokenization used at training time is currently not available in open source. I understand that this makes it very inconvenient to use those modules out of the box, but this is where we stand right now (and for the near future).

I regret to not have a better message for you at this time, and I apologize for the misunderstanding in my first response.

arnoegw

Aki Ariga

unread,
Nov 2, 2018, 11:09:32 AM11/2/18
to arnoegw, TensorFlow Hub
No problem.

It will be really helpful for us if tokenization code will be open sourced, or at least it would be nice if we can get prerequisites for normalization in the document.

2018年11月2日(金) 23:57 'arnoegw' via TensorFlow Hub <h...@tensorflow.org>:
Reply all
Reply to author
Forward
0 new messages