Text embedding for non english text

30 views
Skip to first unread message

Henri B

unread,
Apr 29, 2023, 5:54:35 PM4/29/23
to simple-ml-for...@googlegroups.com
I was reading the document  provided in this discussion:

I understand the model was pre trained on the English Google News Dataset. Was wondering if there is any plan to expand to the non english Google news dataset. 
Thank you!

Dennis Pickert

unread,
Apr 30, 2023, 2:04:18 AM4/30/23
to User Group | Simple ML for Sheets (Public)
Hi Henri,

if you referring to https://arxiv.org/pdf/2009.09991.pdf (p.4, 4.2 Methods):
"PreTrained is a 128-dimension term based text embedding [3] trained on the English Google News 200B corpus. 1"

As I understood, it's just used for evaluation measurements (comparison methods),
SimpleML is using the "Categorical Sets" approach (GreedyMask isn't using pre-trained embeddings).
https://github.com/tensorflow/decision-forests/blob/main/documentation/text_features.md#better-strategies

If you like, you can find further pre-trained language models here (e.g: Google Chinese News ...):
https://tfhub.dev/s?q=nnlm

Hope that helps,
Dennis

User Group | Simple ML for Sheets (Public)

unread,
May 8, 2023, 2:40:29 AM5/8/23
to User Group | Simple ML for Sheets (Public)
Hi everyone,

Dennis is right, thank you for chiming in!

Best, Richard
Reply all
Reply to author
Forward
0 new messages