Text embedding for non english text

Henri B

unread,

Apr 29, 2023, 5:54:35 PM4/29/23

to simple-ml-for...@googlegroups.com

I was reading the document provided in this discussion:

https://groups.google.com/g/simple-ml-for-sheets-users/c/yz1bp_U_pV4 .

I understand the model was pre trained on the English Google News Dataset. Was wondering if there is any plan to expand to the non english Google news dataset.

Thank you!

Dennis Pickert

unread,

Apr 30, 2023, 2:04:18 AM4/30/23

to User Group | Simple ML for Sheets (Public)

Hi Henri,

if you referring to https://arxiv.org/pdf/2009.09991.pdf (p.4, 4.2 Methods):
"PreTrained is a 128-dimension term based text embedding [3] trained on the English Google News 200B corpus. 1"

As I understood, it's just used for evaluation measurements (comparison methods),
SimpleML is using the "Categorical Sets" approach (GreedyMask isn't using pre-trained embeddings).
https://github.com/tensorflow/decision-forests/blob/main/documentation/text_features.md#better-strategies

If you like, you can find further pre-trained language models here (e.g: Google Chinese News ...):
https://tfhub.dev/s?q=nnlm

Hope that helps,
Dennis

User Group | Simple ML for Sheets (Public)

unread,

May 8, 2023, 2:40:29 AM5/8/23

to User Group | Simple ML for Sheets (Public)

Hi everyone,

Dennis is right, thank you for chiming in!

Best, Richard

Reply all

Reply to author

Forward