Unsupervised Tokenization Learning

7 views
Skip to first unread message

Anton Kolonin @ Gmail

unread,
May 24, 2022, 12:10:35 AM5/24/22
to link-grammar, lang-learn
Paper: https://arxiv.org/abs/2205.11443
In the presented study, we discover that so called "transition freedom"
metric appears superior for unsupervised tokenization purposes, compared
to statistical metrics such as mutual information and conditional
probability, providing F-measure scores in range from 0.71 to 1.0 across
explored corpora. We find that different languages require different
derivatives of that metric (such as variance and "peak values") for
successful tokenization. Larger training corpora does not necessarily
effect in better tokenization quality, while compacting the models
eliminating statistically weak evidence tends to improve performance.
Proposed unsupervised tokenization technique provides quality better or
comparable to lexicon-based one, depending on the language.

Source: https://t.me/internlp

I will be presenting this work on limited-access workshop this Friday -
see at the bottom:
https://wiki.opencog.org/w/AGI_Discussion_Forum

Best regards,

--
-Anton Kolonin
telegram/skype/facebook: akolonin
mobile/WhatsApp: +79139250058
akol...@aigents.com
https://aigents.com
https://www.youtube.com/aigents
https://www.facebook.com/aigents
https://wt.social/wt/aigents
https://medium.com/@aigents
https://steemit.com/@aigents
https://reddit.com/r/aigents
https://twitter.com/aigents
https://golos.in/@aigents
https://vk.com/aigents
https://aigents.com/en/slack.html
https://www.messenger.com/t/aigents
https://web.telegram.org/#/im?p=@AigentsBot
Reply all
Reply to author
Forward
0 new messages