[ANN] Unicode text segmentation

72 views
Skip to first unread message

Matt Sherman

unread,
May 7, 2020, 12:06:18 PM5/7/20
to golang-nuts
Hi gophers, I’ve implemented Unicode text segmentation for Go: https://github.com/clipperhouse/uax29/words

It tokenizes text into words, sentences or graphemes according to the Unicode spec. I’d been tokenizing text in ad hoc ways, and then learned that there is a Unicode standard.

Hopefully useful for you, feedback welcome. (I’m also talking to @mpvl about how such functionality might be useful in x/text.)

Matt Sherman

unread,
May 7, 2020, 12:07:15 PM5/7/20
to golang-nuts
Sorry, bad link. Here it is: https://github.com/clipperhouse/uax29
Reply all
Reply to author
Forward
0 new messages