[ANN] Unicode text segmentation

72 views

Skip to first unread message

Matt Sherman

unread,

May 7, 2020, 12:06:18 PM5/7/20

to golang-nuts

Hi gophers, I’ve implemented Unicode text segmentation for Go: https://github.com/clipperhouse/uax29/words

It tokenizes text into words, sentences or graphemes according to the Unicode spec. I’d been tokenizing text in ad hoc ways, and then learned that there is a Unicode standard.

Hopefully useful for you, feedback welcome. (I’m also talking to @mpvl about how such functionality might be useful in x/text.)

Matt Sherman

unread,

May 7, 2020, 12:07:15 PM5/7/20

to golang-nuts

Sorry, bad link. Here it is: https://github.com/clipperhouse/uax29

Reply all

Reply to author

Forward

0 new messages