Hi gophers, I’ve implemented Unicode text segmentation for Go:
https://github.com/clipperhouse/uax29/words
It tokenizes text into words, sentences or graphemes according to the
Unicode spec. I’d been tokenizing text in ad hoc ways, and then learned that there is a Unicode standard.
Hopefully useful for you, feedback welcome. (I’m also talking to @mpvl about how such functionality might be useful in x/text.)