x/text: Interest in Unicode text segmentation?

211 views
Skip to first unread message

Matt Sherman

unread,
Apr 15, 2020, 5:30:28 PM4/15/20
to golang-nuts
Hi, I am working on a tokenizer based on Unicode text segmentation (UAX 29). I am wondering if there would be an interest in adding range tables for word break categories to the x/text or unicode packages. It appears they could be code-gen’d alongside the rest of the range tables.

Pardon if this is already being done and I have missed it. I see some mention of those categories (e.g. ALetter) in other places.

My code is here. Thanks.

Ian Lance Taylor

unread,
Apr 15, 2020, 5:56:57 PM4/15/20
to Matt Sherman, Marcel van Lohuizen, golang-nuts
[ +mpvl ]
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/2a058556-da51-46d0-a41b-28e323541332%40googlegroups.com.

mp...@golang.org

unread,
Apr 16, 2020, 1:53:29 PM4/16/20
to Ian Lance Taylor, Marcel van Lohuizen, Matt Sherman, golang-nuts
Yes that would be interesting. Especially if it can be generated from the Unicode raw data upon updates. 

Matt Sherman

unread,
Apr 16, 2020, 3:47:13 PM4/16/20
to mp...@golang.org, golang-nuts
Great. Yes, the data files are here: https://unicode.org/reports/tr41/tr41-26.html#Props0

I’ve done a proof of concept here: https://github.com/clipperhouse/uax29

To do it properly, I assume we’d want to use the house style here? https://github.com/golang/text/blob/master/unicode/rangetable/gen.go

mp...@golang.org

unread,
Apr 17, 2020, 1:47:32 AM4/17/20
to Matt Sherman, golang-nuts, mp...@golang.org
Most of the x/text packages use tries and not rangetables. These allow arbitrary data (as long as it fits in an int) to be associated with runes and allow operating on utf8 without having to convert to tunes. 
https://godoc.org/golang.org/x/text/internal/triegen. But that’s not a requirement. 

The package 
https://godoc.org/golang.org/x/text/internal/gen/bitfield converts Go structs to ints and can be used to pack the rune data in a convenient way. 

Furthermore Package 
can be used for reading UCD files

And Package 
can be used to generate Go tables other than the trie and include utilities to generate canonical x/text files, such as including the Unicode and CLDR versions. 

The top-level file gen.go is used to orchestrate building x/text and captured dependencies between packages. 

I may have some designs laying around for the API. 

Matt Sherman

unread,
Apr 17, 2020, 11:58:21 AM4/17/20
to mp...@golang.org, golang-nuts
Nice. Well, happy to discuss how I might be helpful — implementation, API design, etc.

For the work I’m doing on UAX 29, the key API is unicode.Is. I am satisfied with the perf so far. unicode.Is dominates the profiling, but that’s to be expected, as my scanner is basically a tight loop evaluating rune categories. Certainly open to using a different trie-driven API.
Reply all
Reply to author
Forward
0 new messages