Personally I have a preference for the modern C++ approach too. It makes it easier to integrate with other C++ libraries and traditional C++ lint scanner tools. Though I’m not adamant about this preference.
The rest of this email is about an alternate implementation that already tries to overcome the shortcomings of the current RBBI implementation. It’s informational, and I’m not proposing changes to the design proposal.
I should mention that the Unicode Inflection project has its own wrapper around the ICU BreakIterator and redefines the meaning of a “word” to be usable in linguistic terms, and not the double click selection notion of a word. People involved with TTS systems will typically want to break a string into phonetic terms instead of linguistic terms. This implementation is needed for scenarios where you have something like “fishmarket” in a language that likes to compound words. You need to decompound it to “fish”, and “market”, and detect the grammemes (grammatical category values) of each subword. This is important for grammatical agreement, and you want to add a preposition, or make it definite. That way you don’t need all possible combinations of compound words, and thus allow keeping the lexical dictionary minimal. Words that combine numbers with letters also get separated apart, like “9AM”.
That Unicode Inflection API has the following features for the tokenizer API.
- The start and end index in UTF-16 code units for each token (segment) so that it’s easy to create a substring in C, C++, or Java.
- A copy of the actual string. This isn’t memory efficient. I don’t recommend this choice for large strings, but it’s helpful if you ever need to chunk the segmentation without needing to keep the entire original string in memory.
- The ability to reference a specific token from a TokenChain (Segments) by index. Though this isn’t typical usage, and some programming languages make this indexing feature hard to work with because of the misalignment of the meaning of the index value (code unit vs code point vs grapheme cluster).
- Each token has a token type. The current ubrk_getRuleStatus/UWordBreak usage isn’t an easy API to interpret nor use. You can sort of derive this information from the break iterator now, but the Unicode Inflection implementation currently ignores this information and deduces it based on character properties directly. Such information is helpful to have when you’re looking for significant words to inflect. It also allows you to count the number of words in a string.
- A lot of new users to this API tend to convert the significant word tokens to an array of the native string type. Why? Because it’s easier to pass such information around in a larger framework ecosystem, and it provides an easier way to ignore the fluff involving whitespace, and punctuation.
- More advanced users will iterate over the token chain. In an even more advanced usage (not included in Unicode Inflection) a contraction has to be expanded, and 2 adjacent tokens will have the same index range. The value will be the original value, and the clean value will have the uncontracted words. So you can get the original text, but map it to other normalized values, which is another reason to have actual strings instead of a aliased strings.
- The clean value is the normalized value. Currently it’s just language specific lowercased.
- Each tokenizer is thread safe. The ICU break iterator currently has to be kept in separate threads because the builder and iterator are in the same object. This API caches the break iterator and it will clone a new one if it runs out of ICU break iterators. Loading the break iterator is slow, especially when working on numerous small strings in multiple threads.
If I were to create a minimal API, I’d provide the following:
- Provide the start and end range so that I can use substr on the original string. This would avoid the aliasing and memory management. Though providing some form of a string is typically what people want to use in NLP type systems. The indexes would typically be wanted in editors.
- Provide a clear type for the range of the string. I may not care about all ranges, but I still need to reference the content between significant words on occasion. For future extensibility, other annotations may be needed, like what you see in the Lucene analyzers.
- Make builder and iterator separate, like the existing regex API. This is the immutability reason for the design proposal.
George