API proposal: C++ API for Segmenter

29 views
Skip to first unread message

Elango Cheran

unread,
May 15, 2026, 8:17:28 PM (6 days ago) May 15
to icu-d...@unicode.org
Hi everyone,
As I begin to take a look at the C++ implementation of the Segmenter API, a few important high level design decision points have already come up. So I want to get your opinions and carry that into the discussion in our next TC meeting.

I've added the C++ specific API design and questions in the C++ Design Details tab of the design doc. Specifically, I've added the following high level questions:

  1. Input string type

    • UTF-16: std::u16string_view or UnicodeString ?

    • UTF-8: std::string_view or StringPiece ?

  2. Return type: Classic ICU or modern C++?

    • Classic ICU: return a pointer, and by convention the caller takes ownership.

    • Modern C++: Return a smart pointer -> clearly indicated ownership

  3. If modern C++ return type: which style?

    • ICU style: return LocalPointer<Segments>

    • Standard C++ style: std::unique_ptr<Segments>


Here is an example of one way the answers to the above questions might look like for `segmenter.h`:

class U_COMMON_API_CLASS Segmenter : public UObject {

public:

   ~Segmenter() override;


   virtual std::unique_ptr<Segments> segment(std::u16string_view s, UErrorCode &errorCode);


   virtual std::unique_ptr<SegmentsUTF8> segment(StringPiece s, UErrorCode &errorCode);


};


Your answers to the questions will help inform the rest of the design & impl details of the API, and could be useful for how we approach future API design.

Thanks,
Elango

Fredrik Roubert

unread,
May 18, 2026, 9:49:23 AM (3 days ago) May 18
to Elango Cheran, icu-d...@unicode.org
On Sat, May 16, 2026 at 2:17 AM Elango Cheran <ela...@unicode.org> wrote:

> Classic ICU or modern C++?

I'd say that unless you have a very compelling reason for why doing
"Classic ICU" would make this easier to use, modern C++ should be the
default choice whenever possible.

--
Fredrik Roubert
rou...@google.com

Robin Leroy

unread,
May 18, 2026, 11:29:26 AM (3 days ago) May 18
to Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
I agree with Fredrik that modern C++ is where we should go (and where we have been going in recent ICU4C work).

It is not clear to me that returning a pointer (even a unique_ptr), presumably to do user-visible polymorphism on Segments, is the modern C++ way to go here; this feels like a translation of the Java rather than a modern C++ design.

I would expect some kind of range-like object, like we did for the UTF iterators. I don’t think it is useful or necessary to do the whole range adaptor thing, so this can work on u16string_view, but it should be a range, so that the state will be in the iterator rather than the range, and attempting to mirror the Java polymorphism won’t go well.

In other words: Segments cannot both be reasonably idiomatic C++ and look like its Java counterpart, and without seeing what Segments looks like I don’t think I can say anything useful about this proposal.

Best regards,

Robin Leroy

--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-design/CAPLBv_Po-bn1z5woAvwiSpUS7%2BPtbRQJ8rd7St_2UH59Voo4Xw%40mail.gmail.com.
For more options, visit https://groups.google.com/a/unicode.org/d/optout.

Markus Scherer

unread,
11:36 AM (4 hours ago) 11:36 AM
to Robin Leroy, Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
On Mon, May 18, 2026 at 8:29 AM Robin Leroy <eggr...@unicode.org> wrote:
It is not clear to me that returning a pointer (even a unique_ptr), presumably to do user-visible polymorphism on Segments, is the modern C++ way to go here; this feels like a translation of the Java rather than a modern C++ design.

I would expect some kind of range-like object, like we did for the UTF iterators.

Different level.

Class Segments logically represents the segmentation of a string, but it itself isn't Iterable. Instead, it has methods which return Streams or Iterables -- and would return "ranges" in C++.

markus

George Rhoten

unread,
1:10 PM (3 hours ago) 1:10 PM
to Robin Leroy, Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
Personally I have a preference for the modern C++ approach too. It makes it easier to integrate with other C++ libraries and traditional C++ lint scanner tools.  Though I’m not adamant about this preference.

The rest of this email is about an alternate implementation that already tries to overcome the shortcomings of the current RBBI implementation.  It’s informational, and I’m not proposing changes to the design proposal.

I should mention that the Unicode Inflection project has its own wrapper around the ICU BreakIterator and redefines the meaning of a “word” to be usable in linguistic terms, and not the double click selection notion of a word.  People involved with TTS systems will typically want to break a string into phonetic terms instead of linguistic terms. This implementation is needed for scenarios where you have something like “fishmarket” in a language that likes to compound words. You need to decompound it to “fish”, and “market”, and detect the grammemes (grammatical category values) of each subword.  This is important for grammatical agreement, and you want to add a preposition, or make it definite. That way you don’t need all possible combinations of compound words, and thus allow keeping the lexical dictionary minimal.  Words that combine numbers with letters also get separated apart, like “9AM”.

That Unicode Inflection API has the following features for the tokenizer API.

  • The start and end index in UTF-16 code units for each token (segment) so that it’s easy to create a substring in C, C++, or Java.
  • A copy of the actual string. This isn’t memory efficient. I don’t recommend this choice for large strings, but it’s helpful if you ever need to chunk the segmentation without needing to keep the entire original string in memory.
  • The ability to reference a specific token from a TokenChain (Segments) by index.  Though this isn’t typical usage, and some programming languages make this indexing feature hard to work with because of the misalignment of the meaning of the index value (code unit vs code point vs grapheme cluster).
  • Each token has a token type. The current ubrk_getRuleStatus/UWordBreak usage isn’t an easy API to interpret nor use. You can sort of derive this information from the break iterator now, but the Unicode Inflection implementation currently ignores this information and deduces it based on character properties directly. Such information is helpful to have when you’re looking for significant words to inflect. It also allows you to count the number of words in a string.
  • A lot of new users to this API tend to convert the significant word tokens to an array of the native string type.  Why? Because it’s easier to pass such information around in a larger framework ecosystem, and it provides an easier way to ignore the fluff involving whitespace, and punctuation.
  • More advanced users will iterate over the token chain. In an even more advanced usage (not included in Unicode Inflection) a contraction has to be expanded, and 2 adjacent tokens will have the same index range.  The value will be the original value, and the clean value will have the uncontracted words.  So you can get the original text, but map it to other normalized values, which is another reason to have actual strings instead of a aliased strings.
  • The clean value is the normalized value. Currently it’s just language specific lowercased.
  • Each tokenizer is thread safe.  The ICU break iterator currently has to be kept in separate threads because the builder and iterator are in the same object.  This API caches the break iterator and it will clone a new one if it runs out of ICU break iterators.  Loading the break iterator is slow, especially when working on numerous small strings in multiple threads.

If I were to create a minimal API, I’d provide the following:

  • Provide the start and end range so that I can use substr on the original string.  This would avoid the aliasing and memory management. Though providing some form of a string is typically what people want to use in NLP type systems.  The indexes would typically be wanted in editors.
  • Provide a clear type for the range of the string.  I may not care about all ranges, but I still need to reference the content between significant words on occasion.  For future extensibility, other annotations may be needed, like what you see in the Lucene analyzers.
  • Make builder and iterator separate, like the existing regex API. This is the immutability reason for the design proposal.

George

Markus Scherer

unread,
1:28 PM (3 hours ago) 1:28 PM
to George Rhoten, Robin Leroy, Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
Hi George,

On Thu, May 21, 2026 at 10:10 AM 'George Rhoten' via icu-design <icu-d...@unicode.org> wrote:
Personally I have a preference for the modern C++ approach too. It makes it easier to integrate with other C++ libraries and traditional C++ lint scanner tools.  Though I’m not adamant about this preference.

With the small crowd in the ICU-TC meeting today, we did agree on trying to use modern C++ constructs, breaking with ICU tradition.

The rest of this email is about an alternate implementation that already tries to overcome the shortcomings of the current RBBI implementation.  It’s informational, and I’m not proposing changes to the design proposal.

Good thoughts.
Remember that we are not starting totally from scratch. We are pretty happy with the ICU 78 Java Segmenter/Segments/... API, and I think it covers most if not all of the non-linguistic-specific things you are describing: Thread-safe, easy access to both segment boundaries and substrings, colloquial iteration, ...

Take a look:

tnx
markus
Reply all
Reply to author
Forward
0 new messages