API proposal: C++ API for Segmenter

41 views
Skip to first unread message

Elango Cheran

unread,
May 15, 2026, 8:17:28 PMMay 15
to icu-d...@unicode.org
Hi everyone,
As I begin to take a look at the C++ implementation of the Segmenter API, a few important high level design decision points have already come up. So I want to get your opinions and carry that into the discussion in our next TC meeting.

I've added the C++ specific API design and questions in the C++ Design Details tab of the design doc. Specifically, I've added the following high level questions:

  1. Input string type

    • UTF-16: std::u16string_view or UnicodeString ?

    • UTF-8: std::string_view or StringPiece ?

  2. Return type: Classic ICU or modern C++?

    • Classic ICU: return a pointer, and by convention the caller takes ownership.

    • Modern C++: Return a smart pointer -> clearly indicated ownership

  3. If modern C++ return type: which style?

    • ICU style: return LocalPointer<Segments>

    • Standard C++ style: std::unique_ptr<Segments>


Here is an example of one way the answers to the above questions might look like for `segmenter.h`:

class U_COMMON_API_CLASS Segmenter : public UObject {

public:

   ~Segmenter() override;


   virtual std::unique_ptr<Segments> segment(std::u16string_view s, UErrorCode &errorCode);


   virtual std::unique_ptr<SegmentsUTF8> segment(StringPiece s, UErrorCode &errorCode);


};


Your answers to the questions will help inform the rest of the design & impl details of the API, and could be useful for how we approach future API design.

Thanks,
Elango

Fredrik Roubert

unread,
May 18, 2026, 9:49:23 AMMay 18
to Elango Cheran, icu-d...@unicode.org
On Sat, May 16, 2026 at 2:17 AM Elango Cheran <ela...@unicode.org> wrote:

> Classic ICU or modern C++?

I'd say that unless you have a very compelling reason for why doing
"Classic ICU" would make this easier to use, modern C++ should be the
default choice whenever possible.

--
Fredrik Roubert
rou...@google.com

Robin Leroy

unread,
May 18, 2026, 11:29:26 AMMay 18
to Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
I agree with Fredrik that modern C++ is where we should go (and where we have been going in recent ICU4C work).

It is not clear to me that returning a pointer (even a unique_ptr), presumably to do user-visible polymorphism on Segments, is the modern C++ way to go here; this feels like a translation of the Java rather than a modern C++ design.

I would expect some kind of range-like object, like we did for the UTF iterators. I don’t think it is useful or necessary to do the whole range adaptor thing, so this can work on u16string_view, but it should be a range, so that the state will be in the iterator rather than the range, and attempting to mirror the Java polymorphism won’t go well.

In other words: Segments cannot both be reasonably idiomatic C++ and look like its Java counterpart, and without seeing what Segments looks like I don’t think I can say anything useful about this proposal.

Best regards,

Robin Leroy

--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-design/CAPLBv_Po-bn1z5woAvwiSpUS7%2BPtbRQJ8rd7St_2UH59Voo4Xw%40mail.gmail.com.
For more options, visit https://groups.google.com/a/unicode.org/d/optout.

Markus Scherer

unread,
May 21, 2026, 11:36:43 AMMay 21
to Robin Leroy, Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
On Mon, May 18, 2026 at 8:29 AM Robin Leroy <eggr...@unicode.org> wrote:
It is not clear to me that returning a pointer (even a unique_ptr), presumably to do user-visible polymorphism on Segments, is the modern C++ way to go here; this feels like a translation of the Java rather than a modern C++ design.

I would expect some kind of range-like object, like we did for the UTF iterators.

Different level.

Class Segments logically represents the segmentation of a string, but it itself isn't Iterable. Instead, it has methods which return Streams or Iterables -- and would return "ranges" in C++.

markus

George Rhoten

unread,
May 21, 2026, 1:10:42 PMMay 21
to Robin Leroy, Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
Personally I have a preference for the modern C++ approach too. It makes it easier to integrate with other C++ libraries and traditional C++ lint scanner tools.  Though I’m not adamant about this preference.

The rest of this email is about an alternate implementation that already tries to overcome the shortcomings of the current RBBI implementation.  It’s informational, and I’m not proposing changes to the design proposal.

I should mention that the Unicode Inflection project has its own wrapper around the ICU BreakIterator and redefines the meaning of a “word” to be usable in linguistic terms, and not the double click selection notion of a word.  People involved with TTS systems will typically want to break a string into phonetic terms instead of linguistic terms. This implementation is needed for scenarios where you have something like “fishmarket” in a language that likes to compound words. You need to decompound it to “fish”, and “market”, and detect the grammemes (grammatical category values) of each subword.  This is important for grammatical agreement, and you want to add a preposition, or make it definite. That way you don’t need all possible combinations of compound words, and thus allow keeping the lexical dictionary minimal.  Words that combine numbers with letters also get separated apart, like “9AM”.

That Unicode Inflection API has the following features for the tokenizer API.

  • The start and end index in UTF-16 code units for each token (segment) so that it’s easy to create a substring in C, C++, or Java.
  • A copy of the actual string. This isn’t memory efficient. I don’t recommend this choice for large strings, but it’s helpful if you ever need to chunk the segmentation without needing to keep the entire original string in memory.
  • The ability to reference a specific token from a TokenChain (Segments) by index.  Though this isn’t typical usage, and some programming languages make this indexing feature hard to work with because of the misalignment of the meaning of the index value (code unit vs code point vs grapheme cluster).
  • Each token has a token type. The current ubrk_getRuleStatus/UWordBreak usage isn’t an easy API to interpret nor use. You can sort of derive this information from the break iterator now, but the Unicode Inflection implementation currently ignores this information and deduces it based on character properties directly. Such information is helpful to have when you’re looking for significant words to inflect. It also allows you to count the number of words in a string.
  • A lot of new users to this API tend to convert the significant word tokens to an array of the native string type.  Why? Because it’s easier to pass such information around in a larger framework ecosystem, and it provides an easier way to ignore the fluff involving whitespace, and punctuation.
  • More advanced users will iterate over the token chain. In an even more advanced usage (not included in Unicode Inflection) a contraction has to be expanded, and 2 adjacent tokens will have the same index range.  The value will be the original value, and the clean value will have the uncontracted words.  So you can get the original text, but map it to other normalized values, which is another reason to have actual strings instead of a aliased strings.
  • The clean value is the normalized value. Currently it’s just language specific lowercased.
  • Each tokenizer is thread safe.  The ICU break iterator currently has to be kept in separate threads because the builder and iterator are in the same object.  This API caches the break iterator and it will clone a new one if it runs out of ICU break iterators.  Loading the break iterator is slow, especially when working on numerous small strings in multiple threads.

If I were to create a minimal API, I’d provide the following:

  • Provide the start and end range so that I can use substr on the original string.  This would avoid the aliasing and memory management. Though providing some form of a string is typically what people want to use in NLP type systems.  The indexes would typically be wanted in editors.
  • Provide a clear type for the range of the string.  I may not care about all ranges, but I still need to reference the content between significant words on occasion.  For future extensibility, other annotations may be needed, like what you see in the Lucene analyzers.
  • Make builder and iterator separate, like the existing regex API. This is the immutability reason for the design proposal.

George

Markus Scherer

unread,
May 21, 2026, 1:28:27 PMMay 21
to George Rhoten, Robin Leroy, Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
Hi George,

On Thu, May 21, 2026 at 10:10 AM 'George Rhoten' via icu-design <icu-d...@unicode.org> wrote:
Personally I have a preference for the modern C++ approach too. It makes it easier to integrate with other C++ libraries and traditional C++ lint scanner tools.  Though I’m not adamant about this preference.

With the small crowd in the ICU-TC meeting today, we did agree on trying to use modern C++ constructs, breaking with ICU tradition.

The rest of this email is about an alternate implementation that already tries to overcome the shortcomings of the current RBBI implementation.  It’s informational, and I’m not proposing changes to the design proposal.

Good thoughts.
Remember that we are not starting totally from scratch. We are pretty happy with the ICU 78 Java Segmenter/Segments/... API, and I think it covers most if not all of the non-linguistic-specific things you are describing: Thread-safe, easy access to both segment boundaries and substrings, colloquial iteration, ...

Take a look:

tnx
markus

Shane Carr

unread,
May 27, 2026, 9:18:35 AMMay 27
to Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, Elango Cheran, icu-d...@unicode.org
I feel like ICU4C should continue to use its existing safe abstractions, like LocalPointer and UnicodeString, instead of the C++ equivalents, but if the correct abstraction doesn't already exist in ICU4C, then use the modern C++ version instead of creating a new one.

--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/CAN49p6qesgHBngu4HmHKCns%2Be5HjkFSOBcw0t9A9qS%3D_Bg91ww%40mail.gmail.com.

Elango Cheran

unread,
May 27, 2026, 10:52:46 AMMay 27
to Shane Carr, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-d...@unicode.org
Shane -- are there specific reasons why you prefer ICU constructs over modern C++ constructs? The obvious argument for using modem C++ constructs is that it follows the same spirit and purpose of the API modernization work, which is make something that is easier to use effectively. (Why only follow that partially?)

Shane Carr

unread,
May 27, 2026, 11:04:41 AMMay 27
to Elango Cheran, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-d...@unicode.org
(1) I see the goal of this work as creating safer APIs that are easier to use correctly and harder to use incorrectly; I don't see C++ safe abstractions as more aligned with the spirit and purpose of the project. Both ICU4C safe abstractions (LocalPointer and UnicodeString) and C++ safe abstractions (std::unique_ptr and std::u16string) accomplish that goal.

(2) The ICU4C abstractions contain functionality that ties in better with ICU4C operations, like LocalPointer's handling of out-of-memory errors, and UnicodeString's handling of stack strings and UTF-8 conversions.

(3) For users of ICU4C, we already have landed a great deal of APIs using UnicodeString and LocalPointer. It seems disruptive to start using C++ abstractions only now. It seems like a worse outcome for half of our "modern" APIs like NumberFormatter and MeasureFormat to use one style and the other half of our "modern" APIs to use the other style.

(4) We've already done a great deal of work to make the ICU4C abstractions and C++ abstractions interoperate with each other, with custom constructors, etc. So I don't see "interoperates with other C++ code better" as being an argument, either.

Alan Liu

unread,
May 27, 2026, 11:24:00 AMMay 27
to Shane Carr, Elango Cheran, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-d...@unicode.org
One argument for using C++ abstractions is that developers already know idiomatic C++ constructs. (Coding agent proficiency can also be expected to be high.) Existing C++ mechanisms like std::unique_ptr have idiomatic usage patterns that are well-known. A library-specific mechanism increases the developers cognitive load, asking them to learn a new smart pointer API, for example.



For more options, visit https://groups.google.com/a/unicode.org/d/optout.

--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.

Shane Carr

unread,
May 27, 2026, 12:44:56 PMMay 27
to Alan Liu, Elango Cheran, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-design
It's a valid question to ask about who our audience is. 

I'll point out that we had C++11 when I did NumberFormatter, and we decided then to keep using the ICU4C types.

Elango Cheran

unread,
May 27, 2026, 1:48:19 PMMay 27
to Shane Carr, Alan Liu, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-design
The points are good points in the recent comments, although they do echo discussion from Thursday's meeting, so I want to refrain from repeating or relitigating too much. I took notes in a tab of the design doc, but even then, I feel I only captured half of the discussion. We did spend time deliberating in order to pay attention to all of these considerations.

Alan brings up a couple of interesting points. The latter point about cognitive load came up in the meeting, and it gives a different take to Shane's point #1. For Shane's points #3 & 4, the BreakIterator API already exists, and the Segmenter API can be seen a layer on top. As far as the changes to API conventions, the TC discussed and came to a different conclusion. In terms of audience, we should also think about how much of a particular decision/concern falls along the tradeoff of is it a benefit to the user vs. is it an implementation complexity cost to be borne by ICU developers?

Shane Carr

unread,
May 27, 2026, 2:06:32 PMMay 27
to Elango Cheran, Alan Liu, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-design
It sounds like the points are mostly the same as in 2017. Do you have an idea of why the TC of 2026 would reach a different conclusion?

I am just catching up on my email backlog and assumed the TC was still open to feedback.

Shane Carr

unread,
May 27, 2026, 2:20:11 PMMay 27
to Elango Cheran, Alan Liu, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-design
One point I remember discussing in 2017 was that most C++ developers are accustomed to using a bunch of weird abstractions, whether it's boost, absl, or something specific to your company (e.g. to prevent panics/exceptions). If we feel or have data that a critical mass of today's C++ developers are more familiar with std abstractions than they were at that time, then that would be a reason to reach a different conclusion.

Elango Cheran

unread,
May 28, 2026, 12:54:18 PM (13 days ago) May 28
to Shane Carr, Alan Liu, Markus Scherer, George Rhoten, Robin Leroy, Fredrik Roubert, icu-design
Yep, that's the answer to your question -- there are more people familiar with unique_ptr, both overall and within ICU developers, which came out in C++11. u16string_view only came out in C++17.

Note: NumberFormatter was developed in 2017. It was made stable in ICU 60, and ICU only adopted C++11 in ICU 59, so presumably NumberFormatter didn't have C++11 during formative development. It definitely didn't have C++17 available, especially considering ICU only adopted C++17 a couple of years ago.
Reply all
Reply to author
Forward
0 new messages