ICU API proposal: Unicode helper APIs

17 views

Skip to first unread message

Markus Scherer

unread,

Jul 9, 2025, 1:54:41 PMJul 9

to icu-design, Robin Leroy, Richard T. Gillam, Mihai ⦅U⦆ Niță

Dear ICU team & users,

I would like to propose the following API for: ICU 78

Please provide feedback by: next Wednesday, 2025-07-16

Designated API reviewers: Robin & Rich & Mihai

Ticket: ICU-23152

WIP changes: https://github.com/unicode-org/icu/pull/3539

I would like to plug some long-standing gaps where I have seen code with magic numbers, multi-line constructs, and incomplete solutions. Most of the additions are for C++, working with code points and writing them to UTF-8/16/32 destinations. A few are for Java.

Among these are four C++ functions for writing a code point to a C++ standard string. I know that this is valuable functionality, but I have notes/questions about the exact API shape for them; see below.

unicode/utf.h

/**

* Is c a Unicode code point U+0000..U+10FFFF?

* https://www.unicode.org/glossary/#code_point

* @param c 32-bit code point

* @return true or false

* @draft ICU 78

#define U_IS_CODE_POINT(c)

/**

* Is c a Unicode scalar value, that is, a non-surrogate code point?

* Only scalar values can be represented in well-formed UTF-8/16/32.

* https://www.unicode.org/glossary/#unicode_scalar_value

* @param c 32-bit code point

* @return true or false

* @draft ICU 78

#define U_IS_SCALAR_VALUE(c)

unicode/utf8.h

/**

* Returns the length of a well-formed UTF-8 byte sequence according to its lead byte.

* Returns 1 for 0..0xc1 as well as for 0xf5..0xff.

* leadByte might be evaluated multiple times.

* @param leadByte The first byte of a UTF-8 sequence. Must be 0..0xff.

* @return 1..4

* @draft ICU 78

#define U8_LENGTH_FROM_LEAD_BYTE(leadByte)

/**

* Returns the length of a well-formed UTF-8 byte sequence according to its lead byte.

* Returns 1 for 0..0xc1. Undefined for 0xf5..0xff.

* leadByte might be evaluated multiple times.

* @param leadByte The first byte of a UTF-8 sequence. Must be 0..0xff.

* @return 1..4

* @draft ICU 78

#define U8_LENGTH_FROM_LEAD_BYTE_UNSAFE(leadByte)

unicode/utfiterator.h

/**

* A C++ "range" over all Unicode code points U+0000..U+10FFFF.

* https://www.unicode.org/glossary/#code_point

* Intended for test and builder code.

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

* @draft ICU 78

* @see U_IS_CODE_POINT

template<typename CP32>

class AllCodePoints {

public:

/** Constructor. @draft ICU 78 */

AllCodePoints() {}

/**

* @return an iterator over all Unicode code points.

* The iterator returns CP32 integers.

* @draft ICU 78

auto begin() const

/**

* @return an exclusive-end iterator over all Unicode code points.

* @draft ICU 78

auto end() const

};

/**

* A C++ "range" over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.

* That is, all code points except surrogates.

* Only scalar values can be represented in well-formed UTF-8/16/32.

* https://www.unicode.org/glossary/#unicode_scalar_value

* Intended for test and builder code.

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

* @draft ICU 78

* @see U_IS_SCALAR_VALUE

template<typename CP32>

class AllScalarValues {

public:

/** Constructor. @draft ICU 78 */

AllScalarValues() {}

/**

* @return an iterator over all Unicode scalar values.

* The iterator returns CP32 integers.

* @draft ICU 78

auto begin() const

/**

* @return an exclusive-end iterator over all Unicode scalar values.

* @draft ICU 78

auto end() const

};

Note:

I propose the following code point to string functions for utfiterator.h. They are not related to iterators but would have similar code if & when we add input or output iterators that encode code points to strings.
I can’t think of another existing header files where these would fit.
We could create a whole new header file just for these four functions.
WDYT?

Note:

The append... functions have a parameter order of (string, code point). This seems intuitive to me.
However, for C++ functions we usually prefer inputs before outputs.
WDYT?

Note:

These names are kind of long, especially the stringFrom... functions which require the StringClass template parameter.

... stringFromCodePointOrFFFD<std::u8string>(U'カ') ...

Ideas for shorter names that don’t become confusing/ambiguous?
Also, rather than raw functions in the icu::header namespace, these could live on a class or two like UTFString or maybe UnsafeUTFString.

... UTFString<std::u8string>::fromCodePointOrFFFD(U'カ') ...
... UTFString<std::u8string>::fromCodePointUnsafe(U'カ') ...
... UnsafeUTFString<std::u8string>::fromCodePoint(U'カ') ...

Calling code could create mini inline functions with short names like “strFromCP()” for the versions that it uses.

Note:

For these functions, I am proposing to just take UChar32 input values, as usual for ICU, rather than a tparam typename CP32. That is, these rely on compilers not complaining about implicit conversions between UChar32=int32_t and uint32_t/char32_t. (Tests pass on CI including treat-warnings-as-errors configs.)

/**

* Appends the code point to the string.

* Appends the U+FFFD replacement character instead if c is not a scalar value.

* See https://www.unicode.org/glossary/#unicode_scalar_value

* @tparam StringClass A version of std::basic_string (or a compatible type)

* @param s The string to append to

* @param c The code point to append

* @return s

* @draft ICU 78

* @see U_IS_SCALAR_VALUE

template<typename StringClass>

U_FORCE_INLINE StringClass &appendCodePointOrFFFD(StringClass &s, UChar32 c)

/**

* Appends the code point to the string.

* The code point must be a scalar value; otherwise the behavior is undefined.

* See https://www.unicode.org/glossary/#unicode_scalar_value

* @tparam StringClass A version of std::basic_string (or a compatible type)

* @param s The string to append to

* @param c The code point to append (must be a scalar value)

* @return s

* @draft ICU 78

* @see U_IS_SCALAR_VALUE

template<typename StringClass>

U_FORCE_INLINE StringClass &appendCodePointUnsafe(StringClass &s, UChar32 c)

/**

* Returns the code point as a string of code units.

* Returns the U+FFFD replacement character instead if c is not a scalar value.

* See https://www.unicode.org/glossary/#unicode_scalar_value

* @tparam StringClass A version of std::basic_string (or a compatible type)

* @param c The code point

* @return the string of c's code units

* @draft ICU 78

* @see U_IS_SCALAR_VALUE

template<typename StringClass>

U_FORCE_INLINE StringClass stringFromCodePointOrFFFD(UChar32 c)

/**

* Returns the code point as a string of code units.

* The code point must be a scalar value; otherwise the behavior is undefined.

* See https://www.unicode.org/glossary/#unicode_scalar_value

* @tparam StringClass A version of std::basic_string (or a compatible type)

* @param c The code point

* @return the string of c's code units

* @draft ICU 78

* @see U_IS_SCALAR_VALUE

template<typename StringClass>

U_FORCE_INLINE StringClass stringFromCodePointUnsafe(UChar32 c)

UnicodeString

Add a version of toUTF8String() without a result in/out parameter.

/**

* Convert the UnicodeString to a UTF-8 string.

* Unpaired surrogates are replaced with U+FFFD.

* Calls toUTF8().

* @tparam StringClass A std::string or a std::u8string (or a compatible type)

* @return A std::string or a std::u8string (or a compatible object)

* with the UTF-8 version of the string.

* @draft ICU 78

* @see toUTF8

template<typename StringClass>

StringClass toUTF8String() const

Make UnicodeString a “range” of UTF-16 code units (char16_t).

/**

* @return an iterator to the first code unit in this string.

* The iterator may be a pointer or a contiguous-iterator object.

* @draft ICU 78

auto begin() const { return std::u16string_view(*this).begin(); }

/**

* @return an iterator to just past the last code unit in this string.

* The iterator may be a pointer or a contiguous-iterator object.

* @draft ICU 78

auto end() const { return std::u16string_view(*this).end(); }

/**

* @return a reverse iterator to the last code unit in this string.

* The iterator may be a pointer or a contiguous-iterator object.

* @draft ICU 78

auto rbegin() const { return std::u16string_view(*this).rbegin(); }

/**

* @return a reverse iterator to just before the first code unit in this string.

* The iterator may be a pointer or a contiguous-iterator object.

* @draft ICU 78

auto rend() const { return std::u16string_view(*this).rend(); }

Make std::back_inserter(UnicodeString) work:

/**

* Appends the code unit `c` to the UnicodeString object.

* @param c the code unit to append

* @draft ICU 78

inline void push_back(char16_t c) { append(c); }

IterableOfInt.java – new file

package com.ibm.icu.lang;

import java.util.PrimitiveIterator;

/**

* Subinterface of Iterable whose iterator() returns a {@link PrimitiveIterator.OfInt}.

* Allows direct use of the primitive iterator without downcasting.

* @draft ICU 78

public interface IterableOfInt extends Iterable<Integer> {

/**

* @return a {@link PrimitiveIterator.OfInt}

* @draft ICU 78

@Override

public PrimitiveIterator.OfInt iterator();

}

UCharacter.java

(FYI: Java already has [U]Character.isValidCodePoint(c); C++ already has U_IS_UNICODE_NONCHAR())

/**

* {@icu} Is cp a Unicode scalar value, that is, a non-surrogate code point?

* Only scalar values can be represented in well-formed UTF-8/16/32.

* See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:

* Unicode Scalar Value</a>.

* @param cp the code point to check

* @return true if cp is a Unicode scalar value

* @draft ICU 78

public static final boolean isScalarValue(int cp) {

/**

* {@icu} Is cp a Unicode noncharacter?

* See <a href="https://www.unicode.org/glossary/#noncharacter">Unicode Glossary:

* Noncharacter</a>.

* @param cp the code point to check

* @return true if cp is a Unicode noncharacter code point

* @draft ICU 78

public static final boolean isNoncharacter(int cp)

/**

* {@icu} Returns an IterableOfInt over all Unicode code points U+0000..U+10FFFF.

* See <a href="https://www.unicode.org/glossary/#code_point">Unicode Glossary: Code Point</a>.

* <p>Intended for test and builder code.

* @return an IterableOfInt over all Unicode code points U+0000..U+10FFFF.

* @draft ICU 78

public static final IterableOfInt allCodePoints()

/**

* {@icu} Returns an IterableOfInt over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.

* See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:

* Unicode Scalar Value</a>.

* <p>Intended for test and builder code.

* @return an IterableOfInt over all Unicode scalar values.

* @draft ICU 78

public static final IterableOfInt allScalarValues()

/**

* {@icu} Returns an IntStream over all Unicode code points U+0000..U+10FFFF.

* See <a href="https://www.unicode.org/glossary/#code_point">Unicode Glossary: Code Point</a>.

* <p>Intended for test and builder code.

* @return an IntStream over all Unicode code points U+0000..U+10FFFF.

* @draft ICU 78

public static final IntStream allCodePointsStream()

/**

* {@icu} Returns an IntStream over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.

* See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:

* Unicode Scalar Value</a>.

* <p>Intended for test and builder code.

* @return an IntStream over all Unicode scalar values.

* @draft ICU 78

public static final IntStream allScalarValuesStream()

Sincerely,

markus

Markus Scherer

unread,

Jul 17, 2025, 1:16:09 PMJul 17

to icu-design, Robin Leroy, Richard T. Gillam, Mihai ⦅U⦆ Niță

The TC approved these APIs today with the following changes:

Change the code-point-to-string functions like this:

New header unicode/utfstring.h
namespace icu::header::utfstring

template<typename StringClass>
U_FORCE_INLINE StringClass &appendOrFFFD(StringClass &s, UChar32 c)
template<typename StringClass>
U_FORCE_INLINE StringClass &appendUnsafe(StringClass &s, UChar32 c)
template<typename StringClass>
U_FORCE_INLINE StringClass encodeOrFFFD(UChar32 c)
template<typename StringClass>
U_FORCE_INLINE StringClass encodeUnsafe(UChar32 c)

Example for call sites:

auto s = icu::header::utfstring::encodeOrFFFD<std::u16string>(U'カ');
using icu::header::utfstring;
std::string s = ...;
const char *p = appendUnsafe(s, U'カ');

Reply all

Reply to author

Forward

0 new messages