ICU API proposal: Unicode helper APIs

17 views
Skip to first unread message

Markus Scherer

unread,
Jul 9, 2025, 1:54:41 PMJul 9
to icu-design, Robin Leroy, Richard T. Gillam, Mihai ⦅U⦆ Niță

Dear ICU team & users,


I would like to propose the following API for: ICU 78

Please provide feedback by: next Wednesday, 2025-07-16

Designated API reviewers: Robin & Rich & Mihai

Ticket: ICU-23152

WIP changes: https://github.com/unicode-org/icu/pull/3539


I would like to plug some long-standing gaps where I have seen code with magic numbers, multi-line constructs, and incomplete solutions. Most of the additions are for C++, working with code points and writing them to UTF-8/16/32 destinations. A few are for Java.


Among these are four C++ functions for writing a code point to a C++ standard string. I know that this is valuable functionality, but I have notes/questions about the exact API shape for them; see below.

unicode/utf.h


/**

 * Is c a Unicode code point U+0000..U+10FFFF?

 * https://www.unicode.org/glossary/#code_point

 *

 * @param c 32-bit code point

 * @return true or false

 * @draft ICU 78

 */

#define U_IS_CODE_POINT(c)


/**

 * Is c a Unicode scalar value, that is, a non-surrogate code point?

 * Only scalar values can be represented in well-formed UTF-8/16/32.

 * https://www.unicode.org/glossary/#unicode_scalar_value

 *

 * @param c 32-bit code point

 * @return true or false

 * @draft ICU 78

 */

#define U_IS_SCALAR_VALUE(c)

unicode/utf8.h

/**

 * Returns the length of a well-formed UTF-8 byte sequence according to its lead byte.

 * Returns 1 for 0..0xc1 as well as for 0xf5..0xff.

 * leadByte might be evaluated multiple times.

 *

 * @param leadByte The first byte of a UTF-8 sequence. Must be 0..0xff.

 * @return 1..4

 * @draft ICU 78

 */

#define U8_LENGTH_FROM_LEAD_BYTE(leadByte)


/**

 * Returns the length of a well-formed UTF-8 byte sequence according to its lead byte.

 * Returns 1 for 0..0xc1. Undefined for 0xf5..0xff.

 * leadByte might be evaluated multiple times.

 *

 * @param leadByte The first byte of a UTF-8 sequence. Must be 0..0xff.

 * @return 1..4

 * @draft ICU 78

 */

#define U8_LENGTH_FROM_LEAD_BYTE_UNSAFE(leadByte)

unicode/utfiterator.h


/**

 * A C++ "range" over all Unicode code points U+0000..U+10FFFF.

 * https://www.unicode.org/glossary/#code_point

 *

 * Intended for test and builder code.

 *

 * @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

 * @draft ICU 78

 * @see U_IS_CODE_POINT

 */

template<typename CP32>

class AllCodePoints {

public:

    /** Constructor. @draft ICU 78 */

    AllCodePoints() {}

    /**

     * @return an iterator over all Unicode code points.

     *     The iterator returns CP32 integers.

     * @draft ICU 78

     */

    auto begin() const

    /**

     * @return an exclusive-end iterator over all Unicode code points.

     * @draft ICU 78

     */

    auto end() const

};


/**

 * A C++ "range" over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.

 * That is, all code points except surrogates.

 * Only scalar values can be represented in well-formed UTF-8/16/32.

 * https://www.unicode.org/glossary/#unicode_scalar_value

 *

 * Intended for test and builder code.

 *

 * @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

 * @draft ICU 78

 * @see U_IS_SCALAR_VALUE

 */

template<typename CP32>

class AllScalarValues {

public:

    /** Constructor. @draft ICU 78 */

    AllScalarValues() {}

    /**

     * @return an iterator over all Unicode scalar values.

     *     The iterator returns CP32 integers.

     * @draft ICU 78

     */

    auto begin() const

    /**

     * @return an exclusive-end iterator over all Unicode scalar values.

     * @draft ICU 78

     */

    auto end() const

};


Note:

  • I propose the following code point to string functions for utfiterator.h. They are not related to iterators but would have similar code if & when we add input or output iterators that encode code points to strings.

  • I can’t think of another existing header files where these would fit.

  • We could create a whole new header file just for these four functions.

  • WDYT?


Note:

  • The append... functions have a parameter order of (string, code point). This seems intuitive to me.

  • However, for C++ functions we usually prefer inputs before outputs.

  • WDYT?


Note:

  • These names are kind of long, especially the stringFrom... functions which require the StringClass template parameter.

    • ... stringFromCodePointOrFFFD<std::u8string>(U'カ') ...

  • Ideas for shorter names that don’t become confusing/ambiguous?

  • Also, rather than raw functions in the icu::header namespace, these could live on a class or two like UTFString or maybe UnsafeUTFString.

    • ... UTFString<std::u8string>::fromCodePointOrFFFD(U'カ') ...

    • ... UTFString<std::u8string>::fromCodePointUnsafe(U'カ') ...

    • ... UnsafeUTFString<std::u8string>::fromCodePoint(U'カ') ...

  • Calling code could create mini inline functions with short names like “strFromCP()” for the versions that it uses.


Note:

  • For these functions, I am proposing to just take UChar32 input values, as usual for ICU, rather than a tparam typename CP32. That is, these rely on compilers not complaining about implicit conversions between UChar32=int32_t and uint32_t/char32_t. (Tests pass on CI including treat-warnings-as-errors configs.)


/**

 * Appends the code point to the string.

 * Appends the U+FFFD replacement character instead if c is not a scalar value.

 * See https://www.unicode.org/glossary/#unicode_scalar_value

 *

 * @tparam StringClass A version of std::basic_string (or a compatible type)

 * @param s The string to append to

 * @param c The code point to append

 * @return s

 * @draft ICU 78

 * @see U_IS_SCALAR_VALUE

 */

template<typename StringClass>

U_FORCE_INLINE StringClass &appendCodePointOrFFFD(StringClass &s, UChar32 c)


/**

 * Appends the code point to the string.

 * The code point must be a scalar value; otherwise the behavior is undefined.

 * See https://www.unicode.org/glossary/#unicode_scalar_value

 *

 * @tparam StringClass A version of std::basic_string (or a compatible type)

 * @param s The string to append to

 * @param c The code point to append (must be a scalar value)

 * @return s

 * @draft ICU 78

 * @see U_IS_SCALAR_VALUE

 */

template<typename StringClass>

U_FORCE_INLINE StringClass &appendCodePointUnsafe(StringClass &s, UChar32 c)


/**

 * Returns the code point as a string of code units.

 * Returns the U+FFFD replacement character instead if c is not a scalar value.

 * See https://www.unicode.org/glossary/#unicode_scalar_value

 *

 * @tparam StringClass A version of std::basic_string (or a compatible type)

 * @param c The code point

 * @return the string of c's code units

 * @draft ICU 78

 * @see U_IS_SCALAR_VALUE

 */

template<typename StringClass>

U_FORCE_INLINE StringClass stringFromCodePointOrFFFD(UChar32 c)


/**

 * Returns the code point as a string of code units.

 * The code point must be a scalar value; otherwise the behavior is undefined.

 * See https://www.unicode.org/glossary/#unicode_scalar_value

 *

 * @tparam StringClass A version of std::basic_string (or a compatible type)

 * @param c The code point

 * @return the string of c's code units

 * @draft ICU 78

 * @see U_IS_SCALAR_VALUE

 */

template<typename StringClass>

U_FORCE_INLINE StringClass stringFromCodePointUnsafe(UChar32 c)

UnicodeString


Add a version of toUTF8String() without a result in/out parameter.


  /**

   * Convert the UnicodeString to a UTF-8 string.

   * Unpaired surrogates are replaced with U+FFFD.

   * Calls toUTF8().

   *

   * @tparam StringClass A std::string or a std::u8string (or a compatible type)

   * @return A std::string or a std::u8string (or a compatible object)

   *        with the UTF-8 version of the string.

   * @draft ICU 78

   * @see toUTF8

   */

  template<typename StringClass>

  StringClass toUTF8String() const


Make UnicodeString a “range” of UTF-16 code units (char16_t).


  /**

   * @return an iterator to the first code unit in this string.

   *     The iterator may be a pointer or a contiguous-iterator object.

   * @draft ICU 78

   */

  auto begin() const { return std::u16string_view(*this).begin(); }

  /**

   * @return an iterator to just past the last code unit in this string.

   *     The iterator may be a pointer or a contiguous-iterator object.

   * @draft ICU 78

   */

  auto end() const { return std::u16string_view(*this).end(); }

  /**

   * @return a reverse iterator to the last code unit in this string.

   *     The iterator may be a pointer or a contiguous-iterator object.

   * @draft ICU 78

   */

  auto rbegin() const { return std::u16string_view(*this).rbegin(); }

  /**

   * @return a reverse iterator to just before the first code unit in this string.

   *     The iterator may be a pointer or a contiguous-iterator object.

   * @draft ICU 78

   */

  auto rend() const { return std::u16string_view(*this).rend(); }


Make std::back_inserter(UnicodeString) work:


  /**

   * Appends the code unit `c` to the UnicodeString object.

   * @param c the code unit to append

   * @draft ICU 78

   */

  inline void push_back(char16_t c) { append(c); }

IterableOfInt.java – new file

package com.ibm.icu.lang;

import java.util.PrimitiveIterator;

/**

* Subinterface of Iterable whose iterator() returns a {@link PrimitiveIterator.OfInt}.

* Allows direct use of the primitive iterator without downcasting.

*

* @draft ICU 78

*/

public interface IterableOfInt extends Iterable<Integer> {

   /**

    * @return a {@link PrimitiveIterator.OfInt}

    * @draft ICU 78

    */

   @Override

   public PrimitiveIterator.OfInt iterator();

}

UCharacter.java

(FYI: Java already has [U]Character.isValidCodePoint(c); C++ already has U_IS_UNICODE_NONCHAR())


   /**

    * {@icu} Is cp a Unicode scalar value, that is, a non-surrogate code point?

    * Only scalar values can be represented in well-formed UTF-8/16/32.

    * See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:

    * Unicode Scalar Value</a>.

    *

    * @param cp the code point to check

    * @return true if cp is a Unicode scalar value

    * @draft ICU 78

    */

   public static final boolean isScalarValue(int cp) {


   /**

    * {@icu} Is cp a Unicode noncharacter?

    * See <a href="https://www.unicode.org/glossary/#noncharacter">Unicode Glossary:

    * Noncharacter</a>.

    *

    * @param cp the code point to check

    * @return true if cp is a Unicode noncharacter code point

    * @draft ICU 78

    */

   public static final boolean isNoncharacter(int cp)


   /**

    * {@icu} Returns an IterableOfInt over all Unicode code points U+0000..U+10FFFF.

    * See <a href="https://www.unicode.org/glossary/#code_point">Unicode Glossary: Code Point</a>.

    *

    * <p>Intended for test and builder code.

    *

    * @return an IterableOfInt over all Unicode code points U+0000..U+10FFFF.

    * @draft ICU 78

    */

   public static final IterableOfInt allCodePoints()

   /**

    * {@icu} Returns an IterableOfInt over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.

    * See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:

    * Unicode Scalar Value</a>.

    *

    * <p>Intended for test and builder code.

    *

    * @return an IterableOfInt over all Unicode scalar values.

    * @draft ICU 78

    */

   public static final IterableOfInt allScalarValues()

   /**

    * {@icu} Returns an IntStream over all Unicode code points U+0000..U+10FFFF.

    * See <a href="https://www.unicode.org/glossary/#code_point">Unicode Glossary: Code Point</a>.

    *

    * <p>Intended for test and builder code.

    *

    * @return an IntStream over all Unicode code points U+0000..U+10FFFF.

    * @draft ICU 78

    */

   public static final IntStream allCodePointsStream()

   /**

    * {@icu} Returns an IntStream over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.

    * See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:

    * Unicode Scalar Value</a>.

    *

    * <p>Intended for test and builder code.

    *

    * @return an IntStream over all Unicode scalar values.

    * @draft ICU 78

    */

   public static final IntStream allScalarValuesStream()


Sincerely,
markus

Markus Scherer

unread,
Jul 17, 2025, 1:16:09 PMJul 17
to icu-design, Robin Leroy, Richard T. Gillam, Mihai ⦅U⦆ Niță
The TC approved these APIs today with the following changes:

Change the code-point-to-string functions like this:

  • New header unicode/utfstring.h

  • namespace icu::header::utfstring

    • template<typename StringClass>
      U_FORCE_INLINE StringClass &appendOrFFFD(StringClass &s, UChar32 c)

    • template<typename StringClass>
      U_FORCE_INLINE StringClass &appendUnsafe(StringClass &s, UChar32 c)

    • template<typename StringClass>
      U_FORCE_INLINE StringClass encodeOrFFFD(UChar32 c)

    • template<typename StringClass>
      U_FORCE_INLINE StringClass encodeUnsafe(UChar32 c)

  • Example for call sites:

    • auto s = icu::header::utfstring::encodeOrFFFD<std::u16string>(U'カ');


    • using icu::header::utfstring;
      std::string s = ...;
      const char *p = appendUnsafe(s, U'カ');


Reply all
Reply to author
Forward
0 new messages