Dear ICU team & users,
I would like to propose the following API for: ICU 78
Please provide feedback by: next Wednesday, 2025-07-16
Designated API reviewers: Robin & Rich & Mihai
Ticket: ICU-23152
WIP changes: https://github.com/unicode-org/icu/pull/3539
I would like to plug some long-standing gaps where I have seen code with magic numbers, multi-line constructs, and incomplete solutions. Most of the additions are for C++, working with code points and writing them to UTF-8/16/32 destinations. A few are for Java.
Among these are four C++ functions for writing a code point to a C++ standard string. I know that this is valuable functionality, but I have notes/questions about the exact API shape for them; see below.
/**
* Is c a Unicode code point U+0000..U+10FFFF?
* https://www.unicode.org/glossary/#code_point
*
* @param c 32-bit code point
* @return true or false
* @draft ICU 78
*/
#define U_IS_CODE_POINT(c)
/**
* Is c a Unicode scalar value, that is, a non-surrogate code point?
* Only scalar values can be represented in well-formed UTF-8/16/32.
* https://www.unicode.org/glossary/#unicode_scalar_value
*
* @param c 32-bit code point
* @return true or false
* @draft ICU 78
*/
#define U_IS_SCALAR_VALUE(c)
/**
* Returns the length of a well-formed UTF-8 byte sequence according to its lead byte.
* Returns 1 for 0..0xc1 as well as for 0xf5..0xff.
* leadByte might be evaluated multiple times.
*
* @param leadByte The first byte of a UTF-8 sequence. Must be 0..0xff.
* @return 1..4
* @draft ICU 78
*/
#define U8_LENGTH_FROM_LEAD_BYTE(leadByte)
/**
* Returns the length of a well-formed UTF-8 byte sequence according to its lead byte.
* Returns 1 for 0..0xc1. Undefined for 0xf5..0xff.
* leadByte might be evaluated multiple times.
*
* @param leadByte The first byte of a UTF-8 sequence. Must be 0..0xff.
* @return 1..4
* @draft ICU 78
*/
#define U8_LENGTH_FROM_LEAD_BYTE_UNSAFE(leadByte)
/**
* A C++ "range" over all Unicode code points U+0000..U+10FFFF.
* https://www.unicode.org/glossary/#code_point
*
* Intended for test and builder code.
*
* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t
* @draft ICU 78
* @see U_IS_CODE_POINT
*/
template<typename CP32>
class AllCodePoints {
public:
/** Constructor. @draft ICU 78 */
AllCodePoints() {}
/**
* @return an iterator over all Unicode code points.
* The iterator returns CP32 integers.
* @draft ICU 78
*/
auto begin() const
/**
* @return an exclusive-end iterator over all Unicode code points.
* @draft ICU 78
*/
auto end() const
};
/**
* A C++ "range" over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.
* That is, all code points except surrogates.
* Only scalar values can be represented in well-formed UTF-8/16/32.
* https://www.unicode.org/glossary/#unicode_scalar_value
*
* Intended for test and builder code.
*
* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t
* @draft ICU 78
* @see U_IS_SCALAR_VALUE
*/
template<typename CP32>
class AllScalarValues {
public:
/** Constructor. @draft ICU 78 */
AllScalarValues() {}
/**
* @return an iterator over all Unicode scalar values.
* The iterator returns CP32 integers.
* @draft ICU 78
*/
auto begin() const
/**
* @return an exclusive-end iterator over all Unicode scalar values.
* @draft ICU 78
*/
auto end() const
};
Note:
I propose the following code point to string functions for utfiterator.h. They are not related to iterators but would have similar code if & when we add input or output iterators that encode code points to strings.
I can’t think of another existing header files where these would fit.
We could create a whole new header file just for these four functions.
WDYT?
Note:
The append... functions have a parameter order of (string, code point). This seems intuitive to me.
However, for C++ functions we usually prefer inputs before outputs.
WDYT?
Note:
These names are kind of long, especially the stringFrom... functions which require the StringClass template parameter.
... stringFromCodePointOrFFFD<std::u8string>(U'カ') ...
Ideas for shorter names that don’t become confusing/ambiguous?
Also, rather than raw functions in the icu::header namespace, these could live on a class or two like UTFString or maybe UnsafeUTFString.
... UTFString<std::u8string>::fromCodePointOrFFFD(U'カ') ...
... UTFString<std::u8string>::fromCodePointUnsafe(U'カ') ...
... UnsafeUTFString<std::u8string>::fromCodePoint(U'カ') ...
Calling code could create mini inline functions with short names like “strFromCP()” for the versions that it uses.
Note:
For these functions, I am proposing to just take UChar32 input values, as usual for ICU, rather than a tparam typename CP32. That is, these rely on compilers not complaining about implicit conversions between UChar32=int32_t and uint32_t/char32_t. (Tests pass on CI including treat-warnings-as-errors configs.)
/**
* Appends the code point to the string.
* Appends the U+FFFD replacement character instead if c is not a scalar value.
* See https://www.unicode.org/glossary/#unicode_scalar_value
*
* @tparam StringClass A version of std::basic_string (or a compatible type)
* @param s The string to append to
* @param c The code point to append
* @return s
* @draft ICU 78
* @see U_IS_SCALAR_VALUE
*/
template<typename StringClass>
U_FORCE_INLINE StringClass &appendCodePointOrFFFD(StringClass &s, UChar32 c)
/**
* Appends the code point to the string.
* The code point must be a scalar value; otherwise the behavior is undefined.
* See https://www.unicode.org/glossary/#unicode_scalar_value
*
* @tparam StringClass A version of std::basic_string (or a compatible type)
* @param s The string to append to
* @param c The code point to append (must be a scalar value)
* @return s
* @draft ICU 78
* @see U_IS_SCALAR_VALUE
*/
template<typename StringClass>
U_FORCE_INLINE StringClass &appendCodePointUnsafe(StringClass &s, UChar32 c)
/**
* Returns the code point as a string of code units.
* Returns the U+FFFD replacement character instead if c is not a scalar value.
* See https://www.unicode.org/glossary/#unicode_scalar_value
*
* @tparam StringClass A version of std::basic_string (or a compatible type)
* @param c The code point
* @return the string of c's code units
* @draft ICU 78
* @see U_IS_SCALAR_VALUE
*/
template<typename StringClass>
U_FORCE_INLINE StringClass stringFromCodePointOrFFFD(UChar32 c)
/**
* Returns the code point as a string of code units.
* The code point must be a scalar value; otherwise the behavior is undefined.
* See https://www.unicode.org/glossary/#unicode_scalar_value
*
* @tparam StringClass A version of std::basic_string (or a compatible type)
* @param c The code point
* @return the string of c's code units
* @draft ICU 78
* @see U_IS_SCALAR_VALUE
*/
template<typename StringClass>
U_FORCE_INLINE StringClass stringFromCodePointUnsafe(UChar32 c)
Add a version of toUTF8String() without a result in/out parameter.
/**
* Convert the UnicodeString to a UTF-8 string.
* Unpaired surrogates are replaced with U+FFFD.
* Calls toUTF8().
*
* @tparam StringClass A std::string or a std::u8string (or a compatible type)
* @return A std::string or a std::u8string (or a compatible object)
* with the UTF-8 version of the string.
* @draft ICU 78
* @see toUTF8
*/
template<typename StringClass>
StringClass toUTF8String() const
Make UnicodeString a “range” of UTF-16 code units (char16_t).
/**
* @return an iterator to the first code unit in this string.
* The iterator may be a pointer or a contiguous-iterator object.
* @draft ICU 78
*/
auto begin() const { return std::u16string_view(*this).begin(); }
/**
* @return an iterator to just past the last code unit in this string.
* The iterator may be a pointer or a contiguous-iterator object.
* @draft ICU 78
*/
auto end() const { return std::u16string_view(*this).end(); }
/**
* @return a reverse iterator to the last code unit in this string.
* The iterator may be a pointer or a contiguous-iterator object.
* @draft ICU 78
*/
auto rbegin() const { return std::u16string_view(*this).rbegin(); }
/**
* @return a reverse iterator to just before the first code unit in this string.
* The iterator may be a pointer or a contiguous-iterator object.
* @draft ICU 78
*/
auto rend() const { return std::u16string_view(*this).rend(); }
Make std::back_inserter(UnicodeString) work:
/**
* Appends the code unit `c` to the UnicodeString object.
* @param c the code unit to append
* @draft ICU 78
*/
inline void push_back(char16_t c) { append(c); }
package com.ibm.icu.lang;
import java.util.PrimitiveIterator;
/**
* Subinterface of Iterable whose iterator() returns a {@link PrimitiveIterator.OfInt}.
* Allows direct use of the primitive iterator without downcasting.
*
* @draft ICU 78
*/
public interface IterableOfInt extends Iterable<Integer> {
/**
* @return a {@link PrimitiveIterator.OfInt}
* @draft ICU 78
*/
@Override
public PrimitiveIterator.OfInt iterator();
}
(FYI: Java already has [U]Character.isValidCodePoint(c); C++ already has U_IS_UNICODE_NONCHAR())
/**
* {@icu} Is cp a Unicode scalar value, that is, a non-surrogate code point?
* Only scalar values can be represented in well-formed UTF-8/16/32.
* See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:
* Unicode Scalar Value</a>.
*
* @param cp the code point to check
* @return true if cp is a Unicode scalar value
* @draft ICU 78
*/
public static final boolean isScalarValue(int cp) {
/**
* {@icu} Is cp a Unicode noncharacter?
* See <a href="https://www.unicode.org/glossary/#noncharacter">Unicode Glossary:
* Noncharacter</a>.
*
* @param cp the code point to check
* @return true if cp is a Unicode noncharacter code point
* @draft ICU 78
*/
public static final boolean isNoncharacter(int cp)
/**
* {@icu} Returns an IterableOfInt over all Unicode code points U+0000..U+10FFFF.
* See <a href="https://www.unicode.org/glossary/#code_point">Unicode Glossary: Code Point</a>.
*
* <p>Intended for test and builder code.
*
* @return an IterableOfInt over all Unicode code points U+0000..U+10FFFF.
* @draft ICU 78
*/
public static final IterableOfInt allCodePoints()
/**
* {@icu} Returns an IterableOfInt over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.
* See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:
* Unicode Scalar Value</a>.
*
* <p>Intended for test and builder code.
*
* @return an IterableOfInt over all Unicode scalar values.
* @draft ICU 78
*/
public static final IterableOfInt allScalarValues()
/**
* {@icu} Returns an IntStream over all Unicode code points U+0000..U+10FFFF.
* See <a href="https://www.unicode.org/glossary/#code_point">Unicode Glossary: Code Point</a>.
*
* <p>Intended for test and builder code.
*
* @return an IntStream over all Unicode code points U+0000..U+10FFFF.
* @draft ICU 78
*/
public static final IntStream allCodePointsStream()
/**
* {@icu} Returns an IntStream over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF.
* See <a href="https://www.unicode.org/glossary/#unicode_scalar_value">Unicode Glossary:
* Unicode Scalar Value</a>.
*
* <p>Intended for test and builder code.
*
* @return an IntStream over all Unicode scalar values.
* @draft ICU 78
*/
public static final IntStream allScalarValuesStream()Change the code-point-to-string functions like this:
New header unicode/utfstring.h
namespace icu::header::utfstring
template<typename StringClass>
U_FORCE_INLINE StringClass &appendOrFFFD(StringClass &s, UChar32 c)
template<typename StringClass>
U_FORCE_INLINE StringClass &appendUnsafe(StringClass &s, UChar32 c)
template<typename StringClass>
U_FORCE_INLINE StringClass encodeOrFFFD(UChar32 c)
template<typename StringClass>
U_FORCE_INLINE StringClass encodeUnsafe(UChar32 c)
Example for call sites:
auto s = icu::header::utfstring::encodeOrFFFD<std::u16string>(U'カ');
using icu::header::utfstring;
std::string s = ...;
const char *p = appendUnsafe(s, U'カ');