ICU4C API proposal: C++ Unicode string code point iterators

Markus Scherer

unread,

Mar 10, 2025, 8:23:30 PMMar 10

to icu-d...@unicode.org, Robin Leroy

Dear ICU team & users,

I would like to propose the following API for: ICU 78

Please provide feedback by: next Wednesday, 2025-03-19

Designated API reviewer: Robin

Ticket: ICU-23004 / draft pull request: icu/pull/3096

I would like to propose new C++ header-only APIs for iterating over the Unicode code points in a Unicode string, and more generally over the code units from a code unit iterator. These are modern C++ equivalents of some of the long-standing C macros for iterating over UTF-8 and UTF-16. This C++ API also supports UTF-32.

FYI: UTF-8 and UTF-16 encode code points with variable-length code unit sequences. A validating iterator needs to read and check all of the code units for one code point. When a code unit sequence is ill-formed, then the returned subsequence must be a prefix of a well-formed sequence. (Except we always return at least one code unit, so that we always progress.) (UTF-32 still has validation, but sequences always have length one.)

The proposed API can read code units from a C++ input_iterator or forward_iterator or bidirectional_iterator. The latter includes code unit pointers like const char * and const char16_t *. There is a convenience API for std::string_views.

The main class is called UTFIterator. Its operator*() returns a value serving a variety of use cases: Class CodeUnits provides the code point, the start of its minimal subsequence, the number of code units, and whether they are well-formed. (All functions are declared inline. An optimizing compiler will usually omit fields that are not used, and the code for computing them.)

UTFIterator has the API of a C++ STL iterator. It has template parameters for the code unit iterator type, for the code point type, and for how to handle ill-formed subsequences. std::make_reverse_iterator works for making reverse-range iterators.

The convenience class UTFStringCodePoints turns a std::string_view (of variable code unit type) into a code point iteration “range” with begin()/end()/rbegin()/rend() functions.

There are convenience functions utfIterator() and utfStringCodePoints() to simplify call sites; they deduce the code unit and base iterator types.

For each of these classes and convenience functions, there is also an “unsafe” version, just like for the C macros. The normal versions validate the code unit sequences. The “unsafe” ones assume/require that the strings/sequences are well-formed. As a result, they yield much smaller and faster code.

Sample code

using U_HEADER_ONLY_NAMESPACE::utfIterator;

using U_HEADER_ONLY_NAMESPACE::utfStringCodePoints;

using U_HEADER_ONLY_NAMESPACE::unsafeUTFIterator;

using U_HEADER_ONLY_NAMESPACE::unsafeUTFStringCodePoints;

int32_t rangeLoop16(std::u16string_view s) {

// We are just adding up the code points for minimal-code demonstration purposes.

int32_t sum = 0;

for (auto units : utfStringCodePoints<UChar32, UTF_BEHAVIOR_NEGATIVE>(s)) {

sum += units.codePoint(); // < 0 if ill-formed

}

return sum;

}

int32_t loopIterPlusPlus16(std::u16string_view s) {

auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);

int32_t sum = 0;

for (auto iter = range.begin(), limit = range.end(); iter != limit;) {

sum += (*iter++).codePoint(); // U+FFFD if ill-formed

}

return sum;

}

int32_t backwardLoop16(std::u16string_view s) {

auto range = utfStringCodePoints<UChar32, UTF_BEHAVIOR_SURROGATE>(s);

int32_t sum = 0;

for (auto start = range.begin(), iter = range.end(); start != iter;) {

sum += (*--iter).codePoint(); // surrogate code point if unpaired / ill-formed

}

return sum;

}

int32_t reverseLoop8(std::string_view s) {

auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);

int32_t sum = 0;

for (auto iter = range.rbegin(), limit = range.rend(); iter != limit; ++iter) {

sum += iter->codePoint(); // U+FFFD if ill-formed

}

return sum;

}

int32_t countCodePoints16(std::u16string_view s) {

auto range = utfStringCodePoints<UChar32, UTF_BEHAVIOR_SURROGATE>(s);

return std::distance(range.begin(), range.end());

}

int32_t unsafeRangeLoop16(std::u16string_view s) {

int32_t sum = 0;

for (auto units : unsafeUTFStringCodePoints<UChar32>(s)) {

sum += units.codePoint();

}

return sum;

}

int32_t unsafeReverseLoop8(std::string_view s) {

auto range = unsafeUTFStringCodePoints<UChar32>(s);

int32_t sum = 0;

for (auto iter = range.rbegin(), limit = range.rend(); iter != limit; ++iter) {

sum += iter->codePoint();

}

return sum;

}

char32_t firstCodePointOrFFFD16(std::u16string_view s) {

if (s.empty()) { return 0xfffd; }

auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);

return range.begin()->codePoint();

}

std::string_view firstSequence8(std::string_view s) {

if (s.empty()) { return {}; }

auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);

auto units = *(range.begin());

if (units.wellFormed()) {

return units.stringView();

} else {

return {};

}

Proposed public API signatures

New header file: unicode/utfiterator.h

// Some defined behaviors for handling ill-formed Unicode strings.

typedef enum UTFIllFormedBehavior {

// Returns a negative value instead of a code point.

UTF_BEHAVIOR_NEGATIVE,

// Returns U+FFFD Replacement Character.

UTF_BEHAVIOR_FFFD,

// UTF-8: Not allowed;

// UTF-16: returns the unpaired surrogate;

// UTF-32: returns the surrogate code point, or U+FFFD if out of range.

UTF_BEHAVIOR_SURROGATE

} UTFIllFormedBehavior;

namespace U_HEADER_ONLY_NAMESPACE {

/**

* Result of decoding a minimal Unicode code unit sequence.

* Returned from non-validating Unicode string code point iterators.

* Base class for class CodeUnits which is returned from validating iterators.

*

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t;

* should be signed if UTF_BEHAVIOR_NEGATIVE

* @tparam UnitIter An iterator (often a pointer) that returns a code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @see UnsafeUTFIterator

* @see UnsafeUTFStringCodePoints

* @draft ICU 78

*/

template<typename CP32, typename UnitIter, typename = void>

class UnsafeCodeUnits {

public:

UnsafeCodeUnits(const UnsafeCodeUnits &other) = default;

UnsafeCodeUnits &operator=(const UnsafeCodeUnits &other) = default;

/**

* @return the Unicode code point decoded from the code unit sequence.

* If the sequence is ill-formed and the iterator validates,

* then this is a replacement value according to the iterator‘s

* UTFIllFormedBehavior template parameter.

* @draft ICU 78

*/

UChar32 codePoint() const { return c; }

/**

* @return the start of the minimal Unicode code unit sequence.

* Only enabled if UnitIter is a (multi-pass) forward_iterator or better.

* @draft ICU 78

*/

UnitIter data() const { return p; }

/**

* @return the length of the minimal Unicode code unit sequence.

* @draft ICU 78

*/

uint8_t length() const { return len; }

/**

* @return a string_view of the minimal Unicode code unit sequence.

* Only enabled if UnitIter is a pointer.

* @draft ICU 78

*/

stringView() const {

};

/**

* Result of validating and decoding a minimal Unicode code unit sequence.

* Returned from validating Unicode string code point iterators.

* Adds function wellFormed() to base class UnsafeCodeUnits.

*

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t;

* should be signed if UTF_BEHAVIOR_NEGATIVE

* @tparam UnitIter An iterator (often a pointer) that returns a code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @see UTFIterator

* @see UTFStringCodePoints

* @draft ICU 78

*/

template<typename CP32, typename UnitIter, typename = void>

class CodeUnits : public UnsafeCodeUnits<CP32, UnitIter> {

public:

CodeUnits(const CodeUnits &other) = default;

CodeUnits &operator=(const CodeUnits &other) = default;

bool wellFormed() const { return ok; }

};

validating

/**

* Validating iterator over the code points in a Unicode string.

*

* The UnitIter can be

* an input_iterator, a forward_iterator, or a bidirectional_iterator (including a pointer).

* The UTFIterator will have the corresponding iterator_category.

*

* For reverse iteration, either use this iterator directly as in <code>*--iter</code>

* or wrap it using std::make_reverse_iterator(iter).

*

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t;

* should be signed if UTF_BEHAVIOR_NEGATIVE

* @tparam behavior How to handle ill-formed Unicode strings

* @tparam UnitIter An iterator (often a pointer) that returns a code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @draft ICU 78

*/

template<typename CP32, UTFIllFormedBehavior behavior, typename UnitIter>

class UTFIterator {

public:

// Constructor with start <= p < limit.

// All of these iterators/pointers should be at code point boundaries.

// Only enabled if UnitIter is a (multi-pass) forward_iterator or better.

// TODO: Should we enable this only for a bidirectional_iterator?

inline UTFIterator(UnitIter start, UnitIter p, UnitIter limit) :

// Constructs an iterator with start=p.

inline UTFIterator(UnitIter p, UnitIter limit) :

// Constructs an iterator start or limit sentinel.

// Requires UnitIter to be copyable.

inline UTFIterator(UnitIter p)

inline UTFIterator(UTFIterator &&src) noexcept = default;

inline UTFIterator &operator=(UTFIterator &&src) noexcept = default;

inline UTFIterator(const UTFIterator &other) = default;

inline UTFIterator &operator=(const UTFIterator &other) = default;

inline bool operator==(const UTFIterator &other) const {

inline bool operator!=(const UTFIterator &other) const { return !operator==(other); }

inline CodeUnits<CP32, UnitIter> operator*() const {

/**

* @return the current decoded subsequence via an opaque proxy object

* so that <code>iter->codePoint()</code> etc. works.

* @draft ICU 78

*/

inline Proxy operator->() const {

inline UTFIterator &operator++() { // pre-increment

/**

* @return a copy of this iterator from before the increment.

* If UnitIter is a single-pass input_iterator, then this function

* returns an opaque proxy object so that <code>*iter++</code> still works.

* @draft ICU 78

*/

inline UTFIterator operator++(int) { // post-increment

// Only enabled if UnitIter is a bidirectional_iterator (including a pointer).

inline UTFIterator &operator--() { // pre-decrement

// Only enabled if UnitIter is a bidirectional_iterator (including a pointer).

inline UTFIterator operator--(int) { // post-decrement

};

/**

* A C++ "range" for validating iteration over all of the code points of a Unicode string.

*

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t;

* should be signed if UTF_BEHAVIOR_NEGATIVE

* @tparam behavior How to handle ill-formed Unicode strings

* @tparam Unit Code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @draft ICU 78

*/

template<typename CP32, UTFIllFormedBehavior behavior, typename Unit>

class UTFStringCodePoints {

public:

/**

* Constructs a C++ "range" object over the code points in the string.

* @draft ICU 78

*/

UTFStringCodePoints(std::basic_string_view<Unit> s) : s(s) {}

/** @draft ICU 78 */

UTFStringCodePoints(const UTFStringCodePoints &other) = default;

/** @draft ICU 78 */

UTFStringCodePoints &operator=(const UTFStringCodePoints &other) = default;

/** @draft ICU 78 */

UTFIterator<CP32, behavior, const Unit *> begin() const {

/** @draft ICU 78 */

UTFIterator<CP32, behavior, const Unit *> end() const {

/**

* @return std::reverse_iterator(end())

* @draft ICU 78

*/

auto rbegin() const {

/**

* @return std::reverse_iterator(begin())

* @draft ICU 78

*/

auto rend() const {

};

/**

* UTFIterator factory function for start <= p < limit.

* Only enabled if UnitIter is a (multi-pass) forward_iterator or better.

*

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

* @tparam behavior How to handle ill-formed Unicode strings

* @tparam UnitIter Can usually be omitted/deduced:

* An iterator (often a pointer) that returns a code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @param start start code unit iterator

* @param p current-position code unit iterator

* @param limit limit (exclusive-end) code unit iterator

* @return a UTFIterator<CP32, behavior, UnitIter>

* for the given code unit iterators or character pointers

* @draft ICU 78

*/

template<typename CP32, UTFIllFormedBehavior behavior, typename UnitIter>

auto utfIterator(UnitIter start, UnitIter p, UnitIter limit) {

/**

* UTFIterator factory function for start = p < limit.

* ...

*/

template<typename CP32, UTFIllFormedBehavior behavior, typename UnitIter>

auto utfIterator(UnitIter p, UnitIter limit) {

/**

* UTFIterator factory function for a start or limit sentinel.

* Requires UnitIter to be copyable.

* ...

*/

template<typename CP32, UTFIllFormedBehavior behavior, typename UnitIter>

auto utfIterator(UnitIter p) {

/**

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t;

* should be signed if UTF_BEHAVIOR_NEGATIVE

* @tparam behavior How to handle ill-formed Unicode strings

* @tparam StringView Can usually be omitted/deduced: A std::basic_string_view<Unit>

* @param s input string_view

* @return a UTFStringCodePoints<CP32, behavior, Unit>

* for the given std::basic_string_view<Unit>,

* deducing the Unit character type

* @draft ICU 78

*/

template<typename CP32, UTFIllFormedBehavior behavior, typename StringView>

auto utfStringCodePoints(StringView s) {

non-validating

/**

* Non-validating iterator over the code points in a Unicode string.

* The string must be well-formed.

*

* The UnitIter can be

* an input_iterator, a forward_iterator, or a bidirectional_iterator (including a pointer).

* The UTFIterator will have the corresponding iterator_category.

*

* For reverse iteration, either use this iterator directly as in <code>*--iter</code>

* or wrap it using std::make_reverse_iterator(iter).

*

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

* @tparam UnitIter An iterator (often a pointer) that returns a code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @draft ICU 78

*/

template<typename CP32, typename UnitIter>

class UnsafeUTFIterator {

public:

inline UnsafeUTFIterator(UnitIter p) : p_(p), units_(0, 0, p) {}

inline UnsafeUTFIterator(UnsafeUTFIterator &&src) noexcept = default;

inline UnsafeUTFIterator &operator=(UnsafeUTFIterator &&src) noexcept = default;

inline UnsafeUTFIterator(const UnsafeUTFIterator &other) = default;

inline UnsafeUTFIterator &operator=(const UnsafeUTFIterator &other) = default;

inline bool operator==(const UnsafeUTFIterator &other) const {

inline bool operator!=(const UnsafeUTFIterator &other) const { return !operator==(other); }

inline UnsafeCodeUnits<CP32, UnitIter> operator*() const {

/**

* @return the current decoded subsequence via an opaque proxy object

* so that <code>iter->codePoint()</code> etc. works.

* @draft ICU 78

*/

inline Proxy operator->() const {

inline UnsafeUTFIterator &operator++() { // pre-increment

/**

* @return a copy of this iterator from before the increment.

* If UnitIter is a single-pass input_iterator, then this function

* returns an opaque proxy object so that <code>*iter++</code> still works.

* @draft ICU 78

*/

inline UnsafeUTFIterator operator++(int) { // post-increment

// Only enabled if UnitIter is a bidirectional_iterator (including a pointer).

inline UnsafeUTFIterator &operator--() { // pre-decrement

// Only enabled if UnitIter is a bidirectional_iterator (including a pointer).

inline UnsafeUTFIterator operator--(int) { // post-decrement

};

/**

* A C++ "range" for non-validating iteration over all of the code points of a Unicode string.

* The string must be well-formed.

*

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

* @tparam Unit Code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @draft ICU 78

*/

template<typename CP32, typename Unit>

class UnsafeUTFStringCodePoints {

public:

/**

* Constructs a C++ "range" object over the code points in the string.

* @draft ICU 78

*/

UnsafeUTFStringCodePoints(std::basic_string_view<Unit> s) : s(s) {}

/** @draft ICU 78 */

UnsafeUTFStringCodePoints(const UnsafeUTFStringCodePoints &other) = default;

/** @draft ICU 78 */

UnsafeUTFStringCodePoints &operator=(const UnsafeUTFStringCodePoints &other) = default;

/** @draft ICU 78 */

UnsafeUTFIterator<CP32, const Unit *> begin() const {

/** @draft ICU 78 */

UnsafeUTFIterator<CP32, const Unit *> end() const {

/**

* @return std::reverse_iterator(end())

* @draft ICU 78

*/

auto rbegin() const {

/**

* @return std::reverse_iterator(begin())

* @draft ICU 78

*/

auto rend() const {

};

/**

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

* @tparam UnitIter Can usually be omitted/deduced:

* An iterator (often a pointer) that returns a code unit type:

* UTF-8: char or char8_t or uint8_t;

* UTF-16: char16_t or uint16_t or (on Windows) wchar_t

* @param iter code unit iterator

* @return an UnsafeUTFIterator<CP32, UnitIter>

* for the given code unit iterator or character pointer

* @draft ICU 78

*/

template<typename CP32, typename UnitIter>

auto unsafeUTFIterator(UnitIter iter) {

/**

* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t

* @tparam StringView Can usually be omitted/deduced: A std::basic_string_view<Unit>

* @param s input string_view

* @return an UnsafeUTFStringCodePoints<CP32, Unit>

* for the given std::basic_string_view<Unit>,

* deducing the Unit character type

* @draft ICU 78

*/

template<typename CP32, typename StringView>

auto unsafeUTFStringCodePoints(StringView s) {

} // namespace U_HEADER_ONLY_NAMESPACE

Sincerely,

markus

Markus Scherer

unread,

Apr 4, 2025, 8:07:53 PMApr 4

to icu-d...@unicode.org, Robin Leroy

Update:

Between sending this proposal and the ICU-TC review, Robin and I had made the following changes.

With these changes, the TC approved the proposal.

(And requested some non-API-signature changes which I made.)

a) Changed [Unsafe]CodeUnits function data() to begin() and end().

This fits, because they return iterators, not generally pointers.

And it makes the CodeUnits more like an actual C++ "range" of code units.

class UnsafeCodeUnits {

// Post-proposal change:

// Renamed data() to begin() (because it returns a UnitIter which need not be a pointer)

// and add the corresponding end().

/**

* @return the start of the minimal Unicode code unit sequence.

* Only enabled if UnitIter is a (multi-pass) forward_iterator or better.

* @draft ICU 78

*/

UnitIter data begin() const { return p; }

/**

* @return the limit (exclusive end) of the minimal Unicode code unit sequence.

* Only enabled if UnitIter is a (multi-pass) forward_iterator or better.

* @draft ICU 78

*/

UnitIter end() const { return limit_; }

b) We changed the return types of the [Unsafe]UTFCodePoints begin() and end() functions to opaque "auto", making the exact types implementation details, in case we need to change them later.

class UTFStringCodePoints {

// Post-proposal change: (twice here, twice in UnsafeUTFStringCodePoints)

// Make the begin()/end() return types opaque.

// Returns a UTFIterator<CP32, behavior, UnitIter> where the UnitIter may vary;

// it may be a const Unit * or a basic_string_view<Unit>::iterator.

/** @draft ICU 78 */

UTFIterator<CP32, behavior, const Unit *> auto begin() const {

/** @draft ICU 78 */

UTFIterator<CP32, behavior, const Unit *> auto end() const {

class UnsafeUTFStringCodePoints {

/** @draft ICU 78 */

UnsafeUTFIterator<CP32, const Unit *> auto begin() const {

/** @draft ICU 78 */

UnsafeUTFIterator<CP32, const Unit *> auto end() const {

markus

Markus Scherer

unread,

Apr 4, 2025, 8:11:12 PMApr 4

to icu-d...@unicode.org, Robin Leroy

Addendum:

(The TC has not seen this yet.)

While writing more test code, I found two problems that require API additions and changes.

Please let me know if you disagree.

First:

A C++ forward_iterator is required to be default-constructible (have a default constructor).

Therefore, I have added:

class UTFIterator {

/**

* Default constructor. Makes a non-functional iterator.

*

* @draft ICU 78

*/

U_FORCE_INLINE UTFIterator()

class UnsafeUTFIterator {

/**

* Default constructor. Makes a non-functional iterator.

*

* @draft ICU 78

*/

U_FORCE_INLINE UnsafeUTFIterator()

Second:

I found that

template<typename CP32, UTFIllFormedBehavior behavior, typename StringView>

auto utfStringCodePoints(StringView s)

and

template<typename CP32, typename StringView>

auto unsafeUTFStringCodePoints(StringView s)

did not work when the StringView was not literally a std::basic_string_view<CharType>.

(I should have known better...)

In order for these convenience functions to work with other inputs – std::string variants, string literals, UnicodeString – I had to replace each of these two functions with five overloads, without the StringView template parameter:

template<typename CP32, UTFIllFormedBehavior behavior>

auto utfStringCodePoints(std::string_view s) {

template<typename CP32, UTFIllFormedBehavior behavior>

auto utfStringCodePoints(std::u16string_view s) {

template<typename CP32, UTFIllFormedBehavior behavior>

auto utfStringCodePoints(std::u32string_view s) {

template<typename CP32, UTFIllFormedBehavior behavior>

auto utfStringCodePoints(std::u8string_view s) {

template<typename CP32, UTFIllFormedBehavior behavior>

auto utfStringCodePoints(std::wstring_view s) {

and

template<typename CP32>

auto unsafeUTFStringCodePoints(std::string_view s) {

template<typename CP32>

auto unsafeUTFStringCodePoints(std::u16string_view s) {

template<typename CP32>

auto unsafeUTFStringCodePoints(std::u32string_view s) {

template<typename CP32>

auto unsafeUTFStringCodePoints(std::u8string_view s) {

template<typename CP32>

auto unsafeUTFStringCodePoints(std::wstring_view s) {

Sincerely,

markus

Robin Leroy

unread,

May 21, 2025, 12:17:54 PMMay 21

to Markus Scherer, icu-d...@unicode.org

Le mar. 11 mars 2025 à 01:23, Markus Scherer <marku...@gmail.com> a écrit :

/**
* Result of decoding a minimal Unicode code unit sequence.
* Returned from non-validating Unicode string code point iterators.
* Base class for class CodeUnits which is returned from validating iterators.
*
* @tparam CP32 Code point type: UChar32 (=int32_t) or char32_t or uint32_t;
* should be signed if UTF_BEHAVIOR_NEGATIVE
* @tparam UnitIter An iterator (often a pointer) that returns a code unit type:
* UTF-8: char or char8_t or uint8_t;
* UTF-16: char16_t or uint16_t or (on Windows) wchar_t
* @see UnsafeUTFIterator
* @see UnsafeUTFStringCodePoints
* @draft ICU 78
*/
template<typename CP32, typename UnitIter, typename = void>
class UnsafeCodeUnits {
public:

// […]

    /**
     * @return the Unicode code point decoded from the code unit sequence.
     * If the sequence is ill-formed and the iterator validates,
     * then this is a replacement value according to the iterator‘s
     * UTFIllFormedBehavior template parameter.
     * @draft ICU 78
     */
    UChar32 codePoint() const { return c; }

This should be CP32.

Markus Scherer

unread,

May 21, 2025, 12:46:42 PMMay 21

to Robin Leroy, icu-d...@unicode.org

Oh my :-(

Yes! Good catch, the whole point of the typename CP32 template parameter is to store and return a type chosen by the call site, to fit with their code.

We need to fix that.

tnx

markus

Markus Scherer

unread,

Jun 14, 2025, 6:04:22 PMJun 14

to Robin Leroy, icu-d...@unicode.org

Dear ICU team & users,

We have another addendum.

Robin has been testing the new iterators with C++20 and C++23. They require changes in order to work with C++20 range iterators – for example, because in C++20 they are no longer required to define an iterator_category tag, and some iterators use a different “sentinel” type for the end() than for the actual iterator. Robin and I worked on it, with Robin doing the heavy lifting on figuring out what the new C++ versions can do and what they expect.

We have a pending pull request with the changes: https://github.com/unicode-org/icu/pull/3499

Note that the changes are basically transparent to C++17 code. In particular, none of the C++17 test code had to change. (See icu4c/source/test/intltest/utfiteratortest.cpp in the pull request.)

The public API changes in the following ways compared to the previous proposal.

Please speak up if you disagree with any of these changes.

a) The validating class UTFIterator gains an additional template type parameter for the code unit iteration limit (sentinel) which defaults to the same type as the code unit start/current iterators.

template<typename CP32, UTFIllFormedBehavior behavior, typename UnitIter>

class UTFIterator {

→

template<typename CP32, UTFIllFormedBehavior behavior,

typename UnitIter, typename LimitIter = UnitIter>

class UTFIterator {

b) The UTFIterator constructor “limit” parameters change type from UnitIter to LimitIter. Same for the utfIterator() function parameters.

Again, these types are either the same (and often deduced), or the LimitIter is a different “sentinel” type.

c) When working with a code unit input_iterator which uses a sentinel of a different type, then that code unit sentinel type works as the sentinel for the code point iterator = UTFIterator as well.

This is facilitated by additional, conditional operator==() and operator!=() overloads.

d) class UTFStringCodePoints and the utfStringCodePoints() functions used to only work with one of the string_view types (std::basic_string_view<Unit> for various Unit types).

They now work with any code unit range object, and by reference or by value.

This also means that when a caller has a std::string, the begin() and end() iterators of the per-code point CodeUnit object are compatible with the input string iterators – rather than the previous version making a string_view for the string and operating on string_view iterators.

e) class UTFStringCodePoints now also has non-const begin() and end(). (Required for some concepts and algorithms.)

f) Instead of multiple utfStringCodePoints() functions, overloaded for different versions of string_view, there is now one function-like object.

We are declaring an @internal “adaptor” class which has a constructor that takes any kind of “range” and has an operator() which returns the appropriate UTFStringCodePoints. When using C++23, the “adaptor” class is-a std::ranges::range_adaptor_closure.

The end effect is that this object can still be used like the function before, but it works with more than just string_view, and it works in C++23 stream pipeline expressions. And since it works with any range, we no longer need 5 overloads for the different string_view’s.

z) Almost all of these changes also apply to UnsafeUTFIterator, unsafeUTFIterator(), UnsafeUTFStringCodePoints, and unsafeUTFStringCodePoints.

Exception: class UnsafeUTFIterator does not have “limit” parameters, so it does not get the additional LimitIter template parameter. But it does also work with code unit sentinel types.

Sincerely,

markus

Markus Scherer

unread,

Jul 16, 2025, 7:30:57 PMJul 16

to Robin Leroy, icu-d...@unicode.org

Dear ICU team & users,

Yet another addendum, this one minor, and based on feedback from using an early version of these iterators.

We want to add default constructors to the ...CodePoints classes:

class UTFStringCodePoints

/**

* Constructs an empty C++ "range" object.

* @draft ICU 78

*/

UTFStringCodePoints()

class UnsafeUTFStringCodePoints

/**

* Constructs an empty C++ "range" object.

* @draft ICU 78

*/

UnsafeUTFStringCodePoints()

We also want to add “iterator” typedefs to these two classes, but declare them @internal as for other such definitions:

/** C++ iterator boilerplate @internal */

using iterator = [Unsafe]UTFIterator<...>;

This is sometimes handy, but C++20 views and range adaptors generally do not provide these, and there is a chance that this evolves with experience and with C++ standard changes, so we are not formally declaring this part of the public API.