The C++ Iterator API is terrible. How can we fix it?

1,524 views
Skip to first unread message

Matthew Fioravante

unread,
Oct 10, 2016, 2:11:52 PM10/10/16
to ISO C++ Standard - Future Proposals
I was watching the cppcon 2016 talk “Building and Extending the Iterator Hierarchy in a Modern, Multicore World".

At the very beginning, the speaker asks the audience who likes and who loves iterators. Then he goes on to define iterator categories from scratch. While not much of this is exactly new, it really reminded me of the beauty of the standard algorithms and the iterator categories.

For those who haven't seen the talk, he basically defines iterators using a new syntax with concepts lite.

Iterator:
successor(i) -> returns next iterator
operator==() -> compare 2 iterators

Readable:
source(i) -> reads the value from the iterator.

Writable:
sink(i) -> writes the value to the iterator.

Bidirectional Iterator:
predecessor(i) -> returns the prev iterator

Random Access Iterator:
operator+(i,n) -> increment by N
operator-(i,n) -> decrement by N
operator-(i,j) -> difference between 2 iterators


He then goes on to further with contiguous iterators, segmented iterators, etc..

What I found most interesting is that instead of using the operator syntax we already have, he had to define his own in order to make the concepts he was explaining more clear. Finally I thought back to the first question "Who loves iterators?". When I heard this question I thought to myself that I certainly do not love iterators but as I watched the talk I came to realize why. Iterators themselves are actually great. The problem is that whenever I think about iterators, I think about the horrible confusing pain that comes with implementing them.

To get all the way up to random access iterator, we only need the above functions.

However in real C++ code, to write a compliant iterator you have to write all of this stupid boilerplate such as operator->(), operator++(int), operator--(int). 

For random access, you get to write non-member operator+(i,j) along with member operator+=(j).

The big mistake is that in the beginning we tried to model the iterator API to match the API of pointers. A pointer is an iterator, but an iterator is not a pointer. 

All of these extra functions such as operator +=() and ++(int) are useless. When reading the code, they only serve to obscure the definition of the iterator. For the developer, its more boilerplate functions that have to be unit tested. These functions are notoriously easy to write copy and paste errors. How many of you had a unit test catch a ++ inside of your operator--(int) implementation? Since they are required but not often used, such bugs can easily go unnoticed unless you are rigorous with unit testing.

Think back to your generic code, how useful is iterator post increment really? We even had to invent a rule about not using i++ post increment in for loops because i might be an iterator.

Operator [] for random access iterators is also problematic and pretty much useless. If I'm storing a proxy internally, its much easier to require the external user to increment the iterator (thus incrementing the proxy) and then simply return a reference to the new proxy state using operator*(). With operator[], I have to somehow update some state somewhere and then return a reference to it. Or a return a new proxy that somehow acts like a reference. The whole process is pointless and complicated, especially for a function like operator[] that nobody ever uses with iterators but still has to implement to be technically compliant.

I also find source(i) and sink(i) to be incredibly readable in terms of what happens when you try to read and write to the iterator. operator*() and operator*() const, along with paired operator->() and operator->() const not so much. The requirement of operator->() also greatly complicates using proxies because the result of operator->() needs to be dereferenced to the same object as what is returned by operator*().  

Operator * and -> is also confusing when you have an iterator whose value type is a pointer, as you need to do multiple dereferences. source(i)->value is much cleaner than (*i)->value.

Finally, there is a huge teachability problem here. Generic algorithms are one of the key strengths of C++, but in order to teach them we have torture our students with non-sense about post increment and operator->(). All of this just serves to confuse the core idea behind iterators and generic algorithms.

So now, how can we make things better? I can think of a few solutions:

Solution 1: Add more defaulted operators.

This solution preserves the status quo but makes it less painful. It would also help in developing other mathematical and smart pointer types that use operator overloading. Some of this may already be provided by defaulted comparison operators proposals. Basically it would boil down to something like this:

* If T::operator++() is defined and T is copyable, autogenerate T::operator++(int).
* If T::operator--() is defined and T is copyable, autogenerate T::operator--(int).
* If T::operator*() is defined and returns an lvalue reference, autogenerate T::operator->().
* If T::operator*() const is defined and returns an lvalue reference, autogenerate T::operator->() const.
* If T::operator+=(U) is defined, autogenerate operator+(T, U). (or visa-versa)
* If T::operator-=(U) is defined, autogenerate operator-(T, U). (or visa-versa)
* If T::operator+(T,U) is defined and T::operator*() is defined, autogenerate T::operator[](U) (I could see this being problematic)
* If T::operator+(T,U) is defined and T::operator*() const is defined, autogenerate T::operator[](U) const (I could see this being problematic)
* If operator==(T,U) is defined, autogenerate operator!=(T,U). (or visa-versa)
* If operator<(T,U) is defined, autogenerate operator>=(T,U). (or visa-versa)
* If operator>(T,U) is defined, autogenerate operator<=(T,U). (or visa-versa)
* If operator<(T,U) and operator==(T,U) are defined, autogenerate operator>(T,U)
* If operator>(T,U) and operator==(T,U) are defined, autogenerate operator<(T,U)

Obviously, = delete can be used to suppress the generation of such functions and = default can be used to be explicit that you want the compiler generated defaults.

We still have this what I think are ugly and meaning obscuring operators. But at least we have removed a large implementation burden on the writer and and maintainability burden on the reader.

Solution 2: Use a wrapper template which essentially does solution 1 in the library.

I don't like this solution at all. Its less generic because it forces iterator implementations to inherit from some template. It doesn't help iterator types from other code bases which don't use the wrapper. We also don't gain the benefits of defaulted operators for other use cases such as writing mathematical types.

Solution 3: Completely redefine the iterator API, with implementation support backwards compatibility.

In this solution, we essentially redefine the iterator API based on named free functions, similar to the above and deprecate all of the operators. ADL lookup is used to resolve the function calls. The only operators that would be retained are the comparison operators, using the same autogeneration logic I proposed in solution 1.

In the best of all worlds, I would prefer something like this:

* source(i) - read from iterator i. (Readable concept)
* sink(i) - Return something that can be assigned to write to iterator i. (Writable concept)
* sucessor(i) - Increment the iterator by 1. (Iterator concept)
* operator==(i,j) - Compare equal (Iterator concept, auto generates operator!=() if not explicitly defined)
* predecessor(i) - Decrement the iterator by 1. (Bidir Iterator concept)
* sucessor(i, n) - Increment the iterator by n. (Random Access)
* predecessor(i, n) - Decrement the iterator by n. (Random Access)
* difference(i, j) - Difference between 2 iterators. (Random Access)
* operator<(i, j) - Less than comparison (Random Access Concept, autogenerates <=, >, >= if not explicitly defined).

Thats only 9 functions you need to define to get a random access iterator. Actually in many cases only 7, because I reused the names successor() and predecessor() so that many random access iterator implementations could just write one function with a default argument

RAIterator successor(RAIterator iter, int n = 1) { return RAIterator{ iter.impl + n; }; }


Unfortunately we already have some names reserved in the standard. So in order to reuse them, a more realistic approach might be something like this:

* source(i)
* sink(i) 
* advance(i) - we already have std::advance()
* operator==(i,j) - auto generates operator!=() if not explicitly defined.
* retreat(i) - matches std::advance() name
* next(i, n) - we already have std::next() in C++11
* prev(i, n) - we already have std::prev() in C++11
* difference(i, j) - * Not std::distance()!! *
* operator<(i, j) - autogenerates <=, >, >= if not explicitly defined.

Note that difference() is very different from std::distance(). difference is a required operation of the random access iterator concept, and only defined for random access and above iterators. std::distance() is actually a generic algorithm, and I believe it belongs in the <algorithm> header, not the <iterator> header.


The big question is of course backwards compatibility. This one should actually not be hard and I believe it can be done without waiting for an STL 2.0. 

First, all of the std:: versions of the above should include fallbacks to the old operators. Anyone writing a generic algorithm using the new paradigm can say using std::XYZ; (or using namespace std;) in their function body and automatically add support for legacy iterators. All of the standard library algorithm implementations would need to be migrated to the new paradigm.

In addition, all of the iterator tags in std::iterator_traits can be derived automatically by testing for the existence of these functions using meta programming. Iterator class authors should not need to write any typedefs. I expect iterator_traits eventually goes away once we have concepts.

Tom Honermann

unread,
Oct 10, 2016, 2:33:49 PM10/10/16
to std-pr...@isocpp.org
On 10/10/2016 2:11 PM, Matthew Fioravante wrote:

So now, how can we make things better? I can think of a few solutions:
Have you seen the iterator facade proposal?

http://wg21.link/p0186

Tom.

Matthew Fioravante

unread,
Oct 10, 2016, 2:41:59 PM10/10/16
to ISO C++ Standard - Future Proposals
Also I want to emphasize how difficult this API is to use when you want to create iterators that aren't simple pointer like things that walk through a container.

One time I tried to implement an iterator adapter which took an iterator and a function object. The idea was that whenever you dereferenced the iterator instead of getting the original value x, you would get f(x). 

Using my new proposed API, all you really need to do here is implement decltype(auto) source(i); to return f(source(i)). All of the other operations are the same and you are done. Using the current iterator API however is incredibly difficult because you have to return things that act like lvalue references because we're trying to make accessing our iterators value look like dereferencing a pointer. 

Nicol Bolas

unread,
Oct 10, 2016, 3:46:10 PM10/10/16
to ISO C++ Standard - Future Proposals
Any sort of ground-up rethink about iterators should be handled for a prospective version 2 of the standard library. Trying to shoehorn it into the way we do things now makes things much more complicated. People already have algorithms based on the established paradigm, and there's no way to take a new-paradigm iterator and turn it into an old-paradigm iterator. Well, not automatically.

The two worlds ought to be separate.

Also, let's not forget that we already de-facto adopted the current paradigm iterators into the core language, thanks to range-based for. Which now means you'd have to make a language change in order for new-paradigm iterators to work.

As to the particulars of your design, the function-based iterator isn't a bad idea. But until we have some form of unified call syntax, to deal with our current global-function-to-object problems, I think your design is going to encounter problems. For example, to make this global function-based interface work, you have to either 1) make all members of your iterator type public; 2) have a second, member-function interface for your iterators that your global functions have to call, effectively increasing the amount of code you have to write; or 3) make all of these global interface functions friends, which requires a large list of friend declarations in the class.

None of these are attractive alternatives. The nice thing about the current interface is that it's based off of in-members stuff. Global functions call the in-member stuff automatically.

Also, I can't say I like your function-based iterators in certain ways. They're all based on the assumption that the contents of an iterator are simple and fast to copy. Your `successor` function returns a copy of the iterator; it doesn't increment the iterator in-situ. Whereas the current paradigm's iteration, primarily designed around the use of prefix ++ and --, performs in-situ modification of the iterator. That's going to be more efficient if the iterator is at all significant in size.

Oh, and no iterator design should be considered complete these days without including how it interacts with ranges. Particularly with iterator+sentinel ranges. It seems to me that range-based iterators work out better overall.

I think we can mitigate most of your issues without breaking the world, however. For example, you have problems with having to offer postfix ++ and --. So... change the concept so that we don't have to offer that. It's almost never useful, and you can always get around it by being slightly more verbose.

The problem with your source/sink API is that it doesn't permit the possibility of wanting to do both in the same operation. Or at least, the possibility of wanting to do both without knowing that the object comes from an iterator. For input/output iterators, I like the interface. But for any multi-pass iterator, using an actual object makes using them much easier.

Matthew Fioravante

unread,
Oct 10, 2016, 3:48:42 PM10/10/16
to ISO C++ Standard - Future Proposals
That's essentially solution 2 of my proposal. 

In my opinion, iterator_facade only makes the problem worse. Now we have 2 different API's, one for writing iterators and one for using iterators. You just increased the mental overhead for everyone because now we have to learn both and the mapping between them.

Before we go and add another layer on top of this house of cards, lets step back and think about how we might implement iterators if we were doing it from scratch today. I don't think the solution today would be all of these complicated operator overloads + iterator_facade.

How will we teach all of these levels of nonsense to students? None of it has anything to do with the core concepts of generic algorithms and iterators. Its all based on the mistaken (with all respect, these problems are hard) design decisions of the past. If a professor has to spend more time teaching students implementation details of how to write and use iterators in C++ than he does teaching the idea of generic algorithms then C++ has failed. Not only that, he would be very wise to teach generic algorithms using a different language that doesn't have all of this implementation overhead.

I'm also not sure if iterator_facade lets me have iterators which when accessed return by value instead of by reference. Like in my iterator function example which could perform an arbitrary transformation on the value it returns. Finally, what are the implications of all of this on ranges and range adapters?

 

Tom Honermann

unread,
Oct 10, 2016, 4:09:09 PM10/10/16
to std-pr...@isocpp.org
On 10/10/2016 3:48 PM, Matthew Fioravante wrote:
> I'm also not sure if iterator_facade lets me have iterators which when
> accessed return by value instead of by reference. Like in my iterator
> function example which could perform an arbitrary transformation on
> the value it returns.
I think it does. The requirements on read() are that it return auto&&.

See also: http://wg21.link/P0022

Tom.

Matthew Fioravante

unread,
Oct 10, 2016, 4:17:51 PM10/10/16
to ISO C++ Standard - Future Proposals


On Monday, October 10, 2016 at 2:46:10 PM UTC-5, Nicol Bolas wrote:
Any sort of ground-up rethink about iterators should be handled for a prospective version 2 of the standard library. Trying to shoehorn it into the way we do things now makes things much more complicated.

 
People already have algorithms based on the established paradigm, and there's no way to take a new-paradigm iterator and turn it into an old-paradigm iterator. Well, not automatically.


This is true, however I think that's the price you pay by using the new paradigm. Just like if you want to use C++X you can only compile against headers which compile under C++X and libraries that are C++X ABI compatible.

The key necessary feature is backwards compatibility in the algorithms, so old code using old iterators continue to work with updated libraries. Its very rare that you can achieve perfect backwards compatibility which such a major change like this but here I believe its possible.
 

The two worlds ought to be separate.

Also, let's not forget that we already de-facto adopted the current paradigm iterators into the core language, thanks to range-based for. Which now means you'd have to make a language change in order for new-paradigm iterators to work.

Makes the proposal a bit heavier than a strictly library proposal but not impossible. Also if we wait for STL2 to revamp iterators, this issue will still exist. Range for doesn't know whether you're using STL1 or STL2 iterators. There would have to be some kind of compatibility there.
 

As to the particulars of your design, the function-based iterator isn't a bad idea. But until we have some form of unified call syntax, to deal with our current global-function-to-object problems, I think your design is going to encounter problems. For example, to make this global function-based interface work, you have to either 1) make all members of your iterator type public; 2) have a second, member-function interface for your iterators that your global functions have to call, effectively increasing the amount of code you have to write; or 3) make all of these global interface functions friends, which requires a large list of friend declarations in the class.

None of these are attractive alternatives. The nice thing about the current interface is that it's based off of in-members stuff. Global functions call the in-member stuff automatically.

This is a valid concern. I would probably go with friend functions but unified call syntax would solve the problem nicely.

Another option is having std::next() etc.. delegate to member next() if it exists, and then everyone has to say using std::next; in their algorithms. I don't like this at all because its easy to forget and makes one liner expressions impossible. We already have this problem with swap().
 

Also, I can't say I like your function-based iterators in certain ways. They're all based on the assumption that the contents of an iterator are simple and fast to copy. Your `successor` function returns a copy of the iterator; it doesn't increment the iterator in-situ. Whereas the current paradigm's iteration, primarily designed around the use of prefix ++ and --, performs in-situ modification of the iterator. That's going to be more efficient if the iterator is at all significant in size.

Iterators are supposed to be designed to be cheap copyable value types. Thats one aspect of pointers which I think is wise to leverage because it makes iterators much easier to reason about and orders of magnitude easier to use when you can assume they are simple value types.

All of the standard algorithms take and return iterators by value instead of by reference. begin(), end(), etc.. return copies. We also have std::advance(), std::next(), and std::prev() already in the standard and they all return copies by value.

I don't think fast in-situ modification is a desirable property for iterator design and the standard library is certainly going to be your enemy if you're depending on it for performance. Finally, the whole concept of generic algorithms and iterators is based on inlining, which also means we should expect those copies to be inlined away.
 

Oh, and no iterator design should be considered complete these days without including how it interacts with ranges. Particularly with iterator+sentinel ranges. It seems to me that range-based iterators work out better overall.

My understanding is ranges are essentially a pair of objects. Either 2 iterators or (iterator, sentinel). Having iterators that can "dereference" without lvalue semantics enables a whole host of interesting applications with ranges. Want to iterate over a list of squared integers? Thats easy, just write an iterator adapter which squares the value of the original integer iterator and returns it by value. With reference semantics you'd have to store the squared value in the iterator and return it by reference, which is a pessimization if that iterator object crosses inline boundries and can't be optimized out entirely by the compiler. Or maybe you could do it with some horrible proxy that pretends to be a reference.

Matthew Fioravante

unread,
Oct 10, 2016, 4:40:26 PM10/10/16
to ISO C++ Standard - Future Proposals


On Monday, October 10, 2016 at 2:46:10 PM UTC-5, Nicol Bolas wrote:
I think we can mitigate most of your issues without breaking the world, however. For example, you have problems with having to offer postfix ++ and --. So... change the concept so that we don't have to offer that. It's almost never useful, and you can always get around it by being slightly more verbose.

Sure, and by god please also take out operator[].

That being said I think post increment and post decrement adding default operators as suggested in solution 1 is better. ++i and i++ are pretty much ubiquitous in C and C++. I think it would be rather strange to have a type able to ++t but not t++. Again all of the complication here comes from trying to be clever with operator overloading in the initial design. If it was named something other than operator++(), we wouldn't need a post version at all.
 
This also doesn't solve the need for both redundant T::operator+=(U) and operator+(T,U). I don't think we can remove one of those from the concept in favor of the other. Again if it were named something other than +, wouldn't have an issue here.


The problem with your source/sink API is that it doesn't permit the possibility of wanting to do both in the same operation. Or at least, the possibility of wanting to do both without knowing that the object comes from an iterator. For input/output iterators, I like the interface. But for any multi-pass iterator, using an actual object makes using them much easier.

That's true, but it does make strictly input  and output iterators more explicit with what they can do.

Non-const operator*() doesn't differentiate between read-write and write-only. Also when you actually ever do both a read and write in a single expression? If you do, is it clear which operation happens first?

Nicol Bolas

unread,
Oct 10, 2016, 5:21:00 PM10/10/16
to ISO C++ Standard - Future Proposals
On Monday, October 10, 2016 at 4:17:51 PM UTC-4, Matthew Fioravante wrote:
On Monday, October 10, 2016 at 2:46:10 PM UTC-5, Nicol Bolas wrote:
Also, I can't say I like your function-based iterators in certain ways. They're all based on the assumption that the contents of an iterator are simple and fast to copy. Your `successor` function returns a copy of the iterator; it doesn't increment the iterator in-situ. Whereas the current paradigm's iteration, primarily designed around the use of prefix ++ and --, performs in-situ modification of the iterator. That's going to be more efficient if the iterator is at all significant in size.

Iterators are supposed to be designed to be cheap copyable value types.

And the more range adapters you apply, the less cheap such copying gets.

Thats one aspect of pointers which I think is wise to leverage because it makes iterators much easier to reason about and orders of magnitude easier to use when you can assume they are simple value types.

All of the standard algorithms take and return iterators by value instead of by reference. begin(), end(), etc.. return copies. We also have std::advance(), std::next(), and std::prev() already in the standard and they all return copies by value.

But those are not the primary iterator interfaces. They're helper functions. The primary, lowest-level interfaces are defined by the Iterator concepts, not these global functions.

Oh, and no iterator design should be considered complete these days without including how it interacts with ranges. Particularly with iterator+sentinel ranges. It seems to me that range-based iterators work out better overall.

My understanding is ranges are essentially a pair of objects.

Perhaps you should look at some of the Range adapters from Ranges v3. Also, there are the range helper types that make it almost trivially easy to construct ranges. This includes the iterators they depend on.

Ranges are far more than just a `pair<iterator, iterator>`.

Either 2 iterators or (iterator, sentinel). Having iterators that can "dereference" without lvalue semantics enables a whole host of interesting applications with ranges. Want to iterate over a list of squared integers? Thats easy, just write an iterator adapter which squares the value of the original integer iterator and returns it by value. With reference semantics you'd have to store the squared value in the iterator and return it by reference, which is a pessimization if that iterator object crosses inline boundries and can't be optimized out entirely by the compiler. Or maybe you could do it with some horrible proxy that pretends to be a reference.

If you're making a range adapter that squares the stored value, then you're making an immutable range adapter. Which means that your iterators will effectively be const-iterators.

I believe `*i` returning by value is permitted.

Matthew Woehlke

unread,
Oct 12, 2016, 4:12:15 PM10/12/16
to std-pr...@isocpp.org
On 2016-10-10 16:40, Matthew Fioravante wrote:
> Non-const operator*() doesn't differentiate between read-write and
> write-only. Also when you actually ever do both a read and write in a
> single expression? If you do, is it clear which operation happens first?

auto x = ...;
swap(*iter, x);

...or:

*iter += 5;

...or:

mutate(*iter);

None of those seem unclear to me.

--
Matthew

Reply all
Reply to author
Forward
Message has been deleted
0 new messages