proposal to add vector::push_back_()

1420 views
Skip to first unread message

Victor....@ncl.ac.uk

unread,
Sep 6, 2013, 6:05:03 PM9/6/13
to std-pr...@isocpp.org
[I suspect I am re-inventing the wheel, but a quick search of this forum didn't yield any similar proposal.]
 
vector::push_back() is inefficient, as every time it checks whether the vector has reached its full capacity and needs to allocate more memory. Such a check is often unnecessary, e.g. if reserve() was called, see the example below. Hence, the proposal is to add vector::push_back_() which skips this check. Calling it for a full vector results in undefined behaviour (consistent with operator[] which does not check that the index is valid). Of course, an assertion checking the capacity should be used in push_back_() to simplify debugging, but the optimised code has no overhead.
 
---------
Example: The vector class is often used like this:
 
vector<Widget> vw;
vw.reserve(...);
for(...){
    vw.push_back(...); // inefficient due to unnecessary capacity checks; push_back_() should be used instead
}
 
The capacity check is unnecessary in this example and leads to loss of performance (code like this is likely to be in a tight internal loop). Furthermore, unnecessary overhead cannot be removed, e.g. replacing this code by
 
vector<Widget> vw(...);
for(...){
    vw[...]=...;
}
 
does not solve the problem, as now the default constructor of Widget is called many times, only for the created objects to be immediately overwritten in the loop.
---------
 
Of course, the same argument works for emplace_back_(). One might also consider adding back_inserter_ to further exploit the new functionality, e.g.:
 
vector<Widget> vw;
vw.reserve(...);
transform(...,back_inserter_(vw),...);
 
 

David Krauss

unread,
Sep 6, 2013, 10:34:31 PM9/6/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk
In principle it's a valid problem, such checks can adversely affect inner loops, but the solution is way too specialized and obtuse.

How about a means to pass hints to the allocator, maybe a dispatch tag indicating to forward arguments from push_*, insert_*, emplace_*, and constructors to allocator::allocate? This would in general enable dumb allocators where the client can specify where allocations go, and when they are allowed. It would be a powerful interface.

the.ultim...@gmail.com

unread,
Sep 7, 2013, 4:59:45 AM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk
Hello Victor,

 
vector::push_back() is inefficient, as every time it checks whether the vector has reached its full capacity and needs to allocate more memory. Such a check is often unnecessary, e.g. if reserve() was called, see the example below.

I don't deny that there is a textual check at every push_back(). Do you have measurements on the acutal impact on execution? My theory is, if you reserved enough memory, the check should pass everytime and branch prediction would make your insertions throttle full speed.

Laurent

David Krauss

unread,
Sep 7, 2013, 7:10:25 AM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk

I can vouch that it makes a difference. For frequent enough insertions, the cost comes not from mispredicted branches but the integer instructions to fetch and compare the capacity pointer.

The workaround is to not use push_back, but IMHO it should be the right tool for the job.

Ville Voutilainen

unread,
Sep 7, 2013, 7:57:45 AM9/7/13
to std-pr...@isocpp.org
Well, if I don't want to resize the vector (and construct objects), I can't see what work-around would
allow avoiding push_back. A vector of optionals would allow the resize without constructing
the underlying objects, and then I could just use operator[].

the.ultim...@gmail.com

unread,
Sep 7, 2013, 8:31:22 AM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk

I can vouch that it makes a difference. For frequent enough insertions, the cost comes not from mispredicted branches but the integer instructions to fetch and compare the capacity pointer.

The workaround is to not use push_back, but IMHO it should be the right tool for the job.

And we're back at the von Neumann bottleneck. Wouldn't a cleaner way to work around this be to leave the control flow of the filling to the vector itself?

My thought on this isn't complete yet, but now that we can write function objects more easily, wouldn't the right approach be to generalize the following "fill" constructor:

vector (size_type n, const value_type& val,
 
const allocator_type& alloc = allocator_type());

Into
:

template<typename Generator>
vector
(size_type n, Generator gen,
 
const allocator_type& alloc = allocator_type());

If we then had a way to write the algorithm as a coroutine, the vector would determine its capacity only once and let the filling flow.

Maurice Bos

unread,
Sep 7, 2013, 8:40:08 AM9/7/13
to std-pr...@isocpp.org
Shouldn't this be a task of the compiler? (To inline push_back and optimize away the 'if (size == capacity)' check if it knows that 'size < capacity' because of a previous .reserve().)

--
 
---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

DeadMG

unread,
Sep 7, 2013, 9:45:35 AM9/7/13
to std-pr...@isocpp.org
You should refactor your loop into a range, and construct the vector directly from that. This would resolve the issue.

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 2:12:24 PM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk

On Saturday, September 7, 2013 3:34:31 AM UTC+1, David Krauss wrote:
In principle it's a valid problem, such checks can adversely affect inner loops, but the solution is way too specialized and obtuse.
 
How about a means to pass hints to the allocator, maybe a dispatch tag indicating to forward arguments from push_*, insert_*, emplace_*, and constructors to allocator::allocate? This would in general enable dumb allocators where the client can specify where allocations go, and when they are allowed. It would be a powerful interface.
 
If we modify vector class anyway, such a solution is more heavy-weight. Also difficult to teach. In fact, I'm not convinced it would actually give any speedup in practice.

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 2:19:03 PM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk
Just to add: the implementation of push_back_() would be something like
 
*ptr_end++=parameter;
 
and so almost certainly inlined. In contrast, push_back() is unlikely to get inlined, as it does some complicated stuff, like reallocation and handling the ugly special case like v.push_back(v[i]).
 
 

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 2:26:21 PM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk
I believe this will not result in any speedup, as the generator will have internal state, and will need extra calls (unless fully inlined). Moreover, it is less convinient to use - there are generate and generate_n functions in <algorithm> which do essentially the same, and in my experience they are rarely used.
 

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 2:32:48 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 1:40:08 PM UTC+1, Maurice Bos wrote:
Shouldn't this be a task of the compiler? (To inline push_back and optimize away the 'if (size == capacity)' check if it knows that 'size < capacity' because of a previous .reserve().)
 
 
As I said before, push_back is quite complicated, so unlikely to be inlined. Even it it is, the compiler has to be very clever to find this optimisation. In some situations, like inner loop or critical section, the programmer would want to ensure that this optimisation has been applied.

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 2:36:12 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 2:45:35 PM UTC+1, DeadMG wrote:
You should refactor your loop into a range, and construct the vector directly from that. This would resolve the issue.
 
Or, much easier, allocate raw memory using malloc, and work with it.

DeadMG

unread,
Sep 7, 2013, 2:39:14 PM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk
It would unfortunately be much easier because ranges are so irritating to construct. This, however, is immaterial to the fact that the other solutions are a disgusting hack which is only an approximation to the real solution.

Magnus Fromreide

unread,
Sep 7, 2013, 2:58:51 PM9/7/13
to std-pr...@isocpp.org
I would expect the implementetion of push_back to be something along the
line of

inline void vector<T>::push_back(const T& parameter) {
if (ptr_end == ptr_capacity)
grow();
allocator.construct(ptr_end, parameter);
++ptr_end;
}

where grow() grows the vector if needed. Here I don't expect grow() to
be fully inlined but I do not see why the rest couldn't be inlined.

Note that the your push_back_ also needs to do

allocator.construct(ptr_end, parameter);
++ptr_end; // Increment after as construct could throw

since I expect the memory from ptr_end to ptr_capacity to be
uninitialized, so it really isn't that much simpler.

/MF


Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 3:03:15 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 12:57:45 PM UTC+1, Ville Voutilainen wrote:

Well, if I don't want to resize the vector (and construct objects), I can't see what work-around would
allow avoiding push_back. A vector of optionals would allow the resize without constructing
the underlying objects, and then I could just use operator[].
 
Doesn't solve the problem: still many optionals are constructed, only to be immediately overwritten.

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 3:09:47 PM9/7/13
to std-pr...@isocpp.org
Ok, point taken about allocator.construct.
 
However, your implementation of puch_back has a subtle bug: it has undefined behaviour for calls like v.push_back(v[0]) when v is full. Hence, you need to add some code to handle this special case. This complicates the implementation and makes it less likely to be inlined.

Ville Voutilainen

unread,
Sep 7, 2013, 3:14:17 PM9/7/13
to std-pr...@isocpp.org
Assuming that the Widgets in your original example are costly to construct and overwrite,
the optional<Widget>s are not costly to construct.

Magnus Fromreide

unread,
Sep 7, 2013, 3:21:58 PM9/7/13
to std-pr...@isocpp.org
Ok, point taken about the v[i] case, but the whole insertion in the
grow() case could be passed of to another method (grow(parameter)) since
that is a problem only when a reallocation happens so there still is no
reason this would prevent inlining of the non-growing case.

/MF

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 3:41:28 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 8:21:58 PM UTC+1, Magnus Fromreide wrote:
Ok, point taken about the v[i] case, but the whole insertion in the
grow() case could be passed of to another method (grow(parameter)) since
that is a problem only when a reallocation happens so there still is no
reason this would prevent inlining of the non-growing case.

/MF

 
Ok, maybe one can achieve inlining for push_bask, in principle at least. Below is an actual implementation from VS2012. I find it difficult to believe it's inlined...

 

void push_back(value_type&& _Val)
 { // insert by moving into element at end
 if (_Inside(_STD addressof(_Val)))
  { // push back an element
  size_type _Idx = _STD addressof(_Val) - this->_Myfirst;
  if (this->_Mylast == this->_Myend)
   _Reserve(1);
  _Orphan_range(this->_Mylast, this->_Mylast);
  this->_Getal().construct(this->_Mylast,
   _STD forward<value_type>(this->_Myfirst[_Idx]));
  ++this->_Mylast;
  }
 else
  { // push back a non-element
  if (this->_Mylast == this->_Myend)
   _Reserve(1);
  _Orphan_range(this->_Mylast, this->_Mylast);
  this->_Getal().construct(this->_Mylast,
   _STD forward<value_type>(_Val));
  ++this->_Mylast;
  }
 }

 

 

 
 

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 3:50:33 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 8:14:17 PM UTC+1, Ville Voutilainen wrote:

Assuming that the Widgets in your original example are costly to construct and overwrite,
the optional<Widget>s are not costly to construct.
 
Sure - this can work in some circumstances. However, we are talking about saving a couple of processor instructions here, so "costly" should be viewed from this perspective. Also, that extra flag in optional<> can increase the element size, spoil the alignment, and result in something that's actually slower and takes more memory. The bottom line: occasionally it works, but it's not a general solution to the problem.

Felipe Magno de Almeida

unread,
Sep 7, 2013, 3:54:30 PM9/7/13
to std-pr...@isocpp.org
On Fri, Sep 6, 2013 at 7:05 PM, <Victor....@ncl.ac.uk> wrote:
> [I suspect I am re-inventing the wheel, but a quick search of this forum
> didn't yield any similar proposal.]
>
> vector::push_back() is inefficient, as every time it checks whether the
> vector has reached its full capacity and needs to allocate more memory. Such
> a check is often unnecessary, e.g. if reserve() was called, see the example
> below. Hence, the proposal is to add vector::push_back_() which skips this
> check. Calling it for a full vector results in undefined behaviour
> (consistent with operator[] which does not check that the index is valid).
> Of course, an assertion checking the capacity should be used in push_back_()
> to simplify debugging, but the optimised code has no overhead.

I think a elegant solution for this (and a lot of other similar)
problem would be
something like this:

void foo(std::vector<char>& v)
{
assume(v.reserve() > v.size());
v.push_back('a'); // the compiler would do dead-code elimination and
then inline
}

[snip - example]

Regards,
--
Felipe Magno de Almeida

Ville Voutilainen

unread,
Sep 7, 2013, 3:58:50 PM9/7/13
to std-pr...@isocpp.org
True enough. I think this proposal has quite decent merit. I can't really work around it otherwise than postponing
the construction with optional, so it looks like a change to vector is what's necessary. If benchmarks show
that it's necessary, that is. ;)

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 4:28:16 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 8:54:30 PM UTC+1, Felipe Almeida wrote:

I think a elegant solution for this (and a lot of other similar)
problem would be
something like this:

void foo(std::vector<char>& v)
{
  assume(v.reserve() > v.size());
  v.push_back('a'); // the compiler would do dead-code elimination and
then inline
}
 
This is an interesting idea, but not quite perfect:
 * in a critical section or a tight loop I'd positively want to make sure that the optimisation has occured - but "assume" does not give any positive guarantees
 * it would be non-trivial for the compiler to do this optimisation for the following reasons:
    - the user can only refer to public methods like capacity() and size() in the "assume" clause while the implementation of puch_back would probably refer to private data directly, and the compiler will have to figure out the relashionship between these.
    - there is a special case in push_back that is handled separately: v.push_back(v[0]); you need to say to the compiler somehow that if the allocation is not required then this case should not be treated as special.

Magnus Fromreide

unread,
Sep 7, 2013, 4:36:27 PM9/7/13
to std-pr...@isocpp.org
That is a quality of implementation issue but I would measure rather
than guess regarding optimizations of standard library functions since
optimizers can be surprisingly good.

If one looks at GCC instead the implementation is

void
push_back(value_type&& __x)
{ emplace_back(std::move(__x)); }

template<typename _Tp, typename _Alloc>
template<typename... _Args>
void
vector<_Tp, _Alloc>::
emplace_back(_Args&&... __args)
{
if (this->_M_impl._M_finish != this->_M_impl._M_end_of_storage)
{
_Alloc_traits::construct(this->_M_impl, this->_M_impl._M_finish,
std::forward<_Args>(__args)...);
++this->_M_impl._M_finish;
}
else
_M_emplace_back_aux(std::forward<_Args>(__args)...);
}



Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 4:41:08 PM9/7/13
to std-pr...@isocpp.org, Victor....@ncl.ac.uk

On Saturday, September 7, 2013 7:39:14 PM UTC+1, DeadMG wrote:
It would unfortunately be much easier because ranges are so irritating to construct. This, however, is immaterial to the fact that the other solutions are a disgusting hack which is only an approximation to the real solution.
 
By that logic, vector::operator[] is also a "disgusting hack" as it results in indefined behaviour if the index is out of range, instead of throwing an exception. However, the committee felt that efficiency was more important here than protecting the user from shooting themselves in a foot. The C++ philosophy is not to protect programmers from themselves, but to offer a choice - purists can still use the old good push_back. However, in some situation I do want maximum efficiency.
 
As I said in the original posting, any reasonable STL implementation would add an assertion to push_back_ to check that the assumption on the capacity actually holds - this would make debugging such violations as easy as debugging out-of[range problems.

Nevin Liber

unread,
Sep 7, 2013, 4:49:20 PM9/7/13
to std-pr...@isocpp.org
Why is that check any cheaper than the check for push_back to see if
the vector should grow?

--
Nevin ":-)" Liber <mailto:ne...@eviloverlord.com> (312) 623-5420

Howard Hinnant

unread,
Sep 7, 2013, 4:50:51 PM9/7/13
to std-pr...@isocpp.org
On Sep 7, 2013, at 3:41 PM, Victor....@ncl.ac.uk wrote:

> Ok, maybe one can achieve inlining for push_bask, in principle at least. Below is an actual implementation from VS2012. I find it difficult to believe it's inlined...
>
> void push_back(value_type&& _Val)
> { // insert by moving into element at end
> if (_Inside(_STD addressof(_Val)))
> { // push back an element
> size_type _Idx = _STD addressof(_Val) - this->_Myfirst;
> if (this->_Mylast == this->_Myend)
> _Reserve(1);
> _Orphan_range(this->_Mylast, this->_Mylast);
> this->_Getal().construct(this->_Mylast,
> _STD forward<value_type>(this->_Myfirst[_Idx]));
> ++this->_Mylast;
> }
> else
> { // push back a non-element
> if (this->_Mylast == this->_Myend)
> _Reserve(1);
> _Orphan_range(this->_Mylast, this->_Mylast);
> this->_Getal().construct(this->_Mylast,
> _STD forward<value_type>(_Val));
> ++this->_Mylast;
> }
> }

I recommend you stick to quoting open sourced implementations. Here's one where you won't get in trouble:

http://libcxx.llvm.org

template <class _Tp, class _Allocator>
_LIBCPP_INLINE_VISIBILITY inline
void
vector<_Tp, _Allocator>::push_back(const_reference __x)
{
if (this->__end_ != this->__end_cap())
{
__alloc_traits::construct(this->__alloc(),
_VSTD::__to_raw_pointer(this->__end_), __x);
++this->__end_;
}
else
__push_back_slow_path(__x);
}

As others have alluded to, if push_back, and reserve+push_back is too expensive for you, then you are in to a specialty niche situation, and you should be prepared to do some extra work.

1. You need to tell the vector up front how many elements there are going to be.

2. You need to create a Widget generator which the function can use to construct widgets with.

3. Cast the generator in the form of a random access iterator. Random access is important as that is how the vector will get your up-front size statement. I strongly suspect you can safely ignore the requirement that a random access iterator must return a Widget& from operator*(). vector doesn't need or use that requirement. And it is a dumb requirement anyway that we should change.

Here's a sketch:

#include <vector>

class Widget
{
int data_;
public:
explicit Widget(int data) : data_(data) {}
};

class generator_Widget
{
int count_;
int state_;

public:
typedef std::random_access_iterator_tag iterator_category;
typedef Widget value_type;
typedef std::ptrdiff_t difference_type;
typedef const Widget* pointer;
typedef Widget reference;

explicit generator_Widget(int count, int state = 0)
: count_(count), state_(state) {}

reference operator*() const {return Widget(state_);}
pointer operator->() const {return nullptr;}

generator_Widget& operator++() {++count_; --state_; return *this;}
generator_Widget operator++(int)
{generator_Widget tmp(*this); ++(*this); return tmp;}

generator_Widget& operator--() {--count_; ++state_; return *this;}
generator_Widget operator--(int)
{generator_Widget tmp(*this); --(*this); return tmp;}

generator_Widget& operator+=(difference_type n) {count_ += n; state_ -= n; return *this;}
generator_Widget operator+(difference_type n) const
{generator_Widget tmp(*this); tmp += n; return tmp;}
friend generator_Widget operator+(difference_type n, generator_Widget x)
{x += n; return x;}
generator_Widget& operator-=(difference_type n) {return *this += -n;}
generator_Widget operator-(difference_type n) const
{generator_Widget tmp(*this); tmp -= n; return tmp;}

reference operator[](difference_type n) const {return *(*this + n);}

friend
bool
operator==(const generator_Widget& x, const generator_Widget& y)
{return x.count_ == y.count_;}

friend
bool
operator!=(const generator_Widget& x, const generator_Widget& y)
{return !(x == y);}

friend
bool
operator<(const generator_Widget& x, const generator_Widget& y)
{return x.count_ < y.count_;}

friend
bool
operator>(const generator_Widget& x, const generator_Widget& y)
{return y < x;}

friend
bool
operator<=(const generator_Widget& x, const generator_Widget& y)
{return !(y < x);}

friend
bool
operator>=(const generator_Widget& x, const generator_Widget& y)
{return !(x < y);}

friend
difference_type
operator-(const generator_Widget& x, const generator_Widget& y)
{return x.count_ - y.count_;}

};

int
main()
{
std::vector<Widget> v(generator_Widget(0, 10), generator_Widget(10));
}

Yes, its a pain to write all this boilerplate. Maybe you can generalize the iterator/generator so you only have to write it once. Or maybe somebody (boost?) has already written it for you. But it isn't going to get much faster than this. The vector constructor will subtract the pair of random access iterators in its constructor and allocate the right amount of memory. And then it will iterate through the range constructing each Widget from the dereferenced iterator.

Another technique you could use is to give the vector an allocator whose construct member is a no-op:

template <class U, class... Args>
void
construct(U* p, Args&&...)
{
}


Then construct your vector:

vector<Widget, my_alloc<Widget>> v(N); // allocate but don't construct the Widgets

And then loop through your vector and actually construct each Widget:

for (auto& w : v)
::new(&w) Widget(...);

You'll need to watch out for exception safety on this one. If any of your Widget constructors throws, then ~Widget() better be able to cope with an unspecified Widget state.

Neither of these techniques is very elegant. But they are both workable. They seem to me a better approach than adding an extremely error-prone member function to vector.

If we're going to further increase the performance of vector, the next most effective step to take is to teach allocators how to expand existing allocations in place (realloc without the possibility of an address change), and teach containers how to use such allocators. Doing this can produce as much as a 2X performance improvement compared to an un-reserved push_back loop. However even this at-best-2X improvement was not sufficiently enticing for the committee in the past. I have no reason to think it will be in the future.

Howard

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 4:59:32 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 9:49:20 PM UTC+1, Nevin ":-)" Liber wrote:
Why is that check any cheaper than the check for push_back to see if
the vector should grow?
Because assertion are only active in the debug mode, and are removed in the release build.

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 5:12:44 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 9:36:27 PM UTC+1, Magnus Fromreide wrote:
That is a quality of implementation issue but I would measure rather
than guess regarding optimizations of standard library functions since
optimizers can be surprisingly good.
 
 
I wouldn't overestimate the intellect of optimisers, especially when pointers are involved...
 
In any case, there is no positive assurance that the optimisation is applied. This is also unstable - minor changes in the code can upset it. Moreover, it is easy enought to confuse the optimiser in practice. E.g. imagine several threads compute some good stuff and push their results into a vector, protecting puch_back calls by a lock. The user knows the upper bound on the total number of results, and so can reserve memory in advance. However, the code executed by the threads is far away from the call to reserve(), so the compiler is powerless to optimise the critical section.
 
 
If one looks at GCC instead the implementation is
 
Ok, can actually be inlined if you're lucky with your STL vendor :-)

Victor....@ncl.ac.uk

unread,
Sep 7, 2013, 5:33:10 PM9/7/13
to std-pr...@isocpp.org

On Saturday, September 7, 2013 9:50:51 PM UTC+1, Howard Hinnant wrote:
If we're going to further increase the performance of vector, the next most effective step to take is to teach allocators how to expand existing allocations in place (realloc without the possibility of an address change), and teach containers how to use such allocators.  Doing this can produce as much as a 2X performance improvement compared to an un-reserved push_back loop.  However even this at-best-2X improvement was not sufficiently enticing for the committee in the past.  I have no reason to think it will be in the future.

Howard
 
The generator solution is unelegant, I think I'd rather use malloc and manage memory myself than doing anything like this. It is also of limited use compared to push_back_ - e.g. imagine several threads push their results into a vector, protecting puch_back_ calls by a lock; this is difficult to reformulate as generators.
 
The allocator solution is interesting, and would probably work, but I believe it is no less "error-prone" than push_back_. Do you really want to recommend it? I think push_back_ is no more difficult / error-prone than operator[], so the users can cope with it.

Howard Hinnant

unread,
Sep 7, 2013, 5:45:32 PM9/7/13
to std-pr...@isocpp.org
On Sep 7, 2013, at 5:33 PM, Victor....@ncl.ac.uk wrote:

> The generator solution is unelegant, I think I'd rather use malloc and manage memory myself than doing anything like this.

That is certainly another possibility.

> It is also of limited use compared to push_back_ - e.g. imagine several threads push their results into a vector, protecting puch_back_ calls by a lock; this is difficult to reformulate as generators.

And the generator method would also be unnecessary. The cost of the lock would put the cost of the one if-statement you're trying to avoid into the noise.

>
> The allocator solution is interesting, and would probably work, but I believe it is no less "error-prone" than push_back_. Do you really want to recommend it? I think push_back_ is no more difficult / error-prone than operator[], so the users can cope with it.

The no-construct allocator technique is very error prone. The one advantage it has over push_back_ is that it is in C++11 and later, and push_back_ isn't.

Howard

Howard Hinnant

unread,
Sep 7, 2013, 7:45:19 PM9/7/13
to std-pr...@isocpp.org
On Sep 7, 2013, at 4:50 PM, Howard Hinnant <howard....@gmail.com> wrote:

> Here's a sketch:
>
> #include <vector>
>
> class Widget
> {
> int data_;
> public:
> explicit Widget(int data) : data_(data) {}
> };
>
> class generator_Widget
> {

Fwiw, I decided it was time to see some real code and real numbers so we know what we're dealing with. I'm using the cheapest Widget I can imagine - just wraps an int. This emphasizes any performance gains as much as possible. As Widget becomes more expensive to construct, the performance gains of one technique over the other will probably become negligible.

Here's the code I'm testing:

#include <iostream>
#include <chrono>
#include <vector>

class Widget
{
int data_;
public:
explicit Widget(int data) : data_(data) {}
};

class generator_Widget
{
int count_;
int state_;

public:
// typedef std::input_iterator_tag iterator_category;
const unsigned N = 1000000;
using namespace std::chrono;
auto t0 = high_resolution_clock::now();
#if 1
std::vector<Widget> v(generator_Widget(0, N), generator_Widget(N));
#else
std::vector<Widget> v;
int state = N;
v.reserve(N);
for (unsigned i = 0; i < N; ++i, --state)
v.push_back(Widget(state));
#endif
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << duration_cast<microseconds>(t1-t0).count() << "\xC2\xB5s\n";
}

I'm testing the above with clang++ -O3, tip-of-trunk libc++. For me this outputs:

$ a.out
2468µs
$ a.out
2431µs
$ a.out
2443µs
$ a.out
2709µs

If I change the above to use push_back():

#if 0

Then I get:

$ a.out
3848µs
$ a.out
3608µs
$ a.out
3635µs
$ a.out
3765µs

That's around a 40% performance hit, which is significant.

I also noted a generator/iterator adaptor in boost:

http://www.boost.org/doc/libs/1_54_0/libs/utility/generator_iterator.htm

However this adaptor is strictly conforming, and thus classifying itself as an input_iterator. This is a performance killer. I modeled this with:

typedef std::input_iterator_tag iterator_category;
// typedef std::random_access_iterator_tag iterator_category;

and got:

$ a.out
5256µs
$ a.out
5426µs
$ a.out
5233µs
$ a.out
5388µs

Another 48% slower than the push_back loop. The main reason for this is that vector can't predict a-preori the final capacity it will need, and thus must grow its capacity geometrically during the iteration. libc++ grows with a factor of 2. This is also the approximate time I get with the push_back() method if I remove the v.reserve(N).

I haven't bothered testing the allocator-no-op-constructor method. I believe it will be no faster, and possibly slower than the random-access-iterator-generator technique because it requires two passes over the elements, and one of the passes might be optimized out.

Howard

Billy O'Neal

unread,
Sep 8, 2013, 12:43:27 AM9/8/13
to std-proposals
Note that you can use the generator even if you aren't constructing the vector by calling assign or insert. While the generator solution is clunky if someone has really really profiled their code, and really found that a significant fraction is spent in push_back (which I think is *exceedingly* unlikely), then they can use the generator solution.

> The no-construct allocator technique is very error prone.  The one advantage it has over push_back_ is that it is in C++11 and later, and push_back_ isn't.
 
It also has the advantage that the client can't "screw up" and go off the end of the allocated region. The resulting program is perfectly safe, which wouldn't happen with the proposed push_back_ solution. (Yes, the generator is more complex, but the vast majority of bugs there will result in easy to diagnose crashes or simply fail to compile)
 
The benchmark above neglects to mention that there is also a size modification going on inside push_back -- writes are typically more expensive than reads, such as the comparison against capacity. If the vector has been reserved in advance, then the capacity check is going to be predicted with high accuracy by any machine worth its salt, so it should parallelize well.
 
However, writing the value of size in the above example changes on every iteration of the loop, and causes cross loop body memory dependencies. It also doubles or triples the amount of memory being written to, because each int being written is causing a size_t to be written for the size update.
 
The push_back_ proposal doesn't remove the size modification, only the capacity check. Are we sure that the capacity check is actually the cause of that 40% difference?
 
 

Billy O'Neal
Malware Response Instructor - BleepingComputer.com



Howard

--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

Magnus Fromreide

unread,
Sep 8, 2013, 2:25:38 AM9/8/13
to std-pr...@isocpp.org
I did some quick performance runs with Howard's code (gcc-3.9 20130903,
x86_64) but added two more cases

1 - generate_Widget
2 - reserve, push_back
3 - reserve, emplace_back
4 - reserve, push_back_

What I ended up with was that

* 1 and 4 resulted in the same code
* 3 is marginally faster than 2, but still in the same ballpark

Looking at the generated assembler it seems that the real killer is that
the ptr_end and ptr_capacity values are updated (written to memory) on
each lap through the code in case 3 and 4. (The state variable is also
spilled but forcing that into a register doesn't seem to affect the
performance)

/MF


Billy O'Neal

unread,
Sep 8, 2013, 3:44:43 AM9/8/13