Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

"STL: Amazing Speed Differences between std::vector and std::set (Observed with an UndoRedoAction)"

73 views
Skip to first unread message

Lynn McGuire

unread,
May 26, 2021, 4:27:45 PM5/26/21
to
"STL: Amazing Speed Differences between std::vector and std::set
(Observed with an UndoRedoAction)"

https://www.codeproject.com/Tips/5303529/STL-Amazing-Speed-Differences-between-std-vector-a

I have seen this myself. We used a std::map for a very large set
(>10,000 members) just because it is much faster than std::vector.

Lynn

Scott Lurndal

unread,
May 26, 2021, 4:45:03 PM5/26/21
to
Isn't that computer science 101? O(N) vs. [O(1) .. O(Log N)] for
vector vs std::map (red-black tree).

Lynn McGuire

unread,
May 26, 2021, 6:07:18 PM5/26/21
to
Sorry, never took CS 101 at TAMU. My degree is in Mechanical
Engineering. I did take CS 204 or 304, IBM 370 Assembly Language
Programming, for grins.

Lynn

MrSpook...@a07vn7.tv

unread,
May 27, 2021, 4:24:21 AM5/27/21
to
On Wed, 26 May 2021 17:06:59 -0500
Lynn McGuire <lynnmc...@gmail.com> wrote:
>On 5/26/2021 3:44 PM, Scott Lurndal wrote:
>> Lynn McGuire <lynnmc...@gmail.com> writes:
>>> "STL: Amazing Speed Differences between std::vector and std::set
>>> (Observed with an UndoRedoAction)"
>>>
>>>
>https://www.codeproject.com/Tips/5303529/STL-Amazing-Speed-Differences-between-
>std-vector-a
>>>
>>> I have seen this myself. We used a std::map for a very large set
>>> (>10,000 members) just because it is much faster than std::vector.
>>
>> Isn't that computer science 101? O(N) vs. [O(1) .. O(Log N)] for
>> vector vs std::map (red-black tree).
>
>Sorry, never took CS 101 at TAMU. My degree is in Mechanical

Perhaps you need to learn the basics if you're going to write important
software in C++ before you write some POS that maintenance programmers down
the line will have to suffer with for years.

>Engineering. I did take CS 204 or 304, IBM 370 Assembly Language
>Programming, for grins.

Not many programmers would assume they could just go and design a bridge yet
it seems a certain arrogance in engineering circles that programming is
something anyone can akin to making a coffee just because they once wrote

10 print "hello"
20 goto 10

Well guess what, we all built buildings out of lego when we were kids too.

Lynn McGuire

unread,
May 27, 2021, 1:08:55 PM5/27/21
to
Way too late dude. I am 60 and climbing. Been running my own
engineering software company for 26 years now. The eight of us are
shepherding about 1.3 million lines of C++ and F77 code that has been
ported from several mainframes to vaxes to unix boxes to PCs to Windows.

In fact, I'll bet that TAMU has not had a CS 204 IBM 370 Assembly
Language Programming in over three decades. I took the course in 1979
after placing out of all the Fortran courses.

Lynn


Vir Campestris

unread,
May 27, 2021, 4:38:41 PM5/27/21
to
It's funny, I usually say "Use vector. No really, use vector. Oh,
perhaps if you have a special case..."

But having glanced through that code It's not clear to me _why_ set
works so much better for that case. Most likely lots of insertions or
searches.

Andy

Louis Krupp

unread,
May 27, 2021, 9:04:02 PM5/27/21
to
The update code for vectors loops through elements until it finds a
match. In the Code Project page:

for (auto it = _aPixelActionsVec.begin(); it != _aPixelActionsVec.end(); it++)
{
if (it->ManipulationType == PIXEL_MANIPULATION &&
it->ColumnOrPalIdx == lCol &&
it->Row == lRow)
{
it->NewColIdxOrRef = dwNewColIdxOrRef;
it->NewMask = bNewMask;
return true;
}
}

The update code for sets calls the set's "find" method:

IMAGMANIPULATION im;
SetPixelManipulation(im, lCol, lRow, dwOldColIdxOrRef, dwNewColIdxOrRef,
bOldMask, bNewMask);

auto it = _aPixelActionsSet.find(im);
if (it != _aPixelActionsSet.end())
{
if (it->NewColIdxOrRef != dwNewColIdxOrRef ||
it->NewMask != bNewMask )
{
_aPixelActionsSet.erase(it);
_aPixelActionsSet.insert(im);
}
return true;
}

A set *could* be implemented as a vector, but it's almost certainly not;
as Scott said, it uses a self-balancing binary search tree. The set's
"find" method would take advantage of this to find matches quickly;
think of how you would code a binary search to find something in a
sorted array.

If there's no need to cycle through all the elements of a set in a
particular sequence, an unordered set might be even faster than a set;
from what I've read, it uses a hash table instead of a binary search tree.

If there's a need to cycle through the elements in their order of
insertion, it might be possible to maintain both a vector and a set,
adding a vector element pointer to the set element. The code could find
matches quickly using the set while still being able to cycle through
the elements of the vector. The bookkeeping required to keep the vector
and the set in sync might or might not be worth the time and trouble.

Louis


Öö Tiib

unread,
May 27, 2021, 11:10:28 PM5/27/21
to
On Wednesday, 26 May 2021 at 23:27:45 UTC+3, Lynn McGuire wrote:
> "STL: Amazing Speed Differences between std::vector and std::set
> (Observed with an UndoRedoAction)"
>
> https://www.codeproject.com/Tips/5303529/STL-Amazing-Speed-Differences-between-std-vector-a

The article is such a TL;DR garbage about undo_redo_actions with pixels I don't care
about and so can't profile. It is likely important to lot of other people but I would like
to have minimal reproducible example without that befoggement and
confusion.

What is its point? Is it that search from unsorted vector can be orders of magnitude
worse than search from set? Big surprise.

> I have seen this myself. We used a std::map for a very large set
> (>10,000 members) just because it is much faster than std::vector.

As performance does not matter in majority of code I have quite lot of
std::sets in code. Where performance matters there I have one of
sorted std::vector, std::unordered_set or boost::intrusive::set used.
No case where std::set outperforms all three has been met. So
lately my default of searchable containers is std::unordered_set
and sorted std::vector / boost::intrusive::set are performance
optimizations.

Bonita Montero

unread,
May 28, 2021, 1:23:54 AM5/28/21
to
> As performance does not matter in majority of code I have quite lot
> of std::sets in code. Where performance matters there I have one of
> sorted std::vector, std::unordered_set or boost::intrusive::set used.
> No case where std::set outperforms all three has been met. So
> lately my default of searchable containers is std::unordered_set
> and sorted std::vector / boost::intrusive::set are performance
> optimizations.

Of course: if you sort a vector and do a binary search on this
you have a lot of random access memory accesses which aren't
prectible by the prefetcher and the number of memory-accesses
is usually higher than that of a hash-set.

Paavo Helde

unread,
May 28, 2021, 1:59:35 AM5/28/21
to
28.05.2021 04:03 Louis Krupp kirjutas:
> On 5/27/2021 2:38 PM, Vir Campestris wrote:
>> On 26/05/2021 21:44, Scott Lurndal wrote:
>>> Lynn McGuire <lynnmc...@gmail.com> writes:
>>>> "STL: Amazing Speed Differences between std::vector and std::set
>>>> (Observed with an UndoRedoAction)"
>>>>
>>>> https://www.codeproject.com/Tips/5303529/STL-Amazing-Speed-Differences-between-std-vector-a
>>>>
>>>>
>>>> I have seen this myself.   We used a std::map for a very large set
>>>> (>10,000 members) just because it is much faster than std::vector.
>>>
>>> Isn't that computer science 101?   O(N) vs. [O(1) .. O(Log N)] for
>>> vector  vs std::map (red-black tree).
>>>
>> It's funny, I usually say "Use vector. No really, use vector. Oh,
>> perhaps if you have a special case..."
>>
>> But having glanced through that code It's not clear to me _why_ set
>> works so much better for that case. Most likely lots of insertions or
>> searches.
>>
>
> The update code for vectors loops through elements until it finds a
> match. In the Code Project page:
>
>             for  (auto  it = _aPixelActionsVec.begin(); it !=
> _aPixelActionsVec.end(); it++)

One can have efficient search in vectors, beating std::set. But for that
the vector must be sorted and one must use std::lower_bound(), not
linear search.

Juha Nieminen

unread,
May 28, 2021, 2:45:46 AM5/28/21
to
If you are programming something for efficiency, you should know the theory
behind algorithms and data containers. You should unerstand the difference
between linear-time operations and logarithmic-time operations. You should
know how data containers behave internally and what their computational
complexity is.

It's quite obvious that a binary search on a sorted data container (which
is essentially waht std::set does) is going to be exponentially faster
than a linear search on an unsorted data container (the larger the amount
of elements, the exponentially larger the difference in speed).

To a competent programmer something like std::set shouldn't merely be a
black box that does some strange magic inside. The competent programmer
knows what it's doing behind the scenes, and what its strengths and
weaknesses are.

In the same way as std::vector is not good for everything, likewise
std::set is not good for everything. There are many situations where
std::vector (even with a very large number of elements) is actually much
more efficient than std::set. It depends on what exactly you are doing.
(And, of course, std::vector and std::set are not fully interchangeable
in all situations, because std::set is an ordered data container. It does
not preserve the order of insertion, like std::vector does, but always
keeps the elements sorted. That's what allows the logarithmic search,
insertion and deletion of elements.)

When the logarithmic-time insertion, deletion and searching for a large
data container are not needed, std::set can be detrimental to performance
compared to an implementation that uses std::vector efficiently. That's
because std::set allocates memory dynamically for every single element
(and each element has quite some overhead in the form of several pointers
and other data members). This not only means that std::set consumes more
memory, but it's also slower to create new elements and, in some cases,
to traverse the container because the elements are not placed in
contiguous memory locations (and, even if they happened to be, they
would be spaced out a lot more, causing more cache misses).

Depending on the task at hand, std::vector can be as fast as, if not faster
than std::set even when requiring fast searching. This is if you can keep
the vector sorted and use binary searching (std::lower_bound). Of course
this applies mostly if you can generate the entire vector in one go and
then you just need to search but don't need to insert or remove elements.

Scott Lurndal

unread,
May 28, 2021, 10:44:45 AM5/28/21
to
A vector miss will always be O(N). The average hit will be O(N/2). From a big-O
perspective, they're identical complexities.

And either the sort needs to be done after every insert or the insert
complexity becomes high (due to the need to move all the higher
elements up one in the vector).

A miss on a balanced tree structure will always be O(LOG N), with
some complexity on insert due to balancing.

James Kuyper

unread,
May 28, 2021, 11:25:44 AM5/28/21
to
On 5/28/21 10:44 AM, Scott Lurndal wrote:
> Paavo Helde <myfir...@osa.pri.ee> writes:
...
>> One can have efficient search in vectors, beating std::set. But for that
>> the vector must be sorted and one must use std::lower_bound(), not
>> linear search.
>
> A vector miss will always be O(N). The average hit will be O(N/2). From a big-O
> perspective, they're identical complexities.

The description for std::lower_bound<T>() says "They are especially
appropriate for random access iterators, because these algorithms do a
logarithmic number of steps through the data structure." (25.8.3p1). For
types other than bool, std::vector<T> is a contiguous container
(22.3.11.1), and therefore has "member types iterator and const_iterator
[that] meet the Cpp17 RandomAccessIterator requirements (23.3.5.6)"
(22.2.1p13). Therefore lower_bound should be O(ln(N)), not O(N/2),
regardless of whether there's a hit or a miss.


> And either the sort needs to be done after every insert or the insert

If lower_bound() fails to find an exact match, the value it does return
will indicate precisely where the new item should be inserted to keep
the vector in order, so there's no need to sort. On the other hand, it
is true that:

> complexity becomes high (due to the need to move all the higher
> elements up one in the vector).

That's true, but it's importance depends upon how frequently the sorted
vector will be searched compared with how frequently it will need to be
modified. The comment you were responding to was only about searching.

Juha Nieminen

unread,
May 28, 2021, 12:45:17 PM5/28/21
to
Scott Lurndal <sc...@slp53.sl.home> wrote:
>>One can have efficient search in vectors, beating std::set. But for that
>>the vector must be sorted and one must use std::lower_bound(), not
>>linear search.
>
> A vector miss will always be O(N). The average hit will be O(N/2). From a big-O
> perspective, they're identical complexities.

What do you mean by "vector miss"?

> And either the sort needs to be done after every insert or the insert
> complexity becomes high (due to the need to move all the higher
> elements up one in the vector).

I think the idea was that if you already have a sorted vector (for example
a dataset that you read eg. from a file and sorted once), making repeated
searches for different values will be as fast as, if not even slightly
faster than with std::set (not asymptotically faster, but faster in
terms of wallclock time).

Bonita Montero

unread,
May 28, 2021, 12:46:37 PM5/28/21
to
> One can have efficient search in vectors, beating std::set. ...

Use unordered_set if you don't need sorted iteration.
When the load-factor is set correctly that will be faster.

Branimir Maksimovic

unread,
May 28, 2021, 1:02:28 PM5/28/21
to
well that is because vector is more cache friendly ;)
>


--
current job title: senior software engineer
skills: x86 aasembler,c++,c,rust,go,nim,haskell...

press any key to continue or any other to quit...

Branimir Maksimovic

unread,
May 28, 2021, 1:06:00 PM5/28/21
to
Problem is that if binary tree is not reorganizing internaly to be cache
friendly for every insert (balancing and making close nodes close to
each other) traversing set is awfully slow.

Branimir Maksimovic

unread,
May 28, 2021, 1:13:24 PM5/28/21
to
While this is theorectically OK, sorting list in place lineraly
with sequential quirt sort is faster then sorting it by relinking nodes.
And after that traversal is much faster as in place sort doesn't change
memory order. Chasing pointers is unforgivin on modern hardware.
You get less then instruction per cycle ...

Scott Lurndal

unread,
May 28, 2021, 1:24:01 PM5/28/21
to
James Kuyper <james...@alumni.caltech.edu> writes:
>On 5/28/21 10:44 AM, Scott Lurndal wrote:
>> Paavo Helde <myfir...@osa.pri.ee> writes:
>...
>>> One can have efficient search in vectors, beating std::set. But for that
>>> the vector must be sorted and one must use std::lower_bound(), not
>>> linear search.
>>
>> A vector miss will always be O(N). The average hit will be O(N/2). From a big-O
>> perspective, they're identical complexities.
>
>The description for std::lower_bound<T>() says

I missed the reference to lower_bound (i.e. binary search). It's true that
using a binary search on a sorted vector will be O(log N).

Bonita Montero

unread,
May 28, 2021, 1:32:12 PM5/28/21
to
> Problem is that if binary tree is not reorganizing internaly to be cache
> friendly for every insert (balancing and making close nodes close to
> each other) traversing set is awfully slow.

A vector isn't cache-friendly either when you do a binary search.
Random-access memory-accesses are always slow.

Scott Lurndal

unread,
May 28, 2021, 1:51:00 PM5/28/21
to
Not really unforgiving, we spend significant resources on tuning
the cache replacement algorithms and cache prefetchers to detect
and support pointer chasing applications.

Bonita Montero

unread,
May 28, 2021, 1:55:18 PM5/28/21
to
I just wrote a little test:

#include <iostream>
#include <set>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

using namespace std;
using namespace chrono;

int main()
{
using hrc_tp = time_point<high_resolution_clock>;
size_t const ROUNDS = 10'000'000;
set<int> si;
for( int i = 1000; --i; )
si.insert( i );
mt19937_64 mt( (default_random_engine())() );
uniform_int_distribution<int> uid( 0, 999 );
int const *volatile pvi;
hrc_tp start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
pvi = &*si.find( uid( mt ) );
double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (double)ROUNDS;
cout << "set: " << ns << endl;
vector<int> vi( 1'000 );
for( size_t i = 1'000; i--; vi[i] = (int)i );
bool volatile found;
start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
found = binary_search( vi.begin(), vi.end(), uid( mt ) );
ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now()
- start ).count() / (double)ROUNDS;
cout << "vec: " << ns << endl;
}

On my Ryzen Threadripper 3990X the times for set are 52.2 and forvec
they are 48.3. I expected the difference to be somewhat larger, but
as the set and map trees are usually red-black-trees there's some
kind of binary-lookup inside the nodes (up to 4 descendants), so the
memory access-patterns become not so random.

Scott Lurndal

unread,
May 28, 2021, 4:20:28 PM5/28/21
to
Bonita Montero <Bonita....@gmail.com> writes:
>>> Problem is that if binary tree is not reorganizing internaly to be cache
>>> friendly for every insert (balancing and making close nodes close to
>>> each other) traversing set is awfully slow.
>
>> A vector isn't cache-friendly either when you do a binary search.
>> Random-access memory-accesses are always slow.
>
>I just wrote a little test:
>
>#include <iostream>
>#include <set>
>#include <vector>
>#include <algorithm>
>#include <random>
>#include <chrono>
>
>using namespace std;
>using namespace chrono;
>
>int main()
>{
> using hrc_tp = time_point<high_resolution_clock>;
> size_t const ROUNDS = 10'000'000;
> set<int> si;
> for( int i = 1000; --i; )
> si.insert( i );

A thousand integers will fit in the L1 cache. Try
it with a hundred million.

Branimir Maksimovic

unread,
May 28, 2021, 5:32:03 PM5/28/21
to
Look, caches are pretty large these days. Rarely you will have vector
larger then L3 ;)
for example take a look at this bench:
[code]
~/.../examples/sort >>> ./list_sort 1000000
seed: 1622237019
N: 1000000
unsorted

0x102515e70 299531
0x102515e50 398346
0x102515e30 273381
0x102515e10 875323
0x102515df0 289978
0x102515dd0 399336
0x102515db0 21582
0x102515d90 874270
0x102515d70 114179
0x102515d50 778108
0x102515d30 451177
0x102515d10 570319
0x102515cf0 621189
0x102515cd0 138605
0x102515cb0 908552
0x102515c90 267934
array radix elapsed 0.029208 seconds
list radix elapsed 1.070269 seconds
list merge elapsed 0.329870 seconds
equal

equal

sorted

0x100d80710 0
0x101931470 1
0x101bb1dd0 4
0x10198ef10 6
0x101b8a350 6
0x102453390 7
0x1008aa4b0 10
0x101938ab0 10
0x100d8a650 11
0x102093f30 12
0x101a94eb0 13
0x10086a5f0 13
0x101538210 15
0x100876930 16
0x10079f010 16
0x100e9faf0 17
size of node 12, length 1000000
[/code]
Fits in cache (L3)
Now look list sort vs array sort with 8 mil elements.
Goes out of cache size ;)
[code]
~/.../examples/sort >>> ./list_sort 8000000
seed: 1622237118
N: 8000000
unsorted

0x1112ff670 5357700
0x1112ff650 5227237
0x1112ff630 6764785
0x1112ff610 2977649
0x1112ff5f0 5339439
0x1112ff5d0 3588837
0x1112ff5b0 6605927
0x1112ff590 1268989
0x1112ff570 5989679
0x1112ff550 492070
0x1112ff530 4063176
0x1112ff510 2130048
0x1112ff4f0 2334049
0x1112ff4d0 5751483
0x1112ff4b0 1179952
0x1112ff490 973123
array radix elapsed 0.239062 seconds
list radix elapsed 10.192995 seconds
list merge elapsed 4.259887 seconds
equal

equal

sorted

0x10ffe8410 0
0x10e433430 0
0x107ba73b0 1
0x11021d0b0 2
0x10b008290 3
0x1034351f0 4
0x10f710f90 5
0x10b8be5b0 6
0x10395b170 9
0x102da6fd0 9
0x10e729ef0 10
0x110a91990 11
0x103f86a90 12
0x10853c310 13
0x1065f8890 16
0x104453390 16
size of node 12, length 8000000
[/code]
While array does not fits in cache as well, it is more cache friendly
then list. Same is for binary tree. That is why btree is so much faster
then plain ordinary binary tree on modern hardware ;)
But C++ because of iterators can't use btrees :(
Also merge sort is much faster on list then radix sort? Can you beleive
that?
Look now trees benchamr. Binary trees vs btree.
[code]
~/.../rust/trees >>> ./target/release/binary_trees ±[●][master]
└── (0,0):(c:BLACK)
├── nil
└── nil
└── (0,0):(data)
├── nil
└── nil
└── (0,0):(c:BLACK)
├── nil
└── (1,1):(c:RED)
├── nil
└── nil
└── (0,0):(data)
├── nil
└── (1,1):(data)
├── nil
└── nil
└── (1,1):(c:BLACK)
├── (0,0):(c:RED)
│ ├── nil
│ └── nil
└── (2,2):(c:RED)
├── nil
└── nil
└── (0,0):(data)
├── nil
└── (1,1):(data)
├── nil
└── (2,2):(data)
├── nil
└── nil
└── (1,1):(c:BLACK)
├── (0,0):(c:BLACK)
│ ├── nil
│ └── nil
└── (2,2):(c:BLACK)
├── nil
└── (3,3):(c:RED)
├── nil
└── nil
└── (2,2):(data)
├── (1,1):(data)
│ ├── (0,0):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (3,3):(data)
├── nil
└── nil
└── (1,1):(c:BLACK)
├── (0,0):(c:BLACK)
│ ├── nil
│ └── nil
└── (3,3):(c:BLACK)
├── (2,2):(c:RED)
│ ├── nil
│ └── nil
└── (4,4):(c:RED)
├── nil
└── nil
└── (2,2):(data)
├── (1,1):(data)
│ ├── (0,0):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (3,3):(data)
├── nil
└── (4,4):(data)
├── nil
└── nil
└── (1,1):(c:BLACK)
├── (0,0):(c:BLACK)
│ ├── nil
│ └── nil
└── (3,3):(c:RED)
├── (2,2):(c:BLACK)
│ ├── nil
│ └── nil
└── (4,4):(c:BLACK)
├── nil
└── (5,5):(c:RED)
├── nil
└── nil
└── (2,2):(data)
├── (1,1):(data)
│ ├── (0,0):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (3,3):(data)
├── nil
└── (4,4):(data)
├── nil
└── (5,5):(data)
├── nil
└── nil
└── (1,1):(c:BLACK)
├── (0,0):(c:BLACK)
│ ├── nil
│ └── nil
└── (3,3):(c:RED)
├── (2,2):(c:BLACK)
│ ├── nil
│ └── nil
└── (5,5):(c:BLACK)
├── (4,4):(c:RED)
│ ├── nil
│ └── nil
└── (6,6):(c:RED)
├── nil
└── nil
└── (2,2):(data)
├── (1,1):(data)
│ ├── (0,0):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (5,5):(data)
├── (4,4):(data)
│ ├── (3,3):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (6,6):(data)
├── nil
└── nil
└── (3,3):(c:BLACK)
├── (1,1):(c:RED)
│ ├── (0,0):(c:BLACK)
│ │ ├── nil
│ │ └── nil
│ └── (2,2):(c:BLACK)
│ ├── nil
│ └── nil
└── (5,5):(c:RED)
├── (4,4):(c:BLACK)
│ ├── nil
│ └── nil
└── (6,6):(c:BLACK)
├── nil
└── (7,7):(c:RED)
├── nil
└── nil
└── (2,2):(data)
├── (1,1):(data)
│ ├── (0,0):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (5,5):(data)
├── (4,4):(data)
│ ├── (3,3):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (6,6):(data)
├── nil
└── (7,7):(data)
├── nil
└── nil
└── (3,3):(c:BLACK)
├── (1,1):(c:RED)
│ ├── (0,0):(c:BLACK)
│ │ ├── nil
│ │ └── nil
│ └── (2,2):(c:BLACK)
│ ├── nil
│ └── nil
└── (5,5):(c:RED)
├── (4,4):(c:BLACK)
│ ├── nil
│ └── nil
└── (7,7):(c:BLACK)
├── (6,6):(c:RED)
│ ├── nil
│ └── nil
└── (8,8):(c:RED)
├── nil
└── nil
└── (2,2):(data)
├── (1,1):(data)
│ ├── (0,0):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (5,5):(data)
├── (4,4):(data)
│ ├── (3,3):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (6,6):(data)
├── nil
└── (7,7):(data)
├── nil
└── (8,8):(data)
├── nil
└── nil
└── (3,3):(c:BLACK)
├── (1,1):(c:BLACK)
│ ├── (0,0):(c:BLACK)
│ │ ├── nil
│ │ └── nil
│ └── (2,2):(c:BLACK)
│ ├── nil
│ └── nil
└── (5,5):(c:BLACK)
├── (4,4):(c:BLACK)
│ ├── nil
│ └── nil
└── (7,7):(c:RED)
├── (6,6):(c:BLACK)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
└── (2,2):(data)
├── (1,1):(data)
│ ├── (0,0):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (5,5):(data)
├── (4,4):(data)
│ ├── (3,3):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (8,8):(data)
├── (7,7):(data)
│ ├── (6,6):(data)
│ │ ├── nil
│ │ └── nil
│ └── nil
└── (9,9):(data)
├── nil
└── nil
└── (3,3):(c:BLACK)
├── (1,1):(c:BLACK)
│ ├── (0,0):(c:BLACK)
│ │ ├── nil
│ │ └── nil
│ └── (2,2):(c:BLACK)
│ ├── nil
│ └── nil
└── (5,5):(c:BLACK)
├── (4,4):(c:BLACK)
│ ├── nil
│ └── nil
└── (7,7):(c:RED)
├── (6,6):(c:BLACK)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
true
0 1
1 2
2 3
3 4
4 5
0 1
1 2
2 3
3 4
4 5
0 1
1 2
2 3
3 4
4 5
valid true 9
valid true 8
valid true 7
valid true 6
valid true 5
valid true 4
valid true 3
valid true 2
valid true 1
valid true 0
valid true 9
valid true 8
valid true 7
valid true 6
valid true 5
valid true 4
valid true 3
valid true 2
valid true 1
valid true 0
valid true 9
└── (5,5):(c:BLACK)
├── (3,3):(c:BLACK)
│ ├── (1,1):(c:BLACK)
│ │ ├── nil
│ │ └── (2,2):(c:RED)
│ │ ├── nil
│ │ └── nil
│ └── (4,4):(c:BLACK)
│ ├── nil
│ └── nil
└── (7,7):(c:BLACK)
├── (6,6):(c:BLACK)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
valid true 8
└── (5,5):(c:BLACK)
├── (3,3):(c:BLACK)
│ ├── (2,2):(c:BLACK)
│ │ ├── nil
│ │ └── nil
│ └── (4,4):(c:BLACK)
│ ├── nil
│ └── nil
└── (7,7):(c:BLACK)
├── (6,6):(c:BLACK)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
valid true 7
└── (5,5):(c:BLACK)
├── (3,3):(c:BLACK)
│ ├── nil
│ └── (4,4):(c:RED)
│ ├── nil
│ └── nil
└── (7,7):(c:RED)
├── (6,6):(c:BLACK)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
valid true 6
└── (5,5):(c:BLACK)
├── (4,4):(c:BLACK)
│ ├── nil
│ └── nil
└── (7,7):(c:RED)
├── (6,6):(c:BLACK)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
valid true 5
└── (7,7):(c:BLACK)
├── (5,5):(c:BLACK)
│ ├── nil
│ └── (6,6):(c:RED)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
valid true 4
└── (7,7):(c:BLACK)
├── (6,6):(c:BLACK)
│ ├── nil
│ └── nil
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
valid true 3
└── (8,8):(c:BLACK)
├── (7,7):(c:BLACK)
│ ├── nil
│ └── nil
└── (9,9):(c:BLACK)
├── nil
└── nil
valid true 2
└── (8,8):(c:BLACK)
├── nil
└── (9,9):(c:RED)
├── nil
└── nil
valid true 1
└── (9,9):(c:BLACK)
├── nil
└── nil
valid true 0
empty tree

trees! 64
treest! 64
treesr! 64
average op 5360
t insert time 1.591332069
true
height 20
weight (422175,577824)
average op 5802
t1 insert time 1.722261029
true
height 26
weight (670296,329703)
average op 4334
t2 insert time 1.29017692
true
height 20
weight (422175,577824)
average op 4717
t3 insert time 1.401961993
true
height 23
weight (612266,387733)
counter 31780
average op 3397
bt insert time 1.014782747
true size 1000000
t find time 1.166620321
true
sum 1783293664
t1 find time 1.72787108
true
sum 1783293664
t2 find time 1.192506122
true
sum 1783293664
t3 find time 1.376387028
true
sum 1783293664
bt find time 1.042876439
true
sum 1783293664
average op 375
t iter time 0.110657066
true
sum 1783293664
average op 391
t1 iter time 0.115330211
true
sum 1783293664
average op 376
t2 iter time 0.110930662
true
sum 1783293664
average op 388
t3 iter time 0.114391853
true
sum 1783293664
average op 69
bt iter time 0.020547062
true
sum 1783293664
t delete time 1.806868028
true size 0
empty tree
t1 delete time 1.734909562
true size 0
empty tree
t2 delete time 1.47146888
true size 0
empty tree
t3 delete time 3.051412126
true size 0
counter 19
empty tree
bt delete time 1.052457722
true
[/code]
t,t1,t2,t3 are various binary trees, while bt is btree.
Look how superior btree is, because of cache friendlyness :)

Branimir Maksimovic

unread,
May 28, 2021, 5:41:33 PM5/28/21
to
+1

Branimir Maksimovic

unread,
May 28, 2021, 5:50:52 PM5/28/21
to
On 2021-05-28, Bonita Montero <Bonita....@gmail.com> wrote:
Your test is wrong. Too much iterations while data fits in cache:
look when data fits in cache:
~/.../bmaxa_data/examples >>> ./vecvsset
set: 45.2584
vec: 48.2688
so in this case set is even faster then vec, but this is iluusion.
Look how things look like when data does not fits in cache:
/.../bmaxa_data/examples >>> g++ -O2 vecvsset.cpp -march=native -o vecvsset [147]
~/.../bmaxa_data/examples >>> ./vecvsset
set: 226.181
vec: 56.9002
Holy moly vec is 4 times faster then set ;)
Do you understand now what I am talking about ?

Paavo Helde

unread,
May 28, 2021, 7:24:23 PM5/28/21
to
Are you kidding? Making big-O tests with N=1000? This is an insult for
big-O!

Here are timings of your program with a bit more meaningful N. I also
added unordered_set for curiosity, in the last column.

N ROUNDS Set time /ns/ Vec time /ns/ hash_set /ns/
1'000 10'000'000 52.3 50.0 4.88
100'000 100'000 208 115 14.6
10'000'000 1000 1322 340 53
100'000'000 100 2242 690 122
1'000'000'000 10 5320 1380 420

So, std::set is the clear loser here. std::unordered_set lookup is
fastest, but OTOH its memory consumption, construction and destruction
time are the worst (not shown in the table, but building and destroying
the last hash set with 1G elements took ages and it ate up almost 64GB
RAM!). A (sorted) vector takes *much* less memory, is *much* faster to
construct and destruct, and binary lookup in it is better than std::set,
regardless of size.

Bonita Montero

unread,
May 28, 2021, 11:18:23 PM5/28/21
to
> A thousand integers will fit in the L1 cache.
> Try it with a hundred million.

It will be slower - but the relationship will be the same.

Bonita Montero

unread,
May 28, 2021, 11:25:58 PM5/28/21
to
> Your test is wrong. Too much iterations while data fits in cache:
> ...

When I do this:

#include <iostream>
#include <set>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

using namespace std;
using namespace chrono;

int main()
{
using hrc_tp = time_point<high_resolution_clock>;
size_t const N = 1'000'000,
ROUNDS = 100'000;
set<int> si;
for( int i = N; i--; )
si.insert( i );
mt19937_64 mt( (default_random_engine())() );
uniform_int_distribution<int> uid( 0, 999 );
int const *volatile pvi;
hrc_tp start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
pvi = &*si.find( uid( mt ) );
double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (double)ROUNDS;
cout << "set: " << ns << endl;
vector<int> vi( N );
for( size_t i = N; i--; vi[i] = (int)i );
bool volatile found;
start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
found = binary_search( vi.begin(), vi.end(), uid( mt ) );
ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now()
- start ).count() / (double)ROUNDS;
cout << "vec: " << ns << endl;
}

set is about 78,2 and vector is about 53,8.
Not a big difference ...

Bonita Montero

unread,
May 28, 2021, 11:42:53 PM5/28/21
to
And this is an extension to unordered_set:

#include <iostream>
#include <set>
#include <unordered_set>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

using namespace std;
using namespace chrono;

int main()
{
using hrc_tp = time_point<high_resolution_clock>;
size_t const N = 10'000'000,
ROUNDS = 10'000'000;
set<int> si;
for( int i = N; i--; )
si.insert( i );
mt19937_64 mt( (default_random_engine())() );
uniform_int_distribution<int> uid( 0, 999 );
int const *volatile pvi;
hrc_tp start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
pvi = &*si.find( uid( mt ) );
double ns = (int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (double)ROUNDS;
cout << "set: " << ns << endl;
unordered_set<int> usi;
usi.max_load_factor( 2.0f );
usi.reserve( N );
for( int i = N; i--; )
usi.insert( i );
bool volatile found;
start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
pvi = &*usi.find( uid( mt ) );
ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now()
- start ).count() / (double)ROUNDS;
cout << "uset: " << ns << endl;
vector<int> vi( N );
for( size_t i = N; i--; vi[i] = (int)i );
start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
found = binary_search( vi.begin(), vi.end(), uid( mt ) );
ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now()
- start ).count() / (double)ROUNDS;
cout << "vec: " << ns << endl;
}

So with unordered_set the lookup is 15.1ns. I think that's
also while the lookups get OoO-parallelized because the
bucket-chains aren't so deep.

Branimir Maksimovic

unread,
May 29, 2021, 6:45:00 AM5/29/21
to
That still fits in cache. Try with 8-16 million depending on your cache
size.

Bonita Montero

unread,
May 29, 2021, 7:01:05 AM5/29/21
to
> That still fits in cache. Try with 8-16 million depending on your cache
> size.

10E6 ints are 40MB with the vector-version. I've got only 4MB L2-cache,
but 256MB (8 * 32MB) L3-cache. But a cacheline in a L3-cache of a CPU
-die of my multi-die CPU is only populated by one of the cores of the
same die; other dies can't populate it.

That are the number from memory:

set: 1872.5
uset: 188.302
vec: 695.665


Öö Tiib

unread,
May 29, 2021, 11:17:23 AM5/29/21
to
If it is that or other way can be redicted per use case.
But as std::lower_bound is constexpr but std::unordered_map is
not the latter can not compete on field of O(0) not doing it runtime
at all.

Bonita Montero

unread,
May 29, 2021, 12:12:44 PM5/29/21
to
> If it is that or other way can be redicted per use case.

If you have a reasonable load-factor on a hash-set it always outperforms
the vector.

> But as std::lower_bound is constexpr ...

If you've got a array with static content that lower_bound is scanning
in and a very smart compiler this would help - otherwise not.

Juha Nieminen

unread,
May 30, 2021, 2:40:44 AM5/30/21
to
He was talking about traversing the set, not searching it.
In other words, for(auto& element: theSet).

(This, of course, assuming that the amount of data is so large
that it won't fit entirely even in the L3 cache. Or, if we are
just traversing the set for the first time since all of its contents
have been flushed from the caches.)

Of course even with std::vector it depends on the size of the
element. Traversing a (very large) std::vector linearly from
beginning to end isn't magically going to be very fast either,
if each element is large enough. And "large enough" is actually
quite small. If I remember correctly, cache line sizes are typically
64 bytes or so. This means that if the vector element type is an
object of size 64 bytes or more, and you are accessing just one member
variable of each object, then you'll get no benefit from linear
traversal compared to random access (in the case that the contents of
the vector are not already in the caches).

You only get a speed advantage for (very large) vectors which element
size is very small, like 4 or 8 bytes. For example, if the vector
represents a bitmap image, with each "pixel" element taking eg. 4 bytes,
then a linear traversal will be quite efficient (assuming none of the
vector contents were in any cache to begin with, you'll get an
extremely heavy cache miss only each 16 pixels.)

Of course almost none of this applies if the vector or set is
small enough to fit in L1 cache, and it has already been loaded in
there in its entirety previously. Then none of this matters almost
at all. It starts mattering a bit more if the vector is too large
for L1 but small enough for L2 cache, and furthermore if it's too
large for L2 but small enough for L3 cache.

Modern CPUs tend to have a quite large L3 cache, which mitigates
the problems of cache misses in many instances. For example my CPU
has a L3 cache of 12 MB. Thus if I need to, for example, do some
operations repeatedly to, let's say, an image that fits comfortably
within those 12 MB, then it will be very fast.

It's only when the dataset is much larger than L3 that cache locality
really starts having a very pronounced effect (when performing
operations repeatedly on the entire dataset).
0 new messages