Avoiding unnecessary counting

Paul

unread,

Dec 10, 2018, 4:55:22 PM12/10/18

to

Suppose that we want to check if an "enough" condition is satisfied.
Is there a way to use library count functions efficiently -- is this taken
care of by the compiler?

I know that sounds vague. I'll explain the type of thing I mean.
Suppose I have a large vector<int> v
and my code is:
if(std::count(v.begin(), v.end(), 7) > 2)
{
//// more code
}

Suppose the vector actually has a million occurrences of 7.
Do standard compiler optimizations like -03 on gcc let the count stop
after the third occurrence? Or must time be wasted counting the 4th, 5th
... occurrence of 7?

How about the count member function of unordered_multiset.
If s is a multiset,
does the code:
if(s.count(7) > 15)
{
//more code
}
need to count every instance of 7 or can it just execute the loop as soon
as it has found 16 occurrences?

Thank you,

Paul

Jorgen Grahn

unread,

Dec 10, 2018, 5:34:11 PM12/10/18

to

If everything relevant is visible to the compiler, in theory it could
be optimized. But I wouldn't count on it.

If it's important to you, write your own at_least() algorithm.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Paul

unread,

Dec 10, 2018, 6:02:43 PM12/10/18

to

I'm not sure how to proceed in that case. I'm doing the Boyer-Moore majority
problem as follows. If a vector of ints has an element that occurs
more than half the time, output that number, otherwise output 0.5.
There's a standard algorithm to do this by incrementing and decrementing
a counter. But I'm pretending I don't know that, and trying to do
the problem time-efficiently via data structures instead.

My code is below but I don't want the multiset to do so much unnecessary
counting. Basically, I don't know how to code an at_least function for
a multiset. Any ideas gratefully received. Thanks. Paul

double DataStructuresMajority(const std::vector<int>& vec)
{
const double noMajority = 0.5;
std::unordered_multiset<int> duplicates(vec.begin(), vec.end());
for(std::unordered_set<int>::const_iterator iter = duplicates.begin(); iter!= duplicates.end(); ++iter)
if(duplicates.count(*iter) > vec.size()/2)
return *iter;

return noMajority;
}

Alf P. Steinbach

unread,

Dec 10, 2018, 8:20:23 PM12/10/18

to

I'd sort the vector and just count run lengths.

Cheers & hth.,

- Alf

Jorgen Grahn

unread,

Dec 11, 2018, 1:45:06 AM12/11/18

to

I didn't read the rest, but are you saying you don't know how to
implement this?

/**
* Return true if std::count(first, last, val) >= n.
* Will not iterate all the way to last unless necessary.
*/
template <class It, class T>
bool at_least(It first, It last, T val, unsigned n);

Paul

unread,

Dec 11, 2018, 1:50:31 AM12/11/18

to

Thanks, Alf
I'm not sure about practical considerations but I'm doing this
as part of learning about data structures and algorithms where
the "big O" of an algorithm is of primary importance.

Sorting the vector would be regarded as terrible because it
uses an O(n log n) algorithm where linear time is available.

I know vectors are faster so your idea might work in practice.

But how about my original question of efficiently verifying
a condition like m.count(3) > 4; where m is an unordered_multiset<int>?
I suppose an approach might be to research how count() works and then
modify it.

Paul

Paul

unread,

Dec 11, 2018, 1:53:33 AM12/11/18

to

No, but you're saying that you don't understand my code.
For a multiset, using std::count rather than multiset::count()
is a really bad idea.

Thanks for not reading my post, and then complaining about what you
didn't read.

Paul

Jorgen Grahn

unread,

Dec 11, 2018, 3:02:50 AM12/11/18

to

I was referring to your first example problem. If the second one with
unordered_multiset was more important to you, I missed that fact.

> Thanks for not reading my post, and then complaining about what you
> didn't read.

Where am I complaining or saying I don't understand your code?
I was trying to be helpful.

Stuart Redmann

unread,

Dec 11, 2018, 3:04:23 AM12/11/18

to

I don't think that Jorgen did not get your point or did not understand your
problem, even though he wrote that he didn't read your posting to the very
end. I find Jorgen's answer quite informative, and BTW Jorgen does not
propose to use std::count on the implementation of at_least, he just left
the implementation empty as an (rather simple) exercise.

Jorgen probably misunderstood you in that regard that he thinks you, Paul,
were not able to roll your own version of at_least, but I'm pretty sure
that you are able to do so. You statement "I'm not sure how to proceed in
this case" is very likely meant as "I'm not sure how I can find out whether
the compiler will abort the loop prematurely or not."

So this is more or less a classic misunderstanding. However, I think that
your answer to Jorgen is a bit to disrespectful, with regard to Jorgen
being one of the top posters of this group. I'm just afraid that two
competent participants, Jorgen and you, might start to dislike one another
over a simple misunderstanding, which would be a pity.

Regards,
Stuart

Öö Tiib

unread,

Dec 11, 2018, 4:45:48 AM12/11/18

to

On Tuesday, 11 December 2018 01:02:43 UTC+2, Paul wrote:
>
> I'm not sure how to proceed in that case. I'm doing the Boyer-Moore majority
> problem as follows. If a vector of ints has an element that occurs
> more than half the time, output that number, otherwise output 0.5.

You return int as double and use 0.5 as indicator of "not found"?
That is obscure. Return something that makes sense like
std::optional<vote> or std::pair<bool,vote>.

> There's a standard algorithm to do this by incrementing and decrementing
> a counter. But I'm pretending I don't know that, and trying to do
> the problem time-efficiently via data structures instead.

The standard algorithm does it with two passes on unsorted
sequence so O(2*N). Sorting the sequence and then one-pass
counting is O(N*log(N)+N). Copying to another, sorted or hashed
container and then one-pass counting is about as time-complex
as with sorting plus it takes double memory too. Your algorithm
below loses to all of those.

> My code is below but I don't want the multiset to do so much unnecessary
> counting. Basically, I don't know how to code an at_least function for
> a multiset. Any ideas gratefully received. Thanks. Paul
>
> double DataStructuresMajority(const std::vector<int>& vec)
> {
> const double noMajority = 0.5;
> std::unordered_multiset<int> duplicates(vec.begin(), vec.end());
> for(std::unordered_set<int>::const_iterator iter = duplicates.begin(); iter!= duplicates.end(); ++iter)
> if(duplicates.count(*iter) > vec.size()/2)
> return *iter;
>
> return noMajority;
> }

That is mixing up iterators of unordered_multiset and unordered_set.
Does it compile? Also you seemingly call count for every element in
equal range so it looks something like O(N*log(N) + N*N). Sad.

Alf P. Steinbach

unread,

Dec 11, 2018, 5:05:15 AM12/11/18

to

<> Thanks, Alf
> I'm not sure about practical considerations but I'm doing this
> as part of learning about data structures and algorithms where
> the "big O" of an algorithm is of primary importance.
>
> Sorting the vector would be regarded as terrible because it
> uses an O(n log n) algorithm where linear time is available.

Well, in a situation of paid work, if whoever commissioned the work has
that opinion and e.g. refuses to pay, then of course.

But otherwise just keep things simple.

The cost of developing this code in a way that can be marginally more
efficient for a zillion values, and the costs of maintenance of that
more than necessary complex code, likely far outweighs the gains from
possibly running it once on a zillion items.

> I know vectors are faster so your idea might work in practice.
>
> But how about my original question of efficiently verifying
> a condition like m.count(3) > 4; where m is an unordered_multiset<int>?
> I suppose an approach might be to research how count() works and then
> modify it.

For b.count(k) for an unordered associative container b, C++17 table 91,
in §26.2.7, gives complexity requirements average case O(b.count(k)) and
worst case O(b.size()).

In §20.4.1.4 it's explained that “Complexity requirements specified in
the library clauses are upper bounds, and implementations that provide
better complexity guarantees satisfy the requirements.”

The problem with an /unordered/ container is that there's no way to
select only the key values = k that you're interested in, for the
purpose of counting them. An ordered container gives you that, but at
the cost of a log(n) factor of complexity. So you're into trade-offs,
which can be interesting. :)

Cheers!,

- Alf

Paul

unread,

Dec 11, 2018, 5:40:59 AM12/11/18

to

Thanks for your reply. A number of points. In similar exercises, it is
common to have a problem where solutions are assigned a non-negative integer
and -1 is a sentinel to represent the no-solution case.
Here, -1 is valid in the case where a solution is present. So I tried to
mimic this basic technique and chose 0.5 as a sentinel.
However, the technique doesn't work as well because it necessitates changing
the return type which isn't ideal. I hadn't heard of std::optional -- many
thanks for introducing this to me.
The time complexity of my algorithm is O(N)
Thanks for pointing out the mixing of the two types of iterators. I hadn't
noticed that. It does compile with gcc. I avoid posting non-compilable code
by copy-pasting of a verified program (although you couldn't have known
that). I would rather it didn't compile so that I could have uncovered this
mistake myself. I should have just used a range-based for loop.

Because of the hashing, my algorithm is O(N). Unordered_multiset uses hashing.
You might be correct that "it looks something like O(N*log(N) + N*N)" but
it is O(N) regardless of what it looks like.
See Alf's post for more info.

Paul

Paul

unread,

Dec 11, 2018, 5:46:22 AM12/11/18

to

Ok, no offence intended and all that.
But note that what you (Stuart) are clearly saying is that
1) Jorgen didn't read all of my posting.
2) Based on only a partial reading, Jorgen underestimated me.
You used the word "misunderstood" and the misunderstanding was to think
I couldn't do something that I can do.

Anyway, yes, Jorgen is helpful and knows a lot.

Thanks to everyone, etc.

Paul

Paul

unread,

Dec 11, 2018, 6:05:09 AM12/11/18

to

Hi Alf,

I am training for a codility test.
Your algorithm does indeed pass the codility requirements, so it's perfectly
ok for my purposes.

Codility aim to make deductions for time-inefficient algorithm but yours
scores 100%.
BTW, my idea with the unordered_multiset count also scores 100% (and it
certainly wouldn't have done if it was O(N ^ 2)).

Paul

Öö Tiib

unread,

Dec 11, 2018, 9:26:19 AM12/11/18

to

On Tuesday, 11 December 2018 12:40:59 UTC+2, Paul wrote:
> On Tuesday, December 11, 2018 at 9:45:48 AM UTC, Öö Tiib wrote:
> > On Tuesday, 11 December 2018 01:02:43 UTC+2, Paul wrote:
> > >
> > > I'm not sure how to proceed in that case. I'm doing the Boyer-Moore majority
> > > problem as follows. If a vector of ints has an element that occurs
> > > more than half the time, output that number, otherwise output 0.5.
> >
> > You return int as double and use 0.5 as indicator of "not found"?
> > That is obscure. Return something that makes sense like
> > std::optional<vote> or std::pair<bool,vote>.
> >
> > > There's a standard algorithm to do this by incrementing and decrementing
> > > a counter. But I'm pretending I don't know that, and trying to do
> > > the problem time-efficiently via data structures instead.
> >
> > The standard algorithm does it with two passes on unsorted
> > sequence so O(2*N). Sorting the sequence and then one-pass
> > counting is O(N*log(N)+N). Copying to another, sorted or hashed
> > container and then one-pass counting is about as time-complex
> > as with sorting plus it takes double memory too. Your algorithm
> > below loses to all of those.

It is pity that you did ignore it.

> >
> > > My code is below but I don't want the multiset to do so much unnecessary
> > > counting. Basically, I don't know how to code an at_least function for
> > > a multiset. Any ideas gratefully received. Thanks. Paul
> > >
> > > double DataStructuresMajority(const std::vector<int>& vec)
> > > {
> > > const double noMajority = 0.5;
> > > std::unordered_multiset<int> duplicates(vec.begin(), vec.end());
> > > for(std::unordered_set<int>::const_iterator iter = duplicates.begin(); iter!= duplicates.end(); ++iter)
> > > if(duplicates.count(*iter) > vec.size()/2)
> > > return *iter;
> > >
> > > return noMajority;
> > > }
> >
> > That is mixing up iterators of unordered_multiset and unordered_set.
> > Does it compile? Also you seemingly call count for every element in
> > equal range so it looks something like O(N*log(N) + N*N). Sad.
>
> Thanks for your reply. A number of points. In similar exercises, it is
> common to have a problem where solutions are assigned a non-negative integer
> and -1 is a sentinel to represent the no-solution case.
> Here, -1 is valid in the case where a solution is present. So I tried to
> mimic this basic technique and chose 0.5 as a sentinel.
> However, the technique doesn't work as well because it necessitates changing
> the return type which isn't ideal. I hadn't heard of std::optional -- many
> thanks for introducing this to me.

The whole point posting was to help you.

> The time complexity of my algorithm is O(N)

It is clearly not.

> Thanks for pointing out the mixing of the two types of iterators. I hadn't
> noticed that. It does compile with gcc. I avoid posting non-compilable code
> by copy-pasting of a verified program (although you couldn't have known
> that). I would rather it didn't compile so that I could have uncovered this
> mistake myself. I should have just used a range-based for loop.

Yes, but more importantly it can be made one pass algorithm.

> Because of the hashing, my algorithm is O(N). Unordered_multiset uses hashing.
> You might be correct that "it looks something like O(N*log(N) + N*N)" but
> it is O(N) regardless of what it looks like.
> See Alf's post for more info.

Can you try to explain it a bit better? Telling instead that hashtable is
using hashing clarifies nothing of your line of thought.
Learn to explain things with your own words; I can't find support to
your O(N) in any of Alf's posts.

I can explain why your complexity "looks" what I said:
A) Complexity of constructor
https://en.cppreference.com/w/cpp/container/unordered_multiset/unordered_multiset
type 2) so linear to quadratic. Starting O formula:

O(N*something + ...

B) Complexity of for loop over all elements
That is linear. Adding to O forrmula:

O(N*something + N * ...

C) Complexity of calling count in for loop
https://en.cppreference.com/w/cpp/container/unordered_multiset/count
That is also linear to the number of equal elements.

O(N*something + N * M)

Where N is number of elements and M is number of equal
elements. That can't be reduced down to O(N) .... and looks expensive
so go ahead and explain why it is not that.

Paul

unread,

Dec 11, 2018, 10:03:51 AM12/11/18

to

I'll repeat the (bugged because of the mix in iterators) code for convenience.
Comments about the time complexity will be interspersed with the code.

Ok, yeah, I can see it's not linear, now.
I was wrong. Mea culpa.
Yes, I always realised you were trying to help.

Paul

double DataStructuresMajority(const std::vector<int>& vec)
{
const double noMajority = 0.5;

std::unordered_multiset<int> duplicates(vec.begin(), vec.end()); // O(N)

for(std::unordered_set<int>::const_iterator iter = duplicates.begin(); iter!= duplicates.end(); ++iter)
if(duplicates.count(*iter) > vec.size()/2)
return *iter;

// Above is bugged -- wrong iterator type but still compiles and
// even seems to work.
// The count function is O(number of occurrences of element being counted)
return noMajority;
}

Paul

unread,

Dec 11, 2018, 4:13:02 PM12/11/18

to

This is a correct O(n) implementation of the same basic idea.

double DataStructuresMajority(const std::vector<int>& vec)
{
const double noMajority = 0.5;

std::unordered_set<int> nonDuplicates(vec.begin(), vec.end());

std::unordered_multiset<int> duplicates(vec.begin(), vec.end());

for(int i : nonDuplicates)
if(duplicates.count(i) > vec.size()/2)
return i;

return noMajority;
}