set_intersection

Frederick Gotham

unread,

Sep 24, 2020, 5:15:53 AM9/24/20

to

I tried to post this here yesterday but it ended up in comp.lang.c because the new Google Groups interface can't handle a + sign in the name of a group.

I'm trying to write a function just like "set_intersection", except that the containers don't need to be sorted.

#include <type_traits> /* remove_cv, remove_reference */
#include <algorithm> /* sort */

template<class InputIt1, class InputIt2, class OutputIt>
OutputIt set_intersection_unsorted(InputIt1 first1, InputIt1 last1,
InputIt2 first2, InputIt2 last2,
OutputIt d_first)
{
/* The iterators might be plain ol' pointers, e.g. char* */
/* To keep compatibility with C++11, don't use 'remove_cv_t' */

typedef typename std::remove_cv< typename std::remove_reference<decltype(*first1)>::type >::type type1;
typedef typename std::remove_cv< typename std::remove_reference<decltype(*first2)>::type >::type type2;

vector<type1> container1(first1,last1);
vector<type2> container2(first2,last2);

std::sort(container1.begin(), container1.end());
std::sort(container2.begin(), container2.end());

return std::set_intersection(container1.begin(), container1.end(),
container2.begin(), container2.end(),
d_first);
}

Does this look okay? Is there a better way of doing it?

Paavo Helde

unread,

Sep 24, 2020, 3:07:53 PM9/24/20

to

The above seems OK, but let's see if it can be enhanced.

Let's say the lengths of ranges are M and N. This algorithm sorts both
ranges, so has complexity O(M*log(M) + N*log(N)).

An alternative would be to only sort the shorter range, say M, then
iterate over the longer range and look up values in M via
std::binary_search(). The complexity of that would be O(M*log(M) +
N*log(M)). This is not worse than the original, and should perform
better if M << N. It is also better memory-wise. So this algorithm ought
to be better for this specific task (not tested!)

Potential drawbacks are that the result range remains unsorted and the
behavior with repeated elements is different from std::set_intersection.
So it isn't an exact replacement.

Öö Tiib

unread,

Sep 24, 2020, 6:30:11 PM9/24/20

to

It is generally OK. However it is for that generic case of how
operator<(type1,type2) exists and then that destination range
accepts type1. When I care about performance then I do not make it
generic silver bullet as such don't exist.

Instead I take into account every thing about given input sequences,
given type, output sequence and constraints to resource usage.
Isn't it usually same type? Can't that type usually be cheaply hashed?
Are the sequence lengths usually similar or different? Is there plenty
of memory? Can there be several equal values in input sequences? Is
the output iterator really output iterator or just pointer of buffer
that I can sort afterwards? Is it even needed that result is sorted?

Making two equal capacity hash multisets and intersecting those and
then sorting the result is bit cheaper in my tests than your algorithm
but uses more memory.

Paavo Helde's suggestion is also great when hashing is unavailable
or memory is precious, especially when size difference between
two sequences is major but it takes some trick of marking
already matched elements on case the sequences can contain multiple
equal elements.

set_intersection_sorted

Frederick Gotham

Paavo Helde

Öö Tiib