An algorithm question, asking for optimal solution

Sticker

unread,

Apr 16, 2007, 9:40:16 PM4/16/07

to

k sorted sequences. Each sequence contains numeric IDs from smallest
to largest. ID varies from 0 to 100,000s. Average length of the
sequences is 10,000s. Generally speaking the sequences can be very
sparse.

Find out whether the k sequences contain a same ID. Further more, find
out all such IDs.

Example:

2 6 8 159
1 8 50 220

The answer is yes, one ID is consistent, which is 8.

Thad Smith

unread,

Apr 16, 2007, 11:11:48 PM4/16/07

to

You can adapt an n-way merge to efficiently find duplicates.

--
Thad

user923005

unread,

Apr 17, 2007, 3:01:03 PM4/17/07

to

A simple idea:
Create k bit strings large enough to hold the largest value (e.g. if
500,000 is the maximum value then each bit string will be (500000/
CHAR_BIT + 500000%CHAR_BIT ? 1:0) bytes in length.
I guess from the sound of it, we are talking 100KB per string or so.
Now, just and the bytes of the strings together. If any resultant
anded byte is not zero, we have a common value. It can easily be
deciphered with bit operations like this (from the C-FAQ):

20.8: How can I implement sets or arrays of bits?

A: Use arrays of char or int, with a few macros to access the
desired bit at the proper index. Here are some simple macros to
use with arrays of char:

#include <limits.h> /* for CHAR_BIT */

#define BITMASK(b) (1 << ((b) % CHAR_BIT))
#define BITSLOT(b) ((b) / CHAR_BIT)
#define BITSET(a, b) ((a)[BITSLOT(b)] |= BITMASK(b))
#define BITTEST(a, b) ((a)[BITSLOT(b)] & BITMASK(b))

(If you don't have <limits.h>, try using 8 for CHAR_BIT.)

References: H&S Sec. 7.6.7 pp. 211-216.

Suppose that k is 100,
you just and together each byte of the 100 strings and if it is non-
zero then you have a common value which is designated by the position
of the bits.

Now, it is not clear if you are trying to find values present in each
and every string, or values common to two or more strings. The first
problem is simple and efficient, the second is far more difficult,
because of the exploding combinations.

Oliver Wong

unread,

Apr 17, 2007, 6:29:36 PM4/17/07

to

"Sticker" <stick...@gmail.com> wrote in message
news:1176774016.8...@y5g2000hsa.googlegroups.com...

Well, I can think of an O(n) algorithm, but I'm unable to prove that
O(n) is optimal.

- Oliver

Mark P

unread,

Apr 17, 2007, 6:38:08 PM4/17/07

to

What is n? The size of the input? It seems pretty obvious that you
need to read all of the input in general.

To the OP, iterative set intersection (easily done in linear time for
sorted input) would be my approach. This is linear in the total input size.

user923005

unread,

Apr 17, 2007, 7:39:36 PM4/17/07

to

On Apr 17, 3:29 pm, "Oliver Wong" <o...@castortech.com> wrote:
> "Sticker" <sticker...@gmail.com> wrote in message

Don't you mean O(k*n), where k is the number of vectors and n is the
vector length? If you really mean O(n), then you're a lot more clever
than I am.

Logan Shaw

unread,

Apr 18, 2007, 1:23:41 AM4/18/07

to

Mark P wrote:
> Oliver Wong wrote:
>> "Sticker" <stick...@gmail.com> wrote in message
>> news:1176774016.8...@y5g2000hsa.googlegroups.com...
>>> k sorted sequences. Each sequence contains numeric IDs from smallest
>>> to largest. ID varies from 0 to 100,000s. Average length of the
>>> sequences is 10,000s. Generally speaking the sequences can be very
>>> sparse.
>>>
>>> Find out whether the k sequences contain a same ID. Further more, find
>>> out all such IDs.

>> Well, I can think of an O(n) algorithm, but I'm unable to prove
>> that O(n) is optimal.

> What is n?

I took it as the total input size.

> The size of the input? It seems pretty obvious that you
> need to read all of the input in general.

It's not at all obvious. The original poster said that in each list,
the numeric IDs were ordered from smallest to largest. This opens up
the possibility of doing binary search.

For example, imagine the input consists of the following sequences:

10 20 30
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
1 4 7 10 13 16 19 22 25 28 31

The answer in this case can be found by doing a binary search for
10, 20, and 30 in the other two sequences.

In general, if there are qty. S sequences and the longest sequence
has length L_max and the shortest sequence has length L_min, you
can do it in O(L_min * S * log(L_max)). All you have to do is
use S steps to find the shortest sequence, then for each item in
that sequence, do a binary search for it in all the other sequences.

Is that better than O(n)? Well if L_min = L_max (i.e. all the
sequences are the same length), then it's equivalent to O(n * log n).
But if L_min is a lot smaller than L_max, it might be better.

Also, you may be able to do a little better than repeatedly doing
binary search. In the above example, if you do a binary search
for 20 in those other two sequences, once you find 20 (or the
spot where it would be), you can use *that* position as a starting
point for your search for 10. The interval that 10 lies in (if it
is there at all) *must* be to the left of the spot where you found
20. I think of this as sort of a "bulk binary search" algorithm,
which is something you can do to find all elements of a sorted
listed within another sorted list.

So now you have, instead of about 3*log2(n) steps for each sequence,
you have log2(n) to find the 20 plus 2*log2(n/2) to find the other
two numbers. In general, optimization changes n*log2(n) into
approximately

log2(n) + 2*log2(n/2) + 4*log2(n/4) + ...

(where the sequence ends when the coefficients have added up to n).

Someone who is better with me than I am can perhaps find a simpler
form for this function. :-) But the point is that it's less
than log2(n).

There may be other tricks as well.

- Logan

Mark P

unread,

Apr 18, 2007, 2:35:30 AM4/18/07

to

Yes, but these are all special cases, and I said "in general". How will
you do better than O(n) when your input consists of, say, k identical
lists of m items _except_ that for each of the m items, with probability
1/2, that item has been deleted from exactly one of the lists, chosen
randomly (and w.p. 1/2, that item is present in all lists).

Mark

Logan Shaw

unread,

Apr 18, 2007, 2:58:23 AM4/18/07

to

Mark P wrote:
> Logan Shaw wrote:
>> Mark P wrote:

>>> The size of the input? It seems pretty obvious that you need to read
>>> all of the input in general.
>>
>> It's not at all obvious. The original poster said that in each list,
>> the numeric IDs were ordered from smallest to largest. This opens up
>> the possibility of doing binary search.

> Yes, but these are all special cases, and I said "in general". How will

> you do better than O(n) when your input consists of, say, k identical
> lists of m items _except_ that for each of the m items, with probability
> 1/2, that item has been deleted from exactly one of the lists, chosen
> randomly (and w.p. 1/2, that item is present in all lists).

That is a very good point.

I did not catch your meaning when you said "in general", but now it's clear
that you mean that a general solution will have to count on reading all n
items sometimes. That's not proven, but it does seem likely.

I guess it just depends on what you mean by "optimal". Oliver said he
was unable to prove O(n) was "optimal", but does that refer to a worst
case behavior or an average case one? I guess it depends on how you
define it.

For what it's worth, if we assume there is no structure to how the
sequences are created and the numbers are just random, the worst case
(or at least very bad case) you describe above is extremely unlikely.
But then there might be some pattern in the data that makes it happen
all the time. :-)

- Logan