Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Union slowdown when SameTest is specified

25 views
Skip to first unread message

J Siehler

unread,
Jul 3, 2009, 5:36:43 AM7/3/09
to
Hello group! I was a little surprised and puzzled by the following:

In[1]:= $Version
Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"

In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
Timing[Union[data];]
Out[3]= {0.008138, Null}

So far so good; so far so speedy. But why does this happen:

In[4]:= Timing[Union[data, SameTest -> Equal];]
Out[4]= {15.313, Null}

Or more egregiously,

In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
Out[5]= {86.293, Null}

There's nothing special about the form of the data here, just it
happens to be similar in form to the data I was working on when I
experienced terrible slowdown with the SameTest that I've shown in the
third example - which doesn't seem like a demanding comparison. So
can anyone explain the orders-of-magnitude change in times here? Much
appreciated!

DrMajorBob

unread,
Jul 4, 2009, 6:36:55 AM7/4/09
to
I assume that default tests are hard-coded in a preferential way.

The same might explain why, for a version or two, Tally was broken for
some tests but unbroken for others (including the default).

Bobby

--
DrMaj...@bigfoot.com

Leonid Shifrin

unread,
Jul 4, 2009, 6:37:47 AM7/4/09
to
Hi,

This problem has been noticed and discussed before.

http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html,

probably also on this group. I also discussed it in my book:

http://www.mathprogramming-intro.org/book/node290.html


This is how I understand this issue (I may be wrong):
in the case of Union, the main reason for the huge performance difference
is as follows: having a sameness function is a weaker requirement than
having a comparison function. For native sameness function SameQ, there
exists a corresponding comparison function OrderedQ (or its internal
equivalent), therefore O(NlogN) algorithm can be used. For a generic
comparison function, if one wants to stay completely general, all one can do
is to compare elements pairwise, which leads to O(N^2) complexity.

On top of this, there is another factor (perhaps, less important in the
case of Union): when SameTest is not specified, the code of Union (or Sort
etc) is entirely internal, written in C and very fast. When we specify
SameTest, the Union code has to constantly interface with the high - level
Mathematica code (to call SameTest comparison function), and that slows it
down a lot. This is a general situation with Mathematica - there is on the
average an order of magnitude or so gap between built - in and best-written
user-defined functions in Mathematica, if the functionality is not "too
close" to built-ins so that several of them can be "glued" together with a
minimal performance hit. One reason for this is that built-ins are written
in a much lower-level compiled language, another (related) is that they
avoid the main evaluation loop and symbolic rewritings involved in general
Mathematica computations.This shows up to a various degree in many problems.

Returning to Union, I think it would be nice if Union had an option to
provide a comparison function if it exists, and then use the O(NlogN)
algorithm. The function below represents an attempt in this direction:

Clear[unionBy];
unionBy[list_List, hashF : (_Symbol | _Function)] :=
unionBy[list, hashF /@ list];

unionBy[list_List, hashlist_List] /;
Length[list] == Length[hashlist] :=
With[{ord = Ordering[list]},
With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]},
Extract[sorted,
Sort[Union[sortedHashlist] /.
Dispatch[MapIndexed[Rule, sortedHashlist]]]]]];

As it follows from its name, this will work if there exists some kind of
"hash"
function that maps your data to another list on which we can use the native
sort/union. The implementation hinges on Sort / Union algorithms being
stable (that means, they don't displace elements which are already "equal").
Initial sorting of a list and list of hash values is needed so that the
produced
result is exactly the same as for standard Union - here we also use the fact

that Sort, Ordering and Union used without SameTest use the same underlying
sorting algorithm with the same native (canonical) sorting function. If,
instead of the "hash-function", you already have a list of "hashcodes",
then <unionBy> will be even faster.

Examples:

1. Union of integer intervals, sametest - their length:

In[1] =
test = RandomInteger[{1, 100}, {500, 2}];

In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] //
Short // Timing

Out[2] =
{0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}

In[3] = unionBy[test, Subtract @@ # &] // Short // Timing

Out[3] =
{0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}

In[4] = unionBy[test, Subtract @@ # &] ===
Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)]

Out[4] = True

2. Your example (smaller sample):

In[5] =
largetest = RandomInteger[{1, 100}, {1000, 2, 12}];

In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] //
Short // Timing

Out[6] =
{6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[7] =unionBy[largetest, First] // Short // Timing

Out[7] =
{0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing

Out[8] =
{0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] ===
unionBy[largetest, First]

Out[9] = True

The main idea behind unionBy is described in my book at:

http://www.mathprogramming-intro.org/book/node466.html

where the code is analyzed line by line.

Hope this helps.

Regards,
Leonid

Szabolcs

unread,
Jul 4, 2009, 6:38:30 AM7/4/09
to

Without running any tests---my guess is that the default algorithm
used by Union exploits the fact that the elements of the list can be
sorted. When you specify your own SameTest, Union won't know any more
what less-than and greater-than comparison to use that is compatible
with your specific SameTest.

da...@wolfram.com

unread,
Jul 4, 2009, 6:38:51 AM7/4/09
to
> Hello group! I was a little surprised and puzzled by the following:
>
> In[1]:= $Version
> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"
>
> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
> Timing[Union[data];]
> Out[3]= {0.008138, Null}
>
> So far so good; so far so speedy. But why does this happen:
>
> In[4]:= Timing[Union[data, SameTest -> Equal];]
> Out[4]= {15.313, Null}
>
> Or more egregiously,
>
> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
> Out[5]= {86.293, Null}
>
> There's nothing special about the form of the data here, just it
> happens to be similar in form to the data I was working on when I
> experienced terrible slowdown with the SameTest that I've shown in the
> third example - which doesn't seem like a demanding comparison. So
> can anyone explain the orders-of-magnitude change in times here? Much
> appreciated!

Default behavior uses sorting and then comparisons of neighbors only. For
a set of n elements you'd get O(n log(n)) behavior. (There is an oft-cited
code for a Union-no-sort that is O(n) though with larger hidden constant,
but that's another story).

When given a SameTest, Union will not sort but rather test a given element
against every other (remaining) element. This is in general O(n^2).
Another drawback is that it must make a trip through the Mathematica
evaluator to evaluate the provided SameTest. This can be considerably
slower than internal code for the default test, and moreover the speed can
scale with the complexity of the test (witness the distinction between
your second and third examples). These are the issues behind the timing
differences you are seeing.

Daniel Lichtblau
Wolfram Research


DrMajorBob

unread,
Jul 5, 2009, 4:53:10 AM7/5/09
to
Here's a faster and (arguably) simpler method, using unsortedUnion (Tally):

unsortedUnion[x_List] := Tally[x][[All, 1]]
unsortedUnion[x_List, SameTest -> test : (_Symbol | _Function)] :=
Tally[x, test][[All, 1]]
unsortedUnion[x_List, test : (_Symbol | _Function)] :=
Tally[x, test][[All, 1]]

test = RandomInteger[{1, 100}, {500, 2}];

Timing[one =
Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)];]

{0.087628, Null}

Timing[two = unionBy[test, Subtract @@ # &];]

{0.002282, Null}

Timing[three =
unsortedUnion[Sort@test, (Subtract @@ #1 == Subtract @@ #2 &)];]

{0.00181, Null}

one == two == three

True

largetest = RandomInteger[{1, 100}, {1000, 2, 12}];

Timing[one = Union[largetest, SameTest -> (First@#1 == First@#2 &)];]

{0.990508, Null}

Timing[two = unionBy[largetest, First];]

{0.006182, Null}

Timing[three =
unsortedUnion[Sort@largetest, (First@#1 == First@#2 &)];]

{0.003405, Null}

one == two == three

True

Bobby

On Sat, 04 Jul 2009 05:43:54 -0500, Leonid Shifrin <lsh...@gmail.com>
wrote:

>> Hello group! I was a little surprised and puzzled by the following:
>>
>> In[1]:= $Version
>> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"
>>
>> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
>> Timing[Union[data];]
>> Out[3]= {0.008138, Null}
>>
>> So far so good; so far so speedy. But why does this happen:
>>
>> In[4]:= Timing[Union[data, SameTest -> Equal];]
>> Out[4]= {15.313, Null}
>>
>> Or more egregiously,
>>
>> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
>> Out[5]= {86.293, Null}
>>
>> There's nothing special about the form of the data here, just it
>> happens to be similar in form to the data I was working on when I
>> experienced terrible slowdown with the SameTest that I've shown in the
>> third example - which doesn't seem like a demanding comparison. So
>> can anyone explain the orders-of-magnitude change in times here? Much
>> appreciated!
>>
>>
>

--
DrMaj...@bigfoot.com

Jacob Siehler

unread,
Jul 5, 2009, 4:54:45 AM7/5/09
to
Thanks for the uniformly excellent replies. I figured on overhead for
calling the function but that wasn't enough to explain what was going
on, which was pretty clearly (well, "clearly", based on more tests
than I put in my original post) an outright change in complexity. In
hindsight the utility of using an ordering relation to make fast
sorting part of the Union algorithm makes perfect sense, and explains
the problem here to my satisfaction. It would have been a long time
before I thought of that on my own. Special kudos to Leonid for the
additional code and links.

Leonid Shifrin

unread,
Jul 5, 2009, 4:53:20 AM7/5/09
to
Agreed. Nice to know, thanks. Mine will work on v.5 and even v.4.1 though
(although this is probably irrelevant now), and also I think is a bit more
pedagogical since it deals explicitly with the same problem that Tally
handles internally and showcases the power of Dispatch. But for production,
certainly yours is better.

Regards,
Leonid

DrMajorBob

unread,
Jul 5, 2009, 11:16:34 PM7/5/09
to
Apparently, Tally doesn't slow down for custom SameTest functions, in the
way that Union does.

I suppose that's because Union wants an ordering, while Tally needs only a
test for equivalence.

Bobby

On Sun, 05 Jul 2009 03:46:53 -0500, Leonid Shifrin <lsh...@gmail.com>
wrote:

--
DrMaj...@bigfoot.com

0 new messages