DeleteDuplicates is too slow?

Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF

unread,

Feb 4, 2010, 6:25:00 AM2/4/10

to

Hello,

Suppose you have the following.

Data = RandomReal[1,{N,2}];

sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;

Timing[DeleteDuplicates[data,sameQ]][[1]];

If N is a large number this takes an ungodly amount of time?

Is there a more efficient way to delete the duplicate entries of Data ?

ie.

Data = {{1.,2.},{1.,3.},{2.,3.}};

Would become:
{{1.,2.},{ 2.,3.}};

Thanks,

Clint Zeringue

Szabolcs Horvát

unread,

Feb 5, 2010, 3:27:59 AM2/5/10

to

Take care not to use N as a variable as it already has a built-in meaning.

If it is not necessary to keep the elements of the list in the same
order, then a different, lower complexity algorithm can be used:

SplitBy[SortBy[data, First], First][[All, 1]]

This will be much faster, but will not remove exactly the same elements
as DeleteDuplicates because the second element of the pairs is always
ignored. DeleteDuplicates will always keep the very first occurrence of
equivalent elements. Is this important for your calculation?

Clint Zeringue

unread,

Feb 5, 2010, 3:29:24 AM2/5/10

to

Thanks for all the great responses.

I have summarized them below.

Use Tally or, even better, GatherBy, to obtain very substantial reductions in time:

In[1]:= data=RandomInteger[{1,99},{100000,2}];

In[2]:=

sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;

In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
Out[4]= {7.987,Null}

In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
Out[5]= 0.063

In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
Out[6]= 0.016

In[7]:= t0===t1===t2
Out[7]= True

Tomas
Tomas Garza [tgar...@msn.com]

n = 100000;

data = RandomInteger[{0, 9}, {n, 2}] // N;

Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing

{1.39164,10}

Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing

{0.289784,10}

Bob Hanlon
Bob Hanlon [han...@cox.net]

Take care not to use N as a variable as it already has a built-in meaning.

If it is not necessary to keep the elements of the list in the same order, then a different, lower complexity algorithm can be used:

SplitBy[SortBy[data, First], First][[All, 1]]

This will be much faster, but will not remove exactly the same elements as DeleteDuplicates because the second element of the pairs is always ignored. DeleteDuplicates will always keep the very first occurrence of equivalent elements. Is this important for your calculation?

Szabolcs HorvĂĄt [szho...@gmail.com]

The fastest solution was Tomas Garza's :

Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[

Daniel Lichtblau

unread,

Feb 5, 2010, 3:30:49 AM2/5/10

to

Zeringue, Clint M Civ USAF AFMC AFRL/RDLAF wrote:
> Hello,
>
> Suppose you have the following.
>
> Data = RandomReal[1,{N,2}];
>

> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>

> Timing[DeleteDuplicates[data,sameQ]][[1]];
>
> If N is a large number this takes an ungodly amount of time?
>
> Is there a more efficient way to delete the duplicate entries of Data ?
>
> ie.
>
> Data = {{1.,2.},{1.,3.},{2.,3.}};
>
> Would become:
> {{1.,2.},{ 2.,3.}};
>
>

> Thanks,
>
> Clint Zeringue

DeleteDuplicates is not able to recognize that there might be a fast way
to use your sameQ, hence it is an O(n^2) proposition when list length is
n. The variant below will behave better.

deldupes[ll_] := Module[{h, res},
res = Reap[Do[
If[! TrueQ[h[ll[[j, 1]]]], h[ll[[j, 1]]] = True; Sow[ll[[j]]]];
, {j, Length[ll]}]
][[2, 1]];
Clear[h];
res]

Example:

data = RandomInteger[10^3, {10^4, 2}];

In[27]:= Timing[dd1 = deldupes[data];]
Out[27]= {0.029996, Null}

In[28]:= Timing[dd2 = DeleteDuplicates[data, sameQ];]
Out[28]= {6.28405, Null}

In[31]:= dd2 // Length
Out[31]= 1001

In[32]:= dd1 === dd2
Out[32]= True

Daniel Lichtblau
Wolfram Research

Leonid Shifrin

unread,

Feb 5, 2010, 3:31:00 AM2/5/10

to

Hi,

you can use

Tally[data][[All, 1]]

although perhaps this is not the fastest. It seems like DeleteDuplicates is
slow here because it
uses the comparison function to compare each element with all the others,
which takes quadratic time
in the size of the dataset. It is blazingly fast on packed arrays of
numbers, however.

Regards,
Leonid

Bob Hanlon

unread,

Feb 5, 2010, 3:32:03 AM2/5/10

to

n = 100000;

data = RandomInteger[{0, 9}, {n, 2}] // N;

Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing

{1.39164,10}

Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing

{0.289784,10}

Bob Hanlon

---- "Zeringue wrote:

=============

Bill Rowe

unread,

Feb 5, 2010, 3:34:31 AM2/5/10

to

On 2/4/10 at 6:26 AM, Clint.Z...@kirtland.af.mil (Zeringue, Clint

M Civ USAF AFMC AFRL/RDLAF) wrote:

>Suppose you have the following.

>Data = RandomReal[1,{N,2}];

>sameQ[_,_]=False; sameQ[{x_,y_},{x_,z_}]=True;

>Timing[DeleteDuplicates[data,sameQ]][[1]];

>If N is a large number this takes an ungodly amount of time?

>Is there a more efficient way to delete the duplicate entries of
>Data ?

The slowness of DeleteDuplicates comes about when a custom
compare function is used as the following demonstrates

In[27]:= sameQ[_, _] = False;
sameQ[x_, x_] = True;

In[29]:= data = RandomInteger[100, 2000];

In[30]:= Timing[Length[a = DeleteDuplicates[data]]]

Out[30]= {0.000025,101}

In[31]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]

Out[31]= {0.215696,101}

In[32]:= a == b

Out[32]= True

The above is simply to illustrate the issue, not to solve your
specific problem. For your case, I can get the same result in
much less time using GatherBy as illustrated below

In[33]:= Clear[sameQ]

In[34]:= sameQ[_, _] = False;
sameQ[{x_, y_}, {x_, z_}] = True;

In[36]:= data = RandomInteger[100, {2000, 2}];

In[37]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]

Out[37]= {0.246448,101}

In[38]:= Timing[Length[c = First /@ GatherBy[data, First]]]

Out[38]= {0.000957,101}

In[39]:= c == b

Out[39]= True

Tomas Garza

unread,

Feb 5, 2010, 3:35:03 AM2/5/10

to

Use Tally or, even better, GatherBy, to obtain very substantial reduc=
tions in time:

In[1]:= data=RandomInteger[{1,99},{100000,2}];

In[2]:=

sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;

In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
Out[4]= {7.987,Null}

In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
Out[5]= 0.063

In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
Out[6]= 0.016

In[7]:= t0===t1===t2
Out[7]= True

Tomas

> Date: Thu, 4 Feb 2010 06:26:02 -0500
> From: Clint.Z...@kirtland.af.mil
> Subject: DeleteDuplicates is too slow?
> To: math...@smc.vnet.net
>
> Hello,

>
> Suppose you have the following.
>
> Data = RandomReal[1,{N,2}];
>
> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>
> Timing[DeleteDuplicates[data,sameQ]][[1]];
>
> If N is a large number this takes an ungodly amount of time?
>
> Is there a more efficient way to delete the duplicate entries of Data ?
>

DrMajorBob

unread,

Feb 6, 2010, 3:23:44 AM2/6/10

to

Notice that

sameQ@## &

is the same as (and less verbose than)

sameQ

Bobby

On Fri, 05 Feb 2010 02:25:03 -0600, Bill Rowe <read...@sbcglobal.net>
wrote:

> On 2/4/10 at 6:26 AM, Clint.Z...@kirtland.af.mil (Zeringue, Clint
> M Civ USAF AFMC AFRL/RDLAF) wrote:
>

>> Suppose you have the following.
>
>> Data = RandomReal[1,{N,2}];
>
>> sameQ[_,_]=False; sameQ[{x_,y_},{x_,z_}]=True;
>
>> Timing[DeleteDuplicates[data,sameQ]][[1]];
>
>> If N is a large number this takes an ungodly amount of time?
>
>> Is there a more efficient way to delete the duplicate entries of
>> Data ?
>

> The slowness of DeleteDuplicates comes about when a custom
> compare function is used as the following demonstrates
>
> In[27]:= sameQ[_, _] = False;
> sameQ[x_, x_] = True;
>
> In[29]:= data = RandomInteger[100, 2000];
>
> In[30]:= Timing[Length[a = DeleteDuplicates[data]]]
>
> Out[30]= {0.000025,101}
>
> In[31]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]
>
> Out[31]= {0.215696,101}
>
> In[32]:= a == b
>
> Out[32]= True
>
> The above is simply to illustrate the issue, not to solve your
> specific problem. For your case, I can get the same result in
> much less time using GatherBy as illustrated below
>
> In[33]:= Clear[sameQ]
>
> In[34]:= sameQ[_, _] = False;
> sameQ[{x_, y_}, {x_, z_}] = True;
>
> In[36]:= data = RandomInteger[100, {2000, 2}];
>
> In[37]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]
>
> Out[37]= {0.246448,101}
>
> In[38]:= Timing[Length[c = First /@ GatherBy[data, First]]]
>
> Out[38]= {0.000957,101}
>
> In[39]:= c == b
>
> Out[39]= True
>
>

--
DrMaj...@yahoo.com

DrMajorBob

unread,

Feb 7, 2010, 6:16:14 AM2/7/10

to

Sorry... the second version is less verbose. (Obviously.)

Bobby

On Sat, 06 Feb 2010 02:24:16 -0600, DrMajorBob <btr...@austin.rr.com>
wrote:

Leonid Shifrin

unread,

Feb 8, 2010, 3:34:25 AM2/8/10

to

Hi Clint,

Sorry, I misinterpreted your original problem (did not notice that you only
care about the first element when testing for equality). Trying to
compensate for this: below is a solution which will give an order of
magnitude speedup, *if* you work on (positive) integer values (the case of
possible negative values can be rather easily incorporated if needed):

Clear[sparseArrayElements];
sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]]

Clear[deleteDuplicates];
Options[deleteDuplicates] = {Ordered -> True, Threshold -> 1000000};
deleteDuplicates[data_List, opts___?OptionQ] :=
Module[{fdata = data[[All, 1]], parr,
rlen = Range[Length[data], 1, -1],
preserveOrder =
Ordered /. Flatten[{opts}] /. Options[deleteDuplicates],
threshold =
Threshold /. Flatten[{opts}] /. Options[deleteDuplicates], dim },
dim = Max[fdata];
parr =
If[dim < threshold, Table[0, {dim}], SparseArray[{}, dim, 0]];
parr[[fdata[[rlen]]]] = rlen;
parr =
sparseArrayElements@If[dim < threshold, SparseArray@parr, parr];
data[[If[preserveOrder, Sort@parr, parr]]]];

This exploits the technique proposed recently by Norbert Pozar. If you don't
care about your items being ordered, you may set the option Ordered->False
and get some additional speed-up. The Threshold parameter defines the
maximal data value (which serves as a length of auxiliary array), up to
which the "normal" form of array is used. If the first element of any data
point exceeds the threshold, the SparseArray representation is used.

Benchmarks:

In[6]:= data = RandomInteger[{1, 100000}, {30000, 2}];

In[7]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]

Out[7]= 0.19

In[8]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]

Out[8]= 0.511

In[9]:= Timing[t3 = deleteDuplicates[data];][[1]]

Out[9]= 0.04

In[10]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]

Out[10]= 0.01

In[11]:= t1 == t2 == t3

Out[11]= True

In[12]:= data = RandomInteger[{1, 100000}, {500000, 2}];

In[13]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]

Out[13]= 2.844

In[14]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]

Out[14]= 1.983

In[15]:= Timing[t3 = deleteDuplicates[data];][[1]]

Out[15]= 0.241

In[16]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]

Out[16]= 0.18

In[17]:= t3 == t2 == t1

Out[17]= True

In[18]:= data = RandomInteger[{1, 10000000}, {100000, 2}];

In[19]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]

Out[19]= 1.242

In[20]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]

Out[20]= 2.593

In[21]:= Timing[t3 = deleteDuplicates[data];][[1]]

Out[21]= 0.281

In[22]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]

Out[22]= 0.2

In[23]:= t3 == t2 == t1

Out[23]= True

Regards,
Leonid

On Fri, Feb 5, 2010 at 12:19 AM, Clint Zeringue <in...@scienceskillz.com>wrote:

> Thanks for all the great responses.
>
> I have summarized them below.
>

> Use Tally or, even better, GatherBy, to obtain very substantial reductions

> in time:
>
> In[1]:= data=RandomInteger[{1,99},{100000,2}];
>
> In[2]:=

> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>

> In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
> Out[4]= {7.987,Null}
>
> In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
> Out[5]= 0.063
>
> In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
> Out[6]= 0.016
>
> In[7]:= t0===t1===t2
> Out[7]= True
>
> Tomas

> Tomas Garza [tgar...@msn.com]

>
> n = 100000;
>
> data = RandomInteger[{0, 9}, {n, 2}] // N;
>
> Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing
>
> {1.39164,10}
>
> Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing
>
> {0.289784,10}
>
>
> Bob Hanlon

> Bob Hanlon [han...@cox.net]
>
>
>
> Take care not to use N as a variable as it already has a built-in meaning.
>
> If it is not necessary to keep the elements of the list in the same order,
> then a different, lower complexity algorithm can be used:
>
> SplitBy[SortBy[data, First], First][[All, 1]]
>
> This will be much faster, but will not remove exactly the same elements as
> DeleteDuplicates because the second element of the pairs is always ignored.
> DeleteDuplicates will always keep the very first occurrence of equivalent
> elements. Is this important for your calculation?
>

> Szabolcs Horv=C3=A1t [szho...@gmail.com]

>
> The fastest solution was Tomas Garza's :
>