Suppose you have the following.
Data = RandomReal[1,{N,2}];
sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;
Timing[DeleteDuplicates[data,sameQ]][[1]];
If N is a large number this takes an ungodly amount of time?
Is there a more efficient way to delete the duplicate entries of Data ?
ie.
Data = {{1.,2.},{1.,3.},{2.,3.}};
Would become:
{{1.,2.},{ 2.,3.}};
Thanks,
Clint Zeringue
Take care not to use N as a variable as it already has a built-in meaning.
If it is not necessary to keep the elements of the list in the same
order, then a different, lower complexity algorithm can be used:
SplitBy[SortBy[data, First], First][[All, 1]]
This will be much faster, but will not remove exactly the same elements
as DeleteDuplicates because the second element of the pairs is always
ignored. DeleteDuplicates will always keep the very first occurrence of
equivalent elements. Is this important for your calculation?
I have summarized them below.
Use Tally or, even better, GatherBy, to obtain very substantial reductions in time:
In[1]:= data=RandomInteger[{1,99},{100000,2}];
In[2]:=
sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;
In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
Out[4]= {7.987,Null}
In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
Out[5]= 0.063
In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
Out[6]= 0.016
In[7]:= t0===t1===t2
Out[7]= True
Tomas
Tomas Garza [tgar...@msn.com]
n = 100000;
data = RandomInteger[{0, 9}, {n, 2}] // N;
Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing
{1.39164,10}
Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing
{0.289784,10}
Bob Hanlon
Bob Hanlon [han...@cox.net]
Take care not to use N as a variable as it already has a built-in meaning.
If it is not necessary to keep the elements of the list in the same order, then a different, lower complexity algorithm can be used:
SplitBy[SortBy[data, First], First][[All, 1]]
This will be much faster, but will not remove exactly the same elements as DeleteDuplicates because the second element of the pairs is always ignored. DeleteDuplicates will always keep the very first occurrence of equivalent elements. Is this important for your calculation?
Szabolcs HorvĂĄt [szho...@gmail.com]
The fastest solution was Tomas Garza's :
Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[
DeleteDuplicates is not able to recognize that there might be a fast way
to use your sameQ, hence it is an O(n^2) proposition when list length is
n. The variant below will behave better.
deldupes[ll_] := Module[{h, res},
res = Reap[Do[
If[! TrueQ[h[ll[[j, 1]]]], h[ll[[j, 1]]] = True; Sow[ll[[j]]]];
, {j, Length[ll]}]
][[2, 1]];
Clear[h];
res]
Example:
data = RandomInteger[10^3, {10^4, 2}];
In[27]:= Timing[dd1 = deldupes[data];]
Out[27]= {0.029996, Null}
In[28]:= Timing[dd2 = DeleteDuplicates[data, sameQ];]
Out[28]= {6.28405, Null}
In[31]:= dd2 // Length
Out[31]= 1001
In[32]:= dd1 === dd2
Out[32]= True
Daniel Lichtblau
Wolfram Research
you can use
Tally[data][[All, 1]]
although perhaps this is not the fastest. It seems like DeleteDuplicates is
slow here because it
uses the comparison function to compare each element with all the others,
which takes quadratic time
in the size of the dataset. It is blazingly fast on packed arrays of
numbers, however.
Regards,
Leonid
data = RandomInteger[{0, 9}, {n, 2}] // N;
Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing
{1.39164,10}
Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing
{0.289784,10}
Bob Hanlon
---- "Zeringue wrote:
=============
>Suppose you have the following.
>Data = RandomReal[1,{N,2}];
>sameQ[_,_]=False; sameQ[{x_,y_},{x_,z_}]=True;
>Timing[DeleteDuplicates[data,sameQ]][[1]];
>If N is a large number this takes an ungodly amount of time?
>Is there a more efficient way to delete the duplicate entries of
>Data ?
The slowness of DeleteDuplicates comes about when a custom
compare function is used as the following demonstrates
In[27]:= sameQ[_, _] = False;
sameQ[x_, x_] = True;
In[29]:= data = RandomInteger[100, 2000];
In[30]:= Timing[Length[a = DeleteDuplicates[data]]]
Out[30]= {0.000025,101}
In[31]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]
Out[31]= {0.215696,101}
In[32]:= a == b
Out[32]= True
The above is simply to illustrate the issue, not to solve your
specific problem. For your case, I can get the same result in
much less time using GatherBy as illustrated below
In[33]:= Clear[sameQ]
In[34]:= sameQ[_, _] = False;
sameQ[{x_, y_}, {x_, z_}] = True;
In[36]:= data = RandomInteger[100, {2000, 2}];
In[37]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]
Out[37]= {0.246448,101}
In[38]:= Timing[Length[c = First /@ GatherBy[data, First]]]
Out[38]= {0.000957,101}
In[39]:= c == b
Out[39]= True
In[1]:= data=RandomInteger[{1,99},{100000,2}];
In[2]:=
sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;
In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
Out[4]= {7.987,Null}
In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
Out[5]= 0.063
In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
Out[6]= 0.016
In[7]:= t0===t1===t2
Out[7]= True
Tomas
> Date: Thu, 4 Feb 2010 06:26:02 -0500
> From: Clint.Z...@kirtland.af.mil
> Subject: DeleteDuplicates is too slow?
> To: math...@smc.vnet.net
>
> Hello,
>
> Suppose you have the following.
>
> Data = RandomReal[1,{N,2}];
>
> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>
> Timing[DeleteDuplicates[data,sameQ]][[1]];
>
> If N is a large number this takes an ungodly amount of time?
>
> Is there a more efficient way to delete the duplicate entries of Data ?
>
sameQ@## &
is the same as (and less verbose than)
sameQ
Bobby
On Fri, 05 Feb 2010 02:25:03 -0600, Bill Rowe <read...@sbcglobal.net>
wrote:
> On 2/4/10 at 6:26 AM, Clint.Z...@kirtland.af.mil (Zeringue, Clint
> M Civ USAF AFMC AFRL/RDLAF) wrote:
>
>> Suppose you have the following.
>
>> Data = RandomReal[1,{N,2}];
>
>> sameQ[_,_]=False; sameQ[{x_,y_},{x_,z_}]=True;
>
>> Timing[DeleteDuplicates[data,sameQ]][[1]];
>
>> If N is a large number this takes an ungodly amount of time?
>
>> Is there a more efficient way to delete the duplicate entries of
>> Data ?
>
> The slowness of DeleteDuplicates comes about when a custom
> compare function is used as the following demonstrates
>
> In[27]:= sameQ[_, _] = False;
> sameQ[x_, x_] = True;
>
> In[29]:= data = RandomInteger[100, 2000];
>
> In[30]:= Timing[Length[a = DeleteDuplicates[data]]]
>
> Out[30]= {0.000025,101}
>
> In[31]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]
>
> Out[31]= {0.215696,101}
>
> In[32]:= a == b
>
> Out[32]= True
>
> The above is simply to illustrate the issue, not to solve your
> specific problem. For your case, I can get the same result in
> much less time using GatherBy as illustrated below
>
> In[33]:= Clear[sameQ]
>
> In[34]:= sameQ[_, _] = False;
> sameQ[{x_, y_}, {x_, z_}] = True;
>
> In[36]:= data = RandomInteger[100, {2000, 2}];
>
> In[37]:= Timing[Length[b = DeleteDuplicates[data, sameQ@## &]]]
>
> Out[37]= {0.246448,101}
>
> In[38]:= Timing[Length[c = First /@ GatherBy[data, First]]]
>
> Out[38]= {0.000957,101}
>
> In[39]:= c == b
>
> Out[39]= True
>
>
Bobby
On Sat, 06 Feb 2010 02:24:16 -0600, DrMajorBob <btr...@austin.rr.com>
wrote:
Sorry, I misinterpreted your original problem (did not notice that you only
care about the first element when testing for equality). Trying to
compensate for this: below is a solution which will give an order of
magnitude speedup, *if* you work on (positive) integer values (the case of
possible negative values can be rather easily incorporated if needed):
Clear[sparseArrayElements];
sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]]
Clear[deleteDuplicates];
Options[deleteDuplicates] = {Ordered -> True, Threshold -> 1000000};
deleteDuplicates[data_List, opts___?OptionQ] :=
Module[{fdata = data[[All, 1]], parr,
rlen = Range[Length[data], 1, -1],
preserveOrder =
Ordered /. Flatten[{opts}] /. Options[deleteDuplicates],
threshold =
Threshold /. Flatten[{opts}] /. Options[deleteDuplicates], dim },
dim = Max[fdata];
parr =
If[dim < threshold, Table[0, {dim}], SparseArray[{}, dim, 0]];
parr[[fdata[[rlen]]]] = rlen;
parr =
sparseArrayElements@If[dim < threshold, SparseArray@parr, parr];
data[[If[preserveOrder, Sort@parr, parr]]]];
This exploits the technique proposed recently by Norbert Pozar. If you don't
care about your items being ordered, you may set the option Ordered->False
and get some additional speed-up. The Threshold parameter defines the
maximal data value (which serves as a length of auxiliary array), up to
which the "normal" form of array is used. If the first element of any data
point exceeds the threshold, the SparseArray representation is used.
Benchmarks:
In[6]:= data = RandomInteger[{1, 100000}, {30000, 2}];
In[7]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]
Out[7]= 0.19
In[8]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]
Out[8]= 0.511
In[9]:= Timing[t3 = deleteDuplicates[data];][[1]]
Out[9]= 0.04
In[10]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]
Out[10]= 0.01
In[11]:= t1 == t2 == t3
Out[11]= True
In[12]:= data = RandomInteger[{1, 100000}, {500000, 2}];
In[13]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]
Out[13]= 2.844
In[14]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]
Out[14]= 1.983
In[15]:= Timing[t3 = deleteDuplicates[data];][[1]]
Out[15]= 0.241
In[16]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]
Out[16]= 0.18
In[17]:= t3 == t2 == t1
Out[17]= True
In[18]:= data = RandomInteger[{1, 10000000}, {100000, 2}];
In[19]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]
Out[19]= 1.242
In[20]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]
Out[20]= 2.593
In[21]:= Timing[t3 = deleteDuplicates[data];][[1]]
Out[21]= 0.281
In[22]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]
Out[22]= 0.2
In[23]:= t3 == t2 == t1
Out[23]= True
Regards,
Leonid
On Fri, Feb 5, 2010 at 12:19 AM, Clint Zeringue <in...@scienceskillz.com>wrote:
> Thanks for all the great responses.
>
> I have summarized them below.
>
> Use Tally or, even better, GatherBy, to obtain very substantial reductions
> in time:
>
> In[1]:= data=RandomInteger[{1,99},{100000,2}];
>
> In[2]:=
> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>
> In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
> Out[4]= {7.987,Null}
>
> In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
> Out[5]= 0.063
>
> In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
> Out[6]= 0.016
>
> In[7]:= t0===t1===t2
> Out[7]= True
>
> Tomas
> Tomas Garza [tgar...@msn.com]
>
> n = 100000;
>
> data = RandomInteger[{0, 9}, {n, 2}] // N;
>
> Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing
>
> {1.39164,10}
>
> Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing
>
> {0.289784,10}
>
>
> Bob Hanlon
> Bob Hanlon [han...@cox.net]
>
>
>
> Take care not to use N as a variable as it already has a built-in meaning.
>
> If it is not necessary to keep the elements of the list in the same order,
> then a different, lower complexity algorithm can be used:
>
> SplitBy[SortBy[data, First], First][[All, 1]]
>
> This will be much faster, but will not remove exactly the same elements as
> DeleteDuplicates because the second element of the pairs is always ignored.
> DeleteDuplicates will always keep the very first occurrence of equivalent
> elements. Is this important for your calculation?
>
> Szabolcs Horv=C3=A1t [szho...@gmail.com]
>
> The fastest solution was Tomas Garza's :
>