I would like your advice about this problem:
I have a list of elements, consisting in phone numbers with some data:
list1= { { phone1, data1}, {phone2, data2}, {phone3, data3} .... }
And a list of phones, a subset of the phone numbers in list1
list2= {phone2, phone3, phone7... }
I'd like to extract the elements from list1 whose phone numbers are
present in list 2, that is:
result= { { phone2, data2}, {phone3, data3, {phone7, data7} .... }
I've used this with small lists and it works fine:
result = Select[list1, MemberQ[list2, #[[1]]] &];
The problem is that now I would like to use it with big lists. list1
is over 1.000.000 elements long and list2 is about 500.000 elements
long. Ordering is not a problem, I could resort the lists.
Any hint to extract this list faster? It seems to take a lot of time
(estimation is about 5 hours and I have to do it repeatedly)
Thanks!
Could do as follows.
overlapsNth[l1_,l2_,n_:1] := Module[
{h, res},
Do[h[l1[[j,n]]] = True, {j,Length[list1]}];
res = Reap[Do[If[TrueQ[h[l2[[j,n]]]], Sow[list2[[j]]]],
{j,Length[l2]}]][[2,1]];
Clear[h];
res
]
I've not thoroughly checked for correctness so you might want to do
that. As for speed, here is an example.
n = 6;
list1 = RandomInteger[10^9, {10^n,2}];
list2 = RandomInteger[10^9, {10^n,2}];
In[24]:= Timing[intersect = overlapsNth[list1,list2,1];]
Out[24]= {3.86441, Null}
In[25]:= Length[intersect]
Out[25]= 1033
Daniel Lichtblau
Wolfram Research
Block[{a = Sort@list1, b = Sort@list2, i = 1, j = 1},
Reap[While[i <= Length@a && j <= Length@b,
Which[a[[i,1]] < b[[j]], i++,
a[[i,1]] > b[[j]], j++,
True, Sow@a[[i]]; i++; j++]]][[2,1]]]
The following function:
Clear[memberPositions];
memberPositions[x_List, y_List] :=
Module[{tag},
Pick[Range[Length[x]],
Replace[x, Dispatch[Thread[Rule[Intersection[x, y], tag]]], {1}],
tag]]
Is a fast way to find positions in list <x> which are also in <y>, for
arbitrary lists <x> and <y>. Using it, it worked in about 5 seconds on my
machine (M7.0 32bit WinXP AMD Phenom II - 2800 Ghz). Here is how you do it:
Model your lists:
In[16]:=
list1 =
Transpose[{RandomInteger[{1000000,9999999},1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];
In[17]:= list2 = RandomSample[list1[[All,1]],500000];
Get the result:
In[18]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing
Out[18]=
{5.203,{{9506759,data1},{3854419,data2},{1093656,data3},<<526944>>,{6991722,data999997},{9278309,data1000000}}}
You get more elements here because obviously some phone numbers were
randomly generated more than
once. Be aware though that such large lists consume lots of memory - about
320 Mb here.
The phone numbers could be strings or whatever, the memberPositions function
does not assume anything
about them. With this method, you have an added advantage that the resulting
list is in the original order -
no need to resort. It can also handle repeated phone numbers.
Now, if all of the following conditions are satisfied:
1. Your phone numbers can be represented as machine-precision integers
2. Your list list2 contains only phone numbers from list1
3. The list list1 does not contain repeated phone numbers (list2 may)
Then it is possible to use the fact that Sort and Ordering are very fast on
packed arrays. Additional idea here
is to combine vectorized operations on packed arrays with Compile (this
combination often results in great
performance in Mathematica), to get a version of a binary search which finds
positions of many elements in a sorted list at once.The initial list of
phone numbers is ordered first, then positions of elements in list2 are
found by a massive binary search function, and finally the positions in the
original unsorted list are recovered. The resulting more restrictive version
of memberPositions will be about 3 times faster than the above general one
for the
lists of the size you requested:
Clear[memberPositionsInteger];
Module[{bsearchMassiveAux, bsearchMassive},
bsearchMassiveAux =
Compile[{{list, _Integer, 1}, {elems, _Integer,
1}, {ones, _Integer, 1}},
Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
m = ones,
diff = ones, un1 = ones, un2 = ones, flags = ones,
zeros = 0*ones},
n1 = Length[list]*n0;
While[flags != zeros,
m = m + (Floor[(n0 + n1)/2] - m)*Unitize[flags];
flags = list[[m]] - elems;
un1 = UnitStep[flags - 1];
un2 = UnitStep[-flags - 1];
n1 = n1*un2 + (m - 1)*un1;
n0 = n0*un1 + (m + 1)*un2;];
m]];
bsearchMassive[main_, elems_] :=
bsearchMassiveAux[main, elems, ConstantArray[1, {Length[elems]}]];
memberPositionsInteger[x_, y_] :=
With[{xpacked = Developer`ToPackedArray[x],
ypacked = Developer`ToPackedArray[y]},
With[{ord = Ordering[xpacked]},
Sort@ord[[bsearchMassive[xpacked[[ord]], ypacked]]]]]
];
Here is the timing comparison on a fresh kernel:
In[5]:= list1 =
Transpose[{RandomSample[Range[1000000,9999999],1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];
In[6]:= list2 = RandomSample[list1[[All,1]],500000];
In[7]:= list1[[memberPositionsInteger[list1[[All,1]],list2]]]//Short//Timing
Out[7]=
{2.015,{{3301095,data2},{6853443,data5},{3638770,data8},<<499995>>,{8382696,data999998},{2135145,data999999}}}
In[8]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing
Out[8]=
{5.485,{{3301095,data2},{6853443,data5},{3638770,data8},<<499995>>,{8382696,data999998},{2135145,data999999}}}
So, here we are down from your 5 hours to about 2 seconds, which looks
pretty good to me. With your Select-MemberQ method, one can not expect fast
results since its complexity is ~Length[list1]*Length[list2]. The
complexity of the memberPositionsInteger is roughly
C1*Log[Length[list1]]*Length[list2] +
C2*Log[Length[list1]]*Length[list1]+C3*Log[Length[list2]]*Length[list2], but
C2 and C3 are pretty low since they are controlled by the built-in code, and
C1 is not too high either since both Compile and packed arrays have been
fully utilized in bsearchMassiveAux. In the case of memberPositions, the
bottleneck becomes the Dispatch-based rule application, since it is
hash-table based and, although liner in Length[x1], has a relatively large
time constant. OTOH, it is much more general, and the code is simpler.
Hope this helps.
Regards,
Leonid
Hi Nacho,
you can convert list1 to a list of rules by
phonedatarules = Dispatch[Rule @@@ list1];
this needs only to be done once and needs ~11 seconds with 10^6
entries in list1 on my laptop.
Then use
thewanteddata=Transpose[{list2,list2/.phonedatarules}];
(~8 seconds with a random list of 464696 entries)
and you'll get sth. like
Take[thewanteddata,5]//InputForm
-->
{
{Subscript[phone, 764884], Subscript[data, 764884]},
{Subscript[phone, 127257], Subscript[data, 127257]},
{Subscript[phone, 113745], Subscript[data, 113745]},
{Subscript[phone, 279308], Subscript[data, 279308]},
{Subscript[phone, 478296], Subscript[data, 478296]}
}
Regards,
Peter
Good day,
Assuming both lists are sorted, you could match
them this way (an example with size n=10000):
In[1]:= n = 10000;
list1 = Table[{k,RandomInteger[{1,n}]},{k,1,n}] ;
list2 = list1[[All,1]][[Floor[n/4];; Floor[3n/4]]] ;
In[4]:= (list3 = {#,True}& /@ list2;
list4 = {#,False}& /@ Complement[list1[[All,1]],list2];
list5 = Union[list3, list4];
list6 = Transpose[{list1, list5}];
Cases[list6,{{_,_},{_,True}}][[All,1]])//Timing//Short
Out[4]//Short= {0.094,{{2500,8261},{2501,8320},<<4997>>,{7499,6791},
{7500,943}}}
In[5]:= Select[list1,MemberQ[list2,#[[1]]]&]//Timing//Short
Out[5]//Short= {9.219,{{2500,8261},{2501,8320},<<4997>>,{7499,6791},
{7500,943}}}
In[7]:= 9.219/0.094
Out[7]= 98.0745
Matching lists seems to be 100 times faster then MemberQ
hth
--
Valeri Astanoff
In[1]:= list1 = {{phone1, data1}, {phone2, data2}, {phone3, data3},
{phone4, data4}, {phone5, data5}, {phone6, data6}, {phone7, data7},
{phone8, data8}};
rules = Dispatch[Apply[#1 -> {##1} & , list1, {1}]]
Out[2]= Dispatch[{phone1 -> {phone1, data1}, phone2 -> {phone2, data2},
phone3 -> {phone3, data3}, phone4 -> {phone4, data4},
phone5 -> {phone5, data5}, phone6 -> {phone6, data6},
phone7 -> {phone7, data7}, phone8 -> {phone8, data8}}]
In[5]:= list2 = {phone2, phone3, phone7};
list2 /. rules
Out[6]= {{phone2, data2}, {phone3, data3}, {phone7, data7}}
Adriano Pascoletti
2010/9/14 Nacho <ncc17...@gmail.com>
One possibility is to use an auxiliary function. I'll call it hashed.
Clear[hashed]
SetAttributes[hashed, Listable]
With[{lhs = hashed[list1[[All, 1]]]}, lhs = list1];
Then, use:
hashed[list2]
to get your result.
Carl Woll
Wolfram Research
This is a follow-up to my previous post. Since you mentioned that the order
of elements is
not an issue, there is another general solution, both simpler and faster
than the one based on
memberPositions function, but slower than that based on
memberPositionsInteger function
from my previous post:
The lists:
In[1]:= list1 =
Transpose[{RandomSample[Range[1000000,9999999],1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];
In[2]:= list2 = RandomSample[list1[[All,1]],500000];
The solution:
In[3]:= list2/.Dispatch[Thread[#[[All,1]]->#]&[list1]]//Short//Timing
Out[3]=
{3.234,{{6059339,data697996},{6645675,data173455},<<499996>>,{5345183,data604454},{8230717,data405701}}}
The price to pay is that here the original ordering of the phone data is
lost. The other difference w.r.t the solution based on memberPositions is
that if some phones are repeated with different data attached, this one will
repeat the first entry rather than list all different entries correctly,
which is what the memberPositions-based solution does. The more specialized
solution from my previous reply is still about 1.5x faster than this one and
does preserve the original ordering.
Regards,
Leonid
More generalized:
ClearAll[find];
find[keys_, vals_, find_, hole_: $Failed] := Module[{x},
Evaluate[x /@ keys] = vals;
x[_] = hole;
x /@ find
];
Called via:
find[list1[[All,1]], list1[[All,2]], list2]
Although, if you need to do multiple lookups, you'd want to save the
mapping.
This is very quick the first time thru, but each successive call
slows by an increasing amount. Here are 5 sequential sample times
for the same two input lists: {4.84, 5.34, 6.39, 7.21, 8.95}.
What's causing the slowdown, and how can it be fixed?
I have no good explanation for this. I can confirm most of your
observations, except
that in my case the timing stabilizes around 8 seconds, which is about 2x of
the first
(fresh-kernel) result. I suspect that this has to do with some shared memory
(references).
In particular, clearing history like so:
Unprotect[Out];
DownValues[Out] = {};
Protect[Out];
Seems to help a little.
The other thing which comes to mind is that this could be due to some
internals of Dispatch.
When we use Dispatch, some hash-table is created internally. When we abandon
a local
variable which stored the Dispatched rules, presumably the hash table has to
also be destroyed.
Question is, when does this happen, and in case if it is not destroyed
timely, would it slow
down the operation of another Dispatch in some way. I did not observe such
behavior before,
but here Dispatch was used on a pretty large list, so may be this effect is
only visible for
large sets of data. OTOH, my code using Dispatch only:
In[28]:= list21/.Dispatch[Thread[#[[All,1]]->#]&[list11]]//Short//Timing
Out[28]=
{3.312,{{2707844,data121541},{4171257,data402011},<<499996>>,{9613764,data217507},{6452511,data971822}}}
Does not seem to suffer much from this problem, so it may be due to
the entanglement of
Dispatch and Module removing its local variable <tag> from the symbol table.
Yet another reason is that the way I generate (model) data, I create 1000000
symbols in the
Global` context. This seems to complicate the symbol lookup. If, for the
sake of example,
I modify the memberPositions function as:
Clear[memberPositions];
memberPositions[x_List, y_List] :=
Pick[Range[Length[x]],
Replace[x, Dispatch[Thread[Rule[Intersection[x, y], "tag"]]], {1}],
"tag"]
and also generate the data as strings rather than symbols:
list11 = Transpose[{RandomInteger[{1000000, 9999999}, 1000000],
Table["data" <> ToString[i], {i, 1, 1000000}]}];
list21 = RandomSample[list11[[All, 1]], 500000];
Then the problem is much less pronounced (you must re-start the kernel to
see it).
I suspect that with so many symbols in the Global`, some kind of re-hashing
may
happen as a result of running memberPositions (since it has a local variable
<tag>,
which temporarily enters the symbol table), and possibly these re-hashs can
accumulate
and lead to the overall slow-down.
Anyways, I subscribe to your question and hope that someone will
enlighten us.
Regards,
Leonid
Thanks a lot to all you for your nice answers. You have been very
helpful.
Finally, I decided to use the Dispatch[] version, it is fast and
elegant, I think. Easy to understand next month/year if I need to
revisit the code.
I knew about the Dispatch but I never though it was so fast! I'm
impressed.
Thanks again!
Regards.