152 views

Skip to first unread message

Sep 14, 2010, 5:14:20 AM9/14/10

to

Hi!

I would like your advice about this problem:

I have a list of elements, consisting in phone numbers with some data:

list1= { { phone1, data1}, {phone2, data2}, {phone3, data3} .... }

And a list of phones, a subset of the phone numbers in list1

list2= {phone2, phone3, phone7... }

I'd like to extract the elements from list1 whose phone numbers are

present in list 2, that is:

result= { { phone2, data2}, {phone3, data3, {phone7, data7} .... }

I've used this with small lists and it works fine:

result = Select[list1, MemberQ[list2, #[[1]]] &];

The problem is that now I would like to use it with big lists. list1

is over 1.000.000 elements long and list2 is about 500.000 elements

long. Ordering is not a problem, I could resort the lists.

Any hint to extract this list faster? It seems to take a lot of time

(estimation is about 5 hours and I have to do it repeatedly)

Thanks!

Sep 15, 2010, 4:37:29 AM9/15/10

to

Could do as follows.

overlapsNth[l1_,l2_,n_:1] := Module[

{h, res},

Do[h[l1[[j,n]]] = True, {j,Length[list1]}];

res = Reap[Do[If[TrueQ[h[l2[[j,n]]]], Sow[list2[[j]]]],

{j,Length[l2]}]][[2,1]];

Clear[h];

res

]

I've not thoroughly checked for correctness so you might want to do

that. As for speed, here is an example.

n = 6;

list1 = RandomInteger[10^9, {10^n,2}];

list2 = RandomInteger[10^9, {10^n,2}];

In[24]:= Timing[intersect = overlapsNth[list1,list2,1];]

Out[24]= {3.86441, Null}

In[25]:= Length[intersect]

Out[25]= 1033

Daniel Lichtblau

Wolfram Research

Sep 15, 2010, 4:36:34 AM9/15/10

to

Block[{a = Sort@list1, b = Sort@list2, i = 1, j = 1},

Reap[While[i <= Length@a && j <= Length@b,

Which[a[[i,1]] < b[[j]], i++,

a[[i,1]] > b[[j]], j++,

True, Sow@a[[i]]; i++; j++]]][[2,1]]]

Sep 15, 2010, 4:38:01 AM9/15/10

to

Hi,

The following function:

Clear[memberPositions];

memberPositions[x_List, y_List] :=

Module[{tag},

Pick[Range[Length[x]],

Replace[x, Dispatch[Thread[Rule[Intersection[x, y], tag]]], {1}],

tag]]

Is a fast way to find positions in list <x> which are also in <y>, for

arbitrary lists <x> and <y>. Using it, it worked in about 5 seconds on my

machine (M7.0 32bit WinXP AMD Phenom II - 2800 Ghz). Here is how you do it:

Model your lists:

In[16]:=

list1 =

Transpose[{RandomInteger[{1000000,9999999},1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];

In[17]:= list2 = RandomSample[list1[[All,1]],500000];

Get the result:

In[18]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing

Out[18]=

{5.203,{{9506759,data1},{3854419,data2},{1093656,data3},<<526944>>,{6991722,data999997},{9278309,data1000000}}}

You get more elements here because obviously some phone numbers were

randomly generated more than

once. Be aware though that such large lists consume lots of memory - about

320 Mb here.

The phone numbers could be strings or whatever, the memberPositions function

does not assume anything

about them. With this method, you have an added advantage that the resulting

list is in the original order -

no need to resort. It can also handle repeated phone numbers.

Now, if all of the following conditions are satisfied:

1. Your phone numbers can be represented as machine-precision integers

2. Your list list2 contains only phone numbers from list1

3. The list list1 does not contain repeated phone numbers (list2 may)

Then it is possible to use the fact that Sort and Ordering are very fast on

packed arrays. Additional idea here

is to combine vectorized operations on packed arrays with Compile (this

combination often results in great

performance in Mathematica), to get a version of a binary search which finds

positions of many elements in a sorted list at once.The initial list of

phone numbers is ordered first, then positions of elements in list2 are

found by a massive binary search function, and finally the positions in the

original unsorted list are recovered. The resulting more restrictive version

of memberPositions will be about 3 times faster than the above general one

for the

lists of the size you requested:

Clear[memberPositionsInteger];

Module[{bsearchMassiveAux, bsearchMassive},

bsearchMassiveAux =

Compile[{{list, _Integer, 1}, {elems, _Integer,

1}, {ones, _Integer, 1}},

Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,

m = ones,

diff = ones, un1 = ones, un2 = ones, flags = ones,

zeros = 0*ones},

n1 = Length[list]*n0;

While[flags != zeros,

m = m + (Floor[(n0 + n1)/2] - m)*Unitize[flags];

flags = list[[m]] - elems;

un1 = UnitStep[flags - 1];

un2 = UnitStep[-flags - 1];

n1 = n1*un2 + (m - 1)*un1;

n0 = n0*un1 + (m + 1)*un2;];

m]];

bsearchMassive[main_, elems_] :=

bsearchMassiveAux[main, elems, ConstantArray[1, {Length[elems]}]];

memberPositionsInteger[x_, y_] :=

With[{xpacked = Developer`ToPackedArray[x],

ypacked = Developer`ToPackedArray[y]},

With[{ord = Ordering[xpacked]},

Sort@ord[[bsearchMassive[xpacked[[ord]], ypacked]]]]]

];

Here is the timing comparison on a fresh kernel:

In[5]:= list1 =

Transpose[{RandomSample[Range[1000000,9999999],1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];

In[6]:= list2 = RandomSample[list1[[All,1]],500000];

In[7]:= list1[[memberPositionsInteger[list1[[All,1]],list2]]]//Short//Timing

Out[7]=

{2.015,{{3301095,data2},{6853443,data5},{3638770,data8},<<499995>>,{8382696,data999998},{2135145,data999999}}}

In[8]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing

Out[8]=

{5.485,{{3301095,data2},{6853443,data5},{3638770,data8},<<499995>>,{8382696,data999998},{2135145,data999999}}}

So, here we are down from your 5 hours to about 2 seconds, which looks

pretty good to me. With your Select-MemberQ method, one can not expect fast

results since its complexity is ~Length[list1]*Length[list2]. The

complexity of the memberPositionsInteger is roughly

C1*Log[Length[list1]]*Length[list2] +

C2*Log[Length[list1]]*Length[list1]+C3*Log[Length[list2]]*Length[list2], but

C2 and C3 are pretty low since they are controlled by the built-in code, and

C1 is not too high either since both Compile and packed arrays have been

fully utilized in bsearchMassiveAux. In the case of memberPositions, the

bottleneck becomes the Dispatch-based rule application, since it is

hash-table based and, although liner in Length[x1], has a relatively large

time constant. OTOH, it is much more general, and the code is simpler.

Hope this helps.

Regards,

Leonid

Sep 15, 2010, 4:39:31 AM9/15/10

to

Am Tue, 14 Sep 2010 09:14:20 +0000 (UTC)

schrieb Nacho <ncc17...@gmail.com>:

schrieb Nacho <ncc17...@gmail.com>:

Hi Nacho,

you can convert list1 to a list of rules by

phonedatarules = Dispatch[Rule @@@ list1];

this needs only to be done once and needs ~11 seconds with 10^6

entries in list1 on my laptop.

Then use

thewanteddata=Transpose[{list2,list2/.phonedatarules}];

(~8 seconds with a random list of 464696 entries)

and you'll get sth. like

Take[thewanteddata,5]//InputForm

-->

{

{Subscript[phone, 764884], Subscript[data, 764884]},

{Subscript[phone, 127257], Subscript[data, 127257]},

{Subscript[phone, 113745], Subscript[data, 113745]},

{Subscript[phone, 279308], Subscript[data, 279308]},

{Subscript[phone, 478296], Subscript[data, 478296]}

}

Regards,

Peter

Sep 15, 2010, 4:40:23 AM9/15/10

to

Good day,

Assuming both lists are sorted, you could match

them this way (an example with size n=10000):

In[1]:= n = 10000;

list1 = Table[{k,RandomInteger[{1,n}]},{k,1,n}] ;

list2 = list1[[All,1]][[Floor[n/4];; Floor[3n/4]]] ;

In[4]:= (list3 = {#,True}& /@ list2;

list4 = {#,False}& /@ Complement[list1[[All,1]],list2];

list5 = Union[list3, list4];

list6 = Transpose[{list1, list5}];

Cases[list6,{{_,_},{_,True}}][[All,1]])//Timing//Short

Out[4]//Short= {0.094,{{2500,8261},{2501,8320},<<4997>>,{7499,6791},

{7500,943}}}

In[5]:= Select[list1,MemberQ[list2,#[[1]]]&]//Timing//Short

Out[5]//Short= {9.219,{{2500,8261},{2501,8320},<<4997>>,{7499,6791},

{7500,943}}}

In[7]:= 9.219/0.094

Out[7]= 98.0745

Matching lists seems to be 100 times faster then MemberQ

hth

--

Valeri Astanoff

Sep 15, 2010, 4:40:46 AM9/15/10

to

Dispatch should be a good candidate:

In[1]:= list1 = {{phone1, data1}, {phone2, data2}, {phone3, data3},

{phone4, data4}, {phone5, data5}, {phone6, data6}, {phone7, data7},

{phone8, data8}};

rules = Dispatch[Apply[#1 -> {##1} & , list1, {1}]]

Out[2]= Dispatch[{phone1 -> {phone1, data1}, phone2 -> {phone2, data2},

phone3 -> {phone3, data3}, phone4 -> {phone4, data4},

phone5 -> {phone5, data5}, phone6 -> {phone6, data6},

phone7 -> {phone7, data7}, phone8 -> {phone8, data8}}]

In[5]:= list2 = {phone2, phone3, phone7};

list2 /. rules

Out[6]= {{phone2, data2}, {phone3, data3}, {phone7, data7}}

Adriano Pascoletti

2010/9/14 Nacho <ncc17...@gmail.com>

Sep 15, 2010, 4:40:56 AM9/15/10

to

On 9/14/2010 4:14 AM, Nacho wrote:

> Hi!

>

> I would like your advice about this problem:

>

> I have a list of elements, consisting in phone numbers with some data:

>

> list1= { { phone1, data1}, {phone2, data2}, {phone3, data3} .... }

>

> And a list of phones, a subset of the phone numbers in list1

>

> list2= {phone2, phone3, phone7... }

>

> I'd like to extract the elements from list1 whose phone numbers are

> present in list 2, that is:

>

> result= { { phone2, data2}, {phone3, data3, {phone7, data7} .... }

>

> I've used this with small lists and it works fine:

>

> result = Select[list1, MemberQ[list2, #[[1]]]&];

>

> The problem is that now I would like to use it with big lists. list1

> is over 1.000.000 elements long and list2 is about 500.000 elements

> long. Ordering is not a problem, I could resort the lists.

>

> Any hint to extract this list faster? It seems to take a lot of time

> (estimation is about 5 hours and I have to do it repeatedly)

>

>

> Thanks!

>

>

> Hi!

>

> I would like your advice about this problem:

>

> I have a list of elements, consisting in phone numbers with some data:

>

> list1= { { phone1, data1}, {phone2, data2}, {phone3, data3} .... }

>

> And a list of phones, a subset of the phone numbers in list1

>

> list2= {phone2, phone3, phone7... }

>

> I'd like to extract the elements from list1 whose phone numbers are

> present in list 2, that is:

>

> result= { { phone2, data2}, {phone3, data3, {phone7, data7} .... }

>

> I've used this with small lists and it works fine:

>

> result = Select[list1, MemberQ[list2, #[[1]]]&];

>

> The problem is that now I would like to use it with big lists. list1

> is over 1.000.000 elements long and list2 is about 500.000 elements

> long. Ordering is not a problem, I could resort the lists.

>

> Any hint to extract this list faster? It seems to take a lot of time

> (estimation is about 5 hours and I have to do it repeatedly)

>

>

> Thanks!

>

>

One possibility is to use an auxiliary function. I'll call it hashed.

Clear[hashed]

SetAttributes[hashed, Listable]

With[{lhs = hashed[list1[[All, 1]]]}, lhs = list1];

Then, use:

hashed[list2]

to get your result.

Carl Woll

Wolfram Research

Sep 15, 2010, 8:05:08 PM9/15/10

to

Hi,

This is a follow-up to my previous post. Since you mentioned that the order

of elements is

not an issue, there is another general solution, both simpler and faster

than the one based on

memberPositions function, but slower than that based on

memberPositionsInteger function

from my previous post:

The lists:

In[1]:= list1 =

Transpose[{RandomSample[Range[1000000,9999999],1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];

In[2]:= list2 = RandomSample[list1[[All,1]],500000];

The solution:

In[3]:= list2/.Dispatch[Thread[#[[All,1]]->#]&[list1]]//Short//Timing

Out[3]=

{3.234,{{6059339,data697996},{6645675,data173455},<<499996>>,{5345183,data604454},{8230717,data405701}}}

The price to pay is that here the original ordering of the phone data is

lost. The other difference w.r.t the solution based on memberPositions is

that if some phones are repeated with different data attached, this one will

repeat the first entry rather than list all different entries correctly,

which is what the memberPositions-based solution does. The more specialized

solution from my previous reply is still about 1.5x faster than this one and

does preserve the original ordering.

Regards,

Leonid

Sep 16, 2010, 6:00:27 AM9/16/10

to

More generalized:

ClearAll[find];

find[keys_, vals_, find_, hole_: $Failed] := Module[{x},

Evaluate[x /@ keys] = vals;

x[_] = hole;

x /@ find

];

Called via:

find[list1[[All,1]], list1[[All,2]], list2]

Although, if you need to do multiple lookups, you'd want to save the

mapping.

Sep 16, 2010, 6:01:32 AM9/16/10

to

On Sep 15, 1:38 am, Leonid Shifrin <lsh...@gmail.com> wrote:

> Hi,

>

> The following function:

>

> Clear[memberPositions];

> memberPositions[x_List, y_List] :=

> Module[{tag},

> Pick[Range[Length[x]],

> Replace[x, Dispatch[Thread[Rule[Intersection[x, y], tag]]], {1}],

> tag]]

>

> Is a fast way to find positions in list <x> which are also in <y>, for

> arbitrary lists <x> and <y>. Using it, it worked in about 5 seconds on my

> machine (M7.0 32bit WinXP AMD Phenom II - 2800 Ghz). Here is how you do it:

>

> Model your lists:

>

> In[16]:= list1 = Transpose[

> {RandomInteger[{1000000,9999999},1000000],

> Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];

>

> In[17]:= list2 = RandomSample[list1[[All,1]],500000];

>

> Get the result:

>

> In[18]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing

> Out[18]= {5.203,{{9506759,data1},{3854419,data2},{1093656,data3},

> <<526944>>,{6991722,data999997},{9278309,data1000000}}}

> Hi,

>

> The following function:

>

> Clear[memberPositions];

> memberPositions[x_List, y_List] :=

> Module[{tag},

> Pick[Range[Length[x]],

> Replace[x, Dispatch[Thread[Rule[Intersection[x, y], tag]]], {1}],

> tag]]

>

> Is a fast way to find positions in list <x> which are also in <y>, for

> arbitrary lists <x> and <y>. Using it, it worked in about 5 seconds on my

> machine (M7.0 32bit WinXP AMD Phenom II - 2800 Ghz). Here is how you do it:

>

> Model your lists:

>

> In[16]:= list1 = Transpose[

> {RandomInteger[{1000000,9999999},1000000],

> Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];

>

> In[17]:= list2 = RandomSample[list1[[All,1]],500000];

>

> Get the result:

>

> In[18]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing

> Out[18]= {5.203,{{9506759,data1},{3854419,data2},{1093656,data3},

> <<526944>>,{6991722,data999997},{9278309,data1000000}}}

This is very quick the first time thru, but each successive call

slows by an increasing amount. Here are 5 sequential sample times

for the same two input lists: {4.84, 5.34, 6.39, 7.21, 8.95}.

What's causing the slowdown, and how can it be fixed?

Sep 17, 2010, 6:43:54 AM9/17/10

to

Ray,

I have no good explanation for this. I can confirm most of your

observations, except

that in my case the timing stabilizes around 8 seconds, which is about 2x of

the first

(fresh-kernel) result. I suspect that this has to do with some shared memory

(references).

In particular, clearing history like so:

Unprotect[Out];

DownValues[Out] = {};

Protect[Out];

Seems to help a little.

The other thing which comes to mind is that this could be due to some

internals of Dispatch.

When we use Dispatch, some hash-table is created internally. When we abandon

a local

variable which stored the Dispatched rules, presumably the hash table has to

also be destroyed.

Question is, when does this happen, and in case if it is not destroyed

timely, would it slow

down the operation of another Dispatch in some way. I did not observe such

behavior before,

but here Dispatch was used on a pretty large list, so may be this effect is

only visible for

large sets of data. OTOH, my code using Dispatch only:

In[28]:= list21/.Dispatch[Thread[#[[All,1]]->#]&[list11]]//Short//Timing

Out[28]=

{3.312,{{2707844,data121541},{4171257,data402011},<<499996>>,{9613764,data217507},{6452511,data971822}}}

Does not seem to suffer much from this problem, so it may be due to

the entanglement of

Dispatch and Module removing its local variable <tag> from the symbol table.

Yet another reason is that the way I generate (model) data, I create 1000000

symbols in the

Global` context. This seems to complicate the symbol lookup. If, for the

sake of example,

I modify the memberPositions function as:

Clear[memberPositions];

memberPositions[x_List, y_List] :=

Pick[Range[Length[x]],

Replace[x, Dispatch[Thread[Rule[Intersection[x, y], "tag"]]], {1}],

"tag"]

and also generate the data as strings rather than symbols:

list11 = Transpose[{RandomInteger[{1000000, 9999999}, 1000000],

Table["data" <> ToString[i], {i, 1, 1000000}]}];

list21 = RandomSample[list11[[All, 1]], 500000];

Then the problem is much less pronounced (you must re-start the kernel to

see it).

I suspect that with so many symbols in the Global`, some kind of re-hashing

may

happen as a result of running memberPositions (since it has a local variable

<tag>,

which temporarily enters the symbol table), and possibly these re-hashs can

accumulate

and lead to the overall slow-down.

Anyways, I subscribe to your question and hope that someone will

enlighten us.

Regards,

Leonid

Sep 17, 2010, 6:40:19 AM9/17/10

to

Hi group!

Thanks a lot to all you for your nice answers. You have been very

helpful.

Finally, I decided to use the Dispatch[] version, it is fast and

elegant, I think. Easy to understand next month/year if I need to

revisit the code.

I knew about the Dispatch but I never though it was so fast! I'm

impressed.

Thanks again!

Regards.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu