Combining data from indexed lists efficiently

Steve W. Brewer

unread,

Jan 4, 2010, 6:01:09 AM1/4/10

to

I have several lists of the format:

{ {index1, value}, {index2, value}, ... {indexN, value} }

For example:

list1 = { {"A", 1}, {"B", 2}, {"C", 3}, {"D", 4} }
list2 = { {"A", 5}, {"B", 6}, {"D", 7}, {"E", 8} }
list3 = { {"A", 9}, {"B", 10}, {"C", 11} }

The indexes are not necessarily strings; they may be any expression. (In
the specific case I'm addressing now, each index is a list representing a
date/time in the format returned by DateList[].) The lists are not
necessarily the same length. Also, while most of the indexes appear in all
lists, there are some holes (missing data).

I want to combine the lists into a single list of the format:

{ { index1, {value1, value2, ... valueN} },
{ index2, {value1, value2, ... valueN} },
...
{ indexN, {value1, value2, ... valueN} } }

Only the data points with indexes appearing in all lists should be included;
the rest should be dropped. Also, I want to include some derived values
along with the original data values.

Using the sample data above, let's say I want to include two derived values
from the functions:

f1[list1Data_, list2Data_] := list1Data + list2Data
f2[list2Data_, list3Data_] := list2Data + list3Data

The result would be:

combinedList = { { "A", {1, 5, 9, 6, 14} },
{ "B", {2, 6, 10, 8, 16} } }

I have a solution that works fine on "small" data sets. However, it's
impractically slow on the "large" data sets I really need to run it on (over
100k elements in each list).

Here's what I'm doing now:

(* This part executes pretty quickly *)

indexesToUse =
Intersection[First /@ list1, First /@ list2, First /@ list3];

valueAtIndex[index_, list_] :=
Cases[list, {index, _}, 1, 1] // First // Last;

dataAtIndex[index_] := Block[
{v1, v2, v3, vf1, vf2},

v1 = valueAtIndex[index, list1];
v2 = valueAtIndex[index, list2];
v3 = valueAtIndex[index, list3];

vf1 = f1[v1, v2];
vf2 = f2[v2, v3];

{v1, v2, v3, vf1, vf2}
];

(* This is where it bogs down *)

combinedList =
Function[{index}, {index, dataAtIndex[index]}] /@ indexesToUse;

This is all inside an enclosing Module[] along with some other code, and the
actual code is a little more complex (e.g. more than three lists, more than
two derived-value functions). The derived-value functions themselves are
mostly simple algebra; I doubt they're the source of the bottleneck, and in
any case, I can't change them. (I *can* change the way they're applied,
though, if it makes a difference.)

I *think* the bottleneck is probably in my repeated calls to Cases[] to find
particular data points, but that's just a guess.

Is there a more efficient way of doing this that would speed things up
significantly?

Thanks!

Steve W. Brewer

dh

unread,

Jan 5, 2010, 1:41:57 AM1/5/10

to

Hi Steve,

one way to do it:

1-Join the lists

2-gather all entries with the same label

3-separate label and values

The following code will do this:

t = Join[list1, list2, list3];

GatherBy[t, First] /. x : {{a_, _} ..} :> {a, x[[All, 2]] }

Daniel

Leonid Shifrin

unread,

Jan 5, 2010, 1:44:38 AM1/5/10

to

Hi Steve,

I think you correctly identified at least some of your bottlenecks.
Your code is fine as long as the number of different indices is small. For
large number of indices, using Cases to find individual data element indeed
becomes wasteful. Instead, you can use the fact that the indices you want to
keep should be in every list. This means that your final structure will be
rectangular. In the implementation below I have used that:

(* Slightly improved over yours *)
Clear[indexesToUseNew];
indexesToUseNew[lists__List] :=
Intersection @@ List[lists][[All, All, 1]];

(* Collect all results in a single run using GatherBy *)
Clear[coreValuesFromLists];
coreValuesFromLists[lists__List] :=
With[{len = Length[{lists}]},
Transpose[{#[[All, 1, 1]], #[[All, All, 2]]}] &@
Cases[GatherBy[Join[lists], First], x_ /; Length[x] == len]];

(* This is optimized assuming that your functions for derived values \
only use algebraic operations *)
Clear[addDerivedValues];
addDerivedValues[values_?MatrixQ, {}] := values;
addDerivedValues[values_?MatrixQ, functions_List] :=
Flatten[Transpose[
{values,
Transpose@Through[functions[Transpose[values]]]}], {{1}, {2, 3}}];

(* Putting everything together *)
Clear[valuesFromLists];
valuesFromLists[{lists__List}, functions : _List : {}] :=
If[# === {}, {},
Transpose[{#[[All, 1]], addDerivedValues[#[[All, 2]], functions]}]]
&@coreValuesFromLists[lists];

The logic is to first compute the core values and then add derived values
obtained through application of your functions - the latter you have to
supply in a list as a second argument. Here is how this works on your toy
example:

In[5]:= indexesToUseNew[list1, list2, list3]

Out[5]= {"A", "B"}

In[6]:= coreValuesFromLists[list1, list2, list3]

Out[6]= {{"A", {1, 5, 9}}, {"B", {2, 6, 10}}}

In[7]:= addDerivedValues[{{1, 5, 9}, {2, 6,
10}}, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]

Out[7]= {{1, 5, 9, 6, 14}, {2, 6, 10, 8, 16}}

In[8]:= valuesFromLists[{list1, list2,
list3}, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]

Out[8]= {{"A", {1, 5, 9, 6, 14}}, {"B", {2, 6, 10, 8, 16}}}

Here is your code which I modified somewhat so that it can work on any
number of lists:

Clear[f1, f2, indexesToUse, valueAtIndex, dataAtIndex, combinedList];

f1[list1Data_, list2Data_] := list1Data + list2Data
f2[list2Data_, list3Data_] := list2Data + list3Data

indexesToUse[lists__List] := indexesToUseNew[lists];

valueAtIndex[index_, list_] :=
Cases[list, {index, _}, 1, 1] // First // Last;

dataAtIndex[{lists__}, index_] := Block[{vf1, vf2, values},
values = valueAtIndex[index, #] & /@ {lists};
vf1 = f1[values[[1]], values[[2]]];
vf2 = f2[values[[2]], values[[3]]];
{Sequence @@ values, vf1, vf2}];

combinedList[lists__] :=
Function[{index}, {index, dataAtIndex[{lists}, index]}] /@
indexesToUse[lists];

Here is the code for a test generator to produce a test data with larger
numbers of lists and indexes:

Clear[makeTestLists];
makeTestLists[listNum_Integer, indexNum_Integer,
usedFraction : (_?NumericQ) : 0.5,
valueRange : {_, _} : {1, 100}] /; 0 < usedFraction <= 1 :=
Module[{used, indices, generateUsedIndices, generateRandomIndices,
randomPermutation, addValues,
indexLetters = Characters["ABCDEFGHIJ"]},
indices =
Flatten@Outer[StringJoin, indexLetters,
ToString /@ Range[indexNum]];
used = IntegerPart[Length[indices]*usedFraction];
randomPermutation = RandomSample[#, Length[#]] &;
addValues = Map[{#, RandomInteger[valueRange]} &, #] &;
generateUsedIndices[] := Take[indices, used];
generateRandomIndices[] :=
RandomSample[#, RandomInteger[{Length[#] - 1, Length[#]}]] &@
Drop[indices, used];
Table[randomPermutation@
addValues[
generateUsedIndices[]~Join~
generateRandomIndices[]], {listNum}]];

It produces the list of your lists, with total number of them <listNum>,
total number of different numerical indices for a fixed index letter
<indexNum>, with <usedFraction> of total indices present in all lists and
the rest probabilistically, and with the values in range <valueRange>. For
example:

In[10]:= makeTestLists[5, 1]

Out[10]= {{{"B1", 39}, {"I1", 96}, {"E1", 55}, {"G1", 30}, {"A1",
23}, {"H1", 16}, {"J1", 59}, {"C1", 6}, {"D1", 39}}, {{"B1",
13}, {"C1", 45}, {"D1", 75}, {"A1", 3}, {"J1", 26}, {"I1",
39}, {"H1", 41}, {"F1", 9}, {"G1", 39}, {"E1", 35}}, {{"J1",
78}, {"C1", 59}, {"D1", 69}, {"H1", 92}, {"A1", 13}, {"F1",
75}, {"I1", 7}, {"B1", 15}, {"G1", 43}, {"E1", 40}}, {{"I1",
61}, {"H1", 80}, {"B1", 13}, {"E1", 20}, {"F1", 56}, {"A1",
86}, {"G1", 89}, {"J1", 72}, {"D1", 12}, {"C1", 57}}, {{"J1",
54}, {"D1", 29}, {"E1", 65}, {"F1", 34}, {"A1", 18}, {"B1",
61}, {"G1", 88}, {"I1", 91}, {"C1", 87}}}

When only one index persists in all lists, your method is slightly faster:

In[11]:= test = makeTestLists[1000, 5, 0.03];

In[12]:= (res1 =
valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) //
Shallow // Timing

Out[12]= {0.13, {{"A1", {<<1002>>}}}}

In[13]:= (res2 = combinedList @@ test) // Shallow // Timing

Out[13]= {0.091, {{"A1", {<<1002>>}}}}

In[14]:= Sort@res1 === Sort@res2

Out[14]= True

In most other cases, my code is faster or much faster. In the opposite case
with 5 lists and 1000 indices, for example:

In[15]:= test = makeTestLists[5, 100];

In[16]:= (res1 =
valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) //
Shallow // Timing

Out[16]= {0.02, {{"G3", {<<7>>}}, {"C20", {<<7>>}}, {"H27", \
{<<7>>}}, {"E3", {<<7>>}}, {"E90", {<<7>>}}, {"A5", {<<7>>}}, {"F58", \
{<<7>>}}, {"B59", {<<7>>}}, {"A13", {<<7>>}}, {"G70", {<<7>>}}, \
<<988>>}}

In[17]:= (res2 = combinedList @@ test) // Shallow // Timing

Out[17]= {2.163, {{"A1", {<<7>>}}, {"A10", {<<7>>}}, {"A100", \
{<<7>>}}, {"A11", {<<7>>}}, {"A12", {<<7>>}}, {"A13", {<<7>>}}, \
{"A14", {<<7>>}}, {"A15", {<<7>>}}, {"A16", {<<7>>}}, {"A17", \
{<<7>>}}, <<988>>}}

In[18]:= Sort@res1 === Sort@res2

Out[18]= True

Cases like the following with 300 lists and 100 indices are somewhere in
between (Cases is still reasonably fast for lists of length ~ 100):

In[19]:= test = makeTestLists[300, 10];

In[20]:= (res1 =
valuesFromLists[test, {#[[1]] + #[[2]] &, #[[2]] + #[[3]] &}]) //
Shallow // Timing

Out[20]= {0.07, {{"A8", {<<302>>}}, {"D1", {<<302>>}}, {"D4", \
{<<302>>}}, {"B5", {<<302>>}}, {"B4", {<<302>>}}, {"E1", {<<302>>}}, \
{"C10", {<<302>>}}, {"C6", {<<302>>}}, {"D10", {<<302>>}}, {"B7", \
{<<302>>}}, <<41>>}}

In[21]:= (res2 = combinedList @@ test) // Shallow // Timing

Out[21]= {0.932, {{"A1", {<<302>>}}, {"A10", {<<302>>}}, {"A2", \
{<<302>>}}, {"A3", {<<302>>}}, {"A4", {<<302>>}}, {"A5", {<<302>>}}, \
{"A6", {<<302>>}}, {"A7", {<<302>>}}, {"A8", {<<302>>}}, {"A9", \
{<<302>>}}, <<41>>}}

In[22]:= Sort@res1 === Sort@res2

Out[22]= True

Note that in my code I implicitly used the assumption that your functions
for derived values are Listable (which is true for most numerical/algebraic
operations).

Hope this helps.

Regards,
Leonid

DrMajorBob

unread,

Jan 5, 2010, 1:46:24 AM1/5/10

to

I think your output implies you had five initial lists, but you only
listed 3.

Still... if I understand you correctly, here's a solution:

Clear[teach]
SetAttributes[teach, HoldFirst]
teach[f_Symbol, list_List] :=
Module[{x},
First@Last@
Reap[(x = First@#; Head@f@x === f && (Sow@x; f[x] = Last@#)) & /@
list]
]

list1 = {{"A", 1}, {"B", 2}, {"C", 3}, {"D", 4}};
list2 = {{"A", 5}, {"B", 6}, {"D", 7}, {"E", 8}};

list3 = {{"A", 9}, {"B", 10}, {"C", 11}};
lists = {list1, list2, list3};
functions = {f1, f2, f3};
Scan[Clear, functions]
indices = Intersection @@ teach @@@ Transpose@{functions, lists};
Transpose@Outer[Compose, functions, indices];
{#, Through[functions@#]} & /@ indices

{{"A", {1, 5, 9}}, {"B", {2, 6, 10}}}

Now f1, f2, and f3 are functions that retrieve values from the lists.

There are other ways... but this method MAY be faster on large lists.

(And maybe not.)

Bobby

On Mon, 04 Jan 2010 04:59:52 -0600, Steve W. Brewer <st...@take5.org>
wrote:

--
DrMaj...@yahoo.com

Daniel Lichtblau

unread,

Jan 5, 2010, 1:46:59 AM1/5/10

to

One can do this with pedestrian procedural code. The idea is that after
culling out the indices of interest, only loop once over the entire list
of lists. Initialize a structure that has the correct shape of the
output. Keep a set of subindices to tell you which field in each entry
of the result is ready to be set. Set the fields in the iteration. When
done with that, compute the functions of each partial result, augment
those values to the partial results.

indexesToUse[lists_List] := Apply[Intersection,Map[#[[All,1]]&,lists]]

initValueTable[indices_,len_] :=
Table[{indices[[j]],ConstantArray[0,len]}, {j,Length[indices]}]

valueTables[lists_List,funcs_List] := Module[
{indices,valtab,subindices,ll,list,listi,indexset,ipos,vtj},
ll = Length[lists];
indices = indexesToUse[lists];
valtab = initValueTable[indices,ll+Length[funcs]];
subindices = ConstantArray[0,Length[indices]];
Do[indexset[indices[[j]]] = j, {j,Length[indices]}];
Do[
list = lists[[j]];
Do[
listi = list[[i]];
ipos = indexset[listi[[1]]];
If [!IntegerQ[ipos], Continue[]];
subindices[[ipos]] += 1;
valtab[[ipos,2,subindices[[ipos]]]] = listi[[2]];
,{i,Length[list]}];
,{j,ll}];
Clear[indexset];
Do[
vtj = Take[valtab[[j,2]],ll];
Do[
valtab[[j,2,ll+i]] = funcs[[i]][vtj];
,{i,Length[funcs]}];
,{j,Length[valtab]}];
valtab
]

Here is an example that uses 100 lists of (initially) 10^5 elements in
each. I do some preprocessing so that the first elements in eah pair are
unique to each list, yet likely to be found in all lists.

In[88]:= lists = Table[{RandomInteger[10^4],RandomReal[]}, {10^2}, {10^5}];
lists2 = Map[GatherBy[#,First]&, lists];
lists3 = Table[Map[{#[[1,1]],Total[#[[All,2]]]}&, lists2[[j]]],
{j,Length[lists2]}];

In[92]:= Timing[vt = valueTables[lists3,
{#[[1]]+#[[2]]&,#[[1]]+#[[3]]&}];]
Out[92]= {10.9563, Null}

Daniel Lichtblau
Wolfram Research

Bill Rowe

unread,

Jan 5, 2010, 1:48:21 AM1/5/10

to

>For example:

If you are using version 7, then what you want can be achieve
with a lot less code by doing:

In[7]:= list1 = {{"A", 1}, {"B", 2}, {"C", 3}, {"D", 4}};

list2 = {{"A", 5}, {"B", 6}, {"D", 7}, {"E", 8}};

list3 = {{"A", 9}, {"B", 10}, {"C", 11}};

In[10]:= {#[[1, 1]], #[[All, 2]]} & /@
GatherBy[Join[list1, list2, list3], First]

Out[10]= {{"A", {1, 5, 9}}, {"B", {2, 6, 10}}, {"C", {3,
11}}, {"D", {4, 7}}, {"E", {8}}}

I don't know how well this will scale to 100K + items per list.
My guess is this will perform better than what you indicated you
had tried.

Steve W. Brewer

unread,

Jan 6, 2010, 6:00:15 AM1/6/10

to

Thanks everyone who responded!

I ended up going with Daniel's solution because it turned out to be easiest
to adapt to my specific situation. However, I did need to make a couple of
changes to Daniel's code.

One of my derived-value functions (f2[]) uses the result of another
derived-value function (f1[]). I could do something like this ...

valueTables[ lists,
{ f1[ #[[1]], #[[2]] ] &,
f2[ f1[ #[[1]], #[[2]] ], #[[3]] ] & }]

... but then f1[] is evaluated twice for each sample. By changing

vtj = Take[valtab[[j,2]],ll];

to

vtj := valtab[[j,2]];

the complete result vector is available to my derived-value functions,
including results from previously computed functions. So now I can write

valueTables[ lists,
{ f1[ #[[1]], #[[2]] ] &,
f2[ #[[4]], #[[3]] ] & }]

where #[[4]] is the result from f1.

I also discovered that some of my data had a few duplicate indices, which
didn't bother my original code but messed up the recordkeeping in the
subindices list in this approach. I added a couple of lines of code to that
Do loop to skip over any duplicate indices encountered:

Do[
list = lists[[j]];

previpos = 0;

Do[
listi = list[[i]];
ipos = indexset[listi[[1]]];

If [!IntegerQ[ipos] || ipos <= previpos, Continue[]];

subindices[[ipos]] += 1;
valtab[[ipos,2,subindices[[ipos]]]] = listi[[2]];

previpos = ipos;

,{i,Length[list]}];
,{j,ll}];

Final result: My previous code took at least a couple of hours to run on my
machine. The new code takes 19.5 seconds!

Here's the complete code with my changes:

indexesToUse[lists_List] := Apply[Intersection,Map[#[[All,1]]&,lists]]

initValueTable[indices_,len_] :=
Table[{indices[[j]],ConstantArray[0,len]}, {j,Length[indices]}]

valueTables[lists_List,funcs_List] := Module[
{indices,valtab,subindices,ll,list,listi,indexset,ipos,vtj},
ll = Length[lists];
indices = indexesToUse[lists];
valtab = initValueTable[indices,ll+Length[funcs]];
subindices = ConstantArray[0,Length[indices]];
Do[indexset[indices[[j]]] = j, {j,Length[indices]}];
Do[
list = lists[[j]];

previpos = 0;

Do[
listi = list[[i]];
ipos = indexset[listi[[1]]];

If [!IntegerQ[ipos] || ipos <= previpos, Continue[]];

subindices[[ipos]] += 1;
valtab[[ipos,2,subindices[[ipos]]]] = listi[[2]];

previpos = ipos;

,{i,Length[list]}];
,{j,ll}];
Clear[indexset];
Do[

vtj := valtab[[j,2]];

Do[
valtab[[j,2,ll+i]] = funcs[[i]][vtj];
,{i,Length[funcs]}];
,{j,Length[valtab]}];
valtab
]

Steve W. Brewer