I've worked up an alternative implementation which runs pretty
quickly, and works by partitioning component-wise, then finding
intersections of each bucket. I thought it might be worth posting to
see if anyone has encountered this issue and knows more about why it
happens...
Example implementation:
(* generate some large lists of reals *)
{x,y,z} = RandomReal[{0,1}, {3,10^6}];
xyz = Transpose[{x, y, z}];
(* define bin intervals *)
int = Range[0,1,0.1];
(* calculate buckets for each variable *)
{xBins, yBins, zBins} = Table[Flatten@Position[#, _?(int[[i]] <= # <
int[[i+1]] &)], {i, Length[int]-1}] & /@ {x, y, z};
(* get 3x3 grid of bins by taking intersections of the single-variable
bins *)
xyzBins = Table[Intersection[xBins[[i]], yBins[[j]], zBins[[k]]],
{i,Length@xBins}, {j,Length@yBins}, {k,Length@zBins}];
(* look at the length of each bins *)
Map[Length, xyzBins, {3}]
(* get the actual points in each bin *)
binContents = Map[xyz[[#]] &, xyzBins, {3}];
(* look at points from a particular bin *)
ListPlot[Transpose[xyz[[xyzBins[[1,2,3]]]]]]
(* the equivalent call to BinLists runs out of memory on my machine
(32-bit windows) *)
BinLists[xyz, {int}, {int}, {int}];
I'm not sure why BinLists doesn't use an approach like this. My
suspicion is that it may be using something like (using above
definitions):
xyzBins = Outer[Intersection, xBins, yBins, zBins];
the problem being that the Outer function will effectively cause
multiple copies of each list to be generated and thus runs out of
memory.
After encoutering similar issues, I had to roll my own. Effectively,
it's an efficient GatherBy with "replacement" (to allow for
overlapping bins) and an optional reduction immediately after each bin
production. It incrementally consumes a single local copy of the
original data, to avoid duplicating that original more than once. I
can get similar results with careful use of Nearest (thanks to WRI
technical support), but find that uselessly slow and inefficient for
my actual data volumes.
I hope WRI offers something similar in the future (a hyper-efficient
GatherBy with replacement).
Vince Virgilio
Thanks for the suggestion. We will need to take a closer look at this
for a future version.
As an important side note, when bins have regular spacing, it will
always be faster to use the {min,max,dx} spec than the {{c1,c2,...}}
cutoff spec because the dx spec can take advantage of the knowledge that
the bins are regularly spaced. In this particular example,
BinLists[xyz, {0, 1, 0.1}, {0, 1, 0.1}, {0, 1, 0.1}]
is about 11 times faster than the suggested code.
Darren Glosemeyer
Wolfram Research
This too is about 11x faster. It goes directly to the trivariate bins
and takes advantage of the bin structure. If space is a problem it
doesn't need xyz; just change xyz[[L]] to {x[[L],y[[L]],z[[L]]}.
(* generate some large lists of reals *)
{x,y,z} = RandomReal[{0,1}, {3, n = 10^6}];
xyz = Transpose[{x,y,z}];
(* define bin intervals *)
m = 10; int = N@Range[0,1,1/m];
(* calculate buckets for each variable *)
Timing[{xBins, yBins, zBins} = Table[Flatten@Position[#,
_?(int[[i]] <= # < int[[i+1]] &)], {i,m}] & /@ {x,y,z};]
{229.94 Second, Null}
(* get 3x3 grid of bins by taking intersections
of the single-variable bins *)
Timing[xyzBins = Table[Intersection[xBins[[i]],yBins[[j]],zBins[[k]]],
{i,Length@xBins},{j,Length@yBins},{k,Length@zBins}];]
{113.43 Second, Null}
Timing[c = Table[{},{m},{m},{m}];
Do[h = Sequence@@Ceiling[m*xyz[[L]]];
c[[h]] = {c[[h]],L}, {L,n}];
c = Map[Flatten,c,{3}]/.{}->Sequence[];]
{30.3 Second, Null}
c === xyzBins
True