Counting number of numbers in a large list between two valus

Lyle

unread,

Dec 5, 2010, 9:56:42 PM12/5/10

to

Dear Listers,

I have a large (5-20million) one dimensional list of real numbers and
I want to count the number of entries in the list that lie between 2
specific values (x1, x2). I need to run the function for a number of
different ranges.

ie. number of list entries (l), where x1 <= l <= x2

I've tried:

tallydata[{x1_, x2_}] := Count[data, x_ /; x1 <= x <= x2]

that takes about 3-4 seconds

and

tallydata[{x1_, x2_}] := Length[Select[data, x1 <= # <= x2 &]]

which takes a little bit longer.

The best I've managed is (this last one might be off by 1 or 2 but
this doesn't really matter to me):

sorteddata = Sort[data];
nf = Nearest[sorteddata];
tallyrange[{x1_, x2_}] :=
First[Position[sorteddata, First[nf[x2]]]] -
First[Position[sorteddata, First[nf[x1]]]]

which takes between 1 and 2 seconds but I was hoping there might be a
faster way to do this?

Any help would be great!

Thanks,
Lyle Gordon

Northwestern University

Leonid Shifrin

unread,

Dec 6, 2010, 6:13:30 AM12/6/10

to

Hi Lyle,

As far as I can tell, both your solutions are incorrect, besides being
indeed not optimal.
The first one is correct only for an ordered list, for which you can use
binary search to
have a much faster one. I fail to see how the second one can be correct
(except possibly
again for the case of ordered original list of data) , since you look at the
distance in the
sorted list, and make no effort to relate that to the distance between
elements in the
original list.

Here is a different solution (based on a couple of functions posted a while
ago by
Norbert Pozar), which is using SparseArray and is both general and much
faster:

extractPositionFromSparseArray[
HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]

positionExtr[x_List, n_] :=
Flatten@extractPositionFromSparseArray[
SparseArray[Unitize[x - n], Automatic, 1]]

getDistanceBetweenElements[list_, el1_, el2_] :=
With[{p1 = positionExtr[list, el1], p2 = positionExtr[list, el2]},
First@p2 - First@p1 /; Length[p1] == Length[p2] == 1];

It will only return the result if both elements are present in the list
exactly once,
which is a condition for your problem to be well-formulated. Here we test it
on
a list of 20 million random numbers:

tst = RandomReal[{1, 1000}, 20000000];

x1 = tst[[10000]];
x2 = tst[[10000000]];

In[30]:= getDistanceBetweenElements[tst, x1, x2] // Timing

Out[30]= {0.578, 9990000}

Hope this helps.

Regards,
Leonid

Andy

unread,

Dec 6, 2010, 6:14:02 AM12/6/10

to

On 12/5/2010 8:57 PM, Lyle wrote:
> tallydata[{x1_, x2_}] := Length[Select[data, x1<= #<= x2&]]

UnitStep can be quite fast and saves you from having to sort.

In[12]:= data = RandomVariate[NormalDistribution[], 20000000];

tallydata2[data_, {x1_, x2_}] :=
Total[UnitStep[data - x1]*UnitStep[x2 - data]]

tallydata[data_, {x1_, x2_}] := Length[Select[data, x1 <= # <= x2 &]]

tallydata2[data, {2, 3}] // AbsoluteTiming

Out[14]= {1.4374356, 426159}

tallydata[data, {2, 3}] // AbsoluteTiming

Out[15]= {18.2491824, 426159}

Andy Ross
Wolfram Research

Fred Simons

unread,

Dec 6, 2010, 6:14:55 AM12/6/10

to

Op 6-12-2010 3:57, Lyle schreef:

> Dear Listers,
>
> I have a large (5-20million) one dimensional list of real numbers and
> I want to count the number of entries in the list that lie between 2
> specific values (x1, x2). I need to run the function for a number of
> different ranges.
>
> ie. number of list entries (l), where x1<= l<= x2
>

This seems to be a pretty fast solution:

In[28]:= between[lst_, {x1_, x2_}] :=
Total[Unitize[Clip[lst, {x1, x2}, {0, 0}]]]

In[29]:= lst = RandomReal[{-100, 100}, {10^7}];
between[lst, {-50, -40}] // Timing

Out[30]= {0.125, 500249}

Regards,

Fred Simons
Eindhoven University of Technology

Albert Retey

unread,

Dec 6, 2010, 6:15:17 AM12/6/10

to

Hi,

I think this is a perfect use case for Compile:

ctallydata = Compile[{{data, _Real, 1}, {x1, _Real}, {x2, _Real}},
Block[{r = 0},
Do[If[x1 <= data[[i]] <= x2, r++], {i, Length[data]}]; r]
]

in version 8 you can achieve an additional speedup by setting
CompilationTarget -> "C" and RuntimeOptions -> "Speed". With some more
effort you might achieve further speedup by parallelization or using the
GPU, but I doubt that that would be worth the effort...

hth,

albert

Ray Koopman

unread,

Dec 6, 2010, 6:15:06 AM12/6/10

to

data = RandomReal[1., 2*^6]; min = .2; max = .3;

Timing@Length@Select[data, min <= # <= max &]
{8.76, 200028}

Timing@Count[data, x_ /; min <= x <= max]
{7.99, 200028}

Timing@Tr@UnitStep[(data - min)*(max - data)]
{0.61, 200028}

Carl K. Woll

unread,

Dec 6, 2010, 6:13:09 AM12/6/10

to

Here's one possibility:

data = RandomReal[1, 10^7];

In[206]:= Total@Unitize@Clip[data, {.2, .3}, {0, 0}] // Timing

Out[206]= {0.14, 1000695}

Carl Woll
Wolfram Research

Kevin J. McCann

unread,

Dec 7, 2010, 6:45:46 AM12/7/10

to

The fastest I got is

data = RandomReal[UniformDistribution[{0, 1}], 10000000];
xmin = 0.5;
xmax = 0.6;

Timing[Total[UnitStep[(data - xmin)*(xmax - data)]]]

{0.25, 1001405}

Kevin J. McCann

unread,

Dec 7, 2010, 6:45:57 AM12/7/10

to

This is the fastest I got as well.

Kevin

P.S. I never knew about Clip or Unitize before. Thanks.

Lyle Gordon

unread,

Dec 7, 2010, 6:46:41 AM12/7/10

to

Again, thanks to everyone who responded.

After some playing around it seems that since I need to run the code a
number of times on the same set of data and my ranges only have a certain
degree of precision this code might be faster:

binwidth = 10^-5;
min = Floor[Min[data], binwidth];
max = Ceiling[Max[data], binwidth];
binneddata = BinCounts[data, {min, max, binwidth}];
tallydata4[{x1_, x2_}] :=
Total[Take[
binneddata, {(x1 - min)/binwidth + 1, (x2 - min)/binwidth}]]

Thanks,
Lyle

--
Lyle Gordon
Department of Materials Science and Engineering
Northwestern University

2220 Campus Drive
Evanston, IL 60208

Tel: (847) 491-3584
Mobile: (847) 400-4071
lgo...@u.northwestern.edu

On Mon, Dec 6, 2010 at 11:27 AM, Lyle Gordon <lgo...@gmail.com> wrote:

> Thanks everyone for your responses, it seems like Carl's and Fred's method
> with Total-Unitize-Clip is the fastest.
>
> Thanks very much,
> Lyle
>
> --
> Lyle Gordon
> Department of Materials Science and Engineering
> Northwestern University
>
> 2220 Campus Drive
> Evanston, IL 60208
>
> Tel: (847) 491-3584
> Mobile: (847) 400-4071
> lgo...@u.northwestern.edu
>
>
>
> On Mon, Dec 6, 2010 at 5:33 AM, Leonid Shifrin <lsh...@gmail.com> wrote:
>
>> Hi Lyle,
>>
>> Sorry - I misunderstood the problem. I thought you meant the positional
>> arrangement
>> of the numbers. Since you seem to be getting plenty of good answers, I
>> will not attempt
>> to correct myself with more code.

>>
>>
>> Regards,
>> Leonid
>>
>>
>> On Mon, Dec 6, 2010 at 5:57 AM, Lyle <lgo...@gmail.com> wrote:
>>

>>> Dear Listers,
>>>
>>> I have a large (5-20million) one dimensional list of real numbers and
>>> I want to count the number of entries in the list that lie between 2
>>> specific values (x1, x2). I need to run the function for a number of
>>> different ranges.
>>>
>>> ie. number of list entries (l), where x1 <= l <= x2
>>>

Ray Koopman

unread,

Dec 7, 2010, 6:47:14 AM12/7/10

to

On Dec 5, 6:56 pm, Lyle <lgor...@gmail.com> wrote:

Here are some oldies, plus improvements on them. Note that the
fastest two routines that use UnitStep will give wrong answers if
the data contain any values that equal max, and that the routines
that use Clip will give wrong answers if min <= 0 <= max and the
data contain any zeros.

data = RandomReal[1.,1*^7]; min = .2; max = .3;

Total@UnitStep[(data-min)*(max-data)] //Timing
{2.64,999208}

UnitStep[data-min].UnitStep[max-data] //Timing
{2.3,999208}

Total[ UnitStep[data-min]-UnitStep[data-max] ] //Timing
{2.23,999208}

Total@UnitStep[data-min] - Total@UnitStep[data-max] //Timing
{1.73,999208}

Total@Unitize@Clip[data,{min,max},{0,0}] //Timing
{0.91,999208}

SparseArray@Clip[data,{min,max},{0,0}] /.
SparseArray[_,_,_,d_] :> d[[2,1,-1]] //Timing
{0.77,999208}

truxton spangler

unread,

Dec 7, 2010, 6:47:24 AM12/7/10

to

Why is it that a function called Count is not the fastest way to
count, a function called BinCount is not the fastest way to count
bins, a function called Sort is rarely the fastest way to sort, a
function called Select is rarely the fastest way to select ...and so
on. When "named" built in functions are rarely the best programming
option this seems like a flaw in the design philosophy of the software
and certainly is an added barrier to Mathematica becoming an intuitive program
to learn.

While I am on this topic the date and time handling functions in Mathematica,
even in version 8, are so slow as to be useless for real work. This is
another example of named functions not being the best programming
option. Ditto the time value of money functions. It is very easy to
get 2 orders of magnitude speed enhancement writing your own code
which brings into question the point of developing these sort of built
in functions in the form that they currently exist.

Leonid Shifrin

unread,

Dec 7, 2010, 6:43:24 AM12/7/10

to

Hi Lyle,

Sorry - I misunderstood the problem. I thought you meant the positional
arrangement of the numbers. Since you seem to be getting plenty of
good answers, I will not attempt to correct myself with more code.

Regards,
Leonid

Lyle Gordon

unread,

Dec 7, 2010, 6:45:24 AM12/7/10

to

Thanks everyone for your responses, it seems like Carl's and Fred's method
with Total-Unitize-Clip is the fastest.

Thanks very much,
Lyle

--
Lyle Gordon
Department of Materials Science and Engineering
Northwestern University

2220 Campus Drive
Evanston, IL 60208

Tel: (847) 491-3584
Mobile: (847) 400-4071
lgo...@u.northwestern.edu

On Mon, Dec 6, 2010 at 5:33 AM, Leonid Shifrin <lsh...@gmail.com> wrote:

Daniel Lichtblau

unread,

Dec 8, 2010, 6:40:12 AM12/8/10

to

truxton spangler wrote:
> [...]

>
> Why is it that a function called Count is not the fastest way to
> count, a function called BinCount is not the fastest way to count
> bins, a function called Sort is rarely the fastest way to sort, a
> function called Select is rarely the fastest way to select ...and so
> on. When "named" built in functions are rarely the best programming
> option this seems like a flaw in the design philosophy of the software
> and certainly is an added barrier to Mathematica becoming an intuitive program
> to learn.
>
> While I am on this topic the date and time handling functions in Mathematica,
> even in version 8, are so slow as to be useless for real work. This is
> another example of named functions not being the best programming
> option. Ditto the time value of money functions. It is very easy to
> get 2 orders of magnitude speed enhancement writing your own code
> which brings into question the point of developing these sort of built
> in functions in the form that they currently exist.

I have to punt on the second paragraph issues. Others know much more
about that stuff.

For the first, let me point out that most of the functions you claim to
be slow are meant to work in very general ways. Some work with pattern
matching and/or function predicates. These involve invocations of the
full Mathematica evaluation sequence. Every so often we recognize
special cases that need to be faster, and optimize for those cases.

Functions like Clip and Unitize work on vecotrs of reals, and I think
were designed with packed arrays in mind. They are thus more
specialized, and faster. But you cannot get them to do the things you
might do with e.g. Select or Count.

Sort tends to be fast, by the way, if using default ordering. If you
have an example that shows otherwise, we (at WRI) would want to see it.

Daniel Lichtblau
Wolfram Research

truxton spangler

unread,

Dec 9, 2010, 6:01:33 AM12/9/10

to

e.g. Carl Woll has contributed many examples of ways to program things
that work orders of magnitude faster than the built in functions. Why
wouldn't these algorithms be used internally within the built in
function? Why don't these bits of code appear with the documentation
under the relevant function so people can easily figure out the
fastest way to do something?

I got to tell you that one of the reasons that you see 100 competitors
for every Mathematica user is because it is perceived as slow and only for
prototyping not for production. Long term users know Mathematica can be fast
-- often by ignoring built in functionality -- but some of these IMO
design oversights do not help make the case ...especially when other
things about the product seem so random (example: it was more
important to develop a Renki chart in version 8 than a two axis
plot!?).