[Numpy-discussion] New functions.

55 views
Skip to first unread message

Charles R Harris

unread,
May 31, 2011, 9:08:18 PM5/31/11
to numpy-discussion
Hi All,

I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.

1) Modified sort/argsort functions that return the maximum k values.
    This is easy to do with heapsort and almost as easy with mergesort.

2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.

3) Fast medians.


Chuck

Robert Kern

unread,
May 31, 2011, 9:15:51 PM5/31/11
to Discussion of Numerical Python

+3

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Warren Weckesser

unread,
May 31, 2011, 9:18:28 PM5/31/11
to Discussion of Numerical Python
On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <charles...@gmail.com> wrote:
Hi All,

I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.

1) Modified sort/argsort functions that return the maximum k values.
    This is easy to do with heapsort and almost as easy with mergesort.



While you're at, how about a function that finds both the max and min in one pass?  (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)


2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.

3) Fast medians.


Chuck

Charles R Harris

unread,
May 31, 2011, 9:26:51 PM5/31/11
to Discussion of Numerical Python
On Tue, May 31, 2011 at 7:18 PM, Warren Weckesser <warren.w...@enthought.com> wrote:


On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <charles...@gmail.com> wrote:
Hi All,

I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.

1) Modified sort/argsort functions that return the maximum k values.
    This is easy to do with heapsort and almost as easy with mergesort.



While you're at, how about a function that finds both the max and min in one pass?  (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)


What should it be called? minmax?

This also brings suggests a function that returns both mean and standard deviation, something that could also be implemented using a more stable algorithm than the current one.
 

2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.

3) Fast medians.



Other suggestions are welcome. Most of these are of the low hanging fruit variety and shouldn't be too much work.

Chuck

Benjamin Root

unread,
May 31, 2011, 9:31:10 PM5/31/11
to Discussion of Numerical Python
On Tue, May 31, 2011 at 8:18 PM, Warren Weckesser <warren.w...@enthought.com> wrote:


On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <charles...@gmail.com> wrote:
Hi All,

I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.

1) Modified sort/argsort functions that return the maximum k values.
    This is easy to do with heapsort and almost as easy with mergesort.



While you're at, how about a function that finds both the max and min in one pass?  (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)



+1 from myself and probably just about anybody in matplotlib.  If both the maxs and mins are searched during the same run through an array, I would imagine that would result in a noticeable speedup with automatic range finding.

Ben Root

David

unread,
May 31, 2011, 9:33:24 PM5/31/11
to Discussion of Numerical Python

+1 for fast median as well, and more generally fast "linear" (O(kN))
order statistics would be nice.

cheers,

David

Charles R Harris

unread,
May 31, 2011, 9:34:55 PM5/31/11
to Discussion of Numerical Python
On Tue, May 31, 2011 at 7:33 PM, David <da...@silveregg.co.jp> wrote:
On 06/01/2011 10:08 AM, Charles R Harris wrote:
> Hi All,
>
> I've been contemplating new functions that could be added to numpy and
> thought I'd run them by folks to see if there is any interest.
>
> 1) Modified sort/argsort functions that return the maximum k values.
>      This is easy to do with heapsort and almost as easy with mergesort.
>
> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a
> faster version of nansum possible.
>
> 3) Fast medians.

+1 for fast median as well, and more generally fast "linear" (O(kN))
order statistics would be nice.


OK, noob question. What are order statistics?

Chuck

Skipper Seabold

unread,
May 31, 2011, 9:36:35 PM5/31/11
to Discussion of Numerical Python

I don't know if it's one pass off the top of my head, but I've used
percentile for interpercentile ranges.

[docs]
[1]: X = np.random.random(1000)

[docs]
[2]: np.percentile(X,[0,100])
[2]: [0.00016535235312509222, 0.99961513543316571]

[docs]
[3]: X.min(),X.max()
[3]: (0.00016535235312509222, 0.99961513543316571)

Skipper

David

unread,
May 31, 2011, 9:39:36 PM5/31/11
to Discussion of Numerical Python

In statistics, order statistics are statistics based on sorted samples,
median, min and max being the most common:

http://en.wikipedia.org/wiki/Order_statistic

Concretely here, I meant a fast way to compute any rank of a given data
set, e.g. with the select algorithm. I wanted to do that for some time,
but never took the time for it,

Benjamin Root

unread,
May 31, 2011, 9:48:54 PM5/31/11
to Discussion of Numerical Python
On Tue, May 31, 2011 at 8:26 PM, Charles R Harris <charles...@gmail.com> wrote:


On Tue, May 31, 2011 at 7:18 PM, Warren Weckesser <warren.w...@enthought.com> wrote:


On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <charles...@gmail.com> wrote:
Hi All,

I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.

1) Modified sort/argsort functions that return the maximum k values.
    This is easy to do with heapsort and almost as easy with mergesort.



While you're at, how about a function that finds both the max and min in one pass?  (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)


What should it be called? minmax?

That's what it is called in IDL.  Don't know of any other tool that does anything similar.
 

This also brings suggests a function that returns both mean and standard deviation, something that could also be implemented using a more stable algorithm than the current one.

Yeah, it does.  An implementation I did once for a project of mine in Matlab was to have a function that took an index that indicated that I wanted first/second/third/etc - order moments and it used that to figure out that it needed to save the sum, sum of squares, sum of cubes, etc. in order to calculate the mean, std, kurtosis, and so on.  Just a suggestion on how to implement a useful function for stats.  Of course, there are all sorts of issues with summation here, but my arrays were small enough to not worry in that project.

Ben Root

Warren Weckesser

unread,
May 31, 2011, 9:53:01 PM5/31/11
to Discussion of Numerical Python


percentile() isn't one pass; using percentile like that is much slower:

In [25]: %timeit np.percentile(X,[0,100])
10000 loops, best of 3: 103 us per loop

In [26]: %timeit X.min(),X.max()
100000 loops, best of 3: 11.8 us per loop


Warren

 

Skipper Seabold

unread,
May 31, 2011, 10:00:25 PM5/31/11
to Discussion of Numerical Python

Probably should've checked that before opening my mouth. Never
actually used it for a minmax, but it is faster than two calls to
scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics.

Charles R Harris

unread,
May 31, 2011, 10:26:50 PM5/31/11
to Discussion of Numerical Python
On Tue, May 31, 2011 at 8:00 PM, Skipper Seabold <jsse...@gmail.com> wrote:
On Tue, May 31, 2011 at 9:53 PM, Warren Weckesser
> On Tue, May 31, 2011 at 8:36 PM, Skipper Seabold <jsse...@gmail.com>
> wrote:
>> I don't know if it's one pass off the top of my head, but I've used
>> percentile for interpercentile ranges.
>>
>> [docs]
>> [1]: X = np.random.random(1000)
>>
>> [docs]
>> [2]: np.percentile(X,[0,100])
>> [2]: [0.00016535235312509222, 0.99961513543316571]
>>
>> [docs]
>> [3]: X.min(),X.max()
>> [3]: (0.00016535235312509222, 0.99961513543316571)
>>
>
>
> percentile() isn't one pass; using percentile like that is much slower:
>
> In [25]: %timeit np.percentile(X,[0,100])
> 10000 loops, best of 3: 103 us per loop
>
> In [26]: %timeit X.min(),X.max()
> 100000 loops, best of 3: 11.8 us per loop
>

Probably should've checked that before opening my mouth. Never
actually used it for a minmax, but it is faster than two calls to
scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics.


So far the biggest interest seems to be in order statistics of various sorts, so to speak.

Order Statistics

minmax
median
k'th element
largest/smallest k elements

Other Statistics

mean/std

Nan functions

nanadd

Chuck

Bruce Southey

unread,
May 31, 2011, 10:50:48 PM5/31/11
to Discussion of Numerical Python
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Di...@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

How about including all or some of Keith's Bottleneck package?
He has tried to include some of the discussed functions and tried to
make them very fast.

Also, this Wikipedia "Algorithms for calculating variance"
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) has
some basic info on calculating the variance as well as higher order
moments.However, there are probably more efficient algorithms
available.

Bruce

Charles R Harris

unread,
May 31, 2011, 11:41:46 PM5/31/11
to Discussion of Numerical Python

I don't think they are sufficiently general as they are limited to 2 dimensions. However, I think the moving filters should go into scipy, either in ndimage or maybe signals. Some of the others we can still speed of significantly, for instance nanmedian, by using the new functionality in numpy, i.e., numpy sort has worked with nans for a while now. It looks like call overhead dominates the nanmax times for small arrays and this might improve if the ufunc machinery is cleaned up a bit more, I don't know how far Mark got with that.

One bit of infrastructure that could be generally helpful is low-level support for masked arrays, but that is a larger topic.

Chuck

Keith Goodman

unread,
Jun 1, 2011, 11:17:55 AM6/1/11
to Discussion of Numerical Python

Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else
falls back to a slower, non-cython version of the function. The same
goes for int32, int64, float32, float64.

It should not be difficult to extend to higher nd and more dtypes
since everything is generated from template. The problem is that there
would be a LOT of cython auto-generated C code since there is a
separate function for each ndim, dtype, axis combination.

Each of the ndim, dtype, axis functions currently has its own copy of
the algorithm (such as median). Pulling that out and reusing it should
save a lot of trees by reducing the auto-generated C code size.

I recently added a partsort and argpartsort.

Mark Miller

unread,
Jun 1, 2011, 11:31:55 AM6/1/11
to Discussion of Numerical Python
I'd love to see something like a "count_unique" function included. The
numpy.unique function is handy, but it can be a little awkward to
efficiently go back and get counts of each unique value after the
fact.

-Mark

Craig Yoshioka

unread,
Jun 1, 2011, 11:44:20 AM6/1/11
to Discussion of Numerical Python
would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way

np.mean(np.zeros([4000,4000],'f4')+500)

would not equal 511.493408?

`

Neal Becker

unread,
Jun 1, 2011, 11:59:52 AM6/1/11
to numpy-di...@scipy.org
Short-circuiting find would be nice. Right now, to 'find' something you first
make a bool array, then iterate over it. If all you want is the first index
where x[i] = e, not very efficient.

What I just described is a find with a '==' predicate. Not sure if it's
worthwhile to consider other predicates.

Maybe call it 'find_first'

Robert Kern

unread,
Jun 1, 2011, 12:01:37 PM6/1/11
to Discussion of Numerical Python
On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka <cra...@me.com> wrote:
> would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation.  That way
>
> np.mean(np.zeros([4000,4000],'f4')+500)
>
> would not equal 511.493408?

Yes, I object. You can set the accumulator dtype explicitly if you
need it: np.mean(arr, dtype=np.float64)

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

Bruce Southey

unread,
Jun 1, 2011, 12:11:06 PM6/1/11
to numpy-di...@scipy.org
On 06/01/2011 11:01 AM, Robert Kern wrote:
> On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka<cra...@me.com> wrote:
>> would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation. That way
>>
>> np.mean(np.zeros([4000,4000],'f4')+500)
>>
>> would not equal 511.493408?
> Yes, I object. You can set the accumulator dtype explicitly if you
> need it: np.mean(arr, dtype=np.float64)
>
Sure but fails to address that the output dtype of mean in this case is
np.float64 which one would naively assume is also np.float64:
>>> np.mean(np.zeros([4000,4000],'f4')+500).dtype
dtype('float64')

Thus, we have:
Tickets 465 and 518
http://projects.scipy.org/numpy/ticket/465
http://projects.scipy.org/numpy/ticket/518

Various threads as well such as:
http://mail.scipy.org/pipermail/numpy-discussion/2010-December/054253.html

Bruce

Robert Kern

unread,
Jun 1, 2011, 12:25:59 PM6/1/11
to Discussion of Numerical Python
On Wed, Jun 1, 2011 at 11:11, Bruce Southey <bsou...@gmail.com> wrote:
> On 06/01/2011 11:01 AM, Robert Kern wrote:
>> On Wed, Jun 1, 2011 at 10:44, Craig Yoshioka<cra...@me.com>  wrote:
>>> would anyone object to fixing the numpy mean and stdv functions, so that they always used a 64-bit value to track sums, or so that they used a running calculation.  That way
>>>
>>> np.mean(np.zeros([4000,4000],'f4')+500)
>>>
>>> would not equal 511.493408?
>> Yes, I object. You can set the accumulator dtype explicitly if you
>> need it: np.mean(arr, dtype=np.float64)
>>
> Sure but fails to address that the output dtype of mean in this case is
> np.float64 which one would naively assume is also np.float64:
>  >>> np.mean(np.zeros([4000,4000],'f4')+500).dtype
> dtype('float64')

Of course my statement fails to address that because that wasn't what
I was responding to.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

Craig Yoshioka

unread,
Jun 1, 2011, 12:29:50 PM6/1/11
to Discussion of Numerical Python
yes, and its probably slower to boot.  A quick benchmark on my computer shows that:

a = np.zeros([4000,4000],'f4')+500

np.mean(a)
takes 0.02 secs

np.mean(a,dtype=np.float64)
takes 0.1 secs

np.mean(a.astype(np.float64))
takes 0.06 secs

so casting the whole array is almost 40% faster than setting the accumulator type!, I would imagine having the type-casting being done on the fly during the mean computation would be even faster. 

Skipper Seabold

unread,
Jun 1, 2011, 12:32:09 PM6/1/11
to Discussion of Numerical Python
On Wed, Jun 1, 2011 at 11:31 AM, Mark Miller <markper...@gmail.com> wrote:
> I'd love to see something like a "count_unique" function included. The
> numpy.unique function is handy, but it can be a little awkward to
> efficiently go back and get counts of each unique value after the
> fact.
>

Does bincount do what you're after?

[~/]
[1]: x = np.random.randint(1,6,5)

[~/]
[2]: x
[2]: array([1, 3, 4, 5, 1])

[~/]
[3]: np.bincount(x)
[3]: array([0, 2, 0, 1, 1, 1])

Skipper

Mark Miller

unread,
Jun 1, 2011, 12:49:16 PM6/1/11
to Discussion of Numerical Python
Not quite. Bincount is fine if you have a set of approximately
sequential numbers. But if you don't....

>>> a = numpy.array((1,500,1000))
>>> a
array([ 1, 500, 1000])
>>> b = numpy.bincount(a)
>>> b
array([0, 1, 0, ..., 0, 0, 1])
>>> len(b)
1001

-Mark

Christopher Barker

unread,
Jun 1, 2011, 1:30:20 PM6/1/11
to Discussion of Numerical Python
On 5/31/11 6:08 PM, Charles R Harris wrote:
> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition.

so:

In [53]: a
Out[53]: array([ 1., 2., nan])

In [54]: b
Out[54]: array([0, 1, 2])

In [55]: a + b
Out[55]: array([ 1., 3., nan])

and nanadd(a,b) would yield:

array([ 1., 3., 2.)

I don't see how that is particularly useful, at least not any more
useful that nanprod, nandiv, etc, etc...

What am I missing?

-Chris

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Robert Kern

unread,
Jun 1, 2011, 1:36:30 PM6/1/11
to Discussion of Numerical Python
On Wed, Jun 1, 2011 at 12:30, Christopher Barker <Chris....@noaa.gov> wrote:
> On 5/31/11 6:08 PM, Charles R Harris wrote:
>> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition.
>
> so:
>
> In [53]: a
> Out[53]: array([  1.,   2.,  nan])
>
> In [54]: b
> Out[54]: array([0, 1, 2])
>
> In [55]: a + b
> Out[55]: array([  1.,   3.,  nan])
>
> and nanadd(a,b) would yield:
>
> array([  1.,   3.,  2.)
>
> I don't see how that is particularly useful, at least not any more
> useful that nanprod, nandiv, etc, etc...
>
> What am I missing?

It's mostly useful for its .reduce() and .accumulate() form.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

josef...@gmail.com

unread,
Jun 1, 2011, 3:09:36 PM6/1/11
to Discussion of Numerical Python
My favorite missing extension to numpy functions

np.bincount with 2 (or more) dimensional weights for fast calculation
of group statistics.

Josef

David Cournapeau

unread,
Jun 1, 2011, 9:35:45 PM6/1/11
to Discussion of Numerical Python
On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller <markper...@gmail.com> wrote:
> Not quite. Bincount is fine if you have a set of approximately
> sequential numbers. But if you don't....

Even worse, it fails miserably if you sequential numbers but with a high shift.

np.bincount([100000001, 100000002]) # will take a lof of memory

Doing bincount with dict is faster in those cases.

cheers,

David

Alan G Isaac

unread,
Jun 1, 2011, 10:10:02 PM6/1/11
to Discussion of Numerical Python
> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller<markper...@gmail.com> wrote:
>> Not quite. Bincount is fine if you have a set of approximately
>> sequential numbers. But if you don't....


On 6/1/2011 9:35 PM, David Cournapeau wrote:
> Even worse, it fails miserably if you sequential numbers but with a high shift.
> np.bincount([100000001, 100000002]) # will take a lof of memory
> Doing bincount with dict is faster in those cases.


Since this discussion has turned shortcomings of bincount,
may I ask why np.bincount([]) is not an empty array?
Even more puzzling, why is np.bincount([],minlength=6)
not a 6-array of zeros?

Use case: bincount of infected individuals by number of contacts.
(In some periods there may be no infections.)

Thank you,
Alan Isaac

PS A collections.Counter works pretty nice for Mark and David's cases,
aside from the fact that the keys are not sorted.

Travis Oliphant

unread,
Jun 1, 2011, 11:06:19 PM6/1/11
to Discussion of Numerical Python

On May 31, 2011, at 8:08 PM, Charles R Harris wrote:

> Hi All,
>
> I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
>
> 1) Modified sort/argsort functions that return the maximum k values.
> This is easy to do with heapsort and almost as easy with mergesort.

+1

>
> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.

+0 --- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling. This needs more discussion, but useful to mention in this context.

>
> 3) Fast medians.

Definitely order-statistics would be the right thing to handle generally.

-Travis

>
>
> Chuck


> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Di...@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

---
Travis Oliphant
Enthought, Inc.
olip...@enthought.com
1-512-536-1057
http://www.enthought.com

josef...@gmail.com

unread,
Jun 2, 2011, 4:42:40 AM6/2/11
to Discussion of Numerical Python
On Wed, Jun 1, 2011 at 9:35 PM, David Cournapeau <cour...@gmail.com> wrote:
> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller <markper...@gmail.com> wrote:
>> Not quite. Bincount is fine if you have a set of approximately
>> sequential numbers. But if you don't....
>
> Even worse, it fails miserably if you sequential numbers but with a high shift.
>
> np.bincount([100000001, 100000002]) # will take a lof of memory
>
> Doing bincount with dict is faster in those cases.

same with negative numbers, but in these cases I just subtract the min
and we are back to the fast bincount case

cheers,

Josef

David Cournapeau

unread,
Jun 2, 2011, 5:17:54 AM6/2/11
to Discussion of Numerical Python
On Thu, Jun 2, 2011 at 5:42 PM, <josef...@gmail.com> wrote:
> On Wed, Jun 1, 2011 at 9:35 PM, David Cournapeau <cour...@gmail.com> wrote:
>> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller <markper...@gmail.com> wrote:
>>> Not quite. Bincount is fine if you have a set of approximately
>>> sequential numbers. But if you don't....
>>
>> Even worse, it fails miserably if you sequential numbers but with a high shift.
>>
>> np.bincount([100000001, 100000002]) # will take a lof of memory
>>
>> Doing bincount with dict is faster in those cases.
>
> same with negative numbers, but in these cases I just subtract the min
> and we are back to the fast bincount case

Indeed, you can also deal with large numbers which are not consecutive
by using a lookup-table. All those methods are quite error-prone in
general, and reallly, there is no reason why bincount could not handle
the general case,

Skipper Seabold

unread,
Jun 2, 2011, 1:08:32 PM6/2/11
to Discussion of Numerical Python
On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac <alan....@gmail.com> wrote:
>> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller<markper...@gmail.com>  wrote:
>>> Not quite. Bincount is fine if you have a set of approximately
>>> sequential numbers. But if you don't....
>
>
> On 6/1/2011 9:35 PM, David Cournapeau wrote:
>> Even worse, it fails miserably if you sequential numbers but with a high shift.
>> np.bincount([100000001, 100000002]) # will take a lof of memory
>> Doing bincount with dict is faster in those cases.
>
>
> Since this discussion has turned shortcomings of bincount,
> may I ask why np.bincount([]) is not an empty array?
> Even more puzzling, why is np.bincount([],minlength=6)
> not a 6-array of zeros?
>

Just looks like it wasn't coded that way, but it's low-hanging fruit.
Any objections to adding this behavior? This commit should take care
of it. Tests pass. Comments welcome, as I'm just getting my feet wet
here.

https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7970d5

Skipper

Robert Kern

unread,
Jun 2, 2011, 1:11:48 PM6/2/11
to Discussion of Numerical Python
On Thu, Jun 2, 2011 at 12:08, Skipper Seabold <jsse...@gmail.com> wrote:
> On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac <alan....@gmail.com> wrote:
>>> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller<markper...@gmail.com>  wrote:
>>>> Not quite. Bincount is fine if you have a set of approximately
>>>> sequential numbers. But if you don't....
>>
>>
>> On 6/1/2011 9:35 PM, David Cournapeau wrote:
>>> Even worse, it fails miserably if you sequential numbers but with a high shift.
>>> np.bincount([100000001, 100000002]) # will take a lof of memory
>>> Doing bincount with dict is faster in those cases.
>>
>>
>> Since this discussion has turned shortcomings of bincount,
>> may I ask why np.bincount([]) is not an empty array?
>> Even more puzzling, why is np.bincount([],minlength=6)
>> not a 6-array of zeros?
>>
>
> Just looks like it wasn't coded that way, but it's low-hanging fruit.
> Any objections to adding this behavior? This commit should take care
> of it. Tests pass. Comments welcome, as I'm just getting my feet wet
> here.
>
> https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7970d5

I would use np.zeros(5, dtype=int) in test_empty_with_minlength(), but
otherwise, it looks good.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

Skipper Seabold

unread,
Jun 2, 2011, 1:39:08 PM6/2/11
to Discussion of Numerical Python
On Thu, Jun 2, 2011 at 1:11 PM, Robert Kern <rober...@gmail.com> wrote:
> On Thu, Jun 2, 2011 at 12:08, Skipper Seabold <jsse...@gmail.com> wrote:
>> On Wed, Jun 1, 2011 at 10:10 PM, Alan G Isaac <alan....@gmail.com> wrote:
>>>> On Thu, Jun 2, 2011 at 1:49 AM, Mark Miller<markper...@gmail.com>  wrote:
>>>>> Not quite. Bincount is fine if you have a set of approximately
>>>>> sequential numbers. But if you don't....
>>>
>>>
>>> On 6/1/2011 9:35 PM, David Cournapeau wrote:
>>>> Even worse, it fails miserably if you sequential numbers but with a high shift.
>>>> np.bincount([100000001, 100000002]) # will take a lof of memory
>>>> Doing bincount with dict is faster in those cases.
>>>
>>>
>>> Since this discussion has turned shortcomings of bincount,
>>> may I ask why np.bincount([]) is not an empty array?
>>> Even more puzzling, why is np.bincount([],minlength=6)
>>> not a 6-array of zeros?
>>>
>>
>> Just looks like it wasn't coded that way, but it's low-hanging fruit.
>> Any objections to adding this behavior? This commit should take care
>> of it. Tests pass. Comments welcome, as I'm just getting my feet wet
>> here.
>>
>> https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7970d5
>
> I would use np.zeros(5, dtype=int) in test_empty_with_minlength(), but
> otherwise, it looks good.
>

Ok, thanks. Made the change and removed the old test that it fails on
empty. Pull request.

https://github.com/numpy/numpy/pull/84

Skipper

Robert Kern

unread,
Jun 2, 2011, 5:40:07 PM6/2/11
to Discussion of Numerical Python
On Wed, Jun 1, 2011 at 22:06, Travis Oliphant <olip...@enthought.com> wrote:
>
> On May 31, 2011, at 8:08 PM, Charles R Harris wrote:

>> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
>
> +0  --- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling.   This needs more discussion, but useful to mention in this context.

Actually, these are completely orthogonal to Chuck's proposal. The
NA-enabled dtypes (*not* NaN-enabled dtypes) would have (x + NA) ==
NA, just like R. fadd() would be useful for other things.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco

Reply all
Reply to author
Forward
0 new messages