+3
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values.
This is easy to do with heapsort and almost as easy with mergesort.
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
3) Fast medians.
Chuck
On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <charles...@gmail.com> wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values.
This is easy to do with heapsort and almost as easy with mergesort.
While you're at, how about a function that finds both the max and min in one pass? (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
3) Fast medians.
On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <charles...@gmail.com> wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values.
This is easy to do with heapsort and almost as easy with mergesort.
While you're at, how about a function that finds both the max and min in one pass? (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
+1 for fast median as well, and more generally fast "linear" (O(kN))
order statistics would be nice.
cheers,
David
+1 for fast median as well, and more generally fast "linear" (O(kN))On 06/01/2011 10:08 AM, Charles R Harris wrote:
> Hi All,
>
> I've been contemplating new functions that could be added to numpy and
> thought I'd run them by folks to see if there is any interest.
>
> 1) Modified sort/argsort functions that return the maximum k values.
> This is easy to do with heapsort and almost as easy with mergesort.
>
> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a
> faster version of nansum possible.
>
> 3) Fast medians.
order statistics would be nice.
I don't know if it's one pass off the top of my head, but I've used
percentile for interpercentile ranges.
[docs]
[1]: X = np.random.random(1000)
[docs]
[2]: np.percentile(X,[0,100])
[2]: [0.00016535235312509222, 0.99961513543316571]
[docs]
[3]: X.min(),X.max()
[3]: (0.00016535235312509222, 0.99961513543316571)
Skipper
In statistics, order statistics are statistics based on sorted samples,
median, min and max being the most common:
http://en.wikipedia.org/wiki/Order_statistic
Concretely here, I meant a fast way to compute any rank of a given data
set, e.g. with the select algorithm. I wanted to do that for some time,
but never took the time for it,
On Tue, May 31, 2011 at 7:18 PM, Warren Weckesser <warren.w...@enthought.com> wrote:
On Tue, May 31, 2011 at 8:08 PM, Charles R Harris <charles...@gmail.com> wrote:
Hi All,
I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
1) Modified sort/argsort functions that return the maximum k values.
This is easy to do with heapsort and almost as easy with mergesort.
While you're at, how about a function that finds both the max and min in one pass? (Mentioned previously in this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-June/051072.html)
What should it be called? minmax?
This also brings suggests a function that returns both mean and standard deviation, something that could also be implemented using a more stable algorithm than the current one.
Probably should've checked that before opening my mouth. Never
actually used it for a minmax, but it is faster than two calls to
scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics.
On Tue, May 31, 2011 at 9:53 PM, Warren Weckesser
>> I don't know if it's one pass off the top of my head, but I've usedProbably should've checked that before opening my mouth. Never
>> percentile for interpercentile ranges.
>>
>> [docs]
>> [1]: X = np.random.random(1000)
>>
>> [docs]
>> [2]: np.percentile(X,[0,100])
>> [2]: [0.00016535235312509222, 0.99961513543316571]
>>
>> [docs]
>> [3]: X.min(),X.max()
>> [3]: (0.00016535235312509222, 0.99961513543316571)
>>
>
>
> percentile() isn't one pass; using percentile like that is much slower:
>
> In [25]: %timeit np.percentile(X,[0,100])
> 10000 loops, best of 3: 103 us per loop
>
> In [26]: %timeit X.min(),X.max()
> 100000 loops, best of 3: 11.8 us per loop
>
actually used it for a minmax, but it is faster than two calls to
scipy.stats.scoreatpercentile. Guess I'm +1 to fast order statistics.
How about including all or some of Keith's Bottleneck package?
He has tried to include some of the discussed functions and tried to
make them very fast.
Also, this Wikipedia "Algorithms for calculating variance"
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) has
some basic info on calculating the variance as well as higher order
moments.However, there are probably more efficient algorithms
available.
Bruce
Currently Bottleneck accelerates 1d, 2d, and 3d input. Anything else
falls back to a slower, non-cython version of the function. The same
goes for int32, int64, float32, float64.
It should not be difficult to extend to higher nd and more dtypes
since everything is generated from template. The problem is that there
would be a LOT of cython auto-generated C code since there is a
separate function for each ndim, dtype, axis combination.
Each of the ndim, dtype, axis functions currently has its own copy of
the algorithm (such as median). Pulling that out and reusing it should
save a lot of trees by reducing the auto-generated C code size.
I recently added a partsort and argpartsort.
-Mark
np.mean(np.zeros([4000,4000],'f4')+500)
would not equal 511.493408?
`
What I just described is a find with a '==' predicate. Not sure if it's
worthwhile to consider other predicates.
Maybe call it 'find_first'
Yes, I object. You can set the accumulator dtype explicitly if you
need it: np.mean(arr, dtype=np.float64)
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
Thus, we have:
Tickets 465 and 518
http://projects.scipy.org/numpy/ticket/465
http://projects.scipy.org/numpy/ticket/518
Various threads as well such as:
http://mail.scipy.org/pipermail/numpy-discussion/2010-December/054253.html
Bruce
Of course my statement fails to address that because that wasn't what
I was responding to.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
Does bincount do what you're after?
[~/]
[1]: x = np.random.randint(1,6,5)
[~/]
[2]: x
[2]: array([1, 3, 4, 5, 1])
[~/]
[3]: np.bincount(x)
[3]: array([0, 2, 0, 1, 1, 1])
Skipper
>>> a = numpy.array((1,500,1000))
>>> a
array([ 1, 500, 1000])
>>> b = numpy.bincount(a)
>>> b
array([0, 1, 0, ..., 0, 0, 1])
>>> len(b)
1001
-Mark
so:
In [53]: a
Out[53]: array([ 1., 2., nan])
In [54]: b
Out[54]: array([0, 1, 2])
In [55]: a + b
Out[55]: array([ 1., 3., nan])
and nanadd(a,b) would yield:
array([ 1., 3., 2.)
I don't see how that is particularly useful, at least not any more
useful that nanprod, nandiv, etc, etc...
What am I missing?
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
It's mostly useful for its .reduce() and .accumulate() form.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
np.bincount with 2 (or more) dimensional weights for fast calculation
of group statistics.
Josef
Even worse, it fails miserably if you sequential numbers but with a high shift.
np.bincount([100000001, 100000002]) # will take a lof of memory
Doing bincount with dict is faster in those cases.
cheers,
David
On 6/1/2011 9:35 PM, David Cournapeau wrote:
> Even worse, it fails miserably if you sequential numbers but with a high shift.
> np.bincount([100000001, 100000002]) # will take a lof of memory
> Doing bincount with dict is faster in those cases.
Since this discussion has turned shortcomings of bincount,
may I ask why np.bincount([]) is not an empty array?
Even more puzzling, why is np.bincount([],minlength=6)
not a 6-array of zeros?
Use case: bincount of infected individuals by number of contacts.
(In some periods there may be no infections.)
Thank you,
Alan Isaac
PS A collections.Counter works pretty nice for Mark and David's cases,
aside from the fact that the keys are not sorted.
> Hi All,
>
> I've been contemplating new functions that could be added to numpy and thought I'd run them by folks to see if there is any interest.
>
> 1) Modified sort/argsort functions that return the maximum k values.
> This is easy to do with heapsort and almost as easy with mergesort.
+1
>
> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
+0 --- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling. This needs more discussion, but useful to mention in this context.
>
> 3) Fast medians.
Definitely order-statistics would be the right thing to handle generally.
-Travis
>
>
> Chuck
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Di...@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
---
Travis Oliphant
Enthought, Inc.
olip...@enthought.com
1-512-536-1057
http://www.enthought.com
same with negative numbers, but in these cases I just subtract the min
and we are back to the fast bincount case
cheers,
Josef
Indeed, you can also deal with large numbers which are not consecutive
by using a lookup-table. All those methods are quite error-prone in
general, and reallly, there is no reason why bincount could not handle
the general case,
Just looks like it wasn't coded that way, but it's low-hanging fruit.
Any objections to adding this behavior? This commit should take care
of it. Tests pass. Comments welcome, as I'm just getting my feet wet
here.
https://github.com/jseabold/numpy/commit/133148880bba5fa3a11dfbb95cefb3da4f7970d5
Skipper
I would use np.zeros(5, dtype=int) in test_empty_with_minlength(), but
otherwise, it looks good.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
Ok, thanks. Made the change and removed the old test that it fails on
empty. Pull request.
https://github.com/numpy/numpy/pull/84
Skipper
>> 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. Should make a faster version of nansum possible.
>
> +0 --- Some discussion at the data array summit led to the view that supporting nan-enabled dtypes (nanfloat64, nanfloat32, nanint64) might be a better approach for nan-handling. This needs more discussion, but useful to mention in this context.
Actually, these are completely orthogonal to Chuck's proposal. The
NA-enabled dtypes (*not* NaN-enabled dtypes) would have (x + NA) ==
NA, just like R. fadd() would be useful for other things.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco