[Numpy-discussion] bug in numpy.mean() ?

661 views
Skip to first unread message

K.-Michael Aye

unread,
Jan 24, 2012, 1:33:30 PM1/24/12
to numpy-di...@scipy.org
I know I know, that's pretty outrageous to even suggest, but please
bear with me, I am stumped as you may be:

2-D data file here:
http://dl.dropbox.com/u/139035/data.npy

Then:
In [3]: data.mean()
Out[3]: 3067.0243839999998

In [4]: data.max()
Out[4]: 3052.4343

In [5]: data.shape
Out[5]: (1000, 1000)

In [6]: data.min()
Out[6]: 3040.498

In [7]: data.dtype
Out[7]: dtype('float32')


A mean value calculated per loop over the data gives me 3045.747251076416
I first thought I still misunderstand how data.mean() works, per axis
and so on, but did the same with a flattenend version with the same
results.

Am I really soo tired that I can't see what I am doing wrong here?
For completion, the data was read by a osgeo.gdal dataset method called
ReadAsArray()
My numpy.__version__ gives me 1.6.1 and my whole setup is based on
Enthought's EPD.

Best regards,
Michael

_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Bruce Southey

unread,
Jan 24, 2012, 1:50:31 PM1/24/12
to numpy-di...@scipy.org
You have a million 32-bit floating point numbers that are in the
thousands. Thus you are exceeding the 32-bitfloat precision and, if you
can, you need to increase precision of the accumulator in np.mean() or
change the input dtype:
>>> a.mean(dtype=np.float32) # default and lacks precision
3067.0243839999998
>>> a.mean(dtype=np.float64)
3045.747251076416
>>> a.mean(dtype=np.float128)
3045.7472510764160156
>>> b=a.astype(np.float128)
>>> b.mean()
3045.7472510764160156

Otherwise you are left to using some alternative approach to calculate
the mean.

Bruce

Kathleen M Tacina

unread,
Jan 24, 2012, 1:53:27 PM1/24/12
to Discussion of Numerical Python
I have confirmed this on a 64-bit linux machine running python 2.7.2 with the development version of numpy.  It seems to be related to using float32 instead of float64.   If the array is first converted to a 64-bit float (via astype), mean gives an answer that agrees with your looped-calculation value: 3045.7472500000002.  With the original 32-bit array, averaging successively on one axis and then on the other gives answers that agree with the 64-bit float answer to the second decimal place.


In [125]: d = np.load('data.npy')

In [126]: d.mean()
Out[126]: 3067.0243839999998

In [127]: d64 = d.astype('float64')

In [128]: d64.mean()
Out[128]: 3045.747251076416

In [129]: d.mean(axis=0).mean()
Out[129]: 3045.7487500000002

In [130]: d.mean(axis=1).mean()
Out[130]: 3045.7444999999998

In [131]: np.version.full_version
Out[131]: '2.0.0.dev-55472ca'



--
-- 
--------------------------------------------------
Kathleen M. Tacina
NASA Glenn Research Center
MS 5-10
21000 Brookpark Road
Cleveland, OH 44135
Telephone: (216) 433-6660
Fax: (216) 433-5802
--------------------------------------------------

Zachary Pincus

unread,
Jan 24, 2012, 1:58:57 PM1/24/12
to Discussion of Numerical Python


I get the same result:

In [1]: import numpy

In [2]: data = numpy.load('data.npy')

In [3]: data.mean()
Out[3]: 3067.0243839999998

In [4]: data.max()
Out[4]: 3052.4343

In [5]: data.min()
Out[5]: 3040.498

In [6]: numpy.version.version
Out[6]: '2.0.0.dev-433b02a'

This on OS X 10.7.2 with Python 2.7.1, on an intel Core i7. Running python as a 32 vs. 64-bit process doesn't make a difference.

The data matrix doesn't look too strange when I view it as an image -- all pretty smooth variation around the (min, max) range. But maybe it's still somehow floating-point pathological?

This is fun too:
In [12]: data.mean()
Out[12]: 3067.0243839999998

In [13]: (data/3000).mean()*3000
Out[13]: 3020.8074375000001

In [15]: (data/2).mean()*2
Out[15]: 3067.0243839999998

In [16]: (data/200).mean()*200
Out[16]: 3013.6754000000001


Zach

Val Kalatsky

unread,
Jan 24, 2012, 2:01:40 PM1/24/12
to Discussion of Numerical Python

Just what Bruce said. 

You can run the following to confirm:
np.mean(data - data.mean())

If for some reason you do not want to convert to float64 you can add the result of the previous line to the "bad" mean:
bad_mean = data.mean()
good_mean = bad_mean + np.mean(data - bad_mean)

Val

Zachary Pincus

unread,
Jan 24, 2012, 2:05:50 PM1/24/12
to Discussion of Numerical Python
> You have a million 32-bit floating point numbers that are in the
> thousands. Thus you are exceeding the 32-bitfloat precision and, if you
> can, you need to increase precision of the accumulator in np.mean() or
> change the input dtype:
>>>> a.mean(dtype=np.float32) # default and lacks precision
> 3067.0243839999998
>>>> a.mean(dtype=np.float64)
> 3045.747251076416
>>>> a.mean(dtype=np.float128)
> 3045.7472510764160156
>>>> b=a.astype(np.float128)
>>>> b.mean()
> 3045.7472510764160156
>
> Otherwise you are left to using some alternative approach to calculate
> the mean.
>
> Bruce

Interesting -- I knew that float64 accumulators were used with integer arrays, and I had just assumed that 64-bit or higher accumulators would be used with floating-point arrays too, instead of the array's dtype. This is actually quite a bit of a gotcha for floating-point imaging-type tasks -- good to know!

Zach

K.-Michael Aye

unread,
Jan 24, 2012, 2:48:31 PM1/24/12
to numpy-di...@scipy.org

Thank you Bruce and all,

I knew I was doing something wrong (should have read the mean method doc more closely). Am of course glad that's so easy understandable.

But: If the error can get so big, wouldn't it be a better idea for the accumulator to always be of type 'float64' and then convert later to the type of the original array?

As one can see in this case, the result would be much closer to the true value.


Michael

eat

unread,
Jan 24, 2012, 6:12:06 PM1/24/12
to Discussion of Numerical Python
Hi,

Oddly, but numpy 1.6 seems to behave more consistent manner:

In []: sys.version
Out[]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'
In []: np.version.version
Out[]: '1.6.0'

In []: d= np.load('data.npy')
In []: d.dtype
Out[]: dtype('float32')

In []: d.mean()
Out[]: 3045.7471999999998
In []: d.mean(dtype= np.float32)
Out[]: 3045.7471999999998
In []: d.mean(dtype= np.float64)
Out[]: 3045.747251076416
In []: (d- d.min()).mean()+ d.min()
Out[]: 3045.7472508750002
In []: d.mean(axis= 0).mean()
Out[]: 3045.7472499999999
In []: d.mean(axis= 1).mean()
Out[]: 3045.7472499999999

Or does the results of calculations depend more on the platform?


My 2 cents,
eat

Kathleen M Tacina

unread,
Jan 24, 2012, 6:21:35 PM1/24/12
to Discussion of Numerical Python
I found something similar, with a very simple example.

On 64-bit linux, python 2.7.2, numpy development version:

In [22]: a = 4000*np.ones((1024,1024),dtype=np.float32)

In [23]: a.mean()
Out[23]: 4034.16357421875

In [24]: np.version.full_version
Out[24]: '2.0.0.dev-55472ca'


But, a Windows XP machine running python 2.7.2 with numpy 1.6.1 gives:
>>>a = np.ones((1024,1024),dtype=np.float32)
>>>a.mean()
4000.0
>>>np.version.full_version
'1.6.1'

David Warde-Farley

unread,
Jan 24, 2012, 6:22:26 PM1/24/12
to Discussion of Numerical Python
On Wed, Jan 25, 2012 at 01:12:06AM +0200, eat wrote:

> Or does the results of calculations depend more on the platform?

Floating point operations often do, sadly (not saying that this is the case
here, but you'd need to try both versions on the same machine [or at least
architecture/bit-width]/same platform to be certain).

David

eat

unread,
Jan 24, 2012, 7:21:10 PM1/24/12
to Discussion of Numerical Python
Hi

On Wed, Jan 25, 2012 at 1:21 AM, Kathleen M Tacina <Kathleen...@nasa.gov> wrote:
I found something similar, with a very simple example.

On 64-bit linux, python 2.7.2, numpy development version:

In [22]: a = 4000*np.ones((1024,1024),dtype=np.float32)

In [23]: a.mean()
Out[23]: 4034.16357421875

In [24]: np.version.full_version
Out[24]: '2.0.0.dev-55472ca'


But, a Windows XP machine running python 2.7.2 with numpy 1.6.1 gives:
>>>a = np.ones((1024,1024),dtype=np.float32)
>>>a.mean()
4000.0
>>>np.version.full_version
'1.6.1'
This indeed looks very nasty, regardless of whether it is a version or platform related problem.

-eat

josef...@gmail.com

unread,
Jan 24, 2012, 11:40:02 PM1/24/12
to Discussion of Numerical Python
On Tue, Jan 24, 2012 at 7:21 PM, eat <e.anter...@gmail.com> wrote:
Hi

On Wed, Jan 25, 2012 at 1:21 AM, Kathleen M Tacina <Kathleen...@nasa.gov> wrote:
I found something similar, with a very simple example.

On 64-bit linux, python 2.7.2, numpy development version:

In [22]: a = 4000*np.ones((1024,1024),dtype=np.float32)

In [23]: a.mean()
Out[23]: 4034.16357421875

In [24]: np.version.full_version
Out[24]: '2.0.0.dev-55472ca'


But, a Windows XP machine running python 2.7.2 with numpy 1.6.1 gives:
>>>a = np.ones((1024,1024),dtype=np.float32)
>>>a.mean()
4000.0
>>>np.version.full_version
'1.6.1'
This indeed looks very nasty, regardless of whether it is a version or platform related problem.

Looks like platform specific, same result as -eat

Windows 7,
Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on win32

>>> a = np.ones((1024,1024),dtype=np.float32)
>>> a.mean()
1.0

>>> (4000*a).dtype
dtype('float32')
>>> (4000*a).mean()
4000.0

>>> b = np.load("data.npy")
>>> b.mean()
3045.7471999999998
>>> b.shape
(1000, 1000)
>>> b.mean(0).mean(0)
3045.7472499999999
>>> _.dtype
dtype('float64')
>>> b.dtype
dtype('float32')

>>> b.mean(dtype=np.float32)
3045.7471999999998

Josef
 

Charles R Harris

unread,
Jan 25, 2012, 12:03:49 AM1/25/12
to Discussion of Numerical Python
On Tue, Jan 24, 2012 at 4:21 PM, Kathleen M Tacina <Kathleen...@nasa.gov> wrote:
I found something similar, with a very simple example.

On 64-bit linux, python 2.7.2, numpy development version:

In [22]: a = 4000*np.ones((1024,1024),dtype=np.float32)

In [23]: a.mean()
Out[23]: 4034.16357421875

In [24]: np.version.full_version
Out[24]: '2.0.0.dev-55472ca'


But, a Windows XP machine running python 2.7.2 with numpy 1.6.1 gives:
>>>a = np.ones((1024,1024),dtype=np.float32)
>>>a.mean()
4000.0
>>>np.version.full_version
'1.6.1'



Yes, the results are platform/compiler dependent. The 32 bit platforms tend to use extended precision accumulators and the x87 instruction set. The 64 bit platforms tend to use sse2+. Different precisions, even though you might think they are the same.

<snip>

Chuck

josef...@gmail.com

unread,
Jan 25, 2012, 12:16:55 AM1/25/12
to Discussion of Numerical Python

just to confirm, same computer as before but the python 3.2 version is
64 bit, now I get the "Linux" result

Python 3.2 (r32:88445, Feb 20 2011, 21:30:00) [MSC v.1500 64 bit
(AMD64)] on win32

>>> import numpy as np
>>> np.__version__
'1.5.1'
>>> a = 4000*np.ones((1024,1024),dtype=np.float32)
>>> a.mean()
4034.16357421875
>>> a.mean(0).mean(0)
4000.0
>>> a.mean(dtype=np.float64)
4000.0

Josef

>
> <snip>
>
> Chuck

Reply all
Reply to author
Forward
0 new messages