Changing bool behaviour of pandas objects

1,519 views
Skip to first unread message

andy hayden

unread,
Sep 10, 2013, 12:14:22 PM9/10/13
to pyd...@googlegroups.com
There's a discussion on github to change the behaviour of `__nonzero__` for pandas objects. We wanted to gauge users' feedback on proposed changes*.

Bool behaviour in pandas (and numpy) often trips up and surprises new (and experienced) users, for one thing because it differs from many python objects.

- For empty arrays it's Falsey
- For length one arrays it's bool of item (Note: bool(nan) is True)
- Otherwise it raises a ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

One option (originally discussed https://github.com/pydata/pandas/issues/4633 and currently implemented in master via https://github.com/pydata/pandas/pull/4657) is to turn off bool **always**:

- raise ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

An alternative proposal being discussed is (https://github.com/pydata/pandas/pull/4738):

- For length one arrays it's bool of item (Perhaps raising on bool(Series([nan])).)
- Otherwise raise a ValueError: The truth value of an array with more than one element is ambiguous. Use a.empty, a.any() or a.all()

Note: bool of empty objects would be disallowed.

In both cases:

- not/and/or would be specifically disallowed.



*sometime after 0.8 there was an API change from https://github.com/pydata/pandas/pull/1073, where bool(df) was df.empty see https://github.com/pydata/pandas/issues/4633.

josef...@gmail.com

unread,
Sep 10, 2013, 12:51:02 PM9/10/13
to pyd...@googlegroups.com
On Tue, Sep 10, 2013 at 12:14 PM, andy hayden <andyh...@gmail.com> wrote:
> There's a discussion on github to change the behaviour of `__nonzero__` for
> pandas objects. We wanted to gauge users' feedback on proposed changes*.
>
> Bool behaviour in pandas (and numpy) often trips up and surprises new (and
> experienced) users, for one thing because it differs from many python
> objects.
>
> - For empty arrays it's Falsey
> - For length one arrays it's bool of item (Note: bool(nan) is True)
> - Otherwise it raises a ValueError: The truth value of an array with more
> than one element is ambiguous. Use a.any() or a.all()
>
> One option (originally discussed
> https://github.com/pydata/pandas/issues/4633 and currently implemented in
> master via https://github.com/pydata/pandas/pull/4657) is to turn off bool
> **always**:
>
> - raise ValueError: The truth value of an array is ambiguous. Use a.empty,
> a.any() or a.all().
>
> An alternative proposal being discussed is
> (https://github.com/pydata/pandas/pull/4738):
>
> - For length one arrays it's bool of item (Perhaps raising on
> bool(Series([nan])).)
> - Otherwise raise a ValueError: The truth value of an array with more than
> one element is ambiguous. Use a.empty, a.any() or a.all()

How common is the use of Series([True]) and Series([False])?
Do dataframe or series .any() .all() and similar return a Series or
a python bool?

Josef

>
> Note: bool of empty objects would be disallowed.
>
> In both cases:
>
> - not/and/or would be specifically disallowed.
>
>
>
> *sometime after 0.8 there was an API change from
> https://github.com/pydata/pandas/pull/1073, where bool(df) was df.empty see
> https://github.com/pydata/pandas/issues/4633.
>
> --
> You received this message because you are subscribed to the Google Groups
> "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Skipper Seabold

unread,
Sep 10, 2013, 1:02:20 PM9/10/13
to pyd...@googlegroups.com
On Tue, Sep 10, 2013 at 12:51 PM, <josef...@gmail.com> wrote:
> On Tue, Sep 10, 2013 at 12:14 PM, andy hayden <andyh...@gmail.com> wrote:
>> There's a discussion on github to change the behaviour of `__nonzero__` for
>> pandas objects. We wanted to gauge users' feedback on proposed changes*.
>>
>> Bool behaviour in pandas (and numpy) often trips up and surprises new (and
>> experienced) users, for one thing because it differs from many python
>> objects.
>>
>> - For empty arrays it's Falsey
>> - For length one arrays it's bool of item (Note: bool(nan) is True)
>> - Otherwise it raises a ValueError: The truth value of an array with more
>> than one element is ambiguous. Use a.any() or a.all()
>>
>> One option (originally discussed
>> https://github.com/pydata/pandas/issues/4633 and currently implemented in
>> master via https://github.com/pydata/pandas/pull/4657) is to turn off bool
>> **always**:
>>
>> - raise ValueError: The truth value of an array is ambiguous. Use a.empty,
>> a.any() or a.all().
>>
>> An alternative proposal being discussed is
>> (https://github.com/pydata/pandas/pull/4738):
>>
>> - For length one arrays it's bool of item (Perhaps raising on
>> bool(Series([nan])).)
>> - Otherwise raise a ValueError: The truth value of an array with more than
>> one element is ambiguous. Use a.empty, a.any() or a.all()
>
> How common is the use of Series([True]) and Series([False])?

I rely on this behavior in both numpy and pandas.

> Do dataframe or series .any() .all() and similar return a Series or
> a python bool?
>

Boolean I believe.

FWIW, I'll summarize a bit my vote and the reasoning for it here. I
think we should continue the numpy behavior but fix the wart-y
NaN-handling in numpy, because as we all know this is an area that
pandas exists to improve. I'm operating under the assumption that the
checked Series/DataFrame is the result of an indexing operation for
which one element is expected to be returned. You can't control the
container that's returned and I'd rather not have to add an .item()
everywhere in my code but pandas should keep me from doing the wrong
thing i.e., doing an ambiguous operation. Maybe it's confusing, but I
don't really see how you could shoot yourself in the foot. It just
seems drastic and unnecessary to disallow this behavior. If you want
to use .any() and .all() everywhere, then nothing is stopping you.

Behavior and reasoning:

1. Empty series raises. Maybe you screwed up your index? What is the
'correct' output of this?

if pd.isnull(pd.DataFrame([])):
print 'this dataframe has no missing values?'

This seems ambiguous. You can't answer the question because there's no
information to evaluate the statement.

2. 1 element is fine. You know what you're doing, carry on. Also
.all() == .any() in this case, so it's not ambiguous.

3. Length > 1 raises . This is ambiguous. Ask for all, any, or empty.
Maybe you screwed up your index?

Skipper

Skipper Seabold

unread,
Sep 10, 2013, 1:03:31 PM9/10/13
to pyd...@googlegroups.com
On Tue, Sep 10, 2013 at 12:14 PM, andy hayden <andyh...@gmail.com> wrote:
What do you mean by in both cases here?

>
>
>
> *sometime after 0.8 there was an API change from
> https://github.com/pydata/pandas/pull/1073, where bool(df) was df.empty see
> https://github.com/pydata/pandas/issues/4633.
>

josef...@gmail.com

unread,
Sep 10, 2013, 1:22:54 PM9/10/13
to pyd...@googlegroups.com
If I understand correctly:

Then the point is that pandas should have behavior that is useful for
pandas "scalars".
I think that's the issue (and the inconsistency with numpy) and not
just the "bool" of a scalar or one-element array.

>>> type(np.array(['', 'a'], dtype='O')[0])
<type 'str'>
>>> type(np.array([0, 1])[0])
<type 'numpy.int32'>
>>> type(np.array([0, 1])[0].item())
<type 'int'>
>>> type(np.array([0, 1], bool)[0])
<type 'numpy.bool_'>

When indexing into a numpy array, then we get scalars that we can work
with (besides small differences between the scalar numpy type and the
related python type).

so indexing into a boolean dataframe or series that returns one
element should be useful as bool.
I assume numerical operations also work with a one-element series in
an analogous way.

(I would prefer if numpy didn't have any python bool behavior with
arrays with shape > (), and should always raise.)

Josef


> and I'd rather not have to add an .item()
> everywhere in my code but pandas should keep me from doing the wrong
> thing i.e., doing an ambiguous operation. Maybe it's confusing, but I
> don't really see how you could shoot yourself in the foot. It just
> seems drastic and unnecessary to disallow this behavior. If you want
> to use .any() and .all() everywhere, then nothing is stopping you.
>
> Behavior and reasoning:
>
> 1. Empty series raises. Maybe you screwed up your index? What is the
> 'correct' output of this?
>
> if pd.isnull(pd.DataFrame([])):
> print 'this dataframe has no missing values?'
>
> This seems ambiguous. You can't answer the question because there's no
> information to evaluate the statement.
>
> 2. 1 element is fine. You know what you're doing, carry on. Also
> .all() == .any() in this case, so it's not ambiguous.
>
> 3. Length > 1 raises . This is ambiguous. Ask for all, any, or empty.
> Maybe you screwed up your index?
>
> Skipper
>

josef...@gmail.com

unread,
Sep 10, 2013, 1:37:32 PM9/10/13
to pyd...@googlegroups.com
I don't know where I have a more recent pandas than 0.11, so I better be quiet

I'm getting a numpy.bool_

>>> b
0
0 False
1 False
2 True
3 True
>>> type(b.iloc[0,0])
<type 'numpy.bool_'>
>>> bool(b.iloc[0,0])
False
>>> bool(b.iloc[2,0])
True
>>> pd.__version__
'0.11.0'
>>> type(b)
<class 'pandas.core.frame.DataFrame'>

Josef

Andy Hayden

unread,
Sep 11, 2013, 7:33:16 AM9/11/13
to pyd...@googlegroups.com
To add my thoughts:

Explicit is better than implicit, and imo using bool on pandas object is *never* explicit (and *always* ambiguous).

I find writing code which depends on the context (the array/Series length) a strange idiom, and it's not one I use. This special case can be made completely non-ambiguous by using .item()... so why make it special? (We should add .item() to the ValueError message.)

The disallowing of `__nonzero__` *entirely*, requiring users to be explicit, seems to me a clean and sensible solution to a common hiccup/cause of bugs in pandas code. And the ValueError would give immediate feedback of what the user should do to correct their code and remove the ambiguity.

Josef: we're talking about applying bool to pandas objects e.g. DataFrame and Series: bool(df) and bool(s).

Jeff

unread,
Sep 11, 2013, 8:27:04 AM9/11/13
to pyd...@googlegroups.com, andyh...@gmail.com
FYI, master has this behavior (which makes sense and slightly sways me to Skipper's position)

In [3]: Series([]).item()
ValueError: can only convert an array of size 1 to a Python scalar

In [4]: Series([1]).item()
Out[4]: 1

In [5]: Series([1,2]).item()
ValueError: can only convert an array of size 1 to a Python scalar

Andy Hayden

unread,
Sep 11, 2013, 8:42:29 AM9/11/13
to pyd...@googlegroups.com
I agree that using item() here makes this behaviour completely explicit.

However, imo it would be surprising for boo(s) to be sugar for bool(s.item())... which (I think) is Skipper's suggestion.

Nathaniel Smith

unread,
Sep 11, 2013, 9:09:24 AM9/11/13
to pyd...@googlegroups.com, andyh...@gmail.com
On Wed, Sep 11, 2013 at 1:27 PM, Jeff <jeffr...@gmail.com> wrote:
> FYI, master has this behavior (which makes sense and slightly sways me to
> Skipper's position)
>
> In [3]: Series([]).item()
> ValueError: can only convert an array of size 1 to a Python scalar
>
> In [4]: Series([1]).item()
> Out[4]: 1
>
> In [5]: Series([1,2]).item()
> ValueError: can only convert an array of size 1 to a Python scalar

.item() is a weird and I think widely misunderstood method (certainly
I never understood it until getting more immersed in numpy's
internals). Logically, there are two operations involved:
- Indexing. The "pure indexing" method of course is .__getitem__, []
- Conversion from numpy-defined types types to python native types.
Numpy has a "pure conversion" method, .tolist(). (This method is
misnamed, e.g. if you call .tolist() on a numpy scalar than you get a
Python scalar, type(np.int32(1).tolist()) is int, not list or
np.int32.)

.item() *combines* these two operations: arr.item(*args) is defined as
arr[args].tolist(), *except* that it has its own bizarro indexing
rules:
- if multiple index arguments are given, it's just arr[args] *except*
it is an error if the result is not a scalar.
- if one index argument is given, then the array is first flattened
and then indexed (??)
- if no index arguments are given, then it's equivalent to .item(0)
except that it's required that the flattened array have exactly 1
entry.

My guess is that this is a holdover from the old turmoil about whether
scalar types should exist at all that was a big issue around the
numeric/numarray transition, where someone invented it as a compromise
python-type-using indexing operation that has survived until now, like
a programmatic coelacanth. It's certainly not at all consistent with
modern numpy style.

-n

Jeff

unread,
Sep 16, 2013, 9:17:11 AM9/16/13
to pyd...@googlegroups.com, andyh...@gmail.com, n...@pobox.com

The PR was updated to use @cpcloud name suggestion in the error message.

Note that since we only allow a single element of bool dtype thru, a single element NaN/NaT is already raising

In [1]: bool(Series([np.nan]))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.item(), a.any() or a.all().

In [2]: bool(DataFrame([]))
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.item(), a.any() or a.all().
 
---------
 
@hayd maybe let's summarize and decide?
 
Pandas will differ with numpy (1):

- empty will ALWAYS raise in a boolean context

And conform (2):

- a single element Series of dtype == bool will return the bool of its element

so weighing practciality and consistency I think preservering (2) for the time being preserves backward compat.

``
if Series([True]):
      print foo
```

will still *work*.

Maybe best way to move this forward, is to accept this PR with a deprecation message on (2), which can be then changed in a future pandas release?

Dale Jung

unread,
Sep 16, 2013, 11:21:26 AM9/16/13
to pyd...@googlegroups.com
I guess I'm the only one who made liberal use of: 

if df:
    blah blah

during that small window that it worked. :P
Reply all
Reply to author
Forward
0 new messages