Problem with groupby and nth in pandas 0.18.1

1,135 views
Skip to first unread message

Benjamin Bertrand

unread,
Jul 5, 2016, 9:04:07 AM7/5/16
to PyData
Hi,

With pandas 0.17.1 I used to do the following:

import pandas as pd

df
= pd.DataFrame(
   
{'device': ['A', 'A', 'A', 'B', 'B', 'B'],
     
'timestamp': [0, 2, 4, 1, 3, 5]})
df
['start'] = df.groupby('device')['timestamp'].nth(0)

It gave:

df
   device timestamp start
0  A          0       0
1  A          2       NaN
2  A          4       NaN
3  B          1       1
4  B          3       NaN
5  B          5       NaN

With pandas 0.18.1, this is what I get:

df
   device timestamp start
0  A          0       NaN
1  A          2       NaN
2  A          4       NaN
3  B          1       NaN
4  B          3       NaN
5  B          5       NaN

In pandas 0.17.1, df.groupby('device')['timestamp'].nth(0) returns the index and timestamp column:
0  0
3  1


But in pandas 0.18.1, it returns the device and timestamp column. The index is "lost":
device
A  
0
B  
1


Is this the new normal behavior?
How can I achieve the same thing as what I was doing in pandas 0.17.1?

My DataFrame is sorted by device and timestamp and I want to get the first (and last) timestamp for each device.

Thanks

Benjamin

Joris Van den Bossche

unread,
Jul 5, 2016, 9:25:46 AM7/5/16
to PyData
For now, to get the old result back, you can use head:

In [3]: df.groupby('device')['timestamp'].head(1)
Out[3]:
0    0
3    1
Name: timestamp, dtype: int64

In [4]: pd.__version__
Out[4]: u'0.18.1'

But of course this is not a solution if you want something else than the first element (nth(0)).

Given there is not an easy way to get the old result, and that is has been like that for a long time, maybe we should reconsider this.

Regards,
Joris


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benjamin Bertrand

unread,
Jul 12, 2016, 5:38:14 PM7/12/16
to PyData
Thanks for the answer. Sorry I only noticed it today.

In the meantime, I found this stackoverflow question with another solution using transform('idxmim'):

start_index = df.groupby('device')['timestamp'].transform('idxmim')
df
['start'] = df.loc[start_index, 'timestamp'].values


df
Out[7]:
devicetimestampstart
0A00
1A20
2A40
3B11
4B31
5B51

This gives me what I want. And I can use "idxmax" to get the end value.

Benjamin

Joris Van den Bossche

unread,
Jul 15, 2016, 5:50:35 PM7/15/16
to PyData
I noticed that you can also have the original behaviour of 0.17 by passing as_index=False:

In [13]: df.groupby('device', as_index=False)['timestamp'].nth(0)
Out[13]:

0    0
3    1
Name: timestamp, dtype: int64


Are you sure the transform('idxmin') works? I get an error when I try that (both on 0.17.1 as 0.18.1): AttributeError: 'SeriesGroupBy' object has no attribute 'idxmim'

Regards,
Joris

Joris Van den Bossche

unread,
Jul 15, 2016, 5:52:59 PM7/15/16
to PyData
2016-07-15 23:50 GMT+02:00 Joris Van den Bossche <jorisvand...@gmail.com>:
I noticed that you can also have the original behaviour of 0.17 by passing as_index=False:

In [13]: df.groupby('device', as_index=False)['timestamp'].nth(0)
Out[13]:
0    0
3    1
Name: timestamp, dtype: int64


Are you sure the transform('idxmin') works? I get an error when I try that (both on 0.17.1 as 0.18.1): AttributeError: 'SeriesGroupBy' object has no attribute 'idxmim'

Whoops, there was a typo in your code, which is the cause that it failed: idxmim of course does not work, but idxmin does :-)

Benjamin Bertrand

unread,
Jul 16, 2016, 11:16:20 AM7/16/16
to PyData
Sorry for the typo :-)
I thought I did a copy/paste from my Jupiter Notebook.
But I managed to write twice idxmim...
Reply all
Reply to author
Forward
0 new messages