Re: pandas.pivot_table indexing problem/bug:

178 views
Skip to first unread message

Gagi

unread,
Nov 16, 2012, 7:07:53 PM11/16/12
to pyd...@googlegroups.com
Here is the code not clipped:

import pandas as pd
import numpy as np

# Generate Long File & Test Pivot
NUM_ROWS = 1000000
df = pd.DataFrame({'A' : np.random.randint(100, size=NUM_ROWS),
                                'B' : np.random.randint(300, size=NUM_ROWS),
                                'C' : np.random.randint(-7, 7, size=NUM_ROWS),
                                'D' : np.random.randint(-19,19, size=NUM_ROWS),
                                'E' : np.random.randint(3000, size=NUM_ROWS),
                                'F' : np.random.randn(NUM_ROWS)})

df_pivoted = df.pivot_table(rows=['A', 'B', 'C'], cols='E', values='F')
df_pivoted

Gagi

unread,
Nov 16, 2012, 7:13:19 PM11/16/12
to pyd...@googlegroups.com
And Here is the code that fails. Note that in the code below I am pivoting on combinations of 4 Columns A, B, C, D, and this fails for only 1000 rows, but the above code works on 1M rows.


import pandas as pd
import numpy as np

# Generate Long File & Test Pivot
NUM_ROWS = 1000
df = pd.DataFrame({'A' : np.random.randint(100, size=NUM_ROWS),
                                'B' : np.random.randint(300, size=NUM_ROWS),
                                'C' : np.random.randint(-7, 7, size=NUM_ROWS),
                                'D' : np.random.randint(-19,19, size=NUM_ROWS),
                                'E' : np.random.randint(3000, size=NUM_ROWS),
                                'F' : np.random.randn(NUM_ROWS)})

df_pivoted = df.pivot_table(rows=['A', 'B', 'C', 'D'], cols='E', values='F')
df_pivoted

Wes McKinney

unread,
Nov 16, 2012, 9:33:06 PM11/16/12
to pyd...@googlegroups.com
> --
>
>

Hi Gagi,

I'm fairly certain this is an issue in the unstack algorithm. I'm
surprised it never came up until now. Creating a GitHub issue; I will
address as soon as I can.

http://github.com/pydata/pandas/issues/2278

Thanks,
Wes

Gagi Drmanac

unread,
Nov 17, 2012, 8:18:06 PM11/17/12
to pyd...@googlegroups.com
Thanks Wes!

I'll gladly test it out once the issue has been tracked down.

Thanks,
-Gagi

Hi Gagi,

I'm fairly certain this is an issue in the unstack algorithm. I'm
surprised it never came up until now. Creating a GitHub issue; I will
address as soon as I can.

http://github.com/pydata/pandas/issues/2278

Thanks,
Wes

--




Wes McKinney

unread,
Nov 22, 2012, 6:09:22 PM11/22/12
to pyd...@googlegroups.com
> --
>
>

hi Gagi,

I fixed this issue today-- unstack is also much faster now in a lot of
cases. Keep in mind that the pivot table will be of size N vs K where
N is the number of observed key-tuples in the rows and K the number in
the columns. That could potentially be very large depending on your
data set.

- Wes

Gagi Drmanac

unread,
Nov 22, 2012, 6:11:50 PM11/22/12
to pyd...@googlegroups.com
Thanks Wes,

I'll give it a try. Is the new code available in one of the nightly
build binaries?

Thanks,
-Gagi
> --
>
>

Wes McKinney

unread,
Nov 22, 2012, 6:23:39 PM11/22/12
to pyd...@googlegroups.com
> --
>
>

(pls bottom post if you would!)

There appears to be an issue with the nightly binaries process and
they haven't updated for the last 10 days. Will have to look into it
next week-- you'd have to build from source to get it earlier.

- Wes

Gagi

unread,
Nov 27, 2012, 1:33:03 PM11/27/12
to pyd...@googlegroups.com


Thanks for looking into this. I'll check the binaries folder for the update.

-Gagi
Reply all
Reply to author
Forward
0 new messages