crosstab issue dealing with int

52 views
Skip to first unread message

Stefanie

unread,
Jul 17, 2015, 9:35:15 AM7/17/15
to pyd...@googlegroups.com

I have a csv table, and I tried to use crosstab to get frequency. It worked fine for string type columns, but for integer type, it returned only one column with name "1708", instead of several columns. Any idea?





Joris Van den Bossche

unread,
Jul 17, 2015, 9:44:30 AM7/17/15
to pyd...@googlegroups.com
This seems like a regression in pandas, as this worked correctly in 0.14, but not anymore in 0.16.2.

In [3]: pd.crosstab(tips.day, tips.size)
Out[3]:
size  1   2   3   4  5  6
day
Fri   1  16   1   1  0  0
Sat   2  53  18  13  1  0
Sun   0  39  15  18  3  1
Thur  1  48   4   5  1  3

In [4]: pd.__version__
Out[4]: '0.14.1'

Can you open an issue about this on the issue tracker? (https://github.com/pydata/pandas/issues)

Regards,
Joris

2015-07-17 0:07 GMT+02:00 Stefanie <mtstefa...@gmail.com>:

I have a csv table, and I tried to use crosstab to get frequency. It worked fine for string type columns, but for integer type, it returned only one column with name "1708", instead of several columns. Any idea?





--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joris Van den Bossche

unread,
Jul 19, 2015, 11:39:27 AM7/19/15
to pyd...@googlegroups.com
I posted it as an issue here: https://github.com/pydata/pandas/issues/10629

Andrew Rosenfeld

unread,
Aug 13, 2015, 9:18:51 AM8/13/15
to PyData
I think what changed between 0.14 and 0.16.2 is that Series picked up a "size" attribute via https://github.com/pydata/pandas/pull/8847

So before, tips.size was accessing a "size" column (Series) of the tips DataFrame. Now it picks up the "size" (length) of the tips DataFrame, which happens to be the value 1708.

I think pandas always runs the risk of issues like this as long as it supports the (very convenient!) feature of allowing member/"dot" access to columns (tips.size) rather than requiring getitem/square-bracket (tips['size']). If a new Series/DataFrame field/method is added with a name that was being used by old code as a column name, that old code will break.

Andrew Rosenfeld

unread,
Aug 13, 2015, 9:23:50 AM8/13/15
to PyData
Sorry, hit Post too early. A few minor edits for correctness:


I think what changed between 0.14 and 0.16.2 is that DataFrame picked up a "size" attribute via https://github.com/pydata/pandas/pull/8847

So before, tips.size was accessing a "size" column (Series) of the tips DataFrame. Now it picks up the "size" (length) of the tips DataFrame, which happens to be the value 1708.

I think pandas always runs the risk of issues like this as long as it supports the (very convenient!) feature of allowing member/"dot" access to columns (tips.size) rather than requiring getitem/square-bracket (tips['size']) access. If a new DataFrame field/method is added with a name that was being used by old code as a column name, that old code will break.

Joris Van den Bossche

unread,
Aug 13, 2015, 9:37:09 AM8/13/15
to PyData
Yes, of course!

Indeed, that is always the risk of using column attribute access. Thanks for letting this now.

Reply all
Reply to author
Forward
0 new messages