Default dtypes and memory use

42 views
Skip to first unread message

John E

unread,
Jul 30, 2015, 12:49:53 PM7/30/15
to PyData
It seems pandas will often default to float64 or int32 when it isn't needed (hence wasting memory).  It seems that creation of categoricals is very smart about this, using the smallest integer that is sufficient.

Conversely, pd.get_dummies seems to always create float64s whereas by definition these are 0/1 variables  so int8 ought to be fine (ignoring cases with missing values).

So my simple question is just whether there is a specific reason for get_dummies to output float64 instead of int8, and in general are there any reasons not to store numbers as int8 when int8 is sufficient (or int16, etc.)?  I.e. it seems that pandas will always be smart enough to upcast columns on the fly as needed, but maybe I am missing some other reason?





Jeff Reback

unread,
Jul 30, 2015, 1:03:07 PM7/30/15
to pyd...@googlegroups.com
default is always int64,float64 though the user can pass in differently 
these are the numpy defaults 


get_dummies could be changed (easily to return s smaller dtype) as it uses a categorical internally (which is then expanded) - so u can just use that dtype (on the codes)

pls make an issue (and the pull request of u would like)
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John E

unread,
Jul 30, 2015, 4:56:33 PM7/30/15
to PyData, jeffr...@gmail.com
Thanks Jeff, issue opened:  https://github.com/pydata/pandas/issues/10708

PR is currently beyond my abilities, but I may take a shot at it eventually, if no one else does first.

Along this same lines, I wonder if there would be any interest in some new pandas command to automatically save memory by safely downcasting columns, and possibly converting strings to categoricals?

Stata, for example, has the command "compress" which just cycles through a dataset changing int32 to int16 and such (no compressing in the sparse or gzip sense). It's a second best way to approach things (vs doing it efficiently to begin with), but it's dead simple to use and doesn't really have any disadvantages.

Jeff Reback

unread,
Jul 30, 2015, 5:03:02 PM7/30/15
to John E, PyData
auto conversion to categorical is on the roadmap already 

compress_dtypes could be interesting I suppose but u generally have to be careful with this - down casting is really up to the user

John E

unread,
Jul 30, 2015, 5:23:24 PM7/30/15
to PyData, eil...@gmail.com, jeffr...@gmail.com
I agree downcasting ought to be up to user, of course.  Such a command would only be for explicit use, not implicit (that's how it is in stata).  For example

    df = df.compress_dtype( options )

I am mainly thinking of integers as it's pretty common to end up storing things as int64 that could be stored in much fewer bytes, and it's really easy to check that any downcasting is safe.  I would imagine defaults to be set to be ultra-safe, with some options to more aggressively save memory, but only by explicitly over-riding defaults.  FWIW.
Reply all
Reply to author
Forward
0 new messages