read_csv() parse all numeric columns to 32Bit?

893 views
Skip to first unread message

Gagi

unread,
Dec 7, 2012, 7:38:18 PM12/7/12
to pyd...@googlegroups.com
I'm on a 64-bit system so naturally Python, Pandas, and Numpy all default to 64 bit for numeric types. I often deal with very large data sets containing numeric data that would fit well within 32 or even 16 bit floats. To save space both in memory and on disk what is the best way to force read_csv() to parse all numbers as 32 bit or 16 bit?

I'm thinking this would be something similar to numpy.genfromtext dtype parameter or using converters.

Even better if there is some global flag I can change in Nummpy to default everything to 32Bit? That would be fantastic for dealing with lots of numbers that do not require many significant digits?

Thanks,
-Gagi

Wes McKinney

unread,
Dec 8, 2012, 5:05:49 PM12/8/12
to pyd...@googlegroups.com
> --
>
>

I'm planning to improve pandas's support for the rest of the NumPy
dtype hierarchy at some point in the future (not sure when, a function
of time and resources as always)-- there are technical challenges that
make it not completely trivial. I'm open to hearing ideas about APIs
to automatically trim the memory usage of the data set-- it may be
just as simple as downcasting float64 -> float32, and integers to the
smallest bytewidth dtype that will not lose data.

- Wes

Gagi

unread,
Dec 8, 2012, 7:43:04 PM12/8/12
to pyd...@googlegroups.com

Thanks for getting back to me Wes. So is the current best method for getting 32bit numerical data into a DataFrame to read in as 64bit and then loop through all columns and use the DataFrame.astype method to cast down to 32bit? I'm curious, does astype modify in place or create a copy?

Thanks,
-Gagi

Gagi

unread,
Dec 12, 2012, 8:33:12 PM12/12/12
to pyd...@googlegroups.com
I noticed in the documentation:

"

Specifying column data types

Starting with v0.10, you can indicate the data type for the whole DataFrame or individual columns:
"

This is indeed a great feature epically for numerical objects codes that would otherwise be parsed as integers with leading zeros removed.

From the example I was hoping I could specify reduced precision int or float for each column, but alas they are up-casted to the 64bit versions. :/ Except....

In [1]:
data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': np.float32, 'c': np.int16})
df.dtypes
Out[1]: <-- Upcasted to 64 bit
a     object
b    float64
c      int64

The upcasting occurs even if I explicitly try to set a column to a 32bit float/int.

In [2]:
df['b'] = df['b'].astype(np.float32)
type(df['b'][0])

Out[2]: <-- Upcasted to 64 bit even with explicit column set.
numpy.float64

However, if I start with the object type, and then explicitly set the column to a float16/32 everything it seems to work.


In [3]:
data = 'a,b,c\nCat,2.3456789,3\nDog,5,6\nHat,8,9'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': object, 'c': np.int16})
df.dtypes
Out[3]:
a    object
b    object
c     int64 <-- upcasting during read_csv

In [4]:
print type(df['b'][0])
df['b'][0]
<type 'str'> <-- here we have the string type object not parsed

Out[4]:
'2.3456789'

In [5]:
df['b'] = df['b'].astype(np.float16) <-- explicit casting and setting of Object to float16
In [6]:
print type(df['b'][0])
print df['b'][0]
print df.dtypes
<type 'numpy.float16'>
2.3457
a     object
b    float16 <-- Yay 16 bit!
c      int64
 <-- Correctly Cast Object into float32, with correct truncation of the data value.

Now my next question is does this have any possibly bad memory implications?  When converting many Object columns to np.float32 or np.float16 does pandas properly take care to allow enough memory for the different width numbers? I'm assuming an entirely new column is created and the old object column is just freed from memory. This may be a good workaround for me since I often have several million low resolution real and int columns that are adding extra overhead for storing/reading/parsing/writing in 64bit wide format.

Thanks for any input,
-Gagi



On Saturday, December 8, 2012 2:05:49 PM UTC-8, Wes McKinney wrote:

Wes McKinney

unread,
Dec 12, 2012, 9:28:06 PM12/12/12
to pyd...@googlegroups.com
> --
>
>

Can we move this discussion to GitHub? I think you can straight
copy-paste your e-mail there in a new issue.

- Wes
Reply all
Reply to author
Forward
0 new messages