Estimating DataFrame Size in Memory

5,354 views
Skip to first unread message

Gagi

unread,
Mar 18, 2013, 4:50:17 PM3/18/13
to pyd...@googlegroups.com
Hi All,

I wrote this simple function to return how many MB are taken up by the data contained in a python DataFrame. Maybe there is a better way to extract this data and perhaps it should be a DataFrame/Series method.

def df_size(df):
    """Return the size of a DataFrame in Megabyes"""
    total = 0.0
    for col in df:
        total += df[col].nbytes
    return total/1048576


-Gagi

Andrew Giessel

unread,
Mar 18, 2013, 5:09:56 PM3/18/13
to pyd...@googlegroups.com
what about just getting the size in bytes of the underlying numpy array?

In [1]: import pandas

In [2]: df = pandas.DataFrame(np.random.random((100,100)))

In [3]: df.values.nbytes
Out[3]: 80000




-Gagi

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Andrew Giessel, PhD

Department of Neurobiology, Harvard Medical School
220 Longwood Ave Boston, MA 02115
ph: 617.432.7971 email: andrew_...@hms.harvard.edu

Nathaniel Smith

unread,
Mar 18, 2013, 5:36:24 PM3/18/13
to pyd...@googlegroups.com

Note that these methods are not at all accurate for object arrays (which pandas uses for string data), and also ignores memory used by the indexes (which can be quite substantial when including the hash table and, for multiindexes, the tuple store).

-n

Andrew Giessel

unread,
Mar 18, 2013, 5:43:45 PM3/18/13
to pyd...@googlegroups.com
there is also sys.getsize() but i'm not sure if it does a great job of taking numpy arrays into account

Gagi

unread,
Mar 25, 2013, 2:15:11 PM3/25/13
to pyd...@googlegroups.com, andrew_...@hms.harvard.edu
There is a significant performance difference when accessing 'values' on an entire DF versus looping through the DF columns. I'm not exactly sure what pandas does when you ask for 'values' from differently typed DataFrame columns (probably casting copying and joining into numpy array) versus a single typed array like your example.

My example has floats, int, and string columns.

In [2]: df.shape
Out[62]: (169636, 7)

In [3]: %timeit df_size(df)
10000 loops, best of 3: 22.8 us per loop

In [4]: %timeit df.values.nbytes/1048576.
10 loops, best of 3: 70.7 ms per loop


The simple looping function is much quicker and gets the same result:

In [63]: df.values.nbytes/1048576.
Out[63]: 9.059539794921875

In [64]: df_size(df)
Out[64]: 9.059539794921875


-Gagi

Gagi

unread,
Mar 25, 2013, 2:18:24 PM3/25/13
to pyd...@googlegroups.com, n...@pobox.com
Do you know of a way to estimate the memory taken up by an object array? If we know the string dtype we should be able to scale that by the number of elements in the array to get a rough estimate.

-Gagi

Jeff

unread,
Mar 25, 2013, 3:04:14 PM3/25/13
to pyd...@googlegroups.com, andrew_...@hms.harvard.edu
Gagi,
 
This is correct.
 
The values attribute uses a lowest-common-denominator approach to give you a combine data frame. If you have mixed types
you are basically copying everything (and assembling it).
 
In 0.11 (due any day now), there will be another attribute (.blocks) which will give you a dict of dtype -> DataFrame that is very fast
Reply all
Reply to author
Forward
0 new messages