High memory usage for MultiIndexes?

Nathaniel Smith

unread,

Aug 5, 2012, 11:06:25 AM8/5/12

to pyd...@googlegroups.com

Hi all,

To set the stage: I'm working with EEG data. A basic EEG recording
setup involves a smallish number of electrodes, each of which samples
the local electric field at some high rate, and we let them run for a
while. To make this concrete, my current test file has 32 electrodes
sampling at 250 Hz for a bit over an hour, producing an array of
floats with shape (981504, 32). So the raw data is ~240 megabytes.

Now I'd like to store this in a pandas DataFrame. There are two ways
that I'd like to use pandas's cool indexing stuff here: the first is
that each electrode has a conventional name, like "MiOc" (for "midline
occipital"), "HEOG" ("horizontal electrooculographic"), etc., so using
these as column names is natural. Then on the time (row) axis, things
are a little more complicated. While I have a total of one hour of
data in this file, it actually consists of several disjoint recording
periods that have been concatenated (basically the technician turns
the machine on and off, so there are gaps of known position but
unknown length). And then I have other recording sessions from running
the same experiment on multiple different people. So on the time axis
I'd like to use a MultiIndex like:
(subject, era_index, time_within_era)
so a typical entry might look like:
("subj1", 2, 13084)

'subject' is drawn from a small set of strings. 'era_index' is a small
integer (<100). 'time_within_era' is basically arbitrary 32-bit
integer. (I can't see how there'd be any benefit to using the explicit
timeseries representations here, since my units are always
milliseconds, and always measured as deltas against an arbitrary and
unknown epoch.)

My problem is that a MultiIndex like this seems to take a really
unpleasant amount of memory. To test, I created three test data sets,
saved them to pickle files, and then measured the total resident set
size of a python process that had done nothing except load one of
these files with cPickle.load(). Results:

Raw ndarray: 253 megabytes
DataFrame with column names and the default Int64Index on the rows:
265 megabytes
DataFrame with column names and MultiIndex on the rows: 416 megabytes

The second number is very impressive. The third number is disappointing :-(.

Obviously I expect some overhead for storing such a complex data
structure. With sufficient cleverness it'd be possible to fit it into
the same space as an Int64Index, but I'd be surprised if pandas were
to figure that out on its own. But, with level interning and such I
would have expected that MultiIndex to internally look like, say, four
Int64Index's? Which would have still kept the overall data structure
below 300 megabytes. Do these numbers make sense, and any ideas what I
can do?

(I'm uploading my test files here:
http://vorpus.org/~njs/pandas-eeg/
They're big, so it'll be a few minutes until they're there, but if you
want to play around with the real data there you go.)

-N

Wes McKinney

unread,

Aug 5, 2012, 11:20:53 AM8/5/12

to pyd...@googlegroups.com

> --
>
>

What version of pandas are you on? Until relatively recently,
MultiIndex also stored internally an array of tuples, which for large
datasets would chew up a huge amount of memory like you're describing.
I'll have to have a look at the data sometime next week if not.

- Wes

Nathaniel Smith

unread,

Aug 5, 2012, 11:57:54 AM8/5/12

to pyd...@googlegroups.com

> What version of pandas are you on? Until relatively recently,
> MultiIndex also stored internally an array of tuples, which for large
> datasets would chew up a huge amount of memory like you're describing.
> I'll have to have a look at the data sometime next week if not.

I thought 0.8, but my interpreter says 0.7.3. Doh. I'll try 0.8 and
see if that helps...

-n

Nathaniel Smith

unread,

Aug 5, 2012, 12:25:59 PM8/5/12

to pyd...@googlegroups.com

With 0.8.1, the same test gives 296 megabytes. My back-of-the-envelope
estimate was eerily correct!

Thanks for the awesome work,
-n

Nathaniel Smith

unread,

Aug 6, 2012, 10:01:24 AM8/6/12

to pyd...@googlegroups.com

Unfortunately, it looks like the second number was wrong too, because
these were freshly created indexes, so they hadn't allocated any hash
table yet. As soon as I call .xs once the data frame (with pandas
master, 0.8.2.dev-5771612) the numbers become:

Raw ndarray: 253 megabytes
DataFrame with column names and default Int64Index on the rows: 305 megabytes
DataFrame with column names and MultiIndex on the rows: 452 megabytes

Hash tables are intrinsically memory-hungry, of course, so I'm not
expecting any miracles, but... well, I guess my question is, do you
have any miracles? :-)

Actually, my indexes will always be sorted, so I might even be happier
skipping the hash table altogether and just using binary search for
row lookup...

On a less radical note, here's the internal structure of the
MultiIndex after calling .xs():

In [15]: df.index.__dict__
Out[15]:
{'_cache': {'_engine': <pandas.lib.ObjectEngine at 0x2aed410>},
'_tuples': array([('test', 0, 0), ('test', 0, 4), ('test', 0, 8), ...,
('test', 11, 105460), ('test', 11, 105464), ('test', 11,
105468)], dtype=object),
'labels': [array([0, 0, 0, ..., 0, 0, 0]),
array([ 0, 0, 0, ..., 11, 11, 11]),
array([ 0, 1, 2, ..., 26365, 26366, 26367])],
'levels': [Index([test], dtype=object),
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),
Int64Index([ 0, 4, 8, ..., 798708, 798712, 798716])],
'name': None,
'names': ['name', 'era', 'time'],
'sortorder': None}

Things that jump out at me:

- What's that _tuples attribute doing there? Is there an even-newer
branch around somewhere than master? (It doesn't appear until after
calling .xs()) (Can't tell how much overhead this is, but probably
overwhelming...)

- Why are the labels using 64-bit integers? They could be 8 bits, 8
bits, and 32 bits, respectively. (This is ~18 megabytes.)

- Why are we using two separate 64-bit arrays to store the times (last
entry in the MultiIndex)? Interning integers seems kind of redundant.
I assume that having an explicit list of the levels is useful for
groupby() or such? Would it make sense to generate that list on
demand, like the hash table? (This is something like 2-5 megabytes.)

Cheers,
-n

Reply all

Reply to author

Forward