Unfortunately, it looks like the second number was wrong too, because
these were freshly created indexes, so they hadn't allocated any hash
table yet. As soon as I call .xs once the data frame (with pandas
master, 0.8.2.dev-5771612) the numbers become:
Raw ndarray: 253 megabytes
DataFrame with column names and default Int64Index on the rows: 305 megabytes
DataFrame with column names and MultiIndex on the rows: 452 megabytes
Hash tables are intrinsically memory-hungry, of course, so I'm not
expecting any miracles, but... well, I guess my question is, do you
have any miracles? :-)
Actually, my indexes will always be sorted, so I might even be happier
skipping the hash table altogether and just using binary search for
row lookup...
On a less radical note, here's the internal structure of the
MultiIndex after calling .xs():
In [15]: df.index.__dict__
Out[15]:
{'_cache': {'_engine': <pandas.lib.ObjectEngine at 0x2aed410>},
'_tuples': array([('test', 0, 0), ('test', 0, 4), ('test', 0, 8), ...,
('test', 11, 105460), ('test', 11, 105464), ('test', 11,
105468)], dtype=object),
'labels': [array([0, 0, 0, ..., 0, 0, 0]),
array([ 0, 0, 0, ..., 11, 11, 11]),
array([ 0, 1, 2, ..., 26365, 26366, 26367])],
'levels': [Index([test], dtype=object),
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),
Int64Index([ 0, 4, 8, ..., 798708, 798712, 798716])],
'name': None,
'names': ['name', 'era', 'time'],
'sortorder': None}
Things that jump out at me:
- What's that _tuples attribute doing there? Is there an even-newer
branch around somewhere than master? (It doesn't appear until after
calling .xs()) (Can't tell how much overhead this is, but probably
overwhelming...)
- Why are the labels using 64-bit integers? They could be 8 bits, 8
bits, and 32 bits, respectively. (This is ~18 megabytes.)
- Why are we using two separate 64-bit arrays to store the times (last
entry in the MultiIndex)? Interning integers seems kind of redundant.
I assume that having an explicit list of the levels is useful for
groupby() or such? Would it make sense to generate that list on
demand, like the hash table? (This is something like 2-5 megabytes.)
Cheers,
-n