To set the stage: I'm working with EEG data. A basic EEG recording
setup involves a smallish number of electrodes, each of which samples
the local electric field at some high rate, and we let them run for a
while. To make this concrete, my current test file has 32 electrodes
sampling at 250 Hz for a bit over an hour, producing an array of
floats with shape (981504, 32). So the raw data is ~240 megabytes.
Now I'd like to store this in a pandas DataFrame. There are two ways
that I'd like to use pandas's cool indexing stuff here: the first is
that each electrode has a conventional name, like "MiOc" (for "midline
occipital"), "HEOG" ("horizontal electrooculographic"), etc., so using
these as column names is natural. Then on the time (row) axis, things
are a little more complicated. While I have a total of one hour of
data in this file, it actually consists of several disjoint recording
periods that have been concatenated (basically the technician turns
the machine on and off, so there are gaps of known position but
unknown length). And then I have other recording sessions from running
the same experiment on multiple different people. So on the time axis
I'd like to use a MultiIndex like:
(subject, era_index, time_within_era)
so a typical entry might look like:
("subj1", 2, 13084)
'subject' is drawn from a small set of strings. 'era_index' is a small
integer (<100). 'time_within_era' is basically arbitrary 32-bit
integer. (I can't see how there'd be any benefit to using the explicit
timeseries representations here, since my units are always
milliseconds, and always measured as deltas against an arbitrary and
unknown epoch.)
My problem is that a MultiIndex like this seems to take a really
unpleasant amount of memory. To test, I created three test data sets,
saved them to pickle files, and then measured the total resident set
size of a python process that had done nothing except load one of
these files with cPickle.load(). Results:
Raw ndarray: 253 megabytes
DataFrame with column names and the default Int64Index on the rows:
265 megabytes
DataFrame with column names and MultiIndex on the rows: 416 megabytes
The second number is very impressive. The third number is disappointing :-(.
Obviously I expect some overhead for storing such a complex data
structure. With sufficient cleverness it'd be possible to fit it into
the same space as an Int64Index, but I'd be surprised if pandas were
to figure that out on its own. But, with level interning and such I
would have expected that MultiIndex to internally look like, say, four
Int64Index's? Which would have still kept the overall data structure
below 300 megabytes. Do these numbers make sense, and any ideas what I
can do?
(I'm uploading my test files here:
http://vorpus.org/~njs/pandas-eeg/ They're big, so it'll be a few minutes until they're there, but if you
want to play around with the real data there you go.)
On Sun, Aug 5, 2012 at 11:06 AM, Nathaniel Smith <n...@pobox.com> wrote:
> Hi all,
> To set the stage: I'm working with EEG data. A basic EEG recording
> setup involves a smallish number of electrodes, each of which samples
> the local electric field at some high rate, and we let them run for a
> while. To make this concrete, my current test file has 32 electrodes
> sampling at 250 Hz for a bit over an hour, producing an array of
> floats with shape (981504, 32). So the raw data is ~240 megabytes.
> Now I'd like to store this in a pandas DataFrame. There are two ways
> that I'd like to use pandas's cool indexing stuff here: the first is
> that each electrode has a conventional name, like "MiOc" (for "midline
> occipital"), "HEOG" ("horizontal electrooculographic"), etc., so using
> these as column names is natural. Then on the time (row) axis, things
> are a little more complicated. While I have a total of one hour of
> data in this file, it actually consists of several disjoint recording
> periods that have been concatenated (basically the technician turns
> the machine on and off, so there are gaps of known position but
> unknown length). And then I have other recording sessions from running
> the same experiment on multiple different people. So on the time axis
> I'd like to use a MultiIndex like:
> (subject, era_index, time_within_era)
> so a typical entry might look like:
> ("subj1", 2, 13084)
> 'subject' is drawn from a small set of strings. 'era_index' is a small
> integer (<100). 'time_within_era' is basically arbitrary 32-bit
> integer. (I can't see how there'd be any benefit to using the explicit
> timeseries representations here, since my units are always
> milliseconds, and always measured as deltas against an arbitrary and
> unknown epoch.)
> My problem is that a MultiIndex like this seems to take a really
> unpleasant amount of memory. To test, I created three test data sets,
> saved them to pickle files, and then measured the total resident set
> size of a python process that had done nothing except load one of
> these files with cPickle.load(). Results:
> Raw ndarray: 253 megabytes
> DataFrame with column names and the default Int64Index on the rows:
> 265 megabytes
> DataFrame with column names and MultiIndex on the rows: 416 megabytes
> The second number is very impressive. The third number is disappointing :-(.
> Obviously I expect some overhead for storing such a complex data
> structure. With sufficient cleverness it'd be possible to fit it into
> the same space as an Int64Index, but I'd be surprised if pandas were
> to figure that out on its own. But, with level interning and such I
> would have expected that MultiIndex to internally look like, say, four
> Int64Index's? Which would have still kept the overall data structure
> below 300 megabytes. Do these numbers make sense, and any ideas what I
> can do?
> (I'm uploading my test files here:
> http://vorpus.org/~njs/pandas-eeg/ > They're big, so it'll be a few minutes until they're there, but if you
> want to play around with the real data there you go.)
> -N
> --
What version of pandas are you on? Until relatively recently,
MultiIndex also stored internally an array of tuples, which for large
datasets would chew up a huge amount of memory like you're describing.
I'll have to have a look at the data sometime next week if not.
On Sun, Aug 5, 2012 at 4:20 PM, Wes McKinney <w...@lambdafoundry.com> wrote:
> On Sun, Aug 5, 2012 at 11:06 AM, Nathaniel Smith <n...@pobox.com> wrote:
>> Hi all,
>> To set the stage: I'm working with EEG data. A basic EEG recording
>> setup involves a smallish number of electrodes, each of which samples
>> the local electric field at some high rate, and we let them run for a
>> while. To make this concrete, my current test file has 32 electrodes
>> sampling at 250 Hz for a bit over an hour, producing an array of
>> floats with shape (981504, 32). So the raw data is ~240 megabytes.
>> Now I'd like to store this in a pandas DataFrame. There are two ways
>> that I'd like to use pandas's cool indexing stuff here: the first is
>> that each electrode has a conventional name, like "MiOc" (for "midline
>> occipital"), "HEOG" ("horizontal electrooculographic"), etc., so using
>> these as column names is natural. Then on the time (row) axis, things
>> are a little more complicated. While I have a total of one hour of
>> data in this file, it actually consists of several disjoint recording
>> periods that have been concatenated (basically the technician turns
>> the machine on and off, so there are gaps of known position but
>> unknown length). And then I have other recording sessions from running
>> the same experiment on multiple different people. So on the time axis
>> I'd like to use a MultiIndex like:
>> (subject, era_index, time_within_era)
>> so a typical entry might look like:
>> ("subj1", 2, 13084)
>> 'subject' is drawn from a small set of strings. 'era_index' is a small
>> integer (<100). 'time_within_era' is basically arbitrary 32-bit
>> integer. (I can't see how there'd be any benefit to using the explicit
>> timeseries representations here, since my units are always
>> milliseconds, and always measured as deltas against an arbitrary and
>> unknown epoch.)
>> My problem is that a MultiIndex like this seems to take a really
>> unpleasant amount of memory. To test, I created three test data sets,
>> saved them to pickle files, and then measured the total resident set
>> size of a python process that had done nothing except load one of
>> these files with cPickle.load(). Results:
>> Raw ndarray: 253 megabytes
>> DataFrame with column names and the default Int64Index on the rows:
>> 265 megabytes
>> DataFrame with column names and MultiIndex on the rows: 416 megabytes
>> The second number is very impressive. The third number is disappointing :-(.
>> Obviously I expect some overhead for storing such a complex data
>> structure. With sufficient cleverness it'd be possible to fit it into
>> the same space as an Int64Index, but I'd be surprised if pandas were
>> to figure that out on its own. But, with level interning and such I
>> would have expected that MultiIndex to internally look like, say, four
>> Int64Index's? Which would have still kept the overall data structure
>> below 300 megabytes. Do these numbers make sense, and any ideas what I
>> can do?
>> (I'm uploading my test files here:
>> http://vorpus.org/~njs/pandas-eeg/ >> They're big, so it'll be a few minutes until they're there, but if you
>> want to play around with the real data there you go.)
> What version of pandas are you on? Until relatively recently,
> MultiIndex also stored internally an array of tuples, which for large
> datasets would chew up a huge amount of memory like you're describing.
> I'll have to have a look at the data sometime next week if not.
I thought 0.8, but my interpreter says 0.7.3. Doh. I'll try 0.8 and
see if that helps...
On Sun, Aug 5, 2012 at 4:57 PM, Nathaniel Smith <n...@pobox.com> wrote:
> On Sun, Aug 5, 2012 at 4:20 PM, Wes McKinney <w...@lambdafoundry.com> wrote:
>> On Sun, Aug 5, 2012 at 11:06 AM, Nathaniel Smith <n...@pobox.com> wrote:
>>> Hi all,
>>> To set the stage: I'm working with EEG data. A basic EEG recording
>>> setup involves a smallish number of electrodes, each of which samples
>>> the local electric field at some high rate, and we let them run for a
>>> while. To make this concrete, my current test file has 32 electrodes
>>> sampling at 250 Hz for a bit over an hour, producing an array of
>>> floats with shape (981504, 32). So the raw data is ~240 megabytes.
>>> Now I'd like to store this in a pandas DataFrame. There are two ways
>>> that I'd like to use pandas's cool indexing stuff here: the first is
>>> that each electrode has a conventional name, like "MiOc" (for "midline
>>> occipital"), "HEOG" ("horizontal electrooculographic"), etc., so using
>>> these as column names is natural. Then on the time (row) axis, things
>>> are a little more complicated. While I have a total of one hour of
>>> data in this file, it actually consists of several disjoint recording
>>> periods that have been concatenated (basically the technician turns
>>> the machine on and off, so there are gaps of known position but
>>> unknown length). And then I have other recording sessions from running
>>> the same experiment on multiple different people. So on the time axis
>>> I'd like to use a MultiIndex like:
>>> (subject, era_index, time_within_era)
>>> so a typical entry might look like:
>>> ("subj1", 2, 13084)
>>> 'subject' is drawn from a small set of strings. 'era_index' is a small
>>> integer (<100). 'time_within_era' is basically arbitrary 32-bit
>>> integer. (I can't see how there'd be any benefit to using the explicit
>>> timeseries representations here, since my units are always
>>> milliseconds, and always measured as deltas against an arbitrary and
>>> unknown epoch.)
>>> My problem is that a MultiIndex like this seems to take a really
>>> unpleasant amount of memory. To test, I created three test data sets,
>>> saved them to pickle files, and then measured the total resident set
>>> size of a python process that had done nothing except load one of
>>> these files with cPickle.load(). Results:
>>> Raw ndarray: 253 megabytes
>>> DataFrame with column names and the default Int64Index on the rows:
>>> 265 megabytes
>>> DataFrame with column names and MultiIndex on the rows: 416 megabytes
>>> The second number is very impressive. The third number is disappointing :-(.
>>> Obviously I expect some overhead for storing such a complex data
>>> structure. With sufficient cleverness it'd be possible to fit it into
>>> the same space as an Int64Index, but I'd be surprised if pandas were
>>> to figure that out on its own. But, with level interning and such I
>>> would have expected that MultiIndex to internally look like, say, four
>>> Int64Index's? Which would have still kept the overall data structure
>>> below 300 megabytes. Do these numbers make sense, and any ideas what I
>>> can do?
>>> (I'm uploading my test files here:
>>> http://vorpus.org/~njs/pandas-eeg/ >>> They're big, so it'll be a few minutes until they're there, but if you
>>> want to play around with the real data there you go.)
>> What version of pandas are you on? Until relatively recently,
>> MultiIndex also stored internally an array of tuples, which for large
>> datasets would chew up a huge amount of memory like you're describing.
>> I'll have to have a look at the data sometime next week if not.
> I thought 0.8, but my interpreter says 0.7.3. Doh. I'll try 0.8 and
> see if that helps...
With 0.8.1, the same test gives 296 megabytes. My back-of-the-envelope
estimate was eerily correct!
On Sun, Aug 5, 2012 at 4:06 PM, Nathaniel Smith <n...@pobox.com> wrote:
> Hi all,
> To set the stage: I'm working with EEG data. A basic EEG recording
> setup involves a smallish number of electrodes, each of which samples
> the local electric field at some high rate, and we let them run for a
> while. To make this concrete, my current test file has 32 electrodes
> sampling at 250 Hz for a bit over an hour, producing an array of
> floats with shape (981504, 32). So the raw data is ~240 megabytes.
> Now I'd like to store this in a pandas DataFrame. There are two ways
> that I'd like to use pandas's cool indexing stuff here: the first is
> that each electrode has a conventional name, like "MiOc" (for "midline
> occipital"), "HEOG" ("horizontal electrooculographic"), etc., so using
> these as column names is natural. Then on the time (row) axis, things
> are a little more complicated. While I have a total of one hour of
> data in this file, it actually consists of several disjoint recording
> periods that have been concatenated (basically the technician turns
> the machine on and off, so there are gaps of known position but
> unknown length). And then I have other recording sessions from running
> the same experiment on multiple different people. So on the time axis
> I'd like to use a MultiIndex like:
> (subject, era_index, time_within_era)
> so a typical entry might look like:
> ("subj1", 2, 13084)
> 'subject' is drawn from a small set of strings. 'era_index' is a small
> integer (<100). 'time_within_era' is basically arbitrary 32-bit
> integer. (I can't see how there'd be any benefit to using the explicit
> timeseries representations here, since my units are always
> milliseconds, and always measured as deltas against an arbitrary and
> unknown epoch.)
> My problem is that a MultiIndex like this seems to take a really
> unpleasant amount of memory. To test, I created three test data sets,
> saved them to pickle files, and then measured the total resident set
> size of a python process that had done nothing except load one of
> these files with cPickle.load(). Results:
> Raw ndarray: 253 megabytes
> DataFrame with column names and the default Int64Index on the rows:
> 265 megabytes
> DataFrame with column names and MultiIndex on the rows: 416 megabytes
> The second number is very impressive. The third number is disappointing :-(.
Unfortunately, it looks like the second number was wrong too, because
these were freshly created indexes, so they hadn't allocated any hash
table yet. As soon as I call .xs once the data frame (with pandas
master, 0.8.2.dev-5771612) the numbers become:
Raw ndarray: 253 megabytes
DataFrame with column names and default Int64Index on the rows: 305 megabytes
DataFrame with column names and MultiIndex on the rows: 452 megabytes
Hash tables are intrinsically memory-hungry, of course, so I'm not
expecting any miracles, but... well, I guess my question is, do you
have any miracles? :-)
Actually, my indexes will always be sorted, so I might even be happier
skipping the hash table altogether and just using binary search for
row lookup...
On a less radical note, here's the internal structure of the
MultiIndex after calling .xs():
- What's that _tuples attribute doing there? Is there an even-newer
branch around somewhere than master? (It doesn't appear until after
calling .xs()) (Can't tell how much overhead this is, but probably
overwhelming...)
- Why are the labels using 64-bit integers? They could be 8 bits, 8
bits, and 32 bits, respectively. (This is ~18 megabytes.)
- Why are we using two separate 64-bit arrays to store the times (last
entry in the MultiIndex)? Interning integers seems kind of redundant.
I assume that having an explicit list of the levels is useful for
groupby() or such? Would it make sense to generate that list on
demand, like the hash table? (This is something like 2-5 megabytes.)