Is nesting a good idea with DataFrames or am I confused?

1,816 views
Skip to first unread message

Lee-Ping Wang

unread,
Mar 29, 2014, 5:03:00 PM3/29/14
to pystat...@googlegroups.com
Hi there,

I'm reading through a book on pandas, and it looks like quite a promising way to manage data.  I am curious about the possibility of having an entry in a DataFrame be another DataFrame, but since I'm still learning there's a chance I could be confused.  My Google search wasn't very fruitful so I'm hoping there's someone with similar experience here.

In my application, the index may correspond to a particular experiment (or a set of physical conditions for multiple experiments), and the columns correspond to measurements.  A measurement could be a single number, or it could be a whole spectrum (I could use DataFrame to represent it, but it would be an entry in the DataFrame that contains multiple experiments).

The reason why I don't want to use multilevel indexing is because the secondary index (e.g. the wavenumber, or x-axis in the spectrum) applies to the spectrum only.

Thanks,

- Lee-Ping

Paul Hobson

unread,
Mar 30, 2014, 11:56:58 AM3/30/14
to pystat...@googlegroups.com
I don't view that a reason to not use an multilevel index. But if you are really opposed to that, use a Panel instead.
Message has been deleted
Message has been deleted
Message has been deleted

Lee-Ping Wang

unread,
Mar 30, 2014, 1:56:44 PM3/30/14
to pystat...@googlegroups.com

Table 1

Table 2

Table 3

Thanks for your reply.  I am not opposed to any method in particular, I'm just in the process of figuring out what is the best way to represent the data in my application.  The main issue is that some measurements are scalars, while other measurements come with indices particular to that measurement only.

If I have several sets of experimental conditions (i.e. temperature and pressure) where I have measured the density, heat capacity, and infrared spectrum of water, then the temperature / pressure / density / heat capacity columns are perfectly representable in the DataFrame.  However, the IR spectrum of water is different because each measurement is an entire series with its own axes (wavenumber vs. intensity).  Thus, the IR spectrum is the only column with a third dimension.

If I use multilevel indexing to represent all of the data in a single frame, then I will end up with something like Table 1.

Pandas will fill in NaN for the empty entries, but NaN is supposed to stand for "missing data", i.e. the experiment was not performed at those conditions.  This case is different because it simply makes no sense to fill in a density corresponding to an IR wavenumber, or an IR intensity without a wavenumber.

My proposed solution would be something like Tables 2 and 3.

Since the entries in the top level DataFrame is another DataFrame, then I can assign a frequency axis for the IR spectrum that is relevant only for that particular measurement, and the others can still be scalars.  My C++ programming is a bit rusty but I think it resembles something like a linked list.

The issue with using a Panel is that everything becomes three-dimensional, and if I understand correctly all measurements would share the same three indices. 

Thanks,

- Lee-Ping

Paul Hobson

unread,
Mar 30, 2014, 4:33:16 PM3/30/14
to pystat...@googlegroups.com
That would work. Fill Table 1 with NaNs wouldn't be a bad solutions, IMO. It's hard to know the best way to store it without knowing exactly how you'll be using it. The way I would do this is to stack/melt/unpivot table 1 into something like this:

# raw data
from io import StringIO import pandas data = StringIO("""\ run,obs,T,P,D,HC,WN,IR 1,1,298,1,997,4.18,1000,1.2 1,2,,,,,1100,1.22 1,3,,,,,1200,1.25 1,4,,,,,1300,1.30 2,1,310,1,995,4.1,1000,1.21 2,2,,,,,1100,1.22 2,3,,,,,1200,1.25 """) df = pandas.read_csv(data, index_col=['run', 'obs']) print(df.to_string())


           T   P    D    HC    WN    IR
run obs                                
1   1    298   1  997  4.18  1000  1.20
    2    NaN NaN  NaN   NaN  1100  1.22
    3    NaN NaN  NaN   NaN  1200  1.25
    4    NaN NaN  NaN   NaN  1300  1.30
2   1    310   1  995  4.10  1000  1.21
    2    NaN NaN  NaN   NaN  1100  1.22
    3    NaN NaN  NaN   NaN  1200  1.25

df.stack()
run  obs    
1    1    T      298.00
          P        1.00
          D      997.00
          HC       4.18
          WN    1000.00
          IR       1.20
     2    WN    1100.00
          IR       1.22
     3    WN    1200.00
          IR       1.25
     4    WN    1300.00
          IR       1.30
2    1    T      310.00
          P        1.00
          D      995.00
          HC       4.10
          WN    1000.00
          IR       1.21
     2    WN    1100.00
          IR       1.22
     3    WN    1200.00
          IR       1.25


You can then use the `xs` method of the dataframe to pull out values as needed.

Lee-Ping Wang

unread,
Mar 30, 2014, 5:02:08 PM3/30/14
to pystat...@googlegroups.com
Hi Paul,

Thank you.  I don't plan to use this data in any kind of high-throughput fashion; I just need to read it in and keep it organized.  There will probably be less than 1 MB of total data, though there may be multiple data types.

I think your solution is interesting and it does sound like a good way to go.  When I read in IR spectra, I will know to extract the WN and IR columns; on the other hand, if I am reading in densities I will only extract the D column.  In both cases I omit cases with NaN.  This is also general enough for cases where I have more than one "graph" data type (for example, a pair distribution function will have R / G columns with multiple R / G rows). 

The format you gave above seems like a good way for the program to manage the data, but a related question is how this information is going to be stored on disk in a way that's intuitive for the user.  Here I think the user should have the option to store IR spectra as separate files in addition to putting all of the data in a single text file.  Thus, on disk the top level data table could file names as entries, and these would point to the sub-tables on disk.  I would need to write a parser to convert the multiple user-provided text files into our composite DataFrame but that doesn't seem so difficult.

Thanks,

- Lee-Ping

Richard Styron

unread,
Mar 31, 2014, 11:47:34 AM3/31/14
to pystat...@googlegroups.com
Lee-Ping,

I find myself in the same position with experimental data or with Monte Carlo simulations quite frequently. In my experience, it's better to fill in all of the columns; I would argue that using NaN is inappropriate because those same conditions apply to all rows. Therefore, if you slice using a particular observation (say 'WN' in your dataframe), you want to have all of the P-T data for that experiment tied to it. For instance, if your colleague (not knowing your data layout) wanted to query for P-T conditions tied to wavenumbers >1200, it would always return 'NaN', which varies between less than helpful to actually harmful if they thought that you had not recorded the information for that experiment.

I dislike filling like this on the principle that it duplicates a lot of information, but it does allow for much better usage of Pandas' capabilities (especially when querying or doing other statistical analysis of the data), and it makes it easy to export to csv or xls or something when it comes time to share with non-programming colleagues or submit to a journal.  From a practical perspective, it does take up a lot more space (often I will have like 50 columns per sample, and 40 of those columns will be metadata/experimental conditions that don't change for the rows for that sample).  I also think it's less readable than the sort of 'sparse' view that you have shown in the tables.  I have never checked but I assume that there is something in the Pandas internals that wants/needs data to be laid out this way. 

I have played around with using dicitonaries and having Series and Dataframe objects as values for certain keys where the others are scalars or strings (so I can nest things as appropriate to the data's shape), and though this is less wasteful, it removes a lot of the functionality.  Sometimes I do this just to store large datasets as JSON text files.

HTH,
Richard

Lee-Ping Wang

unread,
Mar 31, 2014, 12:13:38 PM3/31/14
to pystat...@googlegroups.com
Hi Richard,

Thanks for your insight.  If I store all of my data (including spectra) in a single DataFrame, then I agree it could make sense - from a data analysis point of view - to store experimental conditions repeated across multiple lines in a DataFrame.  But as you say, this is not very legible when the data is printed out to a file.  

Since my application mainly involves reading in the data (rather than writing it out), I think it makes sense for the user to enter data in the most intuitive way for him/her - whether this is using tab-delimited, fixed width or .csv text files.  Internally, the data may be converted to the format that you're suggesting.  From reading the book, pandas has pretty good ways to forward-fill the missing entries.

- Lee-Ping
Reply all
Reply to author
Forward
0 new messages