DataArray from "multi-index-like" pandas dataframe

679 views
Skip to first unread message

xzf...@gmail.com

unread,
Aug 18, 2015, 5:50:40 PM8/18/15
to xray
Hello,

I wanted to create a DataArray out of a pandas DataFrame (df), using the columns of the df as coord labels, except for the rightmost column (that would contain the "data")

Like this: (showing a plain list; imagine the columns labeled x, y, v)
Note that the df is not a complete cartesian product of all dimensions; there is a missing value.
So it is a "ragged" df, in a sense.

The example is 2D, but of course this should work for an arbitrary number of dimensions.

Enter code here...
[
   
['a', '1',  1 ],
   
['a', '2',  2 ],
   
['b', '1',  3 ],
   
['b', '2',  4 ],
   
['b', '3',  5 ] ]

=>
<xray.DataArray (x: 2, y: 3)>
array
([[  1.,   2.,  nan],
       
[  3.,   4.,   5.]])
Coordinates:
 
* x        (0) object 'a' 'b'
 
* y        (1) object '1' '2' '3'

Sounds like an obvious task - but I could not find an API method to do this.
I rolled my own in pure Python, but it is slow (and ridiculous :)

def df2xda( df , name=None ) :
'''
Convert a conceptually multiindex-type DataFrame into a multidimensional
xray.DataArray.

All columns except for the last are assumed to contain coord labels
The last column is assumed to contain a numeric value

The resulting array will have the dimensions of
(num of distinct coord labels in col 1 x
num of distinct coord labels in col 2 x ... )

Values not provided in the df will be set to numpy.nan

:param df: pandas.DataFrame
:param name: optional name for the resulting array

:return: xray.DataArray
'''

if not isinstance(df , pd.DataFrame) :
raise ValueError("Expect a DataFrame")

if not df.columns.size > 1 :
raise ValueError("Expect a DataFrame with at least 2 columns")

coords = [ df[c].unique() for c in df[df.columns[:-1]] ]
# 'str()' in case there are no column names.
# should not be the case in normal use
dims = [ str(x) for x in df.columns[:-1] ];

xt = xray.DataArray( np.full( [ len(x) for x in coords ] , np.nan ) ,
dims = dims ,
coords = coords ,
name = name )

for i in df.index :
r = list( df.ix[i] )
val = r.pop()
xt.loc[tuple(r)] = val

return xt


How do I _actually_ do this?
Thanks!
Dmitri

Stephan Hoyer

unread,
Aug 19, 2015, 4:04:46 PM8/19/15
to xzf...@gmail.com, xray
Hi Dmitri,

To convert this sort of DataFrame into a DataArray with the appropriate dimension, you want to first convert it into a pandas.Series with a MultiIndex and then use xray.DataArray.from_series:
http://xray.readthedocs.org/en/stable/pandas.html#dataarray-and-series

On your dataset, that would look something like:

series = df.set_index(['x', 'y'])['z']
array = xray.DataArray.from_series(series)

Cheers,
Stephan

--
You received this message because you are subscribed to the Google Groups "xray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xray-dev+u...@googlegroups.com.
To post to this group, send email to xray...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xray-dev/1a9485e6-ed4c-4540-ae1c-4019ddf9ad5e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

xzf...@gmail.com

unread,
Aug 19, 2015, 6:19:41 PM8/19/15
to xray, xzf...@gmail.com
Stephan - This is perfect. Thank you!

(btw I'm not sure what happened with fontsize in my initial post; looks huge)
Reply all
Reply to author
Forward
0 new messages