Hello,
I wanted to create a DataArray out of a pandas DataFrame (df), using the columns of the df as coord labels, except for the rightmost column (that would contain the "data")
Like this: (showing a plain list; imagine the columns labeled x, y, v)
Note that the df is not a complete cartesian product of all dimensions; there is a missing value.
So it is a "ragged" df, in a sense.
The example is 2D, but of course this should work for an arbitrary number of dimensions.
Enter code here...[
['a', '1', 1 ],
['a', '2', 2 ],
['b', '1', 3 ],
['b', '2', 4 ],
['b', '3', 5 ] ]
=>
<xray.DataArray (x: 2, y: 3)>
array([[ 1., 2., nan],
[ 3., 4., 5.]])
Coordinates:
* x (0) object 'a' 'b'
* y (1) object '1' '2' '3'
Sounds like an obvious task - but I could not find an API method to do this.
I rolled my own in pure Python, but it is slow (and ridiculous :)
def df2xda( df , name=None ) :
'''
Convert a conceptually multiindex-type DataFrame into a multidimensional
xray.DataArray.
All columns except for the last are assumed to contain coord labels
The last column is assumed to contain a numeric value
The resulting array will have the dimensions of
(num of distinct coord labels in col 1 x
num of distinct coord labels in col 2 x ... )
Values not provided in the df will be set to numpy.nan
:param df: pandas.DataFrame
:param name: optional name for the resulting array
:return: xray.DataArray
'''
if not isinstance(df , pd.DataFrame) :
raise ValueError("Expect a DataFrame")
if not df.columns.size > 1 :
raise ValueError("Expect a DataFrame with at least 2 columns")
coords = [ df[c].unique() for c in df[df.columns[:-1]] ]
# 'str()' in case there are no column names.
# should not be the case in normal use
dims = [ str(x) for x in df.columns[:-1] ];
xt = xray.DataArray( np.full( [ len(x) for x in coords ] , np.nan ) ,
dims = dims ,
coords = coords ,
name = name )
for i in df.index :
r = list( df.ix[i] )
val = r.pop()
xt.loc[tuple(r)] = val
return xt
How do I _actually_ do this?
Thanks!
Dmitri