DataArray from "multi-index-like" pandas dataframe

xzf...@gmail.com

unread,

Aug 18, 2015, 5:50:40 PM8/18/15

to xray

Hello,

I wanted to create a DataArray out of a pandas DataFrame (df), using the columns of the df as coord labels, except for the rightmost column (that would contain the "data")

Like this: (showing a plain list; imagine the columns labeled x, y, v)
Note that the df is not a complete cartesian product of all dimensions; there is a missing value.
So it is a "ragged" df, in a sense.

The example is 2D, but of course this should work for an arbitrary number of dimensions.

Enter code here...[
    ['a', '1',  1 ],
    ['a', '2',  2 ],
    ['b', '1',  3 ],
    ['b', '2',  4 ],
    ['b', '3',  5 ] ]

=> 
<xray.DataArray (x: 2, y: 3)>
array([[  1.,   2.,  nan],
       [  3.,   4.,   5.]])
Coordinates:
  * x        (0) object 'a' 'b'
  * y        (1) object '1' '2' '3'

Sounds like an obvious task - but I could not find an API method to do this.
I rolled my own in pure Python, but it is slow (and ridiculous :)

def df2xda( df , name=None ) :
    '''
    Convert a conceptually multiindex-type DataFrame into a multidimensional
    xray.DataArray.

    All columns except for the last are assumed to contain coord labels
    The last column is assumed to contain a numeric value

    The resulting array will have the dimensions of
    (num of distinct coord labels in col 1 x
     num of distinct coord labels in col 2 x ... )

    Values not provided in the df will be set to numpy.nan

    :param df: pandas.DataFrame
    :param name: optional name for the resulting array

    :return: xray.DataArray
    '''

    if not isinstance(df , pd.DataFrame) :
        raise ValueError("Expect a DataFrame")

    if not df.columns.size > 1 :
        raise ValueError("Expect a DataFrame with at least 2 columns")

    coords = [ df[c].unique() for c in df[df.columns[:-1]] ]
    # 'str()' in case there are no column names.
    # should not be the case in normal use
    dims = [ str(x) for x in df.columns[:-1] ];

    xt = xray.DataArray( np.full( [ len(x) for x in coords ] , np.nan ) ,
                         dims   = dims  ,
                         coords = coords ,
                         name   = name )

    for i in df.index :
        r = list( df.ix[i] )
        val = r.pop()
        xt.loc[tuple(r)] = val

    return xt

How do I _actually_ do this?
Thanks!
Dmitri

Stephan Hoyer

unread,

Aug 19, 2015, 4:04:46 PM8/19/15

to xzf...@gmail.com, xray

Hi Dmitri,

To convert this sort of DataFrame into a DataArray with the appropriate dimension, you want to first convert it into a pandas.Series with a MultiIndex and then use xray.DataArray.from_series:
http://xray.readthedocs.org/en/stable/pandas.html#dataarray-and-series

On your dataset, that would look something like:

series = df.set_index(['x', 'y'])['z']

array = xray.DataArray.from_series(series)

Cheers,

Stephan

--
You received this message because you are subscribed to the Google Groups "xray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xray-dev+u...@googlegroups.com.
To post to this group, send email to xray...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xray-dev/1a9485e6-ed4c-4540-ae1c-4019ddf9ad5e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

xzf...@gmail.com

unread,

Aug 19, 2015, 6:19:41 PM8/19/15

to xray, xzf...@gmail.com

Stephan - This is perfect. Thank you!

(btw I'm not sure what happened with fontsize in my initial post; looks huge)

Reply all

Reply to author

Forward