Meta data (e.g. units) with series and/or dataframe columns

1,345 views
Skip to first unread message

Paul Hobson

unread,
Nov 5, 2012, 12:35:30 PM11/5/12
to pyd...@googlegroups.com
I'm curious if there are any plans or interest in general to adding a metadata attribute to the Series object?

Basically what I'm getting at is that I'd like to be able to attach units to different columns of a DataFrame. In my particular use case, I don't need anything fancy, just a string that says 'meters' or 'mg/kg' ,etc.

Personally, I think the simplest and most general way to implement such a feature would be to allow the user to pass a dictionary of metadata when creating a Series or DataFrame, (e.g, mySeries.meta['units'], mySeries.meta['source'])

Just thought. If there's interest, I'd like to take a stab at implementing this.

Thanks,
-paul

Adam Hughes

unread,
Nov 5, 2012, 1:08:45 PM11/5/12
to pyd...@googlegroups.com
I actually posted the same discussion a while ago.

I've written a few modules to let you transfer attributes (not instance methods yet) between dataframes.  Something like:

df1=DataFrame()
df1.metadata="kilometers"

df2=df1.ix[0:40]

transfer_attributes(df1, df2)

print df2.metadata

If you follow the discussion in this recent thread to the end, you can follow my link on github and download the two modules.  One is mainly focused on serializing/deserializing dataframes with custom attributes.  The other has the transfer_attribute() method.

Let me know if these programs work for you.  They actually do their magic in a crude way, so I'm curious to see if it works for various distributions of pandas and python.

If it does/doesn't work, can you let me know either way?


--
 
 

Adam Hughes

unread,
Nov 5, 2012, 1:10:56 PM11/5/12
to pyd...@googlegroups.com
Sorry, just to clarify, the function you want is called transfer_attr() from the module df_attrhandler.py


Tim Michelsen

unread,
Nov 5, 2012, 4:28:13 PM11/5/12
to pyd...@googlegroups.com
> I'm curious if there are any plans or interest in general to adding a
> metadata attribute to the Series object?
There's a variety of issues with this topic:
https://github.com/pydata/pandas/issues/search?q=metadata

I would also like to see such annotation columns, especially for units.

In the ideal case there would be a checker in the future where the units
are changed after operations

m * m would yield m^2

We could then even include unit logic in tests.

Paul Hobson

unread,
Nov 5, 2012, 5:09:00 PM11/5/12
to pyd...@googlegroups.com
On Mon, Nov 5, 2012 at 10:08 AM, Adam Hughes <hughes...@gmail.com> wrote:
>
> I actually posted the same discussion a while ago.
>
> I've written a few modules to let you transfer attributes (not instance methods yet) between dataframes. Something like:
>
> df1=DataFrame()
> df1.metadata="kilometers"
>
> df2=df1.ix[0:40]
>
> transfer_attributes(df1, df2)
>
> print df2.metadata

Adam, thanks for the head's up. Your implementation is a bit too
complicated for my use case, I think. However, those links lead me to
realize that, unlike numpy arrays, you can add attributes to a
DataFrame on the fly, e.g.,

In [36]: x_arr = np.arange(6)
In [37]: x_arr.units = 'meters'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-37-e2b027ca98ac> in <module>()
----> 1 x_arr.units = 'meters'
AttributeError: 'numpy.ndarray' object has no attribute 'units'

In [38]: x_df = pandas.DataFrame(x_arr)
In [39]: x_df.units = 'meters'
In [40]: print(x_df.units)
meters

Knowing this, my needs are currently met. Thanks again,
-paul

Adam Hughes

unread,
Nov 5, 2012, 5:23:32 PM11/5/12
to pyd...@googlegroups.com

The one issue you need to be mindful of is that the attributes will disappear anytime you recreate the dataframe.  Almost every operation in pandas returns a new object (eg, the operations aren't in place).  Therefore, if you have some attributes on a dataframe, and then say, you take the transpose, these attributes will be lost.  For example:

df=DataFrame()
df.x=50

df2=df.transpose()
print df.x
AttributeError

Also, if you save/load the dataframes, again, custom attributes are lost.

You can overcome this either by adding attributes after you've done all of your manipulations, OR, you can use the transfer attributes function that I wrote.  Although the source code may be a bit involved, using it is actually really simple.  If it turns out that you do in fact need persistent attributes (aka need to transfer them from one dataframe to another), or you need to save your attributes when serializing the dataframe, then you should try the functions I wrote.  I know the sourcecode is a bit messy, but it literally should amount to you doing this:

from df_attrhandler import transfer_attr

df2=df.transpose()
transfer_attr(df1,df2)

print df2.x
"test" 

Good luck.

 
-paul

--



Adam Hughes

unread,
Nov 5, 2012, 5:24:05 PM11/5/12
to pyd...@googlegroups.com
Last part should read:

print df2.x
50

Not:

print df2.x
"test" 

Paul Hobson

unread,
Nov 5, 2012, 5:26:06 PM11/5/12
to pyd...@googlegroups.com
Thanks, Adam. I'll check it out when deadlines aren't so pressing. It
does seem quite useful, but I've to put out this fire first :)
> --
>
>
Reply all
Reply to author
Forward
0 new messages