trean.data cannot store empty dataframe?

4 views
Skip to first unread message

Oliver Beckstein

unread,
Jul 30, 2016, 5:57:02 PM7/30/16
to datreant
I tried to store an empty dataframe so that I could later append to it. However, datreant.data cannot deal with empty dfs:


import datreant.core as dtr
import datreant.data

import numpy as np
import pandas as pd

s = dtr.Treant('sequoia')
s.attach('data')

# create empty df
df = pd.DataFrame(data=None, columns=["A", "B"], dtype=np.float32)

# store
s.data["stuff"] = df
s.draw()

sequoia/
 +-- Treant.93cb5ad0-a548-4bc9-a4df-0bf8c188ae30.json
 +-- stuff/
     +-- pdData.h5

# accessing fails
s.data["stuff"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-162-5e7ac2173bad> in <module>()
----> 1 df = domain_positions(sim, chunksize=25, start=100, stop=200, step=3)

<ipython-input-161-c30406f72f4c> in domain_positions(sim, start, stop, step, chunksize, domains, label)
     32         df = pd.DataFrame(data, columns=columns)
     33         # append to storage
---> 34         sim.data[label].append(df)
     35 
     36     return sim.data[label]

/usr/local/lib/python2.7/dist-packages/datreant/data/limbs.pyc in __getitem__(self, handle)
    166                 out.append(self.retrieve(item))
    167         elif isinstance(handle, six.string_types):
--> 168             out = self.retrieve(handle)
    169 
    170         return out

/usr/local/lib/python2.7/dist-packages/datreant/data/limbs.pyc in inner(self, handle, *args, **kwargs)
    107                         datafiletype=filetype)
    108                 try:
--> 109                     out = func(self, handle, *args, **kwargs)
    110                 finally:
    111                     del self._datafile

/usr/local/lib/python2.7/dist-packages/datreant/data/limbs.pyc in retrieve(self, handle, **kwargs)
    358 
    359         """
--> 360         return self._datafile.get_data('main', **kwargs)
    361 
    362     @_write_datafile

/usr/local/lib/python2.7/dist-packages/datreant/data/core.pyc in get_data(self, key, **kwargs)
    133             self.datafile = pddata.pdDataFile(
    134                 os.path.join(self.datadir, pddata.pddatafile))
--> 135             out = self.datafile.get_data(key, **kwargs)
    136             self.datafile = None
    137         elif self.datafiletype == pydata.pydatafile:

/usr/local/lib/python2.7/dist-packages/datreant/data/pddata.pyc in get_data(self, key, **kwargs)
    108         """
    109         with self.read():
--> 110             return self.handle.select(key, **kwargs)
    111 
    112     def del_data(self, key, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    660         group = self.get_node(key)
    661         if group is None:
--> 662             raise KeyError('No object named %s in the file' % key)
    663 
    664         # create the storer and axes

KeyError: 'No object named main in the file'


Is this a bug or expected?

What is a good way to set up something like the following:

    frames = np.arange(u.trajectory.n_frames, dtype=np.int)[start:stop:step]
    n_frames = len(frames)
    n_chunks = int(np.ceil(n_frames/chunksize))
    
    #print(frames)
    #print(n_frames, chunksize, n_chunks)
    
    sim.data[label] = pd.DataFrame(data=None, columns=columns, dtype=np.float32)
    
    for i in xrange(n_chunks):
        frames_chunk = frames[i*chunksize:(i+1)*chunksize]
        data = np.zeros((len(frames_chunk), len(columns)), dtype=np.float32)
        #print(i, frames_chunk)
        # stop for the traj iterator must be increased by step or we loose a frame
        # (frames_chunk[-1] should be *included*)
        for j, ts in enumerate(u.trajectory[frames_chunk[0]:frames_chunk[-1]+step:step]):
            # Using the iterator allows the reader to implement optimization (eg when reading
            # sequentially); but should benchmark just doing u.trajectory[frame] 
            data[j, :] = compute_domain_positions_groups(ts, *groups.values())
        df = pd.DataFrame(data, columns=columns)
        # append to storage
        sim.data[label].append(df)


Thanks!
Oliver


--
Oliver Beckstein * orbe...@gmx.net
skype: orbeckst  * orbe...@gmail.com

David Dotson

unread,
Aug 2, 2016, 6:26:43 PM8/2/16
to datreant
I forgot to reply to the list, so for posterity here is my response:

Hi Oliver,


Use


    sim.data.append(label, df)


in your loop. This will append `df` to the dataframe stored at the given key `label`, whether or not there is a dataframe at all already there. No need for any other gymnastics.

As a sidenote, doing:


    sim.data[label].append(df)


does not write anything out, since by doing `sim.data[label]` you get back a dataframe and doing `.append(df)` on it you only append to that dataframe in memory. This does not give the effect you desire here.


Does that help?


David

Reply all
Reply to author
Forward
0 new messages