larry vs pytables merge performance

103 views
Skip to first unread message

Kyle

unread,
Jul 6, 2010, 4:16:45 PM7/6/10
to labeled-array
Up until now I've been using structured arrays with a wrapper I made
for MySQLdb, but I'm starting to see major performance hits with all
the joins MySQL has to do. It sounds like a move to hdf5 may be in
order, so I'm trying to decide between larry and pytables.

pytables mentions the optimization it's done for query operations, and
larry has lots of merge operations built in. Are there any benchmarks
for which is faster, and are there any other contrasts I should be
taking into account?

My specific case is that I have lots of different datasets of
covariates, at several different levels of aggregation. For instance,
I have GDP by country-year whereas Education is by country-year-age-
sex, plus 20 other covariates. I need to be able to quickly combine
different sets of these covariates, as well as then quickly query the
resultant arrays.

Thanks for any insights
Kyle

Keith Goodman

unread,
Jul 6, 2010, 4:23:32 PM7/6/10
to labele...@googlegroups.com

I see you're taking a break from kenken.

There are two great packages for saving Numpy arrays to HDF5: h5py and
pytables. I used h5py in the la package.

So if you have your own data object then you can use either h5py or
pytables to store it in HDF5. If you want to use larry as your data
object, it will be store in HDF5 using h5py.

Kyle

unread,
Jul 6, 2010, 4:31:10 PM7/6/10
to labeled-array
From what I understand ( http://code.google.com/p/h5py/wiki/FAQ#What's_the_difference_between_h5py_and_PyTables_?
and http://www.pytables.org/moin/FAQ#HowdoesPyTablescomparewiththeh5pyproject.3F
), h5py is closer to a pure implementation of HDF5 whereas PyTables
adds on a layer similar to larry that enables feats like easy merging.
So I guess my question is more about that additional layer which adds
to the usability of the underlying data.

And I have to limit myself to just the 2 KenKens above the daily
crossword in the NYT, otherwise I'd never get any work done :)

On Jul 6, 1:23 pm, Keith Goodman <kwgood...@gmail.com> wrote:

Keith Goodman

unread,
Jul 6, 2010, 4:37:00 PM7/6/10
to labele...@googlegroups.com
On Tue, Jul 6, 2010 at 1:31 PM, Kyle <kylef...@gmail.com> wrote:
> From what I understand ( http://code.google.com/p/h5py/wiki/FAQ#What's_the_difference_between_h5py_and_PyTables_?
> and http://www.pytables.org/moin/FAQ#HowdoesPyTablescomparewiththeh5pyproject.3F
> ), h5py is closer to a pure implementation of HDF5 whereas PyTables
> adds on a layer similar to larry that enables feats like easy merging.
> So I guess my question is more about that additional layer which adds
> to the usability of the underlying data.

Yeah, I haven't used pytables so I'm not much help. Here's an example
of how to save a structed array using h5py:

Make a structured array:

>> x = np.zeros((2,),dtype=('i4,f4,a10'))
>> x[:] = [(1,2.,'Hello'),(2,3.,"World")]
>> x
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])

Save it to HDF5:

>> import h5py
>> f = h5py.File('/tmp/example.hdf5')
>> f['x'] = x

Load the structured array later:

$ ipython
>> import h5py
>> f = h5py.File('/tmp/example.hdf5')
>> f['x'][:]
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '|S10')])

larry can merge quickly but it does it outside of the archive. And
larry does not use structured arrays, it uses plain old numpy arrays.

> And I have to limit myself to just the 2 KenKens above the daily
> crossword in the NYT, otherwise I'd never get any work done :)

I'll have to start playing again---soon as I can find a 2x2 kenken.

Reply all
Reply to author
Forward
0 new messages