MySQL vs HDF5

TD

unread,

Nov 28, 2012, 9:06:14 AM11/28/12

to pyd...@googlegroups.com

I see that there is a bias towards using HDF5 in pandas. Can someone elaborate on the design decision to have better support for HDF5?

What are the pros/cons of using HDF5 vs. SQL?

Andreas Hilboll

unread,

Nov 28, 2012, 9:17:35 AM11/28/12

to pyd...@googlegroups.com, TD

> I see that there is a bias towards using HDF5 in pandas. Can someone
> elaborate on the design decision to have better support for HDF5?
> What are the pros/cons of using HDF5 vs. SQL?

HDF5 is a common dataformat in science. For scientific applications
nobody uses MySQL. Of course, this will surely be different in other
fields, but one of the general advantages of HDF5 is that it's **files**
which you can exchange. It's not (easily) possible to exchange MySQL data.

With pytables, there exists a very powerful HDF interface for Python
which (as far as I understand) offers some SQL-like query functionality.

Just my 2cts.
Andreas.

Jeff Reback

unread,

Nov 28, 2012, 1:22:36 PM11/28/12

to pyd...@googlegroups.com, TD

new docs coming in pandas v0.10.0 will explain how to use the table format (it exists in

prior versions though)

see

http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables

Andreas.

--

Goyo

unread,

Nov 28, 2012, 1:30:45 PM11/28/12

to pyd...@googlegroups.com, TD, li...@hilboll.de

El miércoles, 28 de noviembre de 2012 15:17:35 UTC+1, Andreas Hilboll escribió:

but one of the general advantages of HDF5 is that it's **files**
which you can exchange. It's not (easily) possible to exchange MySQL data.

There's SQLite for that.

Goyo

Adam Hughes

unread,

Nov 28, 2012, 1:40:34 PM11/28/12

to pyd...@googlegroups.com, TD, li...@hilboll.de

I am very interested in building a database around results in my research. At first I had decided to use PySQL and attempt to store data as pickled dataframes, and then would build a small interface to allow users to read the pickles into memory upon selection. I wasn't so much aware of pandas builtin support for PyTables and HDF5, so now I'm leaning in that direction. Has anyone already build a database around pandas-based reasearch? If so, can you share your experience, especially any tips, pointers and gotchas that I should be aware of?

--

TD

unread,

Nov 28, 2012, 4:41:14 PM11/28/12

to pyd...@googlegroups.com, TD, li...@hilboll.de

I am sympathetic to using HDF5, I like the functionality and performance it provides although the down side is it's not relational. I am going to record some transaction histories in the database, so the additional overhead of SQL for simple read access may not be necessary.

Having that said, I prefer to have a solution where I have a centralized database and multiple users fetch data from it. I think this is where HDF5 becomes a subpar solution. There is OpenDAP which aims to provide access to remote HDF5 sources, but the python library support seems lacking in this arena. pydap is the opendap library, and it used to have a pytables plugin which no longer works. Therefore I believe it would not be possible to use pandas with an opendap server at this point since pandas uses pytables.

Of course I might be wrong, since I have not tried this myself -- this is just an observation from what I have read so far. Anyone with experience on using pandas with an OpenDAP server please pitch in

Peter Lubell-Doughtie

unread,

Nov 28, 2012, 8:43:24 PM11/28/12

to pyd...@googlegroups.com

The Modi Research Group at Columbia University wrote a data analysis web service that uses pandas for computation and MongoDB for serialization, it is called bamboo and is open source, the code is here https://github.com/modilabs/bamboo and the documentation is here http://bamboo.io/.

We chose MongoDB to get the benefits of schema-less and query-able storage. For the size of our data (small, normally < 1000 rows, nothing greater than 50MB total) the performance has been OK, but it will have problems with larger datasets. Because of this we may switch to a better performing key-value store in the future.

Performing the translation to/from MongoDB is costly, but needed to support querying and updating, we are inevitably going to have an IO hit somewhere. We considered linking to HDF5 or pickle files on disk or in a blob. If we can get a speedup by having a MongoDB store for queries and a parallel store for DataFrames we would do it. We need to do more research concerning what is fastest given our use cases. Bamboo is under active development and we plan to post a formal announcement soon.

Peter

TD wrote:

--

Senya Seltser

unread,

Nov 28, 2012, 10:25:45 PM11/28/12

to pyd...@googlegroups.com

Peter, just out of curiocity why not use Hbase and Hive instead of MongoDB?

finally SciDB is pretty strong alternative to for huge scientific datasets specifically.

HDF5 is very very cool technology but it seem to be limited to a single server storage and it does not scale as cloud store.

pandas will work with whatever python supports which is all of the above.

Stanislav

--

Peter Lubell-Doughtie

unread,

Nov 28, 2012, 10:48:02 PM11/28/12

to pyd...@googlegroups.com

Hi Stanislav,

We chose MongoDB over HBase because we are not dealing with massive datasets and because the team was more familiar with it. We have written Bamboo to keep the data store separate from the models wherever possible so that we can transition as our needs change.

We will certainly consider HBase as we begin working with larger datasets. We are also interested in using hstore within PostgreSQL, this would give us access to PostGIS, which may become a requirement.

Peter

Senya Seltser wrote:

--

Christian Prinoth

unread,

Nov 29, 2012, 9:56:53 AM11/29/12

to pyd...@googlegroups.com

I use MySQL as a general purpose database for financial data. One of the things I do is backtesting investment strategies over a large number of securities and time points, and for this I use a pandas Panel object. Filling this object takes about 7-8 minutes, so what I do is create it automatically via a cron job in the morning, then save an hdf5 file (about 2GB in size). This file takes about 5-10 seconds to load, so quite an improvement over repopulating the Panel each time directly from mysql.

On Wed, Nov 28, 2012 at 3:06 PM, TD <tansu...@gmail.com> wrote:

I see that there is a bias towards using HDF5 in pandas. Can someone elaborate on the design decision to have better support for HDF5?
What are the pros/cons of using HDF5 vs. SQL?

--

--

Christian Prinoth

TD

unread,

Nov 29, 2012, 10:09:18 AM11/29/12

to pyd...@googlegroups.com

Christian,

This is exactly what I was looking for, thanks a lot. This is very similar to what I want to achieve and the timing comparison is very helpful.

You have the best of both worlds. A relational database for interactively slicing and dicing data, and canned queries to flatten data into HDF5 for fast access. I suppose HDF5 is like a materialized view in SQL sense.

Tansu

Roberto De Almeida

unread,

Nov 29, 2012, 2:20:25 AM11/29/12

to pyd...@googlegroups.com

On Wed, Nov 28, 2012 at 11:41 PM, TD <tansu...@gmail.com> wrote:

Having that said, I prefer to have a solution where I have a centralized database and multiple users fetch data from it. I think this is where HDF5 becomes a subpar solution. There is OpenDAP which aims to provide access to remote HDF5 sources, but the python library support seems lacking in this arena. pydap is the opendap library, and it used to have a pytables plugin which no longer works. Therefore I believe it would not be possible to use pandas with an opendap server at this point since pandas uses pytables.

Hi, pydap author here. If there's interest I can port the old pytables handler to the new version of pydap. Meanwhile, pydap has a generic hdf5 handler that works with files generated by pytables.

Cheers,

--Rob

Of course I might be wrong, since I have not tried this myself -- this is just an observation from what I have read so far. Anyone with experience on using pandas with an OpenDAP server please pitch in

On Wednesday, 28 November 2012 13:40:34 UTC-5, Adam wrote:

I am very interested in building a database around results in my research. At first I had decided to use PySQL and attempt to store data as pickled dataframes, and then would build a small interface to allow users to read the pickles into memory upon selection. I wasn't so much aware of pandas builtin support for PyTables and HDF5, so now I'm leaning in that direction. Has anyone already build a database around pandas-based reasearch? If so, can you share your experience, especially any tips, pointers and gotchas that I should be aware of?

On Wed, Nov 28, 2012 at 1:30 PM, Goyo <goyo...@gmail.com> wrote:

El miércoles, 28 de noviembre de 2012 15:17:35 UTC+1, Andreas Hilboll escribió:

but one of the general advantages of HDF5 is that it's **files**
which you can exchange. It's not (easily) possible to exchange MySQL data.

There's SQLite for that.

Goyo

--

--

--
Roberto De Almeida, PhD

http://dealmeida.net/
:wq

TD

unread,

Dec 7, 2012, 7:44:16 AM12/7/12

to pyd...@googlegroups.com, rob...@dealmeida.net

Rob, dunno how I missed that. I think there would be tremendous value if pytables handler was ported to the new version of pydap. I can see many other users going the pandas - pytables - hdf5 route, and centralizing data access seems to me the missing link to make this an enterprise solution rather than an individual toy project.

Samantha Zeitlin

unread,

Jul 21, 2014, 8:46:48 PM7/21/14

to pyd...@googlegroups.com

I've just been doing this. I have a MacBook Air with only 4GB of RAM, so it was a big of a pressure test on a 100+MB file.

Long story short: HDF5 is MUCH faster for loading. It's a bit more complicated to do the queries since the syntax is not the same as SQL.

After reading up a lot about this and struggling with both options, I think that's the main tradeoff.

Queries in SQL are still fairly slow, but to get similar results with something like PyTables (even using the support for that in pandas) is a bit more involved - it's more like R, so it requires a shift in your mindset about how you're going to get what you want out of the data set. Or you have to write a bunch more code to iterate through the table.

There are also some drawbacks if you need strings, so that might matter, depending on what's in your data set. For example, I have a bunch of columns where there are 'codes' listed as numbers. I want to treat them as strings, but PyTables wants to treat them as int because it's faster if it can coerce everything to be one type.

Hope that helps.

Andy Hayden

unread,

Jul 21, 2014, 9:36:13 PM7/21/14

to pyd...@googlegroups.com

Jeff uses hdf5 a lot and has invested a lot of time making it work well. He can better evangelise :) see also http://stackoverflow.com/q/14262433/1240268

SQL has recently moved to sqlalchemy for greater language support, but this doesn't lend itself to speed (previously had used pure python IIRC so wasn't super fast either). IMO is few devs using this (on large enough datasets)... ?

Note that Continuum have a (non-free) solution: io-pro.

Has been mooted before using a faster/lower-level (C) API, but I don't think we found one/or someone to implement...

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

balasubram...@gmail.com

unread,

Mar 23, 2019, 11:56:45 AM3/23/19

to PyData

Hi,

I know it's been a very long time, since you asked this question, but I am also kind of trying to implement the same thing; a database using hdf5 files using pytables. any pointers that you can give me regarding this.

Regards

Sal Abbasi

unread,

Mar 23, 2019, 12:16:29 PM3/23/19

to pyd...@googlegroups.com

Take a look at pyarrow instead. Implements fast serialization / deserialization of python objects and much simpler than hdf5

https://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/