pandas-friendly data modeling?

583 views
Skip to first unread message

Marko Loparic

unread,
Dec 9, 2015, 11:35:20 PM12/9/15
to PyData
Hello,

Do you know of a standard way of expressing a traditional entity-relationship data modelling in a way which allows me to get pandas objects as result?

I would like to convince colleagues to switch from sql queries to pandas manipulations. This page

http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

is very helpful. A tool allowing the definition entities, relationships, cardinalities and generation of pandas objects is what I still need to complete my argument.

Thanks a lot!
Marko

Andy Ray Terrel

unread,
Dec 10, 2015, 9:50:55 AM12/10/15
to pyd...@googlegroups.com
Such a tool would be handy!

Perhaps a simpler approach would be to have a tool take a SQL schema and produce an appropriate object. Then one can continue to use all the wonderful ERD tools out there as well.

Unfortunately it would be pretty messy for complex entity relationships. Is there a reason to_sql and read_sql don't work well?

-- Andy

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marko Loparic

unread,
Dec 10, 2015, 10:19:13 AM12/10/15
to pyd...@googlegroups.com
Hi Andy,

Using read_sql would solve one of my problems, which is to start with
what I have now (data in postgres) and have everything in pandas. So I
could do the whole work in pandas instead of using sql for each
manipulation.

My other problem is that in our applications the data model is fixed.
I would like it to be an input. If the data model is an input I don't
see the reason to store data into a database and to use sql at all. It
is a useless and time-expensive detour. But we still need somehow to
say that things like "each person is associated to one and only one
address and to any number of phone numbers"...

One of the reasons that I would like to have the data model as an
input is that I want to build a more generic application, one that is
not tied to the business needs, which changes all the time. Also it
would be very nice to give the user the possibility to extend an
existing data model, i.e. adding new entitities in the same way that
they change the values.

I believe that the data models we have are not very complex, neither
are the queries. So I am not afraid of being unable to do the
operations with pandas. But I am looking for a way to express the
relationships and the cardinalities.

Thanks!
Marko

Andy Ray Terrel

unread,
Dec 10, 2015, 10:30:01 AM12/10/15
to pyd...@googlegroups.com
I may be wrong, but I don't think there is a way to "somehow to say that things like each person is associated to one and only one address and to any number of phone numbers", i.e. express cardinalities in data frame object.

I usually enforce such things with code around the data frame. 

-- Andy

Marko Loparic

unread,
Dec 10, 2015, 10:36:11 AM12/10/15
to pyd...@googlegroups.com
Hi,

I see. So my question is: how can I express these things it the most
simple, convenient, pythonic way so that I can build myself the code
around data frames as you say?

The code would then consist on things like data frame format
definition, data entry, validation.

(Well, it is more a question about data modeling in python than about pandas...)

Thanks again,
Marko

lamp

unread,
Dec 10, 2015, 11:31:28 AM12/10/15
to PyData
I wonder if the blaze ecosystem can help : http://blaze.pydata.org/

Andy Ray Terrel

unread,
Dec 10, 2015, 11:40:00 AM12/10/15
to pyd...@googlegroups.com
The Blaze ecosystem was not built with entity relations in mind. Datashape (the blaze data model language) does not include entity relations, rather just descriptions of tabular data. We have records but not the contraints required for cardinality.

I don't know that it currently can do much better here at completely removing your sql store, but does hide it well. Essentially, I think if you want to make SQL fans happy with a dataframe object, you will end up building most of the database layer in your python code. At the end of the day I push things that need entity relations into sql, which is what blaze would do behind the scene and propogate a fun sql alchemy error.

dartdog

unread,
Dec 12, 2015, 9:41:35 PM12/12/15
to PyData
check out SQL Alchemy not exactly what you are looking for but provides a cleaner interface between SQL and Python and hence Pandas....

Marko Loparic

unread,
Dec 14, 2015, 10:47:52 AM12/14/15
to pyd...@googlegroups.com
Hello,

Thanks a lot Andy, lamp and dartdog for the interesting discussion. Here are the possible solutions I see for my need, from the most conformist to the most rebel :-)
- Keep things as they are, in postgres, and try to use sql: perhaps my resistance to it comes in part from the fact that I have never use it :-), so far I have always done all my computing without relying of database packages.
- Keep things in postgres but use something like pandas read_sql or sql alchemy translate them as early as possible to pandas or to another friendly environment.
- Use some kind of ERD tool to generate schemas and implement my own routines for data validation, pandas data frame format definition based on its output.
- Avoid the entity-relationship modeling and define a tabular data model, which can be directly mapped into data frames.

Thanks,
Marko

Rob Schneider

unread,
Dec 16, 2015, 10:24:15 PM12/16/15
to PyData
So far I've been pleased with my use of Python calls to Django for data and Pandas to do its thing.  The Django model connected to the database (MySQL) provides a "python-centric" way to get data which I find much easier than continuing to rely on (and remember) SQL to do it all. 
Reply all
Reply to author
Forward
0 new messages