custom data warehouse in python vs. out-of-the-box ETL tool

107 views
Skip to first unread message

snfctech

unread,
Sep 22, 2009, 4:00:38 PM9/22/09
to
Does anyone have experience building a data warehouse in python? Any
thoughts on custom vs using an out-of-the-box product like Talend or
Informatica?

I have an integrated system Dashboard project that I was going to
build using cross-vendor joins on existing DBs, but I keep hearing
that a data warehouse is the way to go. e.g. I want to create orders
and order_items with relations to members (MS Access DB), products
(flat file) and employees (MySQL).

Thanks in advance for any tips.

Sean DiZazzo

unread,
Sep 22, 2009, 5:59:29 PM9/22/09
to

I have done some small/medium sized stuff using SQLAlchemy,
Turbogears, and Flex. I have never used a commercial product, but I
imagine getting it set up to work with your data is the hardest part
of the job anyway, and the solution you end up with will most likely
limit you to applying their api to your data. If you build it
yourself, you have complete control, and know exactly where to go when
you have a problem, or to add a feature.

I'm no expert, but I think I would try to find a way to consolidate
the data into one data source. We handle the giant amount of data we
are collecting by preprocessing it into another DB anyway, so I
imagine you could do both things at the same time.

This could very probably be handled in a different way if you are a
DBA. I'm just a MySQL hack. :)

~Sean

Martin P. Hellwig

unread,
Sep 22, 2009, 6:40:15 PM9/22/09
to

My experience is that if you enjoy hacking around databases and are
proficient in Python, than for small scale solutions it is preferable to
do it yourself. If you need a large scale solutions with advanced
requirements, building it yourself is mostly the only way.

I have build a rather complex datawarehouse system in the past (well
actually more like a centralised synchronisation hub, having input and
reporting databases as satellite clients), shoving data from around 500
databases (typically 5 Gb in size each) spread over the world.

The only commercial solutions I reviewed was Business Objects Data
Integrator and Oracle Warehouse Builder.

These tools where quite flexible and if you already have a license deal
which includes these tools I would definitely recommend to have more
than just a look at it.

If not and you are comfortably with using python to shovel data from A
to B and transform it at the same time (moving relational data
automatically into a EAV model and back again, for example) than
building your own solution will probably save you money and time (as
opposed to learn how to use that ETL tool).

This will require you to have at least interest in the following subjects:
- Authorization, may everybody use all data or should it be limited to a
subset on the data depending on the data?
(My solution was one centralised hub which contains all data but is only
accessible to special 'client' servers strictly maintained by me which
only sync the data relevant to them).

- Authenticity, if you have different values for the same thing, which
one should be considered authoritative and if yes may it be pushed back
to the un-authoritative?

-Synchronisation, you really don't want to push/pull all of the database
content over every x times, so how can you delta it and is there a way
to do this only when the data changes (push vs pull)?

-ATOMIC, how long may the data be out of date and is it allowed to
partially update

-Using and maintaining multiple databases, hopefully spread over
multiple systems. I had a server for each production DB, a server that
mirrored that production DB with some added columns per table for
external synchronization purposes and a master synchronisation server
(so in essence all data was copied three times, not very efficient but
good if you like to play it on the safe side).


--
MPH
http://blog.dcuktec.com
'If consumed, best digested with added seasoning to own preference.'

snfctech

unread,
Sep 22, 2009, 8:23:24 PM9/22/09
to
Thanks for your replies, Sean and Martin.

I agree that the ETL tools are complex in themselves, and I may as
well spend that learning curve on a lower-level tool-set that has the
added value of greater flexibility.

Can you suggest a good book or tutorial to help me build a data
warehouse in python? Bill Inmon's "Building the Data Warehouse" is 17
years old, and I've been cautioned against Kimball.

Thanks.


On Sep 22, 3:40 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:

> MPHhttp://blog.dcuktec.com

M.-A. Lemburg

unread,
Sep 23, 2009, 5:02:21 AM9/23/09
to snfctech, pytho...@python.org

You might want to look at this solution for doing cross-database JOINs:

EasySoft ODBC Join-Engine:

http://www.easysoft.com/products/data_access/odbc_odbc_join_engine/index.html

and then use our mxODBC to access EasySoft's Engine:

http://www.egenix.com/products/python/mxODBC/

or mxODBC Connect, if you have a client-server setup:

http://www.egenix.com/products/python/mxODBCConnect/

Some database engines also allow integrating external ODBC
data sources - the external tables then look like normal
database tables and can be used in JOINs as well. I know that DB2
and Oracle support this kind of setup. You can access those using
mxODBC as well.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Sep 23 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Martin P. Hellwig

unread,
Sep 23, 2009, 6:15:44 AM9/23/09
to
snfctech wrote:
> Thanks for your replies, Sean and Martin.
>
> I agree that the ETL tools are complex in themselves, and I may as
> well spend that learning curve on a lower-level tool-set that has the
> added value of greater flexibility.
>
> Can you suggest a good book or tutorial to help me build a data
> warehouse in python? Bill Inmon's "Building the Data Warehouse" is 17
> years old, and I've been cautioned against Kimball.
>
> Thanks.
<cut>

Data warehouse isn't something magical, it is just another database,
albeit containing multiple datasets gathered from foreign resources in
possibly multiple formats.

Depending on your purpose of what you want, you design your tables the
way you usually do. For example if you only want reporting, you might
want to build your tables in such a way so it makes your life easier to
build the actual report.

Now you have an empty database containing the fields you wish for the
report and have filled database(s) containing data from the user
application. Now you use Python to fill the empty database, tada, you
have a Data warehouse and used Python for ETL processing.

So if you already have some insights in creating tables in a database,
you are all set. Most likely you will go through a number of iterations
before you are happy with the result though.

There is no book substitute for applying theory, experience and common
sense to a problem you want to solve, unless you write it yourself for
that specific situation.

--
MPH

nn

unread,
Sep 23, 2009, 11:45:45 AM9/23/09
to

I use both Python and a Data-warehouse tool (Datastage) from IBM that
is similar to Informatica. The main difference with Python is
throughput. The tool has good sort and join routines of multithreaded
C code that handles data bigger than what fits in RAM. It also has
good native drivers for the DB2 database. For data conversions and
other transformations every row gets processed on a different CPU. You
can really put a 16 core machine to good use with this thing.

In your case you probably won't have enough data to justify the cost
of buying a tool. They are quite expensive.

snfctech

unread,
Sep 23, 2009, 12:35:14 PM9/23/09
to
@Martin: I originally thought that there was nothing "magical" about
building a data warehouse, but then I did a little research and
received all sorts of feedback about how data warehouse projects have
notorious failure rates, that data warehouse design IS different than
normal RDBMS - and then there's the whole thing about data marts vs.
warehouses, Kimball vs. Inmon, star schemas, EAV tables, and so on.
So I started to think that maybe I needed to get a little better read
on the subject.

On Sep 23, 3:15 am, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:

> MPHhttp://blog.dcuktec.com

snfctech

unread,
Sep 23, 2009, 12:39:49 PM9/23/09
to
@Lemburg: Thanks for the suggestion. I'm sure you make a fine
product, but my development platform is Linux, and I don't want any
additional Windows servers to deal with (than the ones I'm already
stuck with.)

On Sep 23, 2:02 am, "M.-A. Lemburg" <m...@egenix.com> wrote:
> snfctech wrote:
> > Does anyone have experience building a data warehouse in python?  Any
> > thoughts on custom vs using an out-of-the-box product like Talend or
> > Informatica?
>
> > I have an integrated system Dashboard project that I was going to
> > build using cross-vendor joins on existing DBs, but I keep hearing
> > that a data warehouse is the way to go.  e.g. I want to create orders
> > and order_items with relations to members (MS Access DB), products
> > (flat file) and employees (MySQL).
>
> > Thanks in advance for any tips.
>
> You might want to look at this solution for doing cross-database JOINs:
>
> EasySoft ODBC Join-Engine:
>

>    http://www.easysoft.com/products/data_access/odbc_odbc_join_engine/in...

Martin P. Hellwig

unread,
Sep 23, 2009, 3:28:45 PM9/23/09
to
snfctech wrote:
> @Martin: I originally thought that there was nothing "magical" about
> building a data warehouse, but then I did a little research and
> received all sorts of feedback about how data warehouse projects have
> notorious failure rates, that data warehouse design IS different than
> normal RDBMS - and then there's the whole thing about data marts vs.
> warehouses, Kimball vs. Inmon, star schemas, EAV tables, and so on.
> So I started to think that maybe I needed to get a little better read
> on the subject.

Yes failure rate for data warehouse projects is quite high, so are other
IT projects without data warehouses.

Data warehouse design is not that much different than 'normal' RDBMS,
you are following the same decisions for example:
- Do I rather have multiple copies of data than slow or complicated
access? If yes how do I ensure integrity?
- How do I do access control on value level?
- Do I need fail over or load balancing, if yes how much of it can I do
on the application level?

The thing is if you never have designed a database from a database point
of view because you used abstractions that hide these ugly details then
yes, Data warehouses are different than normal RDBMS.

Yes you can make it all sound very complicated by throwing PHB words in
like meta schema's, dimensional approach, subject orientated, etc.
I like to see it more like I have a couple of data sources, I want to
combine them in a way that I can do neat stuff with it. The 'combine
them' part is the data warehouse design, the 'neat stuff' part are the
clients that use that data warehouse. If you wish you can start making
it more complicated by saying but my data sources are also my clients,
but that is another step.

I guess you can sum up all this by saying a data warehouse is a method
of gathering already existing data and present them in a different way,
the concept is simple, the details can be complicated if you want/need
it to be.

--
MPH

Tony Schmidt

unread,
Sep 23, 2009, 4:20:41 PM9/23/09
to
@Martin: Thanks for your great feedback.

So do you think it would be very beneficial for me to start with an
Inman or Kimball book? Or do you think it would be just leisure
reading and not very practical at best - fill my head with needless
jargon and inflexible dogmas, at worst?

I took a database class in college, understand the basic principals of
normalisation, and have built a few complicated RDBMS schemas from the
ground up.

On Sep 23, 12:28 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:


> snfctech wrote:
> > @Martin:  I originally thought that there was nothing "magical" about

> > building a datawarehouse, but then I did a little research and
> > received all sorts of feedback about how datawarehouseprojects have
> > notorious failure rates, that datawarehousedesign IS different than


> > normal RDBMS - and then there's the whole thing about data marts vs.
> > warehouses, Kimball vs. Inmon, star schemas, EAV tables, and so on.
> > So I started to think that maybe I needed to get a little better read
> > on the subject.
>

> Yes failure rate for datawarehouseprojects is quite high, so are other


> IT projects without data warehouses.
>

> Datawarehousedesign is not that much different than 'normal' RDBMS,


> you are following the same decisions for example:
> - Do I rather have multiple copies of data than slow or complicated
> access? If yes how do I ensure integrity?
> - How do I do access control on value level?
> - Do I need fail over or load balancing, if yes how much of it can I do
> on the application level?
>
> The thing is if you never have designed a database from a database point
> of view because you used abstractions that hide these ugly details then
> yes, Data warehouses are different than normal RDBMS.
>
> Yes you can make it all sound very complicated by throwing PHB words in
> like meta schema's, dimensional approach, subject orientated, etc.
> I like to see it more like I have a couple of data sources, I want to
> combine them  in a way that I can do neat stuff with it. The 'combine

> them' part is the datawarehousedesign, the 'neat stuff' part are the
> clients that use that datawarehouse. If you wish you can start making


> it more complicated by saying but my data sources are also my clients,
> but that is another step.
>

> I guess you can sum up all this by saying a datawarehouseis a method


> of gathering already existing data and present them in a different way,
> the concept is simple, the details can be complicated if you want/need
> it to be.
>
> --

> MPHhttp://blog.dcuktec.com

Martin P. Hellwig

unread,
Sep 23, 2009, 4:39:10 PM9/23/09
to
Tony Schmidt wrote:

> So do you think it would be very beneficial for me to start with an
> Inman or Kimball book? Or do you think it would be just leisure
> reading and not very practical at best - fill my head with needless
> jargon and inflexible dogmas, at worst?

You have an unique opportunity here ;-)
I would recommend that you start by writing down what you want and how
it should look like, don't be bothered yet with implementation details
like how tables should be organised, just how you want to access your
data for the end-result.

Than make a detailed list of caveats which you can already see;
security, distribution, centralisation, real-time sync, etc.

Use that list, go to a tech library and flick through all the books and
look for signs that they actually have a solution for your problems,
read those books and take the things that make sense.

After you done this a couple of times you should see some patterns
starting to form and see that some solutions which you used in the past,
although working are better handled differently. But the main thing is
that you should have something working, which is the proof of the pudding.

<cut rest>
--
MPH

M.-A. Lemburg

unread,
Sep 24, 2009, 7:10:54 AM9/24/09
to snfctech, pytho...@python.org
snfctech wrote:
> @Lemburg: Thanks for the suggestion. I'm sure you make a fine
> product, but my development platform is Linux, and I don't want any
> additional Windows servers to deal with (than the ones I'm already
> stuck with.)

Strange, EasySoft used to support their product on Linux as well...

http://download.freshmeat.net/projects/easysoftsqlengine

Looks like they stopped with supporting it on Linux and now only
have Windows downloads available.

mxODBC and mxODBC Connect work just fine on Linux, so you could use
them to build a custom joining solution in Python.

It really depends on whether you need to run those joins on
an on-demand basis or not:

AFAIK, Talend can only run jobs which
then apply the JOINs and put the resulting data somewhere, e.g.
into a file or another database table.

With Python, you could do the join straight away and use the
data directly without the Python script - without having
to go through an external temporary resource.

Anyway, just a suggestion.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Sep 24 2009)

Tony Schmidt

unread,
Sep 24, 2009, 12:15:42 PM9/24/09
to
Hi, Marc-Andre - well, so far you seem to be the only one suggesting
that cross-database joins is the way to go - everyone else has been
telling me to build a warehouse. I initially was trying to avoid the
warehouse idea to "avoid going through the external temporary
resource", as you say. But then the situation came up where the data
warehouse would give me another benefit by enabling power users with
tools like OOBase to do their own reporting. So I started to re-
consider it, and since then, no one has advised the direct approach
with connections/joins on multiple databases.

Have you had much luck circumventing the need for a data warehouse in
this way?

Thanks.

On Sep 24, 4:10 am, "M.-A. Lemburg" <m...@egenix.com> wrote:
> snfctech wrote:
> > @Lemburg: Thanks for the suggestion.  I'm sure you make a fine
> > product, but my development platform is Linux, and I don't want any
> > additional Windows servers to deal with (than the ones I'm already
> > stuck with.)
>
> Strange, EasySoft used to support their product on Linux as well...
>
> http://download.freshmeat.net/projects/easysoftsqlengine
>
> Looks like they stopped with supporting it on Linux and now only
> have Windows downloads available.
>
> mxODBC and mxODBC Connect work just fine on Linux, so you could use
> them to build a custom joining solution in Python.
>
> It really depends on whether you need to run those joins on
> an on-demand basis or not:
>
> AFAIK, Talend can only run jobs which
> then apply the JOINs and put the resulting data somewhere, e.g.
> into a file or another database table.
>
> With Python, you could do the join straight away and use the
> data directly without the Python script - without having
> to go through an external temporary resource.
>
> Anyway, just a suggestion.
>
>
>
> > On Sep 23, 2:02 am, "M.-A. Lemburg" <m...@egenix.com> wrote:
> >> snfctech wrote:

> >>> Does anyone have experience building a datawarehousein python?  Any


> >>> thoughts on custom vs using an out-of-the-box product like Talend or
> >>> Informatica?
>
> >>> I have an integrated system Dashboard project that I was going to
> >>> build using cross-vendor joins on existing DBs, but I keep hearing

> >>> that a datawarehouseis the way to go.  e.g. I want to create orders

Jim

unread,
Sep 28, 2009, 5:45:47 AM9/28/09
to
On this post, you are talking about Talend and Informatica like both
are
comparable... indeed both tools are comparable in power and
functionality.
But keep in mind Informatica is a proprietary solution, with no access
to the code. You have a license fee + other costs associated to the
use
of Infa.
Talend is an open source ETL able to perform data migration and
synchronization. It's code is open source. You can download the free
version of Talend, Talend Open Studio on the website.
Being an open source tool, you can adapt the software to your
organization and therefore spend less time and money on the
development
of your companies' own software.
Download it here: http://www.talend.com/download.php

Just my 2 cents about Talend and Informatica...

M.-A. Lemburg

unread,
Oct 2, 2009, 5:55:52 AM10/2/09
to Tony Schmidt, pytho...@python.org
Tony Schmidt wrote:
> Hi, Marc-Andre - well, so far you seem to be the only one suggesting
> that cross-database joins is the way to go - everyone else has been
> telling me to build a warehouse. I initially was trying to avoid the
> warehouse idea to "avoid going through the external temporary
> resource", as you say. But then the situation came up where the data
> warehouse would give me another benefit by enabling power users with
> tools like OOBase to do their own reporting. So I started to re-
> consider it, and since then, no one has advised the direct approach
> with connections/joins on multiple databases.
>
> Have you had much luck circumventing the need for a data warehouse in
> this way?

Depending on what your data sources are, you will end up
using a data warehouse sooner or later.

We've worked with clients from the middle office banking world
and there you typically have to aggregate data, polish it up,
transform it and then send it off for others to process.

Cross-database joins don't work in such a scenario because
the data sources are too diverse. Logging and auditing
requirements can also lead to having to use a warehouse
approach.

However, if you just to put together data from a few local
sources without too much data to work on, I'd suggest to
try the direct Python approach first. If you have enough
RAM to work with, Python can easily handle millions
of records and if you'd rather use a SQL based approach,
it's also possible to temporarily store the data in a SQLite
database as on-demand data warehouse.

Regards,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 02 2009)

Reply all
Reply to author
Forward
0 new messages