I have an integrated system Dashboard project that I was going to
build using cross-vendor joins on existing DBs, but I keep hearing
that a data warehouse is the way to go. e.g. I want to create orders
and order_items with relations to members (MS Access DB), products
(flat file) and employees (MySQL).
Thanks in advance for any tips.
I have done some small/medium sized stuff using SQLAlchemy,
Turbogears, and Flex. I have never used a commercial product, but I
imagine getting it set up to work with your data is the hardest part
of the job anyway, and the solution you end up with will most likely
limit you to applying their api to your data. If you build it
yourself, you have complete control, and know exactly where to go when
you have a problem, or to add a feature.
I'm no expert, but I think I would try to find a way to consolidate
the data into one data source. We handle the giant amount of data we
are collecting by preprocessing it into another DB anyway, so I
imagine you could do both things at the same time.
This could very probably be handled in a different way if you are a
DBA. I'm just a MySQL hack. :)
~Sean
My experience is that if you enjoy hacking around databases and are
proficient in Python, than for small scale solutions it is preferable to
do it yourself. If you need a large scale solutions with advanced
requirements, building it yourself is mostly the only way.
I have build a rather complex datawarehouse system in the past (well
actually more like a centralised synchronisation hub, having input and
reporting databases as satellite clients), shoving data from around 500
databases (typically 5 Gb in size each) spread over the world.
The only commercial solutions I reviewed was Business Objects Data
Integrator and Oracle Warehouse Builder.
These tools where quite flexible and if you already have a license deal
which includes these tools I would definitely recommend to have more
than just a look at it.
If not and you are comfortably with using python to shovel data from A
to B and transform it at the same time (moving relational data
automatically into a EAV model and back again, for example) than
building your own solution will probably save you money and time (as
opposed to learn how to use that ETL tool).
This will require you to have at least interest in the following subjects:
- Authorization, may everybody use all data or should it be limited to a
subset on the data depending on the data?
(My solution was one centralised hub which contains all data but is only
accessible to special 'client' servers strictly maintained by me which
only sync the data relevant to them).
- Authenticity, if you have different values for the same thing, which
one should be considered authoritative and if yes may it be pushed back
to the un-authoritative?
-Synchronisation, you really don't want to push/pull all of the database
content over every x times, so how can you delta it and is there a way
to do this only when the data changes (push vs pull)?
-ATOMIC, how long may the data be out of date and is it allowed to
partially update
-Using and maintaining multiple databases, hopefully spread over
multiple systems. I had a server for each production DB, a server that
mirrored that production DB with some added columns per table for
external synchronization purposes and a master synchronisation server
(so in essence all data was copied three times, not very efficient but
good if you like to play it on the safe side).
--
MPH
http://blog.dcuktec.com
'If consumed, best digested with added seasoning to own preference.'
I agree that the ETL tools are complex in themselves, and I may as
well spend that learning curve on a lower-level tool-set that has the
added value of greater flexibility.
Can you suggest a good book or tutorial to help me build a data
warehouse in python? Bill Inmon's "Building the Data Warehouse" is 17
years old, and I've been cautioned against Kimball.
Thanks.
On Sep 22, 3:40 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:
> MPHhttp://blog.dcuktec.com
You might want to look at this solution for doing cross-database JOINs:
EasySoft ODBC Join-Engine:
http://www.easysoft.com/products/data_access/odbc_odbc_join_engine/index.html
and then use our mxODBC to access EasySoft's Engine:
http://www.egenix.com/products/python/mxODBC/
or mxODBC Connect, if you have a client-server setup:
http://www.egenix.com/products/python/mxODBCConnect/
Some database engines also allow integrating external ODBC
data sources - the external tables then look like normal
database tables and can be used in JOINs as well. I know that DB2
and Oracle support this kind of setup. You can access those using
mxODBC as well.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Sep 23 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
Data warehouse isn't something magical, it is just another database,
albeit containing multiple datasets gathered from foreign resources in
possibly multiple formats.
Depending on your purpose of what you want, you design your tables the
way you usually do. For example if you only want reporting, you might
want to build your tables in such a way so it makes your life easier to
build the actual report.
Now you have an empty database containing the fields you wish for the
report and have filled database(s) containing data from the user
application. Now you use Python to fill the empty database, tada, you
have a Data warehouse and used Python for ETL processing.
So if you already have some insights in creating tables in a database,
you are all set. Most likely you will go through a number of iterations
before you are happy with the result though.
There is no book substitute for applying theory, experience and common
sense to a problem you want to solve, unless you write it yourself for
that specific situation.
--
MPH
I use both Python and a Data-warehouse tool (Datastage) from IBM that
is similar to Informatica. The main difference with Python is
throughput. The tool has good sort and join routines of multithreaded
C code that handles data bigger than what fits in RAM. It also has
good native drivers for the DB2 database. For data conversions and
other transformations every row gets processed on a different CPU. You
can really put a 16 core machine to good use with this thing.
In your case you probably won't have enough data to justify the cost
of buying a tool. They are quite expensive.
On Sep 23, 3:15 am, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:
> MPHhttp://blog.dcuktec.com
On Sep 23, 2:02 am, "M.-A. Lemburg" <m...@egenix.com> wrote:
> snfctech wrote:
> > Does anyone have experience building a data warehouse in python? Any
> > thoughts on custom vs using an out-of-the-box product like Talend or
> > Informatica?
>
> > I have an integrated system Dashboard project that I was going to
> > build using cross-vendor joins on existing DBs, but I keep hearing
> > that a data warehouse is the way to go. e.g. I want to create orders
> > and order_items with relations to members (MS Access DB), products
> > (flat file) and employees (MySQL).
>
> > Thanks in advance for any tips.
>
> You might want to look at this solution for doing cross-database JOINs:
>
> EasySoft ODBC Join-Engine:
>
> http://www.easysoft.com/products/data_access/odbc_odbc_join_engine/in...
Yes failure rate for data warehouse projects is quite high, so are other
IT projects without data warehouses.
Data warehouse design is not that much different than 'normal' RDBMS,
you are following the same decisions for example:
- Do I rather have multiple copies of data than slow or complicated
access? If yes how do I ensure integrity?
- How do I do access control on value level?
- Do I need fail over or load balancing, if yes how much of it can I do
on the application level?
The thing is if you never have designed a database from a database point
of view because you used abstractions that hide these ugly details then
yes, Data warehouses are different than normal RDBMS.
Yes you can make it all sound very complicated by throwing PHB words in
like meta schema's, dimensional approach, subject orientated, etc.
I like to see it more like I have a couple of data sources, I want to
combine them in a way that I can do neat stuff with it. The 'combine
them' part is the data warehouse design, the 'neat stuff' part are the
clients that use that data warehouse. If you wish you can start making
it more complicated by saying but my data sources are also my clients,
but that is another step.
I guess you can sum up all this by saying a data warehouse is a method
of gathering already existing data and present them in a different way,
the concept is simple, the details can be complicated if you want/need
it to be.
--
MPH
So do you think it would be very beneficial for me to start with an
Inman or Kimball book? Or do you think it would be just leisure
reading and not very practical at best - fill my head with needless
jargon and inflexible dogmas, at worst?
I took a database class in college, understand the basic principals of
normalisation, and have built a few complicated RDBMS schemas from the
ground up.
On Sep 23, 12:28 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:
> snfctech wrote:
> > @Martin: I originally thought that there was nothing "magical" about
> > building a datawarehouse, but then I did a little research and
> > received all sorts of feedback about how datawarehouseprojects have
> > notorious failure rates, that datawarehousedesign IS different than
> > normal RDBMS - and then there's the whole thing about data marts vs.
> > warehouses, Kimball vs. Inmon, star schemas, EAV tables, and so on.
> > So I started to think that maybe I needed to get a little better read
> > on the subject.
>
> Yes failure rate for datawarehouseprojects is quite high, so are other
> IT projects without data warehouses.
>
> Datawarehousedesign is not that much different than 'normal' RDBMS,
> you are following the same decisions for example:
> - Do I rather have multiple copies of data than slow or complicated
> access? If yes how do I ensure integrity?
> - How do I do access control on value level?
> - Do I need fail over or load balancing, if yes how much of it can I do
> on the application level?
>
> The thing is if you never have designed a database from a database point
> of view because you used abstractions that hide these ugly details then
> yes, Data warehouses are different than normal RDBMS.
>
> Yes you can make it all sound very complicated by throwing PHB words in
> like meta schema's, dimensional approach, subject orientated, etc.
> I like to see it more like I have a couple of data sources, I want to
> combine them in a way that I can do neat stuff with it. The 'combine
> them' part is the datawarehousedesign, the 'neat stuff' part are the
> clients that use that datawarehouse. If you wish you can start making
> it more complicated by saying but my data sources are also my clients,
> but that is another step.
>
> I guess you can sum up all this by saying a datawarehouseis a method
> of gathering already existing data and present them in a different way,
> the concept is simple, the details can be complicated if you want/need
> it to be.
>
> --
> MPHhttp://blog.dcuktec.com
> So do you think it would be very beneficial for me to start with an
> Inman or Kimball book? Or do you think it would be just leisure
> reading and not very practical at best - fill my head with needless
> jargon and inflexible dogmas, at worst?
You have an unique opportunity here ;-)
I would recommend that you start by writing down what you want and how
it should look like, don't be bothered yet with implementation details
like how tables should be organised, just how you want to access your
data for the end-result.
Than make a detailed list of caveats which you can already see;
security, distribution, centralisation, real-time sync, etc.
Use that list, go to a tech library and flick through all the books and
look for signs that they actually have a solution for your problems,
read those books and take the things that make sense.
After you done this a couple of times you should see some patterns
starting to form and see that some solutions which you used in the past,
although working are better handled differently. But the main thing is
that you should have something working, which is the proof of the pudding.
<cut rest>
--
MPH
Strange, EasySoft used to support their product on Linux as well...
http://download.freshmeat.net/projects/easysoftsqlengine
Looks like they stopped with supporting it on Linux and now only
have Windows downloads available.
mxODBC and mxODBC Connect work just fine on Linux, so you could use
them to build a custom joining solution in Python.
It really depends on whether you need to run those joins on
an on-demand basis or not:
AFAIK, Talend can only run jobs which
then apply the JOINs and put the resulting data somewhere, e.g.
into a file or another database table.
With Python, you could do the join straight away and use the
data directly without the Python script - without having
to go through an external temporary resource.
Anyway, just a suggestion.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Sep 24 2009)
Have you had much luck circumventing the need for a data warehouse in
this way?
Thanks.
On Sep 24, 4:10 am, "M.-A. Lemburg" <m...@egenix.com> wrote:
> snfctech wrote:
> > @Lemburg: Thanks for the suggestion. I'm sure you make a fine
> > product, but my development platform is Linux, and I don't want any
> > additional Windows servers to deal with (than the ones I'm already
> > stuck with.)
>
> Strange, EasySoft used to support their product on Linux as well...
>
> http://download.freshmeat.net/projects/easysoftsqlengine
>
> Looks like they stopped with supporting it on Linux and now only
> have Windows downloads available.
>
> mxODBC and mxODBC Connect work just fine on Linux, so you could use
> them to build a custom joining solution in Python.
>
> It really depends on whether you need to run those joins on
> an on-demand basis or not:
>
> AFAIK, Talend can only run jobs which
> then apply the JOINs and put the resulting data somewhere, e.g.
> into a file or another database table.
>
> With Python, you could do the join straight away and use the
> data directly without the Python script - without having
> to go through an external temporary resource.
>
> Anyway, just a suggestion.
>
>
>
> > On Sep 23, 2:02 am, "M.-A. Lemburg" <m...@egenix.com> wrote:
> >> snfctech wrote:
> >>> Does anyone have experience building a datawarehousein python? Any
> >>> thoughts on custom vs using an out-of-the-box product like Talend or
> >>> Informatica?
>
> >>> I have an integrated system Dashboard project that I was going to
> >>> build using cross-vendor joins on existing DBs, but I keep hearing
> >>> that a datawarehouseis the way to go. e.g. I want to create orders
Just my 2 cents about Talend and Informatica...
Depending on what your data sources are, you will end up
using a data warehouse sooner or later.
We've worked with clients from the middle office banking world
and there you typically have to aggregate data, polish it up,
transform it and then send it off for others to process.
Cross-database joins don't work in such a scenario because
the data sources are too diverse. Logging and auditing
requirements can also lead to having to use a warehouse
approach.
However, if you just to put together data from a few local
sources without too much data to work on, I'd suggest to
try the direct Python approach first. If you have enough
RAM to work with, Python can easily handle millions
of records and if you'd rather use a SQL based approach,
it's also possible to temporarily store the data in a SQLite
database as on-demand data warehouse.
Regards,
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Oct 02 2009)