Merging multiple data sources

25 views
Skip to first unread message

Eric Budd

unread,
Nov 28, 2012, 5:12:41 PM11/28/12
to activewareh...@googlegroups.com
Hi all,

    I'm just getting started with activewarehouse-etl, and I'm running across an issue trying to build our Customer dimension.

    From the primary transactional database, we get main customer data, and we have a second server that contains their service subscription data.  There's effectively a 1:1 relationship between these two databases, but they have different data about the customer.

    In my ctl file, I've tried adding two sources.  This works, inasmuch as both sources are read, but it adds two records for each customer, one with the data from the transactional database, and one with data from the subscription service.

    How can I merge the fields in these records before writing them to the database?

    Thanks very much.

- Eric

Thibaut Barrère

unread,
Nov 29, 2012, 3:43:41 AM11/29/12
to ActiveWarehouse Discuss
Hello Eric!

Welcome onboard :-)

> In my ctl file, I've tried adding two sources.  This works, inasmuch as
> both sources are read, but it adds two records for each customer, one with
> the data from the transactional database, and one with data from the
> subscription service.
> How can I merge the fields in these records before writing them to the
> database?

Here are a number of techniques you could use in that case:

1/ if you want to keep a single table as the target

- I would create a secondary table for the "child" table (most likely,
the subscription service table) which I would load in a first control
file.
- then I would create a second control file that would process the
transactional database, and do look-ups on the secondary table to
retrieve the fields

In this pattern, the secondary table could be actually anywhere it
fits you: a table in the target store, a table in a local sqlite
database which would just act as a cache, a redis/mongodb instance
etc: the idea is just to keep the secondary data handy to access it as
needed when creating the rows.

If the secondary data is small, you could even load it completely in
memory - I do that at times when the secondary dataset is of limited
size (and not growing over time), but I suspect this won't work for
you here.

2/ you could also keep two target tables

- a first control file would load the child table
- then the second would load the primary table with look-ups for
surrogate keys to the child table
- potentially, you could even create a view joining the two tables as
your final dimensions (depending on your dataset size and which
datastore you are using)

I hope this properly answered your question - let me know if it
doesn't!

Thibaut
--
http://www.logeek.fr
Reply all
Reply to author
Forward
0 new messages