Check Uniq after transforms have been done.

stellard

unread,

Apr 3, 2012, 6:36:10 AM4/3/12

to ActiveWarehouse Discuss

Is there any way to check the uniqueness after other transforms have
been completed?

How do you know the order of execution of things?

Thanks!
-scott

Thibaut Barrère

unread,

Apr 3, 2012, 10:55:49 AM4/3/12

to activewareh...@googlegroups.com

Hi Scott!

Is there any way to check the uniqueness after other transforms have been completed?

There are plenty of situations and plenty of ways to do that actually!

Can you elaborate a bit more on which result you'd like to achieve?

Ex:

- the row should be removed from the pipeline if there is already a record in the database with the same value of a given field

- the row should be removed from the pipeline if I already processed a row with the same value for a given field

- etc

Let me know and I'll give you an accurate answer.

How do you know the order of execution of things?

This is something that really needs a "guide" and this will be addressed (ie: control file lifecycle, what is executed when etc).

Roughly:

- there's a first pass (declaration time, when the control file is loaded) where the sources, transforms and destinations are declared (see https://github.com/activewarehouse/activewarehouse-etl/blob/master/lib/etl/control/control.rb)

- then there's a second pass handled by the "engine" which will fetch rows from the sources, go through the transforms etc, then to the destinations (see https://github.com/activewarehouse/activewarehouse-etl/blob/master/lib/etl/engine.rb#L308)

In short here's an approximate lifecycle of a control file:

- file is interpreted

- pre_process processors are invoked

- the engine iterates through each row of each source

- each row then goes across:

- after_read processors

- transforms

- before_write processors

- destination(s)

- screens are executed

- post_process processors are called

- after post process screens are executed

The key thing to understand is that: inside each of these sections, the order matters (ie: a transform declared before another one will always occur before). But after_read processors will always occur before transforms.

Is it somewhat clearer?

Let me know - I will use this as a basis for the coming guide on the topic.

Thibaut

--

http://www.logeek.fr

stellard

unread,

Apr 4, 2012, 7:27:06 AM4/4/12

to ActiveWarehouse Discuss

Hi Thibaut,

Thanks for this great answer!

This is the case that I was looking for

> - the row should be removed from the pipeline if I already processed a row
> with the same value for a given field

But the other one interests me as well if you can provide an example
of that.

To answer your other question...

Yes! That is much clearer, thank you. I was playing around with the
order of things and it was confusing in that it would only change the
outcome in some cases.

This is a great project btw. It is extremely useful and I cannot
believe its not more well known. Its useful for more that just a DW,
in some cases I am just using it to import large csv files or
migrating from a legacy database.

Cheers!

On Apr 3, 3:55 pm, Thibaut Barrère <thibaut.barr...@gmail.com> wrote:
> Hi Scott!
>
> Is there any way to check the uniqueness after other transforms have been
>
> > completed?
>
> There are plenty of situations and plenty of ways to do that actually!
>
> Can you elaborate a bit more on which result you'd like to achieve?
>
> Ex:
> - the row should be removed from the pipeline if there is already a record
> in the database with the same value of a given field
> - the row should be removed from the pipeline if I already processed a row
> with the same value for a given field
> - etc
>
> Let me know and I'll give you an accurate answer.
>
> > How do you know the order of execution of things?
>
> This is something that really needs a "guide" and this will be addressed
> (ie: control file lifecycle, what is executed when etc).
>
> Roughly:
> - there's a first pass (declaration time, when the control file is loaded)

> where the sources, transforms and destinations are declared (seehttps://github.com/activewarehouse/activewarehouse-etl/blob/master/li...
> )

> - then there's a second pass handled by the "engine" which will fetch rows
> from the sources, go through the transforms etc, then to the destinations

> (seehttps://github.com/activewarehouse/activewarehouse-etl/blob/master/li...
> )

Thibaut Barrère

unread,

Apr 4, 2012, 4:22:56 PM4/4/12

to activewareh...@googlegroups.com

Hi Scott!

This is the case that I was looking for

> - the row should be removed from the pipeline if I already processed a row
> with the same value for a given field

Ok! Then this is the part where you want to dive in to see the available processors.

- the check exist processor (also grep for "check exist" in the wiki) will remove rows that are already in the database

- the check unique processor (also in the current wiki) will remove rows that have been already be seen by your script

But note that (and there will be a guide dedicated to "hackability") you could perfectly create a tailor-made code to do it in whatever way suits you, here with an imaginary scenario relying on redis for caching processed rows, which could be shared between ETL sessions etc:

after_read do |row|

# returning nil will remove the row from the pipeline

if redis.get("processed-#{row[:key]}")

nil

else

row

end

# also writable as

redis.get("processed-#{row[:key]}") ? nil : row

end

To answer your other question...
Yes! That is much clearer, thank you. I was playing around with the
order of things and it was confusing in that it would only change the
outcome in some cases.

It *is* confusing at first, and this is why the "control lifecycle" guide will be written. I've been confused as well :)

This is a great project btw. It is extremely useful and I cannot
believe its not more well known. Its useful for more that just a DW,
in some cases I am just using it to import large csv files or
migrating from a legacy database.

I do agree, which is why I stepped in to maintain in and bring it back to life a while back.

A key point is that most people do not realize they are actually doing ETL, so many people that could use it to help them do not risk to stumble upon it by mistake.

I will make sure this changes :)

By the way, I use activewarehouse for datawarehousing but also for a wide variety of data munging tasks, such as:

- geocoding, cleaning and pushing data to a web app

- checking IBAN accounts validity on a dataset

- synchronizing data with a COBOL mainframe

- change data capture on a Javaish CRM

(I could go on :-).

Having Ruby there, despite its relative slowness, allows to a lot of hackability and a low maintenance overhead.

hope this helps!

Thibaut

--

http://www.logeek.fr

Reply all

Reply to author

Forward