Insert/Updates only

Emmanuel

unread,

Oct 17, 2012, 2:10:36 PM10/17/12

to activewareh...@googlegroups.com

Hi

So I am trying to run an hourly task that would for example pull the list of accounts. Some may be new, some may be updated.. How do I go about it with the activewarehouse-etl?

is there a good example/good practice ? I was using the bulk import which for now does not deal with update so only insert. Works fine for a first load but after that I need something a bit smarter..

Is there special destination that deals with that automatically?

Thanks

Emmanuel

Thibaut Barrère

unread,

Oct 17, 2012, 2:47:42 PM10/17/12

to activewareh...@googlegroups.com

Hi Emmanuel,

So I am trying to run an hourly task that would for example pull the list of accounts. Some may be new, some may be updated.. How do I go about it with the activewarehouse-etl?

is there a good example/good practice ? I was using the bulk import which for now does not deal with update so only insert. Works fine for a first load but after that I need something a bit smarter..
Is there special destination that deals with that automatically?

There are multiple solutions, depending on your specific setup.

Can you tell us:

- what datastore do you need to use as a target?

- what is the volume (number of rows, roughly) you need to upsert per hour?

I'll provide the most appropriate solution based on what you reply.

cheers,

Thibaut

--

http://www.logeek.fr

Emmanuel

unread,

Oct 17, 2012, 3:47:43 PM10/17/12

to activewareh...@googlegroups.com

Hi Thibaut,

So I am mostly interesting in dimension updates. Some are just account/users stuff that rarely changes and less than 20 000 rows for sure.

Some are aggregated number of a dimension table that is not that big either less than 1000 rows but aggregate hourly to update some numbers

I don think I have anything yet bigger than 100 000 rows that requires update/insert logic in my dimensions

For fact table it is just insert so I can use the bulk upload

I am using postgre for the backend

thanks

Emmanuel

Thibaut Barrère

unread,

Oct 17, 2012, 4:00:18 PM10/17/12

to activewareh...@googlegroups.com

Hi,

given that you use PG as a datastore, my suggestion is to try out the upsert gem:

https://github.com/seamusabshere/upsert

Be sure to read the gotchas! And I would use batch mode probably.

I would just use it not batched and without destination, using something like:

before_write do |row|

Table.upsert(xxx)

nil # remove the row from the pipeline

end

If you need to do some batching, then you can either create a custom destination (have a look at the code) or use a mix of before_write and post_process (to finalize the last batch).

You could also use a few additional tricks:

- before bulk loading on immutable tables, be sure to detect the last id imported to avoid re-importing it

- or you can remove rows that are already present in the destination, for instance using:

after_read do |row|

Target.where(id: row[:some_id]).count > 0 ? nil : row

end

etc etc.

Hope this helps, let me know if you need more help later on!

In particular if the upsert gem does not work for you, we'll find another solution.

Thibaut

--

http://www.logeek.fr

Emmanuel

unread,

Oct 17, 2012, 7:08:02 PM10/17/12

to activewareh...@googlegroups.com

Cool! I ll take a look at that . Once I have more time next week or so, I ll take a look at SCD 2. Really need to get his working because it would be useful :)

Too bad we don t have a good example! For now upsert might do the trick for me

Emmanuel

unread,

Oct 18, 2012, 12:47:22 PM10/18/12

to activewareh...@googlegroups.com

Quick question: Do you create a different ETL for the initial import vs update/insert later on?

I am trying to see how you first fill your db, and then only maintain updates? Or just always pull everything and just filter from there? or is that premature optimization from me?

Emmanuel

Andrew Kuklewicz

unread,

Oct 18, 2012, 1:12:07 PM10/18/12

to activewareh...@googlegroups.com

I didn't use different tasks for initial load, and would avoid the extra hassle unless the process is very different.

There were a few fact tables I broke up into different files for load, but the ctl files were the same.

Cheers,

Andrew Kuklewicz

--
You received this message because you are subscribed to the Google Groups "ActiveWarehouse Discuss" group.
To view this discussion on the web visit https://groups.google.com/d/msg/activewarehouse-discuss/-/B2DT6ZZk6cwJ.

To post to this group, send email to activewareh...@googlegroups.com.
To unsubscribe from this group, send email to activewarehouse-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/activewarehouse-discuss?hl=en.

Emmanuel

unread,

Oct 18, 2012, 1:53:40 PM10/18/12

to activewareh...@googlegroups.com

my problem is more on dimension tables. You have to originally load them, but you might not want to keep getting all of the data everytime just for efficiency? like process only the one that were updated?

For fact it is easy because I only add mostly on the one I have

Cheers,

Andrew Kuklewicz

To unsubscribe from this group, send email to activewarehouse-discuss+unsub...@googlegroups.com.

Thibaut Barrère

unread,

Oct 18, 2012, 2:34:38 PM10/18/12

to activewareh...@googlegroups.com

Hello folks!

my problem is more on dimension tables. You have to originally load them, but you might not want to keep getting all of the data everytime just for efficiency? like process only the one that were updated? For fact it is easy because I only add mostly on the one I have

I second what Andrew said: I prefer to write idempotent ETL jobs, so that you can re-run the whole thing, or only a delta if you prefer, with the same process. It's a bit harder to do properly, but then it is much more flexible, and much less error prone.

On your dimension tables: do rows already inserted later see their columns changed? Or is their data immutable? (unlikely, usually).

There's a tradeoff between development time, processing time, and complexity to be found based on your requirements:

- if you prefer a very simple mechanism, at the cost of re-extracting and re-importing the whole dimension (brute force), then you can do it that way (I had jobs that take 10 hours, each night, and it was perfectly fine)

- if you prefer something more elaborated, and your dimension data do not change, then you'll need to track the last dimension id inserted, query the source for only what you need

- if you want to only insert or update what changed, you'll have to track down a timestamp in your source system to extract only what changed (eg: updated_at > ? OR created_at > ?) and upsert based on that

- if you want to track deletes, you'll want to keep a list of deleted ids, or add a "deleted paranoid column"

If processing time is something you want to lower, you can:

- extract less (extract only what changed) to diminish processing time on extraction

- upsert less with strategies like computing and storing a checksum for each row in the target, to diminish processing time on upsert

- use some caching strategy in the upsert process (like keeping the checksums in Redis etc)

So this is really dependent on the setup and the tradeoffs you are ready to make.

With so many different situations, that's also why activewarehouse-etl does not attempt to answer them directly, but instead provide a more general framework :-)

hope this helps!

Thibaut

--

http://www.logeek.fr

Emmanuel

unread,

Oct 18, 2012, 2:48:15 PM10/18/12

to activewareh...@googlegroups.com

The problem I have having is because of the bulk import I use originally to load those dimension (only does insert)... But then I need to not run that later on and do the upcert when I keep getting changes? (if I could have SCD working that would have been a big help already)

1) So does that mean you do not use the bulk import in your case for those situation?

2) how do you use any of the other database control such as https://github.com/activewarehouse/activewarehouse-etl/blob/master/lib/etl/control/destination/insert_update_database_destination.rb

Emmanuel

Thibaut Barrère

unread,

Oct 18, 2012, 3:06:13 PM10/18/12

to activewareh...@googlegroups.com

The problem I have having is because of the bulk import I use originally to load those dimension (only does insert)... But then I need to not run that later on and do the upcert when I keep getting changes? (if I could have SCD working that would have been a big help already)

1) So does that mean you do not use the bulk import in your case for those situation?

You could use bulk import, but then you'd have to target a temporary table, copy of your target table, then use an upsert statement (specific to your DB, like MERGE with SQLServer), to upsert the temp table into the target table.

Otherwise you cannot use bulk import because it will try to insert existing rows in the target.

On my cases, when I need to upsert, I do not use bulk load.

Or for immutable data, you could rely on :check_exist, too: https://github.com/activewarehouse/activewarehouse-etl-sample/blob/master/advanced/etl/prepare_user_dimension.ctl#L29

2) how do you use any of the other database control such as https://github.com/activewarehouse/activewarehouse-etl/blob/master/lib/etl/control/destination/insert_update_database_destination.rb

I have not used this specific one. Use (based on what I read in the source) probably goes like:

destination :out, {

:type => :insert_update_database,

:target => xxx,

:table => yyy,

:primarykey => 'id'

}

(untested!)

hope this helps,

Thibaut

--

http://www.logeek.fr

Emmanuel

unread,

Oct 18, 2012, 3:10:23 PM10/18/12

to activewareh...@googlegroups.com

On Thursday, October 18, 2012 12:06:35 PM UTC-7, Thibaut Barrère wrote:

The problem I have having is because of the bulk import I use originally to load those dimension (only does insert)... But then I need to not run that later on and do the upcert when I keep getting changes? (if I could have SCD working that would have been a big help already)

1) So does that mean you do not use the bulk import in your case for those situation?

You could use bulk import, but then you'd have to target a temporary table, copy of your target table, then use an upsert statement (specific to your DB, like MERGE with SQLServer), to upsert the temp table into the target table.

Otherwise you cannot use bulk import because it will try to insert existing rows in the target.

Make sense

On my cases, when I need to upsert, I do not use bulk load.

I guess we should write an upset destination? rather than using the way you showed me to do upset (in a transform or before_write)

Or for immutable data, you could rely on :check_exist, too: https://github.com/activewarehouse/activewarehouse-etl-sample/blob/master/advanced/etl/prepare_user_dimension.ctl#L29

Yep for that one I use that too :)

2) how do you use any of the other database control such as https://github.com/activewarehouse/activewarehouse-etl/blob/master/lib/etl/control/destination/insert_update_database_destination.rb

I have not used this specific one. Use (based on what I read in the source) probably goes like:

destination :out, {
:type => :insert_update_database,

:target => xxx,
:table => yyy,
:primarykey => 'id'
}

Make sense. I guess I ll try that one otherwise I ll create a destination for upsert :)

Andrew Kuklewicz

unread,

Oct 18, 2012, 3:09:12 PM10/18/12

to activewareh...@googlegroups.com

I do bulk updates from files -> mysql, and use the insert or replace capability in mysql.

http://dev.mysql.com/doc/refman/5.6/en/load-data.html

I don't think that helps you with postgres though.

Cheers,

Andrew Kuklewicz

--

You received this message because you are subscribed to the Google Groups "ActiveWarehouse Discuss" group.

To view this discussion on the web visit https://groups.google.com/d/msg/activewarehouse-discuss/-/70v9zPfU048J.

To post to this group, send email to activewareh...@googlegroups.com.

To unsubscribe from this group, send email to activewarehouse-d...@googlegroups.com.

Reply all

Reply to author

Forward