So I am trying to run an hourly task that would for example pull the list of accounts. Some may be new, some may be updated.. How do I go about it with the activewarehouse-etl?
is there a good example/good practice ? I was using the bulk import which for now does not deal with update so only insert. Works fine for a first load but after that I need something a bit smarter..
Is there special destination that deals with that automatically?
So I am trying to run an hourly task that would for example pull the list
> of accounts. Some may be new, some may be updated.. How do I go about it
> with the activewarehouse-etl?
> is there a good example/good practice ? I was using the bulk import which
> for now does not deal with update so only insert. Works fine for a first
> load but after that I need something a bit smarter..
> Is there special destination that deals with that automatically?
There are multiple solutions, depending on your specific setup.
Can you tell us:
- what datastore do you need to use as a target?
- what is the volume (number of rows, roughly) you need to upsert per hour?
I'll provide the most appropriate solution based on what you reply.
On Wednesday, October 17, 2012 11:48:05 AM UTC-7, Thibaut Barrère wrote:
> Hi Emmanuel,
> So I am trying to run an hourly task that would for example pull the list >> of accounts. Some may be new, some may be updated.. How do I go about it >> with the activewarehouse-etl?
>> is there a good example/good practice ? I was using the bulk import which >> for now does not deal with update so only insert. Works fine for a first >> load but after that I need something a bit smarter.. >> Is there special destination that deals with that automatically?
> There are multiple solutions, depending on your specific setup.
> Can you tell us:
> - what datastore do you need to use as a target?
> - what is the volume (number of rows, roughly) you need to upsert per hour?
> I'll provide the most appropriate solution based on what you reply.
Be sure to read the gotchas! And I would use batch mode probably.
I would just use it not batched and without destination, using something
like:
before_write do |row|
Table.upsert(xxx)
nil # remove the row from the pipeline
end
If you need to do some batching, then you can either create a custom
destination (have a look at the code) or use a mix of before_write and
post_process (to finalize the last batch).
You could also use a few additional tricks:
- before bulk loading on immutable tables, be sure to detect the last id
imported to avoid re-importing it
- or you can remove rows that are already present in the destination, for
instance using:
after_read do |row|
Target.where(id: row[:some_id]).count > 0 ? nil : row
end
etc etc.
Hope this helps, let me know if you need more help later on!
In particular if the upsert gem does not work for you, we'll find another
solution.
Cool! I ll take a look at that . Once I have more time next week or so, I ll take a look at SCD 2. Really need to get his working because it would be useful :)
Too bad we don t have a good example! For now upsert might do the trick for me
> Be sure to read the gotchas! And I would use batch mode probably.
> I would just use it not batched and without destination, using something > like:
> before_write do |row|
> Table.upsert(xxx)
> nil # remove the row from the pipeline
> end
> If you need to do some batching, then you can either create a custom > destination (have a look at the code) or use a mix of before_write and > post_process (to finalize the last batch).
> You could also use a few additional tricks:
> - before bulk loading on immutable tables, be sure to detect the last id > imported to avoid re-importing it
> - or you can remove rows that are already present in the destination, for > instance using:
> after_read do |row|
> Target.where(id: row[:some_id]).count > 0 ? nil : row
> end
> etc etc.
> Hope this helps, let me know if you need more help later on!
> In particular if the upsert gem does not work for you, we'll find another > solution.
Quick question: Do you create a different ETL for the initial import vs update/insert later on?
I am trying to see how you first fill your db, and then only maintain updates? Or just always pull everything and just filter from there? or is that premature optimization from me?
On Wednesday, October 17, 2012 4:08:02 PM UTC-7, Emmanuel wrote:
> Cool! I ll take a look at that . Once I have more time next week or so, I > ll take a look at SCD 2. Really need to get his working because it would be > useful :)
> Too bad we don t have a good example! For now upsert might do the trick > for me
> On Wednesday, October 17, 2012 1:00:41 PM UTC-7, Thibaut Barrère wrote:
>> Hi,
>> given that you use PG as a datastore, my suggestion is to try out the >> upsert gem:
>> Be sure to read the gotchas! And I would use batch mode probably.
>> I would just use it not batched and without destination, using something >> like:
>> before_write do |row|
>> Table.upsert(xxx)
>> nil # remove the row from the pipeline
>> end
>> If you need to do some batching, then you can either create a custom >> destination (have a look at the code) or use a mix of before_write and >> post_process (to finalize the last batch).
>> You could also use a few additional tricks:
>> - before bulk loading on immutable tables, be sure to detect the last id >> imported to avoid re-importing it
>> - or you can remove rows that are already present in the destination, for >> instance using:
>> after_read do |row|
>> Target.where(id: row[:some_id]).count > 0 ? nil : row
>> end
>> etc etc.
>> Hope this helps, let me know if you need more help later on!
>> In particular if the upsert gem does not work for you, we'll find another >> solution.
On Thu, Oct 18, 2012 at 12:47 PM, Emmanuel <epina...@talenttech.com> wrote:
> Quick question: Do you create a different ETL for the initial import vs
> update/insert later on?
> I am trying to see how you first fill your db, and then only maintain
> updates? Or just always pull everything and just filter from there? or is
> that premature optimization from me?
> Emmanuel
> On Wednesday, October 17, 2012 4:08:02 PM UTC-7, Emmanuel wrote:
>> Cool! I ll take a look at that . Once I have more time next week or so, I
>> ll take a look at SCD 2. Really need to get his working because it would be
>> useful :)
>> Too bad we don t have a good example! For now upsert might do the trick
>> for me
>> On Wednesday, October 17, 2012 1:00:41 PM UTC-7, Thibaut Barrère wrote:
>>> Hi,
>>> given that you use PG as a datastore, my suggestion is to try out the
>>> upsert gem:
>>> Be sure to read the gotchas! And I would use batch mode probably.
>>> I would just use it not batched and without destination, using something
>>> like:
>>> before_write do |row|
>>> Table.upsert(xxx)
>>> nil # remove the row from the pipeline
>>> end
>>> If you need to do some batching, then you can either create a custom
>>> destination (have a look at the code) or use a mix of before_write and
>>> post_process (to finalize the last batch).
>>> You could also use a few additional tricks:
>>> - before bulk loading on immutable tables, be sure to detect the last id
>>> imported to avoid re-importing it
>>> - or you can remove rows that are already present in the destination,
>>> for instance using:
>>> after_read do |row|
>>> Target.where(id: row[:some_id]).count > 0 ? nil : row
>>> end
>>> etc etc.
>>> Hope this helps, let me know if you need more help later on!
>>> In particular if the upsert gem does not work for you, we'll find
>>> another solution.
> To post to this group, send email to
> activewarehouse-discuss@googlegroups.com.
> To unsubscribe from this group, send email to
> activewarehouse-discuss+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/activewarehouse-discuss?hl=en.
my problem is more on dimension tables. You have to originally load them, but you might not want to keep getting all of the data everytime just for efficiency? like process only the one that were updated?
For fact it is easy because I only add mostly on the one I have
On Thursday, October 18, 2012 10:12:38 AM UTC-7, Andrew Kuklewicz wrote:
> I didn't use different tasks for initial load, and would avoid the extra > hassle unless the process is very different.
> There were a few fact tables I broke up into different files for load, but > the ctl files were the same.
> Cheers,
> Andrew Kuklewicz
> On Thu, Oct 18, 2012 at 12:47 PM, Emmanuel <epin...@talenttech.com<javascript:>
> > wrote:
>> Quick question: Do you create a different ETL for the initial import vs >> update/insert later on?
>> I am trying to see how you first fill your db, and then only maintain >> updates? Or just always pull everything and just filter from there? or is >> that premature optimization from me?
>> Emmanuel
>> On Wednesday, October 17, 2012 4:08:02 PM UTC-7, Emmanuel wrote:
>>> Cool! I ll take a look at that . Once I have more time next week or so, >>> I ll take a look at SCD 2. Really need to get his working because it would >>> be useful :)
>>> Too bad we don t have a good example! For now upsert might do the trick >>> for me
>>> On Wednesday, October 17, 2012 1:00:41 PM UTC-7, Thibaut Barrère wrote:
>>>> Hi,
>>>> given that you use PG as a datastore, my suggestion is to try out the >>>> upsert gem:
>>>> Be sure to read the gotchas! And I would use batch mode probably.
>>>> I would just use it not batched and without destination, using >>>> something like:
>>>> before_write do |row|
>>>> Table.upsert(xxx)
>>>> nil # remove the row from the pipeline
>>>> end
>>>> If you need to do some batching, then you can either create a custom >>>> destination (have a look at the code) or use a mix of before_write and >>>> post_process (to finalize the last batch).
>>>> You could also use a few additional tricks:
>>>> - before bulk loading on immutable tables, be sure to detect the last >>>> id imported to avoid re-importing it
>>>> - or you can remove rows that are already present in the destination, >>>> for instance using:
>>>> after_read do |row|
>>>> Target.where(id: row[:some_id]).count > 0 ? nil : row
>>>> end
>>>> etc etc.
>>>> Hope this helps, let me know if you need more help later on!
>>>> In particular if the upsert gem does not work for you, we'll find >>>> another solution.
>> To post to this group, send email to activewareh...@googlegroups.com<javascript:>
>> .
>> To unsubscribe from this group, send email to >> activewarehouse-discuss+unsubscribe@googlegroups.com <javascript:>.
>> For more options, visit this group at >> http://groups.google.com/group/activewarehouse-discuss?hl=en.
my problem is more on dimension tables. You have to originally load them,
> but you might not want to keep getting all of the data everytime just for
> efficiency? like process only the one that were updated? For fact it is
> easy because I only add mostly on the one I have
I second what Andrew said: I prefer to write idempotent ETL jobs, so that
you can re-run the whole thing, or only a delta if you prefer, with the
same process. It's a bit harder to do properly, but then it is much more
flexible, and much less error prone.
On your dimension tables: do rows already inserted later see their columns
changed? Or is their data immutable? (unlikely, usually).
There's a tradeoff between development time, processing time, and
complexity to be found based on your requirements:
- if you prefer a very simple mechanism, at the cost of re-extracting and
re-importing the whole dimension (brute force), then you can do it that way
(I had jobs that take 10 hours, each night, and it was perfectly fine)
- if you prefer something more elaborated, and your dimension data do not
change, then you'll need to track the last dimension id inserted, query the
source for only what you need
- if you want to only insert or update what changed, you'll have to track
down a timestamp in your source system to extract only what changed (eg:
updated_at > ? OR created_at > ?) and upsert based on that
- if you want to track deletes, you'll want to keep a list of deleted ids,
or add a "deleted paranoid column"
If processing time is something you want to lower, you can:
- extract less (extract only what changed) to diminish processing time on
extraction
- upsert less with strategies like computing and storing a checksum for
each row in the target, to diminish processing time on upsert
- use some caching strategy in the upsert process (like keeping the
checksums in Redis etc)
So this is really dependent on the setup and the tradeoffs you are ready to
make.
With so many different situations, that's also why activewarehouse-etl does
not attempt to answer them directly, but instead provide a more general
framework :-)
The problem I have having is because of the bulk import I use originally to load those dimension (only does insert)... But then I need to not run that later on and do the upcert when I keep getting changes? (if I could have SCD working that would have been a big help already)
1) So does that mean you do not use the bulk import in your case for those situation?
On Thursday, October 18, 2012 11:34:59 AM UTC-7, Thibaut Barrère wrote:
> Hello folks!
> my problem is more on dimension tables. You have to originally load them, >> but you might not want to keep getting all of the data everytime just for >> efficiency? like process only the one that were updated? For fact it is >> easy because I only add mostly on the one I have
> I second what Andrew said: I prefer to write idempotent ETL jobs, so that > you can re-run the whole thing, or only a delta if you prefer, with the > same process. It's a bit harder to do properly, but then it is much more > flexible, and much less error prone.
> On your dimension tables: do rows already inserted later see their columns > changed? Or is their data immutable? (unlikely, usually).
> There's a tradeoff between development time, processing time, and > complexity to be found based on your requirements:
> - if you prefer a very simple mechanism, at the cost of re-extracting and > re-importing the whole dimension (brute force), then you can do it that way > (I had jobs that take 10 hours, each night, and it was perfectly fine)
> - if you prefer something more elaborated, and your dimension data do not > change, then you'll need to track the last dimension id inserted, query the > source for only what you need
> - if you want to only insert or update what changed, you'll have to track > down a timestamp in your source system to extract only what changed (eg: > updated_at > ? OR created_at > ?) and upsert based on that
> - if you want to track deletes, you'll want to keep a list of deleted ids, > or add a "deleted paranoid column"
> If processing time is something you want to lower, you can:
> - extract less (extract only what changed) to diminish processing time on > extraction
> - upsert less with strategies like computing and storing a checksum for > each row in the target, to diminish processing time on upsert
> - use some caching strategy in the upsert process (like keeping the > checksums in Redis etc)
> So this is really dependent on the setup and the tradeoffs you are ready > to make.
> With so many different situations, that's also why activewarehouse-etl > does not attempt to answer them directly, but instead provide a more > general framework :-)
> The problem I have having is because of the bulk import I use originally
> to load those dimension (only does insert)... But then I need to not run
> that later on and do the upcert when I keep getting changes? (if I could
> have SCD working that would have been a big help already)
> 1) So does that mean you do not use the bulk import in your case for those
> situation?
You could use bulk import, but then you'd have to target a temporary table,
copy of your target table, then use an upsert statement (specific to your
DB, like MERGE with SQLServer), to upsert the temp table into the target
table.
Otherwise you cannot use bulk import because it will try to insert existing
rows in the target.
On my cases, when I need to upsert, I do not use bulk load.
On Thursday, October 18, 2012 12:06:35 PM UTC-7, Thibaut Barrère wrote:
> The problem I have having is because of the bulk import I use originally >> to load those dimension (only does insert)... But then I need to not run >> that later on and do the upcert when I keep getting changes? (if I could >> have SCD working that would have been a big help already)
>> 1) So does that mean you do not use the bulk import in your case for >> those situation?
> You could use bulk import, but then you'd have to target a temporary > table, copy of your target table, then use an upsert statement (specific to > your DB, like MERGE with SQLServer), to upsert the temp table into the > target table.
> Otherwise you cannot use bulk import because it will try to insert > existing rows in the target.
> Make sense
> On my cases, when I need to upsert, I do not use bulk load.
I guess we should write an upset destination? rather than using the way you showed me to do upset (in a transform or before_write)
On Thu, Oct 18, 2012 at 2:48 PM, Emmanuel <epina...@talenttech.com> wrote:
> The problem I have having is because of the bulk import I use originally
> to load those dimension (only does insert)... But then I need to not run
> that later on and do the upcert when I keep getting changes? (if I could
> have SCD working that would have been a big help already)
> 1) So does that mean you do not use the bulk import in your case for those
> situation?
> On Thursday, October 18, 2012 11:34:59 AM UTC-7, Thibaut Barrère wrote:
>> Hello folks!
>> my problem is more on dimension tables. You have to originally load them,
>>> but you might not want to keep getting all of the data everytime just for
>>> efficiency? like process only the one that were updated? For fact it is
>>> easy because I only add mostly on the one I have
>> I second what Andrew said: I prefer to write idempotent ETL jobs, so that
>> you can re-run the whole thing, or only a delta if you prefer, with the
>> same process. It's a bit harder to do properly, but then it is much more
>> flexible, and much less error prone.
>> On your dimension tables: do rows already inserted later see their
>> columns changed? Or is their data immutable? (unlikely, usually).
>> There's a tradeoff between development time, processing time, and
>> complexity to be found based on your requirements:
>> - if you prefer a very simple mechanism, at the cost of re-extracting and
>> re-importing the whole dimension (brute force), then you can do it that way
>> (I had jobs that take 10 hours, each night, and it was perfectly fine)
>> - if you prefer something more elaborated, and your dimension data do not
>> change, then you'll need to track the last dimension id inserted, query the
>> source for only what you need
>> - if you want to only insert or update what changed, you'll have to track
>> down a timestamp in your source system to extract only what changed (eg:
>> updated_at > ? OR created_at > ?) and upsert based on that
>> - if you want to track deletes, you'll want to keep a list of deleted
>> ids, or add a "deleted paranoid column"
>> If processing time is something you want to lower, you can:
>> - extract less (extract only what changed) to diminish processing time on
>> extraction
>> - upsert less with strategies like computing and storing a checksum for
>> each row in the target, to diminish processing time on upsert
>> - use some caching strategy in the upsert process (like keeping the
>> checksums in Redis etc)
>> So this is really dependent on the setup and the tradeoffs you are ready
>> to make.
>> With so many different situations, that's also why activewarehouse-etl
>> does not attempt to answer them directly, but instead provide a more
>> general framework :-)
> To post to this group, send email to
> activewarehouse-discuss@googlegroups.com.
> To unsubscribe from this group, send email to
> activewarehouse-discuss+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/activewarehouse-discuss?hl=en.