Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Insert/Updates only
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  13 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Emmanuel  
View profile  
 More options Oct 17 2012, 2:10 pm
From: Emmanuel <epina...@talenttech.com>
Date: Wed, 17 Oct 2012 11:10:36 -0700 (PDT)
Local: Wed, Oct 17 2012 2:10 pm
Subject: Insert/Updates only

Hi

So I am trying to run an hourly task that would for example pull the list
of accounts. Some may be new, some may be updated.. How do I go about it
with the activewarehouse-etl?

is there a good example/good practice ? I was using the bulk import which
for now does not deal with update so only insert. Works fine for a  first
load but after that I need something a bit smarter..

Is there  special destination that deals with that automatically?

Thanks

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thibaut Barrère  
View profile  
 More options Oct 17 2012, 2:48 pm
From: Thibaut Barrère <thibaut.barr...@gmail.com>
Date: Wed, 17 Oct 2012 20:47:42 +0200
Local: Wed, Oct 17 2012 2:47 pm
Subject: Re: Insert/Updates only

Hi Emmanuel,

So I am trying to run an hourly task that would for example pull the list

> of accounts. Some may be new, some may be updated.. How do I go about it
> with the activewarehouse-etl?
> is there a good example/good practice ? I was using the bulk import which
> for now does not deal with update so only insert. Works fine for a  first
> load but after that I need something a bit smarter..
> Is there  special destination that deals with that automatically?

There are multiple solutions, depending on your specific setup.

Can you tell us:
- what datastore do you need to use as a target?
- what is the volume (number of rows, roughly) you need to upsert per hour?

I'll provide the most appropriate solution based on what you reply.

cheers,

Thibaut
--
http://www.logeek.fr


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 17 2012, 3:47 pm
From: Emmanuel <epina...@talenttech.com>
Date: Wed, 17 Oct 2012 12:47:43 -0700 (PDT)
Local: Wed, Oct 17 2012 3:47 pm
Subject: Re: Insert/Updates only

Hi Thibaut,

So I am mostly interesting in dimension updates. Some are just
account/users stuff that rarely changes and less than 20 000 rows for sure.

Some are aggregated number of  a dimension table that is not that big
either less than 1000 rows but aggregate hourly to update some numbers

I don think I have anything yet bigger than 100 000 rows that requires
update/insert logic in my dimensions

For fact table it is just insert so I can use the bulk upload

I am using postgre for the backend

thanks

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thibaut Barrère  
View profile  
 More options Oct 17 2012, 4:00 pm
From: Thibaut Barrère <thibaut.barr...@gmail.com>
Date: Wed, 17 Oct 2012 22:00:18 +0200
Local: Wed, Oct 17 2012 4:00 pm
Subject: Re: Insert/Updates only

Hi,

given that you use PG as a datastore, my suggestion is to try out the
upsert gem:

https://github.com/seamusabshere/upsert

Be sure to read the gotchas! And I would use batch mode probably.

I would just use it not batched and without destination, using something
like:

before_write do |row|
  Table.upsert(xxx)
  nil # remove the row from the pipeline
end

If you need to do some batching, then you can either create a custom
destination (have a look at the code) or use a mix of before_write and
post_process (to finalize the last batch).

You could also use a few additional tricks:
- before bulk loading on immutable tables, be sure to detect the last id
imported to avoid re-importing it
- or you can remove rows that are already present in the destination, for
instance using:

after_read do |row|
  Target.where(id: row[:some_id]).count > 0 ? nil : row
end

etc etc.

Hope this helps, let me know if you need more help later on!

In particular if the upsert gem does not work for you, we'll find another
solution.

Thibaut
--
http://www.logeek.fr


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 17 2012, 7:08 pm
From: Emmanuel <epina...@talenttech.com>
Date: Wed, 17 Oct 2012 16:08:02 -0700 (PDT)
Local: Wed, Oct 17 2012 7:08 pm
Subject: Re: Insert/Updates only

Cool! I ll take a look at that . Once I have more time next week or so, I
ll take a look at SCD 2. Really need to get his working because it would be
useful :)

Too bad we don t have a good example! For now upsert might do the trick for
me


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 18 2012, 12:47 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 18 Oct 2012 09:47:22 -0700 (PDT)
Local: Thurs, Oct 18 2012 12:47 pm
Subject: Re: Insert/Updates only

Quick question: Do you create a different ETL for the initial import vs
update/insert later on?  

I am trying to see how you first fill your db, and then only maintain
updates? Or just always pull everything and just filter from there? or is
that premature optimization from me?

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Kuklewicz  
View profile  
 More options Oct 18 2012, 1:12 pm
From: Andrew Kuklewicz <kooks...@gmail.com>
Date: Thu, 18 Oct 2012 13:12:07 -0400
Local: Thurs, Oct 18 2012 1:12 pm
Subject: Re: Insert/Updates only

I didn't use different tasks for initial load, and would avoid the extra
hassle unless the process is very different.

There were a few fact tables I broke up into different files for load, but
the ctl files were the same.

Cheers,

Andrew Kuklewicz


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 18 2012, 1:53 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 18 Oct 2012 10:53:40 -0700 (PDT)
Local: Thurs, Oct 18 2012 1:53 pm
Subject: Re: Insert/Updates only

my problem is more on dimension tables. You have to originally load them,
but you might not want to keep getting all of the data everytime just for
efficiency? like process only the one that were updated?

For fact it is easy because I only add mostly on the one I have


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thibaut Barrère  
View profile  
 More options Oct 18 2012, 2:34 pm
From: Thibaut Barrère <thibaut.barr...@gmail.com>
Date: Thu, 18 Oct 2012 20:34:38 +0200
Local: Thurs, Oct 18 2012 2:34 pm
Subject: Re: Insert/Updates only

Hello folks!

my problem is more on dimension tables. You have to originally load them,

> but you might not want to keep getting all of the data everytime just for
> efficiency? like process only the one that were updated?  For fact it is
> easy because I only add mostly on the one I have

I second what Andrew said: I prefer to write idempotent ETL jobs, so that
you can re-run the whole thing, or only a delta if you prefer, with the
same process. It's a bit harder to do properly, but then it is much more
flexible, and much less error prone.

On your dimension tables: do rows already inserted later see their columns
changed? Or is their data immutable? (unlikely, usually).

There's a tradeoff between development time, processing time, and
complexity to be found based on your requirements:
- if you prefer a very simple mechanism, at the cost of re-extracting and
re-importing the whole dimension (brute force), then you can do it that way
(I had jobs that take 10 hours, each night, and it was perfectly fine)
- if you prefer something more elaborated, and your dimension data do not
change, then you'll need to track the last dimension id inserted, query the
source for only what you need
- if you want to only insert or update what changed, you'll have to track
down a timestamp in your source system to extract only what changed (eg:
updated_at > ? OR created_at > ?) and upsert based on that
- if you want to track deletes, you'll want to keep a list of deleted ids,
or add a "deleted paranoid column"

If processing time is something you want to lower, you can:
- extract less (extract only what changed) to diminish processing time on
extraction
- upsert less with strategies like computing and storing a checksum for
each row in the target, to diminish processing time on upsert
- use some caching strategy in the upsert process (like keeping the
checksums in Redis etc)

So this is really dependent on the setup and the tradeoffs you are ready to
make.

With so many different situations, that's also why activewarehouse-etl does
not attempt to answer them directly, but instead provide a more general
framework :-)

hope this helps!

Thibaut
--
http://www.logeek.fr


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 18 2012, 2:48 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 18 Oct 2012 11:48:15 -0700 (PDT)
Local: Thurs, Oct 18 2012 2:48 pm
Subject: Re: Insert/Updates only

The problem I have having is because of the bulk import I use originally to
load those dimension (only does insert)... But then I need to not run that
later on and do the upcert when I keep getting changes? (if I could have
SCD working that would have been a big help already)

1) So does that mean you do not use the bulk import in your case for those
situation?

2) how do you use any of the other database control such
as https://github.com/activewarehouse/activewarehouse-etl/blob/master/li...

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thibaut Barrère  
View profile  
 More options Oct 18 2012, 3:06 pm
From: Thibaut Barrère <thibaut.barr...@gmail.com>
Date: Thu, 18 Oct 2012 21:06:13 +0200
Local: Thurs, Oct 18 2012 3:06 pm
Subject: Re: Insert/Updates only

> The problem I have having is because of the bulk import I use originally
> to load those dimension (only does insert)... But then I need to not run
> that later on and do the upcert when I keep getting changes? (if I could
> have SCD working that would have been a big help already)

> 1) So does that mean you do not use the bulk import in your case for those
> situation?

You could use bulk import, but then you'd have to target a temporary table,
copy of your target table, then use an upsert statement (specific to your
DB, like MERGE with SQLServer), to upsert the temp table into the target
table.

Otherwise you cannot use bulk import because it will try to insert existing
rows in the target.

On my cases, when I need to upsert, I do not use bulk load.

Or for immutable data, you could rely on :check_exist, too:
https://github.com/activewarehouse/activewarehouse-etl-sample/blob/ma...

> 2) how do you use any of the other database control such as
> https://github.com/activewarehouse/activewarehouse-etl/blob/master/li...

I have not used this specific one. Use (based on what I read in the source)
probably goes like:

destination :out, {
  :type => :insert_update_database,
  :target => xxx,
  :table => yyy,
  :primarykey => 'id'

}

(untested!)

hope this helps,

Thibaut
--
http://www.logeek.fr


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 18 2012, 3:10 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 18 Oct 2012 12:10:23 -0700 (PDT)
Local: Thurs, Oct 18 2012 3:10 pm
Subject: Re: Insert/Updates only

I guess we should write an upset destination? rather than using the way you
showed me to do upset (in a transform or before_write)

> Or for immutable data, you could rely on :check_exist, too:
> https://github.com/activewarehouse/activewarehouse-etl-sample/blob/ma...

Yep for that one I use that too :)

Make sense. I guess I ll try that one otherwise I ll create a destination
for upsert :)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Kuklewicz  
View profile  
 More options Oct 18 2012, 3:11 pm
From: Andrew Kuklewicz <kooks...@gmail.com>
Date: Thu, 18 Oct 2012 15:09:12 -0400
Local: Thurs, Oct 18 2012 3:09 pm
Subject: Re: Insert/Updates only

I do bulk updates from files -> mysql, and use the insert or replace
capability in mysql.

http://dev.mysql.com/doc/refman/5.6/en/load-data.html

I don't think that helps you with postgres though.

Cheers,

Andrew Kuklewicz


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »