Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Question around SCD
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Emmanuel  
View profile  
 More options Oct 8 2012, 6:04 am
From: Emmanuel <epina...@talenttech.com>
Date: Mon, 8 Oct 2012 03:04:23 -0700 (PDT)
Local: Mon, Oct 8 2012 6:04 am
Subject: Question around SCD

Hi,

I am playing the framework and I am trying to understand a bit the  SCD
part of it.

My plan is to export from a mysql to a File. Then do the transform from a
File to another File , and finally write to a postgresql (using bulk
upload).

1) Does Bulk upload manages SCD? or this is not even an option in this case?

If it does manage:

I have the following output destination

destination :out, {
  :file => "../tmp/t_board_accounts.csv",
  :append => false

},

{
  :order => [:key, :id, :crawling_enabled, :max_candidates_per_day],
  :virtual => {
    :key =>  ETL::Generator::SurrogateKeyGenerator.new(
      :query => 'SELECT MAX(board_account_key) FROM board_accounts',
      :target => :datawarehouse
    )
  }

}

2) But I see duplicate everytime I run it, is there something I need to do
to prevent duplicate?

3) when I specificy the following directive, I see nothing outputed in the
file nor in the database ever (even after truncating the final destination
and rerunning it)....
destination :out, {
  :file => "../tmp/t_board_accounts.csv",
  :append => false,
  :natural_key => [ :id ],
  :scd_fields => [ :crawling_enabled,
                   :max_candidates_per_day],
  :scd => {
    :dimension_target => :datawarehouse,
    :dimension_table => "board_accounts",
    :type => 2
  }

},

{
  :order => [:key, :id, :crawling_enabled, :max_candidates_per_day],
  :virtual => {
    :key =>  ETL::Generator::SurrogateKeyGenerator.new(
      :query => 'SELECT MAX(board_account_key) FROM board_accounts',
      :target => :datawarehouse
    )
  }

}

Thanks

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thibaut Barrère  
View profile  
 More options Oct 8 2012, 2:47 pm
From: Thibaut Barrère <thibaut.barr...@gmail.com>
Date: Mon, 8 Oct 2012 20:47:23 +0200
Local: Mon, Oct 8 2012 2:47 pm
Subject: Re: Question around SCD

Hey folks,

after some discussion with Emmanuel and Andrew (
https://github.com/activewarehouse/activewarehouse-etl/issues/115), I
wonder: is there anyone here using at least some of the SCD features?

If so, which scenarios are you using?

I don't expect much answers given how specific this is, but in case...

thanks,

Thibaut
--
http://www.logeek.fr


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 25 2012, 6:03 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 25 Oct 2012 15:03:07 -0700 (PDT)
Local: Thurs, Oct 25 2012 6:03 pm
Subject: Re: Question around SCD

So I took a look at the code for SCD

SCD based on http://en.wikipedia.org/wiki/Slowly_changing_dimension and
this
book http://www.amazon.com/Star-Schema-The-Complete-Reference/dp/007174432...

The code here is interesting because it assumes that instead of doing an
update and insert, to make good use of bulkupload, it does a delete and
create 2 new rows :)
one for the old record and marked as expired, and one for the updated
version.

This means that you get 1 delete and 2 insert. So the consequence is it
needs table where you can write your own SK otherwise that will break .
Also I hope AR can deal with this correctly because I recall AR not being a
happy camper when setting an id yourself...

More to come as I debug it


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 25 2012, 6:19 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 25 Oct 2012 15:19:51 -0700 (PDT)
Local: Thurs, Oct 25 2012 6:19 pm
Subject: Re: Question around SCD

So it does work with bulk upload and got it to fully work on a small
example. Just gotta watch out for the fact that you get a delete , then 2
insert

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 25 2012, 6:27 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 25 Oct 2012 15:27:00 -0700 (PDT)
Subject: Re: Question around SCD

Ok , I found what is not working and why I could not see it working fully

The old row gets added with the incorrect SK because I have a virtual key
setup.. and it uses the max(id) from my table. Something about that is not
working correctly when generating the rows and virtual key are used.

Not sure yet how to fix it but I can reproduce this issue now

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 25 2012, 6:38 pm
From: Emmanuel <epina...@talenttech.com>
Date: Thu, 25 Oct 2012 15:38:42 -0700 (PDT)
Local: Thurs, Oct 25 2012 6:38 pm
Subject: Re: Question around SCD

I got it to fully work correctly but there seems to be a specific naming
that has to be correct between the way you specify the virtual field using
the SurrogateKeyGenerator and the way you save it. Then the ID comes back
correct and work with bulkload :)

Bottom line it works and allows me now to use the bulk load .

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Emmanuel  
View profile  
 More options Oct 26 2012, 1:40 pm
From: Emmanuel <epina...@talenttech.com>
Date: Fri, 26 Oct 2012 10:40:36 -0700 (PDT)
Local: Fri, Oct 26 2012 1:40 pm
Subject: Re: Question around SCD

More things to watch out for. If you put a field from the input that is not
coming from the database output. Also boolean fields.. Between mysql to
postgre it looks different and caused constant changes.

The one downside of the current implementation is that it delete all the
records and read them the same way if unchanged . So during the transform
lap till load , you have no records or missing records . Just a gotcha to
watch out for

Normally it should be an update and insert which then ensure that no record
 are ever missing though I understand it is slower because we cannot use
the bulk load

Emmanuel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »