Question around SCD

28 views
Skip to first unread message

Emmanuel

unread,
Oct 8, 2012, 6:04:23 AM10/8/12
to activewareh...@googlegroups.com
Hi, 

I am playing the framework and I am trying to understand a bit the  SCD part of it.

My plan is to export from a mysql to a File. Then do the transform from a File to another File , and finally write to a postgresql (using bulk upload).

1) Does Bulk upload manages SCD? or this is not even an option in this case?

If it does manage: 

I have the following output destination

destination :out, {
  :file => "../tmp/t_board_accounts.csv",
  :append => false
},
{
  :order => [:key, :id, :crawling_enabled, :max_candidates_per_day],
  :virtual => {
    :key =>  ETL::Generator::SurrogateKeyGenerator.new(
      :query => 'SELECT MAX(board_account_key) FROM board_accounts',
      :target => :datawarehouse
    )
  }
}

2) But I see duplicate everytime I run it, is there something I need to do to prevent duplicate?


3) when I specificy the following directive, I see nothing outputed in the file nor in the database ever (even after truncating the final destination and rerunning it)....
destination :out, {
  :file => "../tmp/t_board_accounts.csv",
  :append => false,
  :natural_key => [ :id ],
  :scd_fields => [ :crawling_enabled,
                   :max_candidates_per_day],
  :scd => {
    :dimension_target => :datawarehouse,
    :dimension_table => "board_accounts",
    :type => 2
  }
},
{
  :order => [:key, :id, :crawling_enabled, :max_candidates_per_day],
  :virtual => {
    :key =>  ETL::Generator::SurrogateKeyGenerator.new(
      :query => 'SELECT MAX(board_account_key) FROM board_accounts',
      :target => :datawarehouse
    )
  }
}

Thanks

Emmanuel

Thibaut Barrère

unread,
Oct 8, 2012, 2:47:23 PM10/8/12
to activewareh...@googlegroups.com
Hey folks,

after some discussion with Emmanuel and Andrew (https://github.com/activewarehouse/activewarehouse-etl/issues/115), I wonder: is there anyone here using at least some of the SCD features?

If so, which scenarios are you using?

I don't expect much answers given how specific this is, but in case...

thanks,

Thibaut
--

Emmanuel

unread,
Oct 25, 2012, 6:03:07 PM10/25/12
to activewareh...@googlegroups.com
So I took a look at the code for SCD


The code here is interesting because it assumes that instead of doing an update and insert, to make good use of bulkupload, it does a delete and create 2 new rows :)
one for the old record and marked as expired, and one for the updated version.

This means that you get 1 delete and 2 insert. So the consequence is it needs table where you can write your own SK otherwise that will break . Also I hope AR can deal with this correctly because I recall AR not being a happy camper when setting an id yourself...

More to come as I debug it

Emmanuel

unread,
Oct 25, 2012, 6:19:51 PM10/25/12
to activewareh...@googlegroups.com
So it does work with bulk upload and got it to fully work on a small example. Just gotta watch out for the fact that you get a delete , then 2 insert 

Emmanuel

Emmanuel

unread,
Oct 25, 2012, 6:27:00 PM10/25/12
to activewareh...@googlegroups.com
Ok , I found what is not working and why I could not see it working fully

The old row gets added with the incorrect SK because I have a virtual key setup.. and it uses the max(id) from my table. Something about that is not working correctly when generating the rows and virtual key are used. 

Not sure yet how to fix it but I can reproduce this issue now

Emmanuel

Emmanuel

unread,
Oct 25, 2012, 6:38:42 PM10/25/12
to activewareh...@googlegroups.com
I got it to fully work correctly but there seems to be a specific naming that has to be correct between the way you specify the virtual field using the SurrogateKeyGenerator and the way you save it. Then the ID comes back correct and work with bulkload :)

Bottom line it works and allows me now to use the bulk load .

Emmanuel

Emmanuel

unread,
Oct 26, 2012, 1:40:36 PM10/26/12
to activewareh...@googlegroups.com
More things to watch out for. If you put a field from the input that is not coming from the database output. Also boolean fields.. Between mysql to postgre it looks different and caused constant changes.

The one downside of the current implementation is that it delete all the records and read them the same way if unchanged . So during the transform lap till load , you have no records or missing records . Just a gotcha to watch out for

Normally it should be an update and insert which then ensure that no record  are ever missing though I understand it is slower because we cannot use the bulk load

Emmanuel
Reply all
Reply to author
Forward
0 new messages