Beginner questions !

56 views

Skip to first unread message

Emmanuel

unread,

Oct 8, 2012, 6:23:03 AM10/8/12

to activewareh...@googlegroups.com

Hi,

I am pretty new to this framework and I am trying to understand how it fully works but I am not able to get everything to work correctly yet. And before digging into the code I would rather ask questions and see if I am missing something

so I have a CSV file, that I transform to another csv out, then use the bulk import . My expectation was that it would import first time, but second time would have found duplicate and not import those row again.

Maybe my design is incorrect or I am using it incorrectly so feel free to let me know how else I should approach it.

Here are some question I still have

1) does bulk import manages insert/update? If so , how do I configure it? For now I see that it keeps appending to my final database destination so I have duplicates even though the intermediate transform looks correct

2) How does SCD works? I cannot get it work at all. I get no row outputted in my intermediate csv file at all (see my previous post about it)

3) IS there an example on how you get multiple source and transform them into one only (in the case mostly of dimension being aggregated)

4) How do you get fact table to be filled in since they use dimension surrogate key? Do I need to use the resolver in those case? Any example somewhere I could look at?

Thanks

Emmanuel

Thibaut Barrère

unread,

Oct 8, 2012, 3:04:49 PM10/8/12

to activewareh...@googlegroups.com

Hello Emmanuel,

here is what I can say - this is a bit of braindump so beware!

1) does bulk import manages insert/update? If so , how do I configure it? For now I see that it keeps appending to my final database destination so I have duplicates even though the intermediate transform looks correct

The bulk processor itself will rely on "insert only" bulk features of the databases (see adapter_extensions on github) and will make no attempt to upsert instead of insert.

If you need to upsert, here are a couple of possibilities:

- slow but easy: map an AR model to your destination and rely on first_or_create or similar

- much faster and easy: use https://github.com/zdennis/activerecord-import (eg: on duplicate key if you target MySQL https://github.com/zdennis/activerecord-import/wiki/MySQL:-On-Duplicate-Key-Update-Support - not sure about PG as a target though)

- build custom SQL using your store upsert capabilities (eg: http://vibhorkumar.wordpress.com/2011/10/26/upsertmerge-using-writable-cte-in-postgresql-9-1/)

- you could also mix the "update_database_destination" with bulk insert (using the bulk to insert only, and the slower update destination for updates), but this would probably require two passes

- there is a "insert_update_database_destination" too, probably a bit slow though

Basically it really depends on the upsert capabilities of your target datastore (if any).

Sidenote but on handling upsert (as I recall a note you send me privately): here are common patterns to handle incremental loading:

- if your input data is immutable (doesn't change once you got it), you can just store an incremental id or timestamp and ask the source store for things more recent than this id/timestamp; and bulk insert will do marvels here

- if your input data is mutable, you'll have to rely on modified_at > x + created_at > x or similar, and use a proper upsert processing

Be extra careful to handle errors properly here, ie: you'll want to restart the whole process if something goes wrong, to make sure you don't miss any record or record update.

2) How does SCD works? I cannot get it work at all. I get no row outputted in my intermediate csv file at all (see my previous post about it)

No idea on this one - I have no knowledge of this specific part of the gem.

3) IS there an example on how you get multiple source and transform them into one only (in the case mostly of dimension being aggregated)

I don't think there is such an example available, but here is a common pattern:

- create one .ctl file per source/format, and make sure it output data in a common, conformed format as a destination (in a file for instance)

- then create a "global" .ctl that will achieve the shared processing of those commonly formatted files and will insert/upsert in the target store

4) How do you get fact table to be filled in since they use dimension surrogate key? Do I need to use the resolver in those case? Any example somewhere I could look at?

Each fact table record will have to see its data transformed to surrogate keys using a resolver.

See https://github.com/activewarehouse/activewarehouse-etl-sample/blob/master/advanced/etl/import_new_commits.ctl for an example.

Hope this helps!

Thibaut

http://www.logeek.fr

Reply all

Reply to author

Forward

0 new messages