I second what Andrew said: I prefer to write idempotent ETL jobs, so that you can re-run the whole thing, or only a delta if you prefer, with the same process. It's a bit harder to do properly, but then it is much more flexible, and much less error prone.
On your dimension tables: do rows already inserted later see their columns changed? Or is their data immutable? (unlikely, usually).
There's a tradeoff between development time, processing time, and complexity to be found based on your requirements:
- if you prefer a very simple mechanism, at the cost of re-extracting and re-importing the whole dimension (brute force), then you can do it that way (I had jobs that take 10 hours, each night, and it was perfectly fine)
- if you prefer something more elaborated, and your dimension data do not change, then you'll need to track the last dimension id inserted, query the source for only what you need
- if you want to only insert or update what changed, you'll have to track down a timestamp in your source system to extract only what changed (eg: updated_at > ? OR created_at > ?) and upsert based on that
- if you want to track deletes, you'll want to keep a list of deleted ids, or add a "deleted paranoid column"
If processing time is something you want to lower, you can:
- extract less (extract only what changed) to diminish processing time on extraction
- upsert less with strategies like computing and storing a checksum for each row in the target, to diminish processing time on upsert
- use some caching strategy in the upsert process (like keeping the checksums in Redis etc)
So this is really dependent on the setup and the tradeoffs you are ready to make.
With so many different situations, that's also why activewarehouse-etl does not attempt to answer them directly, but instead provide a more general framework :-)
hope this helps!
Thibaut
--