Hi
I am not active ActiveWarehouse-ETL user but I am interested in multithreaded ETL in Ruby therefore wanted to share experience in this area.
I am now experimenting with multithreaded ETL process in JRuby using standard Thread and Queue classes:
- Create input and output Queue objects (in my case I use SizedQueue to block producers of messages if consumers are not fast enough)
- Create pool of worker threads which all pop messages from input queue and put results in output queue. Each worker thread has its own database connection.
- Create results thread which processes output queue (as it is single thread it can reduce results to something else, e.g. sum or count)
- Main thread pushes new work for processing to input queue
- Main thread after pushing all work pushes end-of-work messages to input queue which will finish worker threads
- Main thread waits while all worker threads finish
Currently it is more or less working but still does not scale linearly because of some blocking events
- When doing upsert to dimension tables I need to ensure that it is not done in parallel in two different threads for the same dimension entry (e.g. if I extract customer dimension entry from flat input row and the same customer can appear in two different rows). Currently I create Mutex for each dimension and when doing upsert in that dimension then do it in synchronize block for this Mutex to ensure that at the same time just one thread can do upsert for this dimension. But this of course introduces waits on this Mutex lock.
- Have not found yet the best way how to tune bulk inserts in fact table. E.g. if in MySQL I do large transactions with many inserts in fact table from many parallel threads then I start to get "Deadlock found when trying to get lock" errors.
Other option that I think would be worth to explore is to create separate actor (thread which sequentially processes messages from other actors) for each dimension and each fact table and also each actor would have a separate database connection. And then all inserts/updates to one dimension or fact table would go through just through this actor and as a result all updates/inserts in database for one table will come just from one database connection. I want to check if this approach could reduce waits on locks at the database side.
Anyone else can recommend best practices how to do multithreaded ETL (in Ruby :)) ?
Kind regards,
Raimonds