multithread ETL

167 views
Skip to first unread message

Oren Mazor

unread,
Feb 13, 2013, 11:37:45 AM2/13/13
to activewareh...@googlegroups.com
Hey guys, 

In building the engine, was the idea of multithreading row processing ever considered? i.e. have 4-10 workers crunching through your source.each_with_index?

I started a naive implementation of this here: https://github.com/orenmazor/activewarehouse-etl/tree/zeromq_workers

the high level overview: use a zmq pipeline with a distributor funneling [row,index] to each worker which ends up calling process_row. you also have to split up the writing to destinations to happen in the sink so that we only have one thread writing to the output file for sanity sake.

thoughts? I have an intense suspicion that there are things in the engine that are not going to play well with this idea.

Raimonds Simanovskis

unread,
Mar 17, 2013, 5:20:56 AM3/17/13
to activewareh...@googlegroups.com
Hi

I am not active ActiveWarehouse-ETL user but I am interested in multithreaded ETL in Ruby therefore wanted to share experience in this area.
I am author of mondrian-olap library (https://github.com/rsim/mondrian-olap) and I have my own custom build ETL in Ruby that I use in my https://eazybi.com application.

I am now experimenting with multithreaded ETL process in JRuby using standard Thread and Queue classes:
- Create input and output Queue objects (in my case I use SizedQueue to block producers of messages if consumers are not fast enough)
- Create pool of worker threads which all pop messages from input queue and put results in output queue. Each worker thread has its own database connection.
- Create results thread which processes output queue (as it is single thread it can reduce results to something else, e.g. sum or count)
- Main thread pushes new work for processing to input queue
- Main thread after pushing all work pushes end-of-work messages to input queue which will finish worker threads
- Main thread waits while all worker threads finish

Currently it is more or less working but still does not scale linearly because of some blocking events
- When doing upsert to dimension tables I need to ensure that it is not done in parallel in two different threads for the same dimension entry (e.g. if I extract customer dimension entry from flat input row and the same customer can appear in two different rows). Currently I create Mutex for each dimension and when doing upsert in that dimension then do it in synchronize block for this Mutex to ensure that at the same time just one thread can do upsert for this dimension. But this of course introduces waits on this Mutex lock.
- Have not found yet the best way how to tune bulk inserts in fact table. E.g. if in MySQL I do large transactions with many inserts in fact table from many parallel threads then I start to get "Deadlock found when trying to get lock" errors.

Other option that I think would be worth to explore is to create separate actor (thread which sequentially processes messages from other actors) for each dimension and each fact table and also each actor would have a separate database connection. And then all inserts/updates to one dimension or fact table would go through just through this actor and as a result all updates/inserts in database for one table will come just from one database connection. I want to check if this approach could reduce waits on locks at the database side.

Anyone else can recommend best practices how to do multithreaded ETL (in Ruby :)) ?

Kind regards,
Raimonds

Oren Mazor

unread,
Apr 13, 2013, 12:36:34 AM4/13/13
to activewareh...@googlegroups.com
Hey Raimonds,

Small world! we used your mondrian-olap library for a while, but then had to step away from JRuby for performance reasons. 

if I understand your problems correctly, why not create a separate thread just for writing to your destination? that way your extraction/transformation workers are not blocked and you dont have duplication issues (plus, no mutexes per dimension).

my transformations always output to CSV so that I do a bulk load from the commandline. this part is not multithreaded for me.

great job on that olap query builder object, by the way!
Reply all
Reply to author
Forward
0 new messages