Help with Concurrency Design

38 views
Skip to first unread message

Vishnu Iyengar

unread,
Mar 30, 2015, 5:21:44 PM3/30/15
to concurr...@googlegroups.com
Hi everyone, 
  I have an app right now that just makes use of the builtin ruby queue and threads, however I've come across a new issue and have to rethink how to achieve this. Since concurrent-ruby provides a host of options, I do not know the best way to do this.

My app is a script that when triggered scans a bunch of files in a directory to calculate the hashes, communicates with a server to get some additional data about each file and then to rename or move these files. Since scanning is IO intensive and highly disk dependent, and since the external API I'm using restricts usage to a single logged in session at any time, I designed both of these to run on a single thread each. So my current design is as follows. 
0) Generate a global count of the number of items to process (say n)
1) Thread 1 obtains a list of files to scan and scans them serially and dumps the result into queue A
2) Thread 2 picks up items from queue A until it has picked up n, communicates with the back end server and once it has obtained the data, dumps the result into queue B
3) Thread 3 picks up items from queue B until it has picked up n and processes the files. 

Currently whenever Thread 3 discovers that a file it is handling is a duplicate, it simply moves the new file to a special duplicate location with a unique name for post processing by a human. However, since this has occurred often enough, it now makes sense to see if some automatic remediation can occur when there is a duplicate. In order to do so, the file it is a duplicate of has to be scanned and identified and then my system can use the extended information to mediate between the two files when enough information is available. I'm looking for recommendations for a re-design that will let me handle this. One approach would be to look at a model where Thread 3 can resubmit a job to the beginning for the original of the duplicate file and then when it recieves the response handle this as a remediation task. However 1) I do not have an easy way anymore to identify the end of the work stream on each of these threads 2) I open myself up to a race condition where there are two duplicates in the work stream.



Chris Seaton

unread,
Mar 31, 2015, 12:08:01 AM3/31/15
to Vishnu Iyengar, concurr...@googlegroups.com
Maybe you need some kind of priority work queue - you’ve got a work list that might have many items in it and may be growing, and you want to add a work item to that queue, but be handled before the existing items.

That could be something to implement inside the library. In fact a general framework for worklists or streaming might be useful.

In the mean time, how about two work queues - high and low priority. The thread tries to take from high first, if there’s nothing there it takes from low.

As for duplicates, maintain a set of completed work?

Chris

--
You received this message because you are subscribed to the Google Groups "Concurrent Ruby" group.
To unsubscribe from this group and stop receiving emails from it, send an email to concurrent-ru...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vishnu Iyengar

unread,
Apr 1, 2015, 9:32:44 PM4/1/15
to concurr...@googlegroups.com, pat...@gmail.com
I thought about using a priority queue (and there seems to be one inside concurrent-ruby), but to go down this road requires not just a priority queue to pre-empt work, but 
1) I'll also need a new workflow so that work from different sources can be requested (by adding a job on the queue) and can then be returned to the corresponding source. Which will mean introducing promises for example.
2) I also no longer have a way to determine when all the work is done and my scheduler can exit.

Thanks for the ideas though :)
Reply all
Reply to author
Forward
0 new messages