Hi everyone,
I have an app right now that just makes use of the builtin ruby queue and threads, however I've come across a new issue and have to rethink how to achieve this. Since concurrent-ruby provides a host of options, I do not know the best way to do this.
My app is a script that when triggered scans a bunch of files in a directory to calculate the hashes, communicates with a server to get some additional data about each file and then to rename or move these files. Since scanning is IO intensive and highly disk dependent, and since the external API I'm using restricts usage to a single logged in session at any time, I designed both of these to run on a single thread each. So my current design is as follows.
0) Generate a global count of the number of items to process (say n)
1) Thread 1 obtains a list of files to scan and scans them serially and dumps the result into queue A
2) Thread 2 picks up items from queue A until it has picked up n, communicates with the back end server and once it has obtained the data, dumps the result into queue B
3) Thread 3 picks up items from queue B until it has picked up n and processes the files.
Currently whenever Thread 3 discovers that a file it is handling is a duplicate, it simply moves the new file to a special duplicate location with a unique name for post processing by a human. However, since this has occurred often enough, it now makes sense to see if some automatic remediation can occur when there is a duplicate. In order to do so, the file it is a duplicate of has to be scanned and identified and then my system can use the extended information to mediate between the two files when enough information is available. I'm looking for recommendations for a re-design that will let me handle this. One approach would be to look at a model where Thread 3 can resubmit a job to the beginning for the original of the duplicate file and then when it recieves the response handle this as a remediation task. However 1) I do not have an easy way anymore to identify the end of the work stream on each of these threads 2) I open myself up to a race condition where there are two duplicates in the work stream.