The probem you describe is a general one.
You have items to be processed
You have clients to process this items
And you have #FAILure in the nature of the world :) That is: clients may fail
And if a client fails, and it popped the element, who will process it?!
You also require all this to be atomic. If a client will get an
element from the queue, and then it fails, the queue must appear again
atomically and only the next one will get it.
> When designing software to use Amazon SQS you must take into account
> your requesting clients may see queue items more than once.
Yep, this may happen if it takes too much time to complete the task
but was not dead I guess.
> The question...??? In the same way you are offering EXPIRE on keys is
> there a simple solution to add a timer to a LIST item suspending it
> for a set time, if we don't return to delete the LIST item on
> completion of the task (ie. client failure) then have it reappear in
> the same position so a new client can collect and process the task..
Here expire is not the solution unfortunately since the granularity is
the whole list.
And we currently don't have LOCK/UNLOCK so it is not possible to
inspect the top element, add it to a temp queue, then actually pop it.
it's not atomic.
If for your usage case order of processing does not matter you can use
a whole Redis 'DB' as queue.
Every time I want to add a new item I select DB0, increment a counter
to get an unique key, so:
INCR myUniqueId => 100
SET item_100 "... value to be processed ..."
Then a client want to process something inside the queue, that is
actually the whole DB0
it uses RANDOMKEY to get a key at random. Now we have the name of a key:
RANDOMKEY => 100
We now use RENAME, that is an atomic operation, to attach time and
status information to our entry:
RENAME item_100 processing_100_<current_unix_time>
If RENAME fails (returns zero) then we know that another client took
this item already. so GOTO RANDOMKEY ;)
Select a new key and so on. Instead if RENAME returned 1 we are happy.
We obtained a form of "locking".
We can now process the query. Now you wonder what happens if RANDOMKEY
will return a key in the form of processing_... but we will see it in
a moment.
When our processing was done we can just remove
processing_100_<unixtime> and reiterate to process more stuff.
RECOVERY
What happens if our client #Fails?
It can fail only in four different stages:
Fail #1 just after RANDOMKEY. not a problem at all...
Fail #2 just after the RENAME. We will have processing_100_<unixtime>
in the DB forever
Fail #3 just after the entry was processed... so not in time to issue
the DEL on the key.
This failures are simple to hande. Basically when RANDOMKEY returns
processing_100_<unixtime> the client checks that this entry is not too
old. If it is too old we try to move it into
processing_100_<new-unix-time> with RENAME. If we failed another
client arrived before us. And so forth.
A client must be prepared to see the same item multiple times... like
in the case of Amazon. Because Fail #3 will actually process the key
but don't delete it.
Unfortunately this does not solve the ordering problem, that is, you
have a "key space" and not a "queue". Sometimes it is nice, some other
times you are more happy processing the keys in FIFO or LIFO fashion,
it depends on the kind of problem.
I think it is possible to find a way to have even the ordering thing
actually, I'll think a bit about it. Otherwise we can add some new
primitive as general as possible to allow for this kind of things to
be done.
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
http://invece.org
> RANDOMKEY => 100
Sorry here I mean:
RANDOMKEY => item_100
Have a look at beanstalkd for a worker queue pretty similar to SQS.
(I like redis, but am not sure it's a good replacement for worker queues.)
Hello Jeremy,
I agree, there are systems specifically conceived to be worker queues
that will work better out of the box. But on the other hand I think
that Redis should export all the primitives to build such a queue if
the user needs one. This is useful since in many environments there
can be Redis servers holding data, and some need for worker queues,
and to maintain multiple systems can be hard.
The queue I described even if appears to use tricks actually has
provably time and space complexity characteristics, is persistent
(automatically), can be replicated and so on. But the fact it's not
able to process elements in the order received is not a great thing
indeed. I absolutely want to add a primitive to let this things
happen, or to find a way exploiting just things already available in
order to implement one.
Probably the simplest primitive that solves all this problems is:
LPOPSTORE and RPOPSTORE (random names):
LPOPSTORE key destkey
pop an element from the list at key and store the value at destkey.
this primitive is probably enough in order to turn the proposed schema
in one with ordering. It is also faster since to add a new task only a
single LPUSH/RPUSH call is needed. For a worker to take an element is
a bit more complex as there is to generate an unique key with INCR,
then use LPOPSTORE, process the key, and then DELETE. A special
worker, or from time to time any normal worker, will LPUSH timedout
keys back in the queue if they are not processed in time.
Hello!
Sorry for the delay, I did a big release of a new web service in the
latest couple of days and it was pretty hard... but tomorrow I can
study your proposal and reply. Sorry again!
Btw this new site that will have a pretty high traffic is using Redis
as a cache, with a lot of EXPIREs. So in the next days we'll start to
have a feeling about the preformances in a production environment.
Regards,