For a node to shut down, it merely needs to check that the maximum
queue size in the global queue size zset is 0.
- Josiah
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
>
With the method I described, yes.
> I wanted to avoid this polling and let the global termination controller do
> the necessary communication because each processing node already has
> sufficient tasks to do.
You've got a global controller that is pinging every queue to
determine how much work they have to do? Even if you remember to poll
all of your queues multiple times, you may be stopping queues
prematurely due to race conditions, etc. (the "C" part of CAP). Is
there any particular reason why you are using a queue on every box
instead of a single single queue that they all make requests from
(something like queue items being large, so slow to transfer across
the network multiple times)?
- Josiah
Since the TC needs to ask for a restart, you don't need to worry about
timestamps. When the queue processor starts up, it inserts it's
process id into Redis at a known key. When the TC checks, all it needs
to do is to verify that the queue processor is still running (checking
the pid on the box), and if not, whether there are any items in the
local queue. If there are items, it starts the local queue processor
back up.
No timestamps necessary, no proxy necessary.
>>> Is there any particular reason why you are using a queue on every box
> Data needs to spread across the cluster for scalability reasons and so each
> node maintains that data in the form of queues.
"Scalability reasons" is pretty ambiguous. A few thousand chunks of
data a second? Multi-kilobyte/megabyte chunks of data being passed
around? Too much data to be stored on a single node at once?
I only ask because having another piece of software controlling the
queue processors (starting them back up if necessary, if I understand
your architecture) is yet another piece of software where little bugs
can creep in.
Regards,
That is interesting. You have 600k queues? Why so many? Does each
queue processor take 1 queue, or do they handle many queues? Why not
just 1 queue per host? Can you tag your data so that you don't have as
many queues? Again, how much data is being passed through? Thousands
of items a second? How big is each item to be processed?
Even at 600k/20, that's 30k queues per host. That's still a lot of
keys to be checking for queue items. I'm sure there's a way to cut
that down.
- Josiah
When I say "queue", I mean a single list in Redis (or a zset, if you
do prioritization).
When I say "queue item", "item", or "data", I mean a string that has
information to process that piece of information. Something like the
json string: "{'method':'aggregate', 'name':'counter_1', 'value':24}",
which may, for example, mean that a counter needs to be incremented by
24.
On Tue, Apr 19, 2011 at 10:26 AM, Raghava Mutharaju
<m.vijay...@gmail.com> wrote:
> One of the big datasets we have would generate that number of queues. The
> goal is to handle even bigger datasets.
In everything that I've used Redis (and queues in general for), the
only time I've ever needed to create more than one queue is in the
case where I was pushing so many items on the queues that one box
couldn't handle it. That said, I have moved 1000+ items per second
through a single Redis queue sustained for weeks (over 1 billion items
processed without issue). When I did use multiple instances of Redis,
it was because I needed to move 50k items/second.
>>> Does each queue processor take 1 queue, or do they handle many queues?
> It (application process running on each node) handles many queues -- all the
> queues in that particular node. In the case of 600k/20, it should process
> 30k queues i.e. there would be 30k keys and each key would have several
> values (in 100s or 1000s).
> Also note that, all the queues are not for processing. Some of them are
> helper queues -- so its elements need not be processed. I haven't actually
> calculated the ratio of processing vs helper queues but processing queues
> could be 60%-70% of it.
I am still confused as to why you need so many queues. Do you have
priorities? Do you have one queue for each different type of item to
be processed?
>>> That's still a lot of keys to be checking for queue items
> Do you think that this is a lot of load on one node? The number of nodes
> could be expanded using third party cloud services (Amazon EC2 etc).
The number of queues means that you need to pass a lot of keys through
to pull items from. That seems like a waste to me.
>>> Why not just 1 queue per host?
> That would too less of a load on 1 host (node) isn't it.
Not if you take all of the items in all of the queues and put it in the 1 queue.
>>> Can you tag your data so that you don't have as many queues?
> I am already using some tags. In Jedis, there is this feature called
> key-tags which I am making use of, so that related queues would be assigned
> to one node.
That's a different thing. What I mean is that if you have a json work
item like "{'name':'counter_1', 'count':24}" and you would typically
place it in the "queue:aggregate" queue, you could alter the data to
be "{'queue':'aggregate', 'name':'counter_1', 'count':24}" and put
that work item along with all of the others in a single queue.
>>> Again, how much data is being passed through?
> Is it the data that is passed around among the queues of different nodes?
Single node, all of your nodes, either. As long as you also say which
number you gave. Bytes per item, number of items, desired throughput
(number of items processed per second), the amount of processing time
(ignoring queue times) necessary per item, etc. With that information,
there is some balancing that can be done to get you to the destination
you need.
>>> I'm sure there's a way to cut that down.
> That would certainly help. We are looking into it, on how to cut down the
> number of queues.
Regards
This is where the confusion was stemming from. I see "queues" as a way
of organizing tasks to be processed only. In the case of Redis, using
RPUSH and LPOP on the LIST structure. You are actually using SETs as a
way of modeling relationships (paternal and maternal ancestry) as a
sort of directed graph, while simultaneously wanting them to be
processed simultaneously.
Different terminology, and different use-cases.
> There is certain kind of relationship between the values and keys. During
> the processing of queues, I would need something like for a person p2, get
> me all the values. If I mix up all the values and put them in one very large
> queue then I relationships are lost. Although, this wouldn't be the case if
> I tag each item with the key but I wouldn't be able to query in the way I
> specified above (get all values for p2).
Looking back on your earlier problem description, you do the following:
1. pull all related items for a person, maybe inserting other items in
other sets
2. when all to-be-processed sets have been processed across all nodes,
shut down your processors
Question: how are each of your nodes discovering which sets need to be
processed, and how do you know what order to process them in?
Regards,
- Josiah
> During processing, values could be inserted back into the queues. So
> multiple instances of my application across many nodes would be accessing
> only 1 queue.
>>> When I did use multiple instances of Redis, it was because I needed to
>>> move 50k items/second.
> That is a very good number. How did you achieve that? Is it because of high
> bandwidth of the network or something else?
We were in EC2. It only took a couple boxes, as our queue items were small.
> there are some triggers (processing tasks) that get fired based on the type of item inserted.I'm curious how the trigger fires?
--
which leads me to want to read more on pub/sub feature and why it is going to get deprecated.
To go deeper, 600k sets really isn't all that many. There are some
people who have millions of sets in a single Redis, and others with
millions of hashes. The only real question is whether 1. you have
enough memory, and 2. you have enough processors to process all of
your data.
You stated earlier that you had something like:
p1: {f1, f2, f3, f4, f5}
p2: {m1, m2, m3, m4, m5}
If you wanted to reduce your number of sets, you can easily use a sorted set:
pf -> {1_f1 -> ..., 1_f2 -> ..., 1_f3 -> ..., ...}
pm -> {2_m1 -> ..., 2_m2 -> ..., 2_m3 -> ...}
With the scores being derived from the "id" portion of p<id> that was
turned into pf -> {<id>_item -> <score>, ...} in the way that I
describe in this thread:
https://groups.google.com/forum/#!topic/redis-db/zCv6C8RI3I4 . If you
can encode your ids into a binary string, that should be good for 256
billion unique sets (if necessary), and could reduce your number of
sets from 30k/host to presumably less than a dozen (put all of your
father information in the same zset, all of your mother information in
a different zset, etc.)
One thing to note: sorted sets use significantly more memory, so
before switching to them, you should verify that you have enough.
- Josiah
On Thu, Apr 21, 2011 at 8:29 PM, Raghava Mutharaju
Not suggested; just an alternate method if your primary consideration
is "reduce the number of sets". You had mentioned the hassle of
checking many different sets for size, which I originally took as
"check thousands of sets on every host". If you have a single set that
lists the items that still needs to be processed, then you are down to
1 per host already, and I'd say that you have a sufficient solution
(though I generally use sorted sets if I want unique items with the
ability to pull one item at a time).
> 2) ID1_2 --> {1_val1, 1_val2, 1_val3, 2_val1, 2_val2, 2_val3}
> Here the 2 sets are merged but the values are differentiated using value IDs
> and their scores.
> Retrieval: I have to use ZRANGE, ZRANGEBYSCORE
> The number of sets would definitely get reduced since we are merging sets
> but would the processing speed improve? If anything, use of
> zrange/zrangebyscore seem to be more expensive. In 2), I would also have the
> additional overhead of calculating scores.
Any latency differences are more than likely imperceptible unless you
were connecting over unix domain socket. And even then, the
differences are probably be measured in microseconds per pull.
Regards,