Re: Tracking tuple tree using xor in acker

729 views

Skip to first unread message

Sean

unread,

Aug 3, 2012, 12:08:22 AM8/3/12

to storm...@googlegroups.com

The acking system is one of the most ingenious features of Storm, in my opinion. I am planning on writing a longer blog post on it to explain how it works, and will post that here. In short, the system relies on the fact that a bitwise-xor applied to any list where every number appears twice (or an even number of times) always produces 0. In this way the acker can know when the tuple has been completed. A mistake could be made if the random 64-bit number causes the xor to produce 0 before the tree is completed. However, I'm not sure myself if 5e7 years is reasonable even at 10K/sec, since through a birthday attack, collisions could occur sooner. I will go into this in the post.

However, I haven't been able to find the part of the code on the bolt-side where the bolts choose a random number as an ack value for the tuples they emit (only for the acker component itself). Could someone point me to this?

On Thursday, August 2, 2012 2:59:16 AM UTC-4, JQ wrote:

In https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing, it is documented that:
1) "An acker task stores a map from a spout tuple id to a pair of values"
2) One of the value, ack val, "is simply the xor of all tuple ids that have been created and/or acked in the tree"
3) "When an acker task sees that an "ack val" has become 0, then it knows that the tuple tree is completed"

It is claimed that, "since tuple ids are random 64 bit numbers, the chances of an "ack val" accidentally becoming 0 is extremely small. If you work the math, at 10K acks per second, it will take 50,000,000 years until a mistake is made".

I don't think I get this, and would appreciate if someone in the group can elaborate more on this.

Nathan Marz

unread,

Aug 3, 2012, 1:15:30 AM8/3/12

to storm...@googlegroups.com

The birthday problem doesn't come up here, as there are no sets involved in the algorithm. Each comparison is one random long vs. another random long, so each comparison has a 1/2^64 chance of making a mistake. A comparison is done on every call to "ack".

Here's where bolts generate new random numbers: https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/daemon/executor.clj#L500

--
Twitter: @nathanmarz
http://nathanmarz.com

Martin Kleppmann

unread,

Aug 3, 2012, 1:40:18 AM8/3/12

to storm...@googlegroups.com, storm...@googlegroups.com

I think the calculation is correct. The birthday paradox does not apply here, because it describes the probability of two random variables having the same value, whatever the value may be (any day in the year can be the birthday on which the two people coincide). By contrast, in Storm, a fault only occurs if the value randomly reaches zero (and no other value) when it shouldn't.

Imagine an extreme case: every tuple belongs to the same tuple tree, and the tuple tree is infinite. So we have just one accumulator, and every time a tuple is emitted or acked, the value of the accumulator changes (it's XORed with the tuple ID). If the value goes zero when a tuple is emitted, we know that the tuple tree can't be done (after all, we just emitted a tuple), so we can ignore zeros on emitting. If the value goes zero when a tuple is acked, we assume the tuple tree is done, even though it's not (we said it's infinite). So a zero value on ack is a fault.

How likely is it that you get a zero value when you shouldn't? Well, the tuple IDs are uniformly distributed random numbers, and XORing them doesn't change the distribution, so the observed sequence of accumulator values is itself a uniformly distributed random integer between 0 and 2^64-1. The probability of any one of those values being zero is 2^-64. That's a pretty small number.

How long can you go until you observe a fault? You can regard each ack as a Bernoulli trial with p=2^-64. Then the distribution of the number of trials until you get a fault is a Geometric distribution, and its expected value (i.e. the 'average' number of acks you need to process before you see one fault) is 1/p, that is 2^64. If you're processing 10k tuples per second, 2^64/10^4/86400/365 = 58 million years.

Of course, in practice, your tuple trees are not infinite. But for the mean time to failure, the size of your tuple trees doesn't make a difference, because the distribution is memoryless (it doesn't matter whether you're processing the first tuple or the 10 millionth tuple, the probabilities are the same).

Finally it's worth noting that all of this only matters if tuples have been lost and need to be replayed. As long as tuples are getting processed fine, it doesn't matter if the accumulator spuriously goes to zero. So actually this is based on the number of tuples you're *losing* when averaged over a long time, and that should be vastly less than 10k per second.

Hope that helps,

Martin

Sean

unread,

Aug 3, 2012, 2:29:12 PM8/3/12

to storm...@googlegroups.com

I see now--I was thinking of the ack value in the acker as a set of random numbers, but realized that a collision between two of those numbers wouldn't be enough to create an xor of 0 (there would have to be an even number repetitions for every number in the set). However, the 2^-64 chance occurs with every ack (including ack_init, but excluding the last ack), so for a tuple that is anchored n times ("sees" n+1 components), the chance that there will be an error for that tuple is 1 - (1 - 2^-64)^n, so it would take 1/(1 - (1 - 2^-64)^n)/10^4/86400/365 years (about (5.8e7)/n years) at 10K/sec. Still higher than anyone should care about, I agree.

However, the birthday problem does play a role in choosing a spout tuple id (root id), since those are modded and sent to the same acker no matter the spout (and then kept in a HashMap). This is extremely dependent on number of tuples outstanding; a topology with 1000 tuples outstanding has a 2.7e-14 probability of failure at any given point in time; one with 1,000,000 tuples pending has a 2.7e-8 chance at any point in time (considering tuples pending on all spouts in a topology). Another variable here is the rate of tuple processing (how many new tuple ids are generated per time); at 10K/sec, there is an expected failure every 58,000 years (with 1000 outstanding); at 1M/sec, one every 5 years with 100,000 outstanding. I have as a formula an expected collision expected every 1/[R(1 - (1 - 2^-64)^N)], where R is the rate of tuple processing, and N the number pending. In this case, a failure would mean one tuple would not be tracked, and another would be incorrectly failed, since it gets acks from the prior tuple, and the xor value won't go to 0. This does not depend on any tuple failing in the topology, unlike the other case.

Or am I missing something here? You're right--nothing to worry about for most use cases I could imagine.

On Friday, August 3, 2012 1:40:18 AM UTC-4, Martin Kleppmann wrote:

I think the calculation is correct. The birthday paradox does not apply here, because it describes the probability of two random variables having the same value, whatever the value may be (any day in the year can be the birthday on which the two people coincide). By contrast, in Storm, a fault only occurs if the value randomly reaches zero (and no other value) when it shouldn't.

Imagine an extreme case: every tuple belongs to the same tuple tree, and the tuple tree is infinite. So we have just one accumulator, and every time a tuple is emitted or acked, the value of the accumulator changes (it's XORed with the tuple ID). If the value goes zero when a tuple is emitted, we know that the tuple tree can't be done (after all, we just emitted a tuple), so we can ignore zeros on emitting. If the value goes zero when a tuple is acked, we assume the tuple tree is done, even though it's not (we said it's infinite). So a zero value on ack is a fault.

How likely is it that you get a zero value when you shouldn't? Well, the tuple IDs are uniformly distributed random numbers, and XORing them doesn't change the distribution, so the observed sequence of accumulator values is itself a uniformly distributed random integer between 0 and 2^64-1. The probability of any one of those values being zero is 2^-64. That's a pretty small number.

How long can you go until you observe a fault? You can regard each ack as a Bernoulli trial with p=2^-64. Then the distribution of the number of trials until you get a fault is a Geometric distribution, and its expected value (i.e. the 'average' number of acks you need to process before you see one fault) is 1/p, that is 2^64. If you're processing 10k tuples per second, 2^64/10^4/86400/365 = 58 million years.

Of course, in practice, your tuple trees are not infinite. But for the mean time to failure, the size of your tuple trees doesn't make a difference, because the distribution is memoryless (it doesn't matter whether you're processing the first tuple or the 10 millionth tuple, the probabilities are the same).

Finally it's worth noting that all of this only matters if tuples have been lost and need to be replayed. As long as tuples are getting processed fine, it doesn't matter if the accumulator spuriously goes to zero. So actually this is based on the number of tuples you're *losing* when averaged over a long time, and that should be vastly less than 10k per second.

Hope that helps,
Martin

Nathan Marz

unread,

Aug 6, 2012, 4:15:41 AM8/6/12

to storm...@googlegroups.com

On Fri, Aug 3, 2012 at 11:29 AM, Sean <sts...@gmail.com> wrote:

I see now--I was thinking of the ack value in the acker as a set of random numbers, but realized that a collision between two of those numbers wouldn't be enough to create an xor of 0 (there would have to be an even number repetitions for every number in the set). However, the 2^-64 chance occurs with every ack (including ack_init, but excluding the last ack), so for a tuple that is anchored n times ("sees" n+1 components), the chance that there will be an error for that tuple is 1 - (1 - 2^-64)^n, so it would take 1/(1 - (1 - 2^-64)^n)/10^4/86400/365 years (about (5.8e7)/n years) at 10K/sec. Still higher than anyone should care about, I agree.

However, the birthday problem does play a role in choosing a spout tuple id (root id), since those are modded and sent to the same acker no matter the spout (and then kept in a HashMap). This is extremely dependent on number of tuples outstanding; a topology with 1000 tuples outstanding has a 2.7e-14 probability of failure at any given point in time; one with 1,000,000 tuples pending has a 2.7e-8 chance at any point in time (considering tuples pending on all spouts in a topology). Another variable here is the rate of tuple processing (how many new tuple ids are generated per time); at 10K/sec, there is an expected failure every 58,000 years (with 1000 outstanding); at 1M/sec, one every 5 years with 100,000 outstanding. I have as a formula an expected collision expected every 1/[R(1 - (1 - 2^-64)^N)], where R is the rate of tuple processing, and N the number pending. In this case, a failure would mean one tuple would not be tracked, and another would be incorrectly failed, since it gets acks from the prior tuple, and the xor value won't go to 0. This does not depend on any tuple failing in the topology, unlike the other case.

Wow you're right. At least in this case it will just cause message failures and replay, which is better than message loss (and still respects Storm's guarantees). This is actually fixable by making spouts be their own ackers, so when they choose a random id they can check to see if it already exists. I opened up an issue for this: https://github.com/nathanmarz/storm/issues/282

Or am I missing something here? You're right--nothing to worry about for most use cases I could imagine.

On Friday, August 3, 2012 1:40:18 AM UTC-4, Martin Kleppmann wrote:

I think the calculation is correct. The birthday paradox does not apply here, because it describes the probability of two random variables having the same value, whatever the value may be (any day in the year can be the birthday on which the two people coincide). By contrast, in Storm, a fault only occurs if the value randomly reaches zero (and no other value) when it shouldn't.

Imagine an extreme case: every tuple belongs to the same tuple tree, and the tuple tree is infinite. So we have just one accumulator, and every time a tuple is emitted or acked, the value of the accumulator changes (it's XORed with the tuple ID). If the value goes zero when a tuple is emitted, we know that the tuple tree can't be done (after all, we just emitted a tuple), so we can ignore zeros on emitting. If the value goes zero when a tuple is acked, we assume the tuple tree is done, even though it's not (we said it's infinite). So a zero value on ack is a fault.

How likely is it that you get a zero value when you shouldn't? Well, the tuple IDs are uniformly distributed random numbers, and XORing them doesn't change the distribution, so the observed sequence of accumulator values is itself a uniformly distributed random integer between 0 and 2^64-1. The probability of any one of those values being zero is 2^-64. That's a pretty small number.

How long can you go until you observe a fault? You can regard each ack as a Bernoulli trial with p=2^-64. Then the distribution of the number of trials until you get a fault is a Geometric distribution, and its expected value (i.e. the 'average' number of acks you need to process before you see one fault) is 1/p, that is 2^64. If you're processing 10k tuples per second, 2^64/10^4/86400/365 = 58 million years.

Of course, in practice, your tuple trees are not infinite. But for the mean time to failure, the size of your tuple trees doesn't make a difference, because the distribution is memoryless (it doesn't matter whether you're processing the first tuple or the 10 millionth tuple, the probabilities are the same).

Finally it's worth noting that all of this only matters if tuples have been lost and need to be replayed. As long as tuples are getting processed fine, it doesn't matter if the accumulator spuriously goes to zero. So actually this is based on the number of tuples you're *losing* when averaged over a long time, and that should be vastly less than 10k per second.

Hope that helps,
Martin

On 2 Aug 2012, at 21:08, Sean wrote:

The acking system is one of the most ingenious features of Storm, in my opinion. I am planning on writing a longer blog post on it to explain how it works, and will post that here. In short, the system relies on the fact that a bitwise-xor applied to any list where every number appears twice (or an even number of times) always produces 0. In this way the acker can know when the tuple has been completed. A mistake could be made if the random 64-bit number causes the xor to produce 0 before the tree is completed. However, I'm not sure myself if 5e7 years is reasonable even at 10K/sec, since through a birthday attack, collisions could occur sooner. I will go into this in the post.

However, I haven't been able to find the part of the code on the bolt-side where the bolts choose a random number as an ack value for the tuples they emit (only for the acker component itself). Could someone point me to this?

On Thursday, August 2, 2012 2:59:16 AM UTC-4, JQ wrote:
In https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing, it is documented that:

1) "An acker task stores a map from a spout tuple id to a pair of values"
2) One of the value, ack val, "is simply the xor of all tuple ids that have been created and/or acked in the tree"

3) "When an acker task sees that an "ack val" has become 0, then it knows that the tuple tree is completed"

It is claimed that, "since tuple ids are random 64 bit numbers, the chances of an "ack val" accidentally becoming 0 is extremely small. If you work the math, at 10K acks per second, it will take 50,000,000 years until a mistake is made".

I don't think I get this, and would appreciate if someone in the group can elaborate more on this.

Reply all

Reply to author

Forward

0 new messages