Vert.x clustered multi-node. Two different timers get the same timerID

548 views
Skip to first unread message

bytor99999

unread,
Sep 27, 2014, 5:07:13 PM9/27/14
to ve...@googlegroups.com
In our application, we have to implement special code to try to make sure that two different timers don't get the same timerID. This occurs if we have two timers started about the same time on two different Vert.x instances in a cluster. I was kind of surprised that vertx.setTimer isn't a cluster-wide type method call, where if one node or another calls it, the timerID would be unique across the cluster.

What solution do you guys propose to making sure timerIDs are unique?

Thanks

Mark

Jordan Halterman (kuujo)

unread,
Sep 28, 2014, 12:19:13 AM9/28/14
to ve...@googlegroups.com
This is a fundamental problem of distributed systems. Probably the simplest logical solution to resolving conflicts across a distributed system is to assign one node as the "master" and require all other nodes to Query the master for consistency checks. So, for example, to resolve timer ID conflicts a slave sets a timer and then sends a message to the master asking it if it's unique. If the master already has that timer ID stored then it says "no," and if not it stores the timer ID and says "yes." This method basically amounts to "first write wins." The first node to set a timer with the given ID gets to keep its timer. You can also achieve the same thing using a database with some sort of atomic putIfAbsent type command.

Note, though, that this type of master-slave system doesn't handle faults well. However, if timer IDs are stored in a highly available persistent store, Vert.x's failover mechanism can make sure the master verticle gets restarted on another node if necessary. Otherwise, fault tolerance and consistency can be achieved by other distributed algorithms that get quite a bit more complex and which I assume are outside of the scope of your question.

Alternatively, if you're willing to use something like Hazelcast you can use things like distributed locks or counters to achieve these types of semantics.

Why can't you just make the timer IDs unique by prepending some sort of node ID to them? I.e. Each node is assigned a UUID at startup and that UUID is prepended to timer IDs for timers started on that node.

Stream Liu

unread,
Sep 28, 2014, 6:04:21 AM9/28/14
to ve...@googlegroups.com

> Why can't you just make the timer IDs unique by prepending some sort of node ID to them? I.e. Each node is assigned a UUID at startup and that UUID is prepended to timer IDs for timers started on that node.

It is look like simply, nodeID as prefix of timerID, and timerID look
like B6994147-4AC2-414C-9669-9DC3E4BE8F88:4467366177

Does this timerID is too long?

bytor99999

unread,
Sep 28, 2014, 4:24:06 PM9/28/14
to ve...@googlegroups.com
Thanks Guys. Appending nodeID and using Hazelcast is our actual current solution, now that we realize that Vert.x isn't guaranteeing unique timerIDs across nodes. Maybe a feature that is now in Vert.x 3.0 since we now have cluster wide sharedData, that could be a source for Vert.x to get unique timerIDs.

Another solution in Vert.x that we could do is use UUIDs instead of Longs for timerIDs.

Thanks

Mark

Tim Fox

unread,
Sep 28, 2014, 5:24:57 PM9/28/14
to ve...@googlegroups.com
Why do you want the timerID to be unique across the cluster? What's the big picture here?
--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nick Scavelli

unread,
Sep 29, 2014, 11:02:33 AM9/29/14
to ve...@googlegroups.com
Yea same question here. I'm sort of surprised you would expect vertx.setTimer to be some distributed operation.

bytor99999

unread,
Sep 29, 2014, 12:32:02 PM9/29/14
to ve...@googlegroups.com
Because we could set a timer on one node, and it being cancelled by another.

Here is an example. We are building a Texas No-limit Poker application. When it is someones turn to act/bet set an act timer. Say 12 seconds for the player to act. The timer is set for 12 seconds. If it expires our closure folds the player. If they act in time we have to stop the timer. So, if you go in order. a player acts and that player has a Socket connection on one node in the cluster. After they act, on their node we pick the next player and start their act timer. That player might be connected (socket) to a different node in the cluster, so when they act, that different node in the cluster now needs to cancel the timer that was started on the other node.

This is all over the place for us in our code, because 1) You can't tie a game to one server because players from all the nodes might join that game/table. 2) The players should see all tables regardless of node they have their socket connected to.

Thanks

Mark

bytor99999

unread,
Sep 29, 2014, 12:36:21 PM9/29/14
to ve...@googlegroups.com
Also, the solution we talked about it fine for us to know about the timer, but canceling the timer on the specific node without being on that node is still not solved. Just because we have the nodeID, I haven't seen a specific way to send a message to that particular node so that we can cancel the timer.

Thanks

Mark

Tim Fox

unread,
Sep 29, 2014, 12:41:16 PM9/29/14
to ve...@googlegroups.com
I'm not sure I fully understand the use case (card games have always confused me ;), but....

In Vert.x 3.0 you could store the timer ID in a distributed map so it would be accessible in any node.

In Vert.x 2.x you could have a handler listening on each node against an address, e.g. "timerCancels" - when you want to cancel that timerID from any other nose, you could publish that timerID so it's picked up by all the nodes and can be cancelled.

Tim Fox

unread,
Sep 29, 2014, 12:42:37 PM9/29/14
to ve...@googlegroups.com
On 29/09/14 09:36, bytor99999 wrote:
Also, the solution we talked about it fine for us to know about the timer, but canceling the timer on the specific node without being on that node is still not solved. Just because we have the nodeID, I haven't seen a specific way to send a message to that particular node so that we can cancel the timer.

If you know the "nodeID" and have a handler on that node registered on that nodeID, surely you can just send a message to that nodeID containing the timerID?

bytor99999

unread,
Sep 29, 2014, 3:57:36 PM9/29/14
to ve...@googlegroups.com
My co-worker used a nodeID as an address.

Regarding the other solutions, another problem with non clusterable timerIDs is that you could have two timers with the exact same timerID running at the exact same time on two different nodes. So when you tried to cancel one, it cancelled the other, or both.

Really if timerIDs was unique across all nodes in the cluster you wouldn't have all these problems we have right now. We had to write up a whole Verticle, manager, API and addressing to handle not having cluster able timerIDs.

About card games, is actually irrelevant in this case. It is a game and players are on different nodes. There are timers set, usually where it starts on one node because of one players actions, and another player's actions on another node needs to cancel that other timer. Simple. These aren't delay, wait timers, they are that the player has x amount of time to do whatever their action is on the client side.

Mark

Nick Scavelli

unread,
Sep 29, 2014, 4:35:38 PM9/29/14
to ve...@googlegroups.com


On Monday, September 29, 2014 3:57:36 PM UTC-4, bytor99999 wrote:
My co-worker used a nodeID as an address.

Regarding the other solutions, another problem with non clusterable timerIDs is that you could have two timers with the exact same timerID running at the exact same time on two different nodes. So when you tried to cancel one, it cancelled the other, or both.

Really if timerIDs was unique across all nodes in the cluster you wouldn't have all these problems we have right now. We had to write up a whole Verticle, manager, API and addressing to handle not having cluster able timerIDs.

I think you are taking a generic distributed systems problem and applying it to timers just because you happen to be using the timerID as a distrubuted value. As Tim has pointed out, in 3.0 there is more support for distributed data structures (map, counter, lock) to help solve these problems. But I currently don't see any reason why we would make timerIDs unique across a cluster.

Jordan Halterman

unread,
Sep 29, 2014, 4:41:23 PM9/29/14
to ve...@googlegroups.com


On Sep 29, 2014, at 12:57 PM, bytor99999 <bytor...@gmail.com> wrote:

My co-worker used a nodeID as an address.

Regarding the other solutions, another problem with non clusterable timerIDs is that you could have two timers with the exact same timerID running at the exact same time on two different nodes. So when you tried to cancel one, it cancelled the other, or both.

I don't see how that's true. Timer IDs are inherently local to a Vert.x instance, so canceling a timer in node A that was set in node A will work properly regardless of whether node B has a timer with the same ID. It seems like something like this could only happen as a result of user code telling the wrong node to cancel the wrong timer.

Really if timerIDs was unique across all nodes in the cluster you wouldn't have all these problems we have right now. We had to write up a whole Verticle, manager, API and addressing to handle not having cluster able timerIDs.

About card games, is actually irrelevant in this case. It is a game and players are on different nodes. There are timers set, usually where it starts on one node because of one players actions, and another player's actions on another node needs to cancel that other timer. Simple. These aren't delay, wait timers, they are that the player has x amount of time to do whatever their action is on the client side.

It still seems like this is easily solvable by prepending a "node ID" to each timer and registering an event bus handler on each node using the "node ID" as the address. Then, to cancel a timer, send a message to the appropriate node based on the "node ID" prefix. That node then cancels the timer using the appropriate timer ID suffix.

Jordan Halterman

unread,
Sep 29, 2014, 4:45:28 PM9/29/14
to ve...@googlegroups.com

On Sep 29, 2014, at 1:35 PM, Nick Scavelli <nick.s...@gmail.com> wrote:



On Monday, September 29, 2014 3:57:36 PM UTC-4, bytor99999 wrote:
My co-worker used a nodeID as an address.

Regarding the other solutions, another problem with non clusterable timerIDs is that you could have two timers with the exact same timerID running at the exact same time on two different nodes. So when you tried to cancel one, it cancelled the other, or both.

Really if timerIDs was unique across all nodes in the cluster you wouldn't have all these problems we have right now. We had to write up a whole Verticle, manager, API and addressing to handle not having cluster able timerIDs.

I think you are taking a generic distributed systems problem and applying it to timers just because you happen to be using the timerID as a distrubuted value. As Tim has pointed out, in 3.0 there is more support for distributed data structures (map, counter, lock) to help solve these problems. But I currently don't see any reason why we would make timerIDs unique across a cluster.
 
+1 I don't see any good reason for timer IDs to be unique across the cluster. Also second that this is a general distributed systems problem, not an issue with Vert.x. It would only really make sense to make timer IDs unique across the cluster if they could be cancelled remotely (i.e. timers were distributed, which they aren't)

bytor99999

unread,
Sep 29, 2014, 7:02:10 PM9/29/14
to ve...@googlegroups.com
OK, I think you guys are missing what I am saying. 

No matter what, in our system, a timer might be set on one node that another node has to cancel. 

This is because each player at the game table might be socket connected to totally different nodes. 

So I might have 9 players at the poker table all on 9 different nodes in the cluster. 

Also, we might have 100s if not hopefully 1000s of tables running at the same time. 

If one node tries to cancel timerID number 123, and there are now 10 nodes that happen to all have a timer running with id of 123, we have no idea if that call happened on the same node that created the timer. (We will never know) So we might cancel the wrong timer. 

(I know this is a fundamental problem of distributed systems, but one of the benefits of Vert.x was to try and make it so that the developer doesn't have to worry about those problems (like message delivery)

We have to create this whole nodeID addressing and message delivery system every time we have to start or stop a timer. And if you look at our Hazelcast map we are currently storing Key (nodeID + timerID) we sometimes have 10,000 timers running simultaneously.

With unique timerIDs we don't have to worry about this. Instead of long/ints they could be UUIDs.

Mark

Jordan Halterman

unread,
Sep 29, 2014, 7:12:04 PM9/29/14
to ve...@googlegroups.com
Why do you have to create it every time you start or stop a timer? Just start one verticle on each node that handles stopping *all* timers for that node. Assign that one verticle per node a unique address to which other nodes can send "cancel" requests to cancel timers belonging to that node. I still don't see a problem with this approach, and the overhead is just one always-running verticle (or just an event bus handler) per node.

Also, you can use a MultiMap (rather than a Map) to cut down on storage. NodeID -> timerIDs

With unique timerIDs we don't have to worry about this. Instead of long/ints they could be UUIDs.

Even if they were UUIDs, because Vert.x timers are inherently local, they would still have to be sent to the correct node in order to be cancelled. I fail to see how unique timer IDs would solve your problem, but maybe I'm misunderstanding.

bytor99999

unread,
Sep 29, 2014, 8:21:17 PM9/29/14
to ve...@googlegroups.com
We already are doing what you suggest, We already created another Verticle for timers etc. If timers were clusterable and timerIDs unique, we wouldn't have to write all this extra code, that is buggy and difficult to test. If it was built in, because most Vert.x application will probably need -cluster, it would be a no code thing. starting and canceling timers would be simple one line.

That is what I am suggesting. Because I can see this issue happening to a lot of people and people having to code this up, and it is easy to make mistakes. Maybe Timer clusters should be a "module" for Vert.x 2.x and an addon for Vert.x 3.x

Mark

bytor99999

unread,
Feb 25, 2015, 6:10:11 PM2/25/15
to ve...@googlegroups.com
So this is still causing us huge issues. We have already set it up where we have a node id with the timer id, so that we can send the cancel to the correct node.

But the issue here is more than a timer id. It is actually that it seems that the timers are not distributed timers. We find that sometimes timers never fire.

So we had to create a Timer watchdog to look for Timers that weren't fired and force fire them. It is tough to explain.

So I will skip that and just ask, are the timers distributed across the cluster or are the timers only on that Vert.x instance in a cluster?

Thanks

Mark

Tim Fox

unread,
Feb 26, 2015, 1:58:15 AM2/26/15
to ve...@googlegroups.com
A timer in Vert.x is simply an event that fires every x milliseconds, it doesn't know anything about clustering.

bytor99999

unread,
Feb 26, 2015, 2:55:36 PM2/26/15
to ve...@googlegroups.com
OK. One more simple quick question. Is it ever possible for a timer that expires to not be fired?

Thanks

Mark

bytor99999

unread,
Feb 26, 2015, 2:58:19 PM2/26/15
to ve...@googlegroups.com


On Wednesday, February 25, 2015 at 10:58:15 PM UTC-8, Tim Fox wrote:
A timer in Vert.x is simply an event that fires every x milliseconds, it doesn't know anything about clustering.


Well unless it is a one off timer, not a repeating timer. ;)

Thanks

Mark

Tim Fox

unread,
Feb 27, 2015, 1:21:22 AM2/27/15
to ve...@googlegroups.com
On 26/02/15 19:55, bytor99999 wrote:
OK. One more simple quick question. Is it ever possible for a timer that expires to not be fired?

Yes, if the context is blocked preventing it from being executed (e.g. something is blocking the event loop or a worker)

Tim Fox

unread,
Feb 27, 2015, 1:21:46 AM2/27/15
to ve...@googlegroups.com
+1 ;)

dgo...@squaredfinancial.com

unread,
Feb 27, 2015, 12:57:26 PM2/27/15
to ve...@googlegroups.com


On Friday, 27 February 2015 06:21:22 UTC, Tim Fox wrote:
On 26/02/15 19:55, bytor99999 wrote:
OK. One more simple quick question. Is it ever possible for a timer that expires to not be fired?

Yes, if the context is blocked preventing it from being executed (e.g. something is blocking the event loop or a worker)


Assmuming it unblocks eventually and you don't just mean it'll never run if it never ever unblocks: I thought the runnable was enqueued regardless, it could just be significantly late, so long as the timer is not actively cancelled in the interim? I may well be wrong.

I think e.g. the back-of-the-envelope "too busy" check that yoke does relies on that though - the later the timer actually fires after its initiially projected target time the more server load it assumes there is.

bytor99999

unread,
Feb 28, 2015, 8:44:42 PM2/28/15
to ve...@googlegroups.com
I guess I could see a timer set on a node that goes down before the timer expires means that that timer is lost and will never fire (Show stopper for us as it will kill games being played)

But even without nodes going down we definitely see timers that expire never firing. Bummer part Tim is there is no way we could even come up with a small sample to show this. Only our huge game, which we can't post the code. :D

So now we had to find a way to make Vert.x timers cluster-able and distributed. It is a bit convoluted and very very error prone on our end. But we have a Watchdog, a Timer Verticle and Hazelcast maps to try to track all the timers and run "cron" jobs to find timers that expired but never fired and run the code. However, our current solution will still not solve the timer on a node that goes down before expiring. We will have to solve that somehow.

Thanks

Mark

Tim Fox

unread,
Mar 1, 2015, 3:03:32 AM3/1/15
to ve...@googlegroups.com
On 27/02/15 17:57, dgo...@squaredfinancial.com wrote:


On Friday, 27 February 2015 06:21:22 UTC, Tim Fox wrote:
On 26/02/15 19:55, bytor99999 wrote:
OK. One more simple quick question. Is it ever possible for a timer that expires to not be fired?

Yes, if the context is blocked preventing it from being executed (e.g. something is blocking the event loop or a worker)


Assmuming it unblocks eventually and you don't just mean it'll never run if it never ever unblocks:

I don't think you can make any assumptions about when, if ever, it unblocks. This completely depends on how the context is blocked.


I thought the runnable was enqueued regardless, it could just be significantly late, so long as the timer is not actively cancelled in the interim? I may well be wrong.

I think e.g. the back-of-the-envelope "too busy" check that yoke does relies on that though

Vert.x 3 has a blocked thread checker uses its own Timer and does not rely on Vert.x timers at all.

- the later the timer actually fires after its initiially projected target time the more server load it assumes there is.

dgo...@squaredfinancial.com

unread,
Mar 2, 2015, 11:10:04 AM3/2/15
to ve...@googlegroups.com
On Sunday, 1 March 2015 08:03:32 UTC, Tim Fox wrote:

I don't think you can make any assumptions about when, if ever, it unblocks. This completely depends on how the context is blocked.


I suppose not, my bad, what I was getting at was just that their run is not e.g. skipped altogether if a deadline isn't met.

dgo...@squaredfinancial.com

unread,
Mar 2, 2015, 11:33:22 AM3/2/15
to ve...@googlegroups.com

But we have a Watchdog, a Timer Verticle and Hazelcast maps to try to track all the timers and run "cron" jobs to find timers that expired but never fired and run the code.

anwyay, whatever this code is, it clearly doesn't actually have to run in the vertx context of the verticle that originally set the timer?

While a full distributed-scheduled-executor-service might be nice (tm) (and looks like hazelcast are planning it at the hazelcast level), it sounds to me like you could try a different architecture for now, instead of considering your watchdog just the fallback, just actually have a dedicateed "cron service" module that looks after all the timers that your other verticles register scheduled tasks to /instead of/ using the local and lightweight timer facility within them.

The cron service module could use a hazelcast distributed backing store (IIRC you were already hitting hazelcast directly...)  so it itself can fail over without loss.  You might be concerned with scaling and timer accuracy of that, but I'd suggest trying it first -  dismissing it may be premature optimisation, particularly if you only need accuracy to seconds.   i.e. Consider something akin to the vertx work queue module but exposing a scheduled task service api.



bytor99999

unread,
Mar 4, 2015, 2:17:38 AM3/4/15
to ve...@googlegroups.com
These aren't cron jobs. These are single run timers of a game. For instance, it is your move, you have 12 seconds to make your move, if not we move to the next player. We move to the next player, start a new timer for that new player for 12 seconds. If they make a move, we cancel the timer, except this timer was created on the server/node that the first timer was started, but the second player has a socket connection to a different node and now has to cancel the second timer which is on another totally different node. 

Anyway, explanations aside. Our application needs guarantees distributable timers. And in the end we have a complex convoluted way to hopefully do that successfully because there is no other solution. :D

Thanks.

Mark

Tim Fox

unread,
Mar 4, 2015, 3:17:57 AM3/4/15
to ve...@googlegroups.com
On 04/03/15 07:17, bytor99999 wrote:
These aren't cron jobs. These are single run timers of a game. For instance, it is your move, you have 12 seconds to make your move, if not we move to the next player. We move to the next player, start a new timer for that new player for 12 seconds. If they make a move, we cancel the timer, except this timer was created on the server/node that the first timer was started, but the second player has a socket connection to a different node and now has to cancel the second timer which is on another totally different node. 

Anyway, explanations aside. Our application needs guarantees distributable timers. And in the end we have a complex convoluted way to hopefully do that successfully because there is no other solution. :D

I think the problem here is you don't yet know what is actually causing the issue, so you're forced to apply mitigating approaches to work around the issue, but this is not an ideal approach.

If it was me doing this I think I would try and spend some time forensically pinpointing exactly what is going - do some debugging, add some logging, takes some stack traces, and trace what happened to those timer events. Once you know exactly what is going on you might have an "aha" moment and be able to apply a fix that solves the real issue once and for all.



Thanks.

Mark

On Monday, March 2, 2015 at 8:33:22 AM UTC-8, dgo...@squaredfinancial.com wrote:

But we have a Watchdog, a Timer Verticle and Hazelcast maps to try to track all the timers and run "cron" jobs to find timers that expired but never fired and run the code.

anwyay, whatever this code is, it clearly doesn't actually have to run in the vertx context of the verticle that originally set the timer?

While a full distributed-scheduled-executor-service might be nice (tm) (and looks like hazelcast are planning it at the hazelcast level), it sounds to me like you could try a different architecture for now, instead of considering your watchdog just the fallback, just actually have a dedicateed "cron service" module that looks after all the timers that your other verticles register scheduled tasks to /instead of/ using the local and lightweight timer facility within them.

The cron service module could use a hazelcast distributed backing store (IIRC you were already hitting hazelcast directly...)  so it itself can fail over without loss.  You might be concerned with scaling and timer accuracy of that, but I'd suggest trying it first -  dismissing it may be premature optimisation, particularly if you only need accuracy to seconds.   i.e. Consider something akin to the vertx work queue module but exposing a scheduled task service api.



bytor99999

unread,
Mar 4, 2015, 12:36:17 PM3/4/15
to ve...@googlegroups.com


I think the problem here is you don't yet know what is actually causing the issue, so you're forced to apply mitigating approaches to work around the issue, but this is not an ideal approach.

If it was me doing this I think I would try and spend some time forensically pinpointing exactly what is going - do some debugging, add some logging, takes some stack traces, and trace what happened to those timer events. Once you know exactly what is going on you might have an "aha" moment and be able to apply a fix that solves the real issue once and for all.



Thanks Tim. Yes that is great advice. 

Before we created the Watchdog and TimerVerticle and such we spent about 2 weeks doing just that and never found anything except that we saw that our first assumption of being able to take the timerID that we had and being able to cancel those timers wasn't correct when we added more nodes to the cluster. Then we also couldn't figure out why we had a good number of log entries about and see timers that expired but that their code in that timer never ran. And again didn't see anything in our logs (We did have a large amount of logging in those areas).

So after a couple weeks and with pending impossible deadlines that we seem to always be facing, we had to come up with a different approach. I know it is coding by fire from above, leading to mistakes and hacks. We have all seen it.

Hopefully, sometime in the future, after we get our hack working completely that it might show up and get that aha moment. 

Thanks

Mark

dgo...@squaredfinancial.com

unread,
Mar 4, 2015, 2:12:57 PM3/4/15
to ve...@googlegroups.com

On Wednesday, 4 March 2015 07:17:38 UTC, bytor99999 wrote:
These aren't cron jobs. These are single run timers of a game.

Okay, cron may have been a bad word choice as it has suggested repetition/periodic but that's not really the part that was important. So let's say nonrepeating "at" jobs instead registered with a non-distributed (but potentially ha-failover with hz distributed backing store as previously discussed) "at service provider" verticle/module. 

verticle instance X asks "at service provider verticle" instance AT to do some job Q after now+N seconds (and perhaps report back to X when Q is complete (perhaps Q could be null if you just wanted a "remind me later") or cancelled). AT returns a job id JQ to X immediately (not when Q runs), X can if it wants publish JQ where, say, a verticle instance Y could find it, and thus either X or Y can ask AT to cancel the sheduled job Q with id JQ at any point before Q is actually run.  As AT itself isn't distributed in that pattern, it can get away with using non-distributed vertx-provided timers internally.

Anyway, up to you.

Reply all
Reply to author
Forward
0 new messages