Consistency of IMap.putIfAbsent and IMap.tryLock

real...@gmail.com

unread,

Oct 23, 2017, 8:23:50 AM10/23/17

to Hazelcast

Hi all,

I have implemented a super-simple locking mechanism by doing:

String previousValue = myIMap.putIfAbsent(key, "locked", 5, TimeUnit.Minutes)

if (previousValue != null && previousValue.equals("locked")) {
   // already locked, skip processing
   return;
}
else {
   // I hold the lock, so continue processing
}

My assumption was that only one node in a cluster can get this lock and so I can get a "at-most-once" behavior. However, this only works most of the time, sometimes both nodes in my two-node cluster get past the lock and continue processing.

I have also tried IMap.tryLock, but that had the same problem. Sometimes it works, sometimes more than one node acquires the lock.

Are my assumptions wrong here or what are the consistency guarantees of IMap?

Kind regards,

Ulrich

Mehmet Dogan

unread,

Oct 23, 2017, 10:39:49 AM10/23/17

to Hazelcast

Hi,

(Assuming your processing under lock completes in a time significantly lower than 5 mins, so before TTL expires.)

In the absence of network partitions, Hazelcast remains consistent (linearizable). But when partition (due to network failures, long GC pauses, OS freezes etc) happens, Hazelcast remains available while sacrificing consistency.

So, all partitioned clusters operate independently, hence you can observe the same lock acquired concurrently on multiple members.

You can find more details at reference manual here: http://docs.hazelcast.org/docs/latest-development/manual/html/Consistency_and_Replication_Model.html

Also you can read following blog posts related to this subject:

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/2e58d562-378f-452b-8006-1c4d4cdeda7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Mehmet Dogan

real...@gmail.com

unread,

Oct 23, 2017, 6:49:24 PM10/23/17

to Hazelcast

I would assume that in case of a network partition I would see some log entries from Hazelcast, but there were none. Going back in the log I can see that the cluster (just two nodes) formed successfully on both ends.

My processing time is also significantly under 5 minutes, I am just sending a confirmation mail. What I am seeing is that node #2 sends the mail 37 milliseconds after node #1. This is only possible if both nodes acquired the lock.

So I can only assume that an OS freeze or a long gc pause were the culprit. How long would those have to be in order for Hazelcast to assume a network split?

Ulrich

Mehmet Dogan

unread,

Oct 24, 2017, 1:50:05 AM10/24/17

to haze...@googlegroups.com

If there was a network split (for some reason), you would see two members are complaining and removing each other from the cluster. By default heartbeat timeout is 5mins (by version 3.9, it's down to 1min). So, a pause should be longer than heartbeat timeout, to cause a split.

You are not just checking existence of the entry but also doing an equality check; previousValue.equals("locked"). What are the other possible values? Is that possible to observe a different value? How do you unlock (remove) the key; manually or depending on TTL expiration?

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/8db59c5f-d375-492b-859e-5c7563f1d50a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Mehmet Dogan

real...@gmail.com

unread,

Oct 24, 2017, 2:15:02 PM10/24/17

to Hazelcast

Hi Mehmet,

there was definitely no network split, I have observed those in the past and they are unmissable in the log. There was also most definitely no 5 minute pause.

If this appears to be a real issue and not just me not understanding how to use Hazelcast correctly, then I can of course prepare a small, self-contained example. Meanwhile, the code I wrote is here:

https://github.com/realulim/Bornemisza/blob/master/java/Users/src/main/java/de/bornemisza/users/subscriber/AbstractConfirmationMailListener.java

This code is fully unit tested, so I'm fairly confident it does what I intend it to do. If you look at the method "onMessage", this is where a message from a ReliableTopic comes in and the processing starts. The whole point is to send a confirmation mail to a user, who just registered for my service. When the user clicks on the "register" button, the message is put onto the ReliableTopic and all nodes will receive it. But only one node should send a mail. I am going for a at-most-once scheme here.

Ulrich

Lukáš Herman

unread,

Oct 24, 2017, 2:25:51 PM10/24/17

to Hazelcast

Hi Ulrich,

I have seen similar behavior recently under load, where putIfAbsent signaled there is already unique value in the map, but the value was not put there before. Very strange, not quite always replicable, but always logged in the application log.

Or another option, where putIfAbsent for nonpersistent map passed when it should not, because due to memory pressure (i guess) Hazelcast prematurely dropped values from the map (ttl 3600s, time between invocations 10s).

This definitely deserves some attention.

regards

Lukas

Dne úterý 24. října 2017 20:15:02 UTC+2 real...@gmail.com napsal(a):

Mehmet Dogan

unread,

Oct 25, 2017, 4:03:54 AM10/25/17

to haze...@googlegroups.com

Hi Ulrich,

If I'm not missing something, your `onMessage(msg)` method is not safe to be called more than once.

Assume there are just two nodes.

- 1st node calls `userIdMap.putIfAbsent(user.getId(), ...)` and putIfAbsent returns null.

- 1st node assumes it has the lock and does its processing.

- 1st node does `userIdMap.put(user.getId(), ...)` which means releasing the lock.

- 2nd node calls `userIdMap.putIfAbsent(user.getId(), ...)` and putIfAbsent returns the uuid inserted by 1st node.

- 2nd node assumes it has the lock (because `!previousValue.equals("locked")` is true) and continues processing.

In above execution flow, both nodes acquire the lock, not concurrently, but sequentially.

I think this should work as intended if you remove the `!previousValue.equals("locked")` condition.

```

public void onMessage(Message<User> msg) {

User user = msg.getMessageObject();

[...]

// Now we know that we have to send an email unless the message is already being worked on

String previousValue = this.userIdMap.putIfAbsent(user.getId(), "locked", 5, TimeUnit.MINUTES);

String uuid = UUID.randomUUID().toString();

if (previousValue == null || !previousValue.equals("locked")) {

[...]

this.userIdMap.put(user.getId(), uuid, 24, TimeUnit.HOURS);

}

else {

Logger.getAnonymousLogger().info("Skipping Request Handling, it is already being worked on.");

}

```

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/cb8225c5-5e05-4589-b682-2d246fd7d0f7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Mehmet Dogan

real...@gmail.com

unread,

Oct 25, 2017, 12:19:33 PM10/25/17

to Hazelcast

Hi Mehmet,

I believe you're right. My unit tests don't cover all race conditions. They assume that the two nodes either come in concurrently (in which case the locking will work) or a long time apart (in which case re-sending the mail is ok). The case you describe is when the two nodes come in right after one another, so the lock has just been released. In that case I don't want to send another mail, but the second node cannot know that one was just sent.

Anyway, thank you very much for looking at this, it seems I have correctly understood Hazelcast, but not my code :)

cheers,

Ulrich

real...@gmail.com

unread,

Oct 25, 2017, 12:27:23 PM10/25/17

to Hazelcast

I should say that in fact I thought I did cover the case you mentioned by the first while-loop, which will return if the incoming data is equals to the existing data, which would be the case for the second node coming in after the first. However, if the second node is already past this check, but not yet at the point, where it is trying to acquire the lock, then the race condition happens.

Ulrich

real...@gmail.com

unread,

Oct 25, 2017, 12:27:52 PM10/25/17

to Hazelcast

for-loop, not while-loop, sorry.

Ulrich

Reply all

Reply to author

Forward