[Hazelcast 3.1.5] Entry TTL became not relevant after shutdown/start Hazelcast nodes in cluster

404 views
Skip to first unread message

Vadim Yaremchuk

unread,
Jun 30, 2014, 2:56:52 PM6/30/14
to haze...@googlegroups.com
Hi All,

I am experiencing with following issue:

1. Start two Hazelcast nodes (NodeA and NodeB)
2. Create entry in NodeA with IMap.put(entryKey, entryValue, 100, TimeUnits.SECONDS);
3. Check that entry exist on NodeB by call IMap.get(key);
4. I am expecting that entry should be evicted after 100 seconds. 
5. During these 100 seconds I start to shutdown and then bring up nodes one by one (one node always available so entries are not lost):
- shutdown NodeA, wait till cluster re-partitioning is finished on NodeB (log messages: "Re-partitioning cluster data...", "Migration queue size: 135 All migration tasks has been completed, queues are empty")
- bring up back NodeA, wait till cluster re-partitioning is finished
shutdown NodeB, wait till cluster re-partitioning is finished
- bring up back NodeB, wait till cluster re-partitioning is finished
shutdown NodeA, wait till cluster re-partitioning
6. Check that 100 seconds was passed, check that entry on NodeB disappeared, but it is still present in Hazelcast.

I can't constantly reproduce it, but the flow I described above giving high chance to replicate.

If there is no disturb to cluster nodes (shutdown/start) - entry is successfully evicted in TTL time.  

Is this known bug? Looks like it happens during data re-partition/migration inside the cluster and entry TTL became default one (suppose with value from <time-to-live-seconds> of hazelcast config)?

Thanks

Regads,
Vadym
Message has been deleted

Vadim Yaremchuk

unread,
Jul 1, 2014, 2:44:34 PM7/1/14
to haze...@googlegroups.com
I tried with the latest Hazelcast version 3.2.3 and see the same issue I described in my previous message.

Btw, I have filling that eviction time was extended not to <time-to-live-seconds> from hazelcast config but to original TTL which was provided during entry put. I think so because I set <time-to-live-seconds> to 400 secs, but entry was evicted after less than this time and more than TTL was provided during original entry put.

Can anybody help to resolve this?

Thanks.

Vadim Yaremchuk

unread,
Jul 8, 2014, 1:33:32 PM7/8/14
to haze...@googlegroups.com
Hi,

Issue is not reproducible with version 3.3-RC1

Daniel Neugebauer

unread,
Jul 21, 2014, 12:40:03 PM7/21/14
to haze...@googlegroups.com
Hi!

I seem to have got a similar problem running 3.2.3 but with max-idle instead of an explicit TTL. max-idle is set to 300 seconds but last access time can go much higher (multiple hours) and the map entry still isn't cleared. I tried to reproduce that locally by rebooting nodes the same way Vadim did but the issue seems to only apply to our production environment. The only difference in configuration is that we don't use multicast but directly configured TCP connections, all other parameters are the same.

The way I work with that map is quite special. I need to run periodic maintenance jobs in our application (no distributed jobs in terms of Hazelcast), but they should only be run by one node at a time. So I thought, I could coordinate access using a map entry in Hazelcast. Each node peeks the last value using an EntryView to avoid triggering an access. The value has to match the local node's ID if it exists, else no jobs are being run. If it matches or does not exist yet, the key is accessed with putIfAbsent and the return value checked again. If it matches the local node's ID, jobs are being run. I'm relying on max-idle as a presumably safe way to remove that map entry when a node went down. So approximately 5 minutes (300s) after the last job execution, Hazelcast should remove the entry, thus allowing any node to re-register as job executioner.

I thought, maybe that entry would become locked so it wouldn't get purged from the map after max-idle time but according to LocalMapStats on both nodes, it isn't. Another difference in our production environment is that we let our proxy + Nagios monitor a status page that shows map.size(), cluster members, multiple map statistics and the EntryView's value. That status page is accessed about once per second. Maybe accessing the map.size() or EntryView repeatedly also triggers an access or temporarily suspends deletion? However, I was unable to reproduce that on my local machine with approx. 2-3 reloads per second around the max-idle timeout. Deletion still happened about 0 to 10 seconds after max-idle timeout.

If there was a bug, it would be good to know. I can implement workarounds to fix it but I think it should work out of the box unless there's really something about using EntryView that is related to that issue (such as purging timed out entries requiring some seconds since last EntryView access). If the next release fixes it, I won't need to put any more time in workarounds/testing.

I see that 3.3 RC3 appears to be in preparation; did anyone hit critical bugs since RC2 or would it be "safe" to try RC2 on our production servers? (I would rather test it locally but as I can only reproduce the issue on our production environment I need to test it there or wait for a final release unless I figure out why it only occurs there, but I'm quite out of ideas at the moment)

@Vadim: Did you do anything similar to what I do or can you reproduce your issue on the vanilla standalone server?

Thanks,
Daniel

Vadim Yaremchuk

unread,
Jul 21, 2014, 1:05:46 PM7/21/14
to haze...@googlegroups.com
Hi Daniel,

Unfortunately I wasn't doing something similar to your case and for my case you need to have at least two Hazelcast instances (either on one box or multiple ones). Issue I experienced definitely can be reproduced on 3.1.5 or 3.2.3.

Regarding your issue I would think that pulling the EntryView's value by Nagios can cause the issue, because you are configuring the max-idle-time and the entry not being in idle state enough time since Nagios pulls its value. Can you try to disable retrieving the EntryView's value by Nagios and see how it works?
Its my guess but may be Hazelcast maintainers can give more details and help you.

Regards,
Vadim

Daniel Neugebauer

unread,
Jul 23, 2014, 5:40:08 AM7/23/14
to haze...@googlegroups.com
Hi!

I just noticed that 3.2.4 is out and there was at least one change that looks like it could have been your (and our?) issue: https://github.com/hazelcast/hazelcast/pull/2942

I just updated our production deployment to 3.2.4 and max-idle was effective without any changes to Nagios or proxy checks; eviction happened about 13 seconds after earliest timeout. Maybe we're lucky and it's fixed. I'll keep an eye on it when we deploy again next time. :)

Bye,
Daniel

daniel...@gmail.com

unread,
Nov 18, 2014, 3:02:22 AM11/18/14
to haze...@googlegroups.com
Hi,

I'm using Hazelcast version 3.3.3 and experiencing the same problem as described in the first post.
Maybe it's because backup-count > 0 ... then item is not fully evicted but "the latest previous value" is still present in the map and never get evicted.

My workaround is to set backup count to 0 for given map.

Daniel Marek

Ahmet Mircik

unread,
Nov 19, 2014, 6:10:04 AM11/19/14
to haze...@googlegroups.com
Hi, 

How are you checking the size of map? If using map.size, it may not give the exact size at that moment, but it should eventually give correct size.

BTW, tried the stated scenario but could not reproduced your case. Is it possible to share a simple reproducer test?




--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/e5cf1a56-bf9e-4692-8148-a1bafa450f31%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

daniel...@gmail.com

unread,
Nov 21, 2014, 4:23:18 AM11/21/14
to haze...@googlegroups.com
Hi, 

I'm not checking map size, only calling containsKey on map (in attached simulation test calling get() on map and testing value ...)
How to simulate: run given simulation test on two nodes, after some warm up there will be "doing refresh of some data ..." on first or second node ... 

After some time stop first node, then start it again, stop second, start it again ... no refresh will be performed (value for key "someTest" will be still stored in refreshMap, no eviction will take place)
Try the same with backup count set to 0, try to monitor given entryset in refreshMap via JMX

Daniel.
sandbox-hz-test.zip

Ahmet Mircik

unread,
Nov 21, 2014, 7:32:34 AM11/21/14
to haze...@googlegroups.com
Hi Daniel,

After a quick look into your test code, i can immediately say that a locked key can not be evicted. Seems that is the reason of what you saw. 

daniel...@gmail.com

unread,
Nov 21, 2014, 8:47:59 AM11/21/14
to haze...@googlegroups.com
Hi, 

seems to me, that locking is not the issue ... (not using map.lock, only using ILock to prevent paralel execution of some refresh task in cluster), try this one without locking, the same problem with forgotten eviction during node restarts ...

Config cfg = new Config();
MapConfig mapConfig = new MapConfig("refresh");
mapConfig.setBackupCount(2);
cfg.addMapConfig(mapConfig);

HazelcastInstance hz = Hazelcast.newHazelcastInstance(cfg);
IMap<String, Long> refreshMap = hz.getMap("refresh");
String keyName = "someTest";
        
        while (true) {
Long value = refreshMap.get(keyName);
if (value == null) {
System.out.println("doing refresh of some data ...");
refreshMap.set(keyName, System.currentTimeMillis(), 30, TimeUnit.SECONDS);
}
else {
System.out.println("last value is " + value + " [" + new Date(value) + "]");
}
        sleepQuietly(4000);

Ahmet Mircik

unread,
Nov 21, 2014, 12:43:19 PM11/21/14
to haze...@googlegroups.com
Hi Daniel,

You are right this is a bug. Thanks for reporting this. 

Fix will be available in 3.3.4.



Lukas Blunschi

unread,
Nov 21, 2014, 1:14:46 PM11/21/14
to haze...@googlegroups.com
Hi Ahmet,

did you already open a GitHub issue for this? I only see this commit:

Thanks and best,
Lukas

Ahmet Mircik

unread,
Nov 21, 2014, 4:32:11 PM11/21/14
to haze...@googlegroups.com
Hi Lukas,

Sent a fix since it is a small thing, but i will also add a test for it.
You can follow this issue here https://github.com/hazelcast/hazelcast/issues/4144

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
Reply all
Reply to author
Forward
0 new messages