Cache invalidation problems in Multi-node cluster

151 views
Skip to first unread message

Kirill Kosolapov

unread,
Jun 11, 2024, 9:55:30 AMJun 11
to Keycloak User
Hi everyone,
I have a production cluster of keycloak deployed in k8s with two connected deployments:
1. keycloak - accepts ingress traffic, contains my custom plugins, so can be updated quite often, but has no data for distributed caches on it.
2. keycloak-ispn - doesn't accept ingress traffic, practically never updated, contains all of the main data for distributed caches.
They are all connected to one headless k8s service and it's fqdn is used in -Djgroups.dns.query

Historically we used to have just one deployment, but the deployment of it was very long, and we were developing a lot of plugins and needed a more fast and stable deployment, so we decided to split it into two parts. We use the following two ispn configs, the main difference between the two is the following:
1. keycloak - <cache-container name="keycloak" zero-capacity-node="true">
2. keycloak-ispn - <cache-container name="keycloak" statistics="true">

Everything was okay, the distributed caches were all on keycloak-ispn nodes, but we faced a large problem with invalidation of local caches - realms, users, authorization, keys. For them we have a bunch of common problems with invalidation - we change settings for realm on one node but it is not applied to the other node on keycloak deployment (non ispn) - i.e. the changes are applied locally and in postgres, but are not seen on the other nodes from ingress deployment. 
Any help is appreciated. 

Alexander Schwartz

unread,
Jun 11, 2024, 10:07:03 AMJun 11
to Kirill Kosolapov, Keycloak User
Hi Kirill,

First of all, we've been working to make the redeployment of Keycloak a lot more stable, especially for Keycloak 24 and later. So I'd suggest you try again with a standard deployment and using a StatefulSet to allow for rolling updates. If you run into problems with rolling updates, create an issue in Keycloak GitHub issue tracker, and we'll investigate. 

The reasons for invalidations not working in your setup is that the work cache isn't populated with new entries. Each node listens for new entries appearing in its work cache, and to then trigger the invalidation. On a zero capacity node, I assume that work cache stays empty, and no invalidations are received. 

If your setup would be running a lot of sessions, you might be interested in the feature of persistent sessions which might reduce your memory usage and is new in Keycloak 25. See https://github.com/keycloak/keycloak/discussions/28271 for the discussion.

Best
Alexander


--
You received this message because you are subscribed to the Google Groups "Keycloak User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/bcca5e38-81fd-4741-aeef-73742a2bf476n%40googlegroups.com.


--

Alexander Schwartz, RHCE

He/Him

Principal Software Engineer, Keycloak Maintainer

Red Hat - Germany remote

asch...@redhat.com   

Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany 
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross

Kirill Kosolapov

unread,
Jun 12, 2024, 5:50:33 AMJun 12
to Keycloak User
Hi Alexander,

Thanks for the answer. We will try to update to version 25, we are currently on version 21, so it will be a long road. Another reason for us to move to two-deployment solution was because our memory usage was very high as we had lots of sessions and sometimes nodes went offline because of OOMs, so I hope it will be resolved with persistent sessions.
One thing I don't understand is what's the usage of zero capacity node then? If it doesn't contain data and cannot be used as an ingress target, what's the point of even having it?
Speaking about work cache being empty, we are collecting metrics for it with <replicated-cache name="work" statistics="true"> and we see that it's non zero (in hundreds) on every node via vendor_cache_manager_keycloak_cache_work_cluster_cache_stats_approximate_entries metric.

Alexander Schwartz

unread,
Jun 14, 2024, 4:43:52 AMJun 14
to Kirill Kosolapov, Keycloak User
Hi Kirill,

thank you for looking at the work cache statistics. IMHO this is the metric of the overall entries in the cluster. For the listener that Keycloak uses, those entries need to be present in the single Keycloak node. 

Zero-capacity nodes might be a good thing in other Infinispan setups. I can't think of one in the context of Infinispan, and it has never been tested or advocated for. 

This might be a question for the Infinispan community chat. https://infinispan.org/community/

Best,
Alexander

Reply all
Reply to author
Forward
0 new messages