Unexpected Downtime When Running Keycloak HA in K8S

Tishan Chanaka

unread,

Aug 14, 2021, 10:17:26 AM8/14/21

to Keycloak User

Hi,

I am running a Keycloak cluster in kubernetes with HA configurations and JGroups Kubeping (most of the time my setup has 3 cluter nodes). This setup works fine in my uat cluster setups. But in my production setup, if one pod goes down (Either because of a new deployment, pod deletion or liveness probe failure) all other pods also goes to the unready state resulting a application downtime for about 3 mins.

Can someone please help me to identify the issue here? I can provide more information if required.

PS: I am using keycloak-6.0.1 for the setup(I know it is an older version and planning to upgrade). I am also using /auth/realms/master as readiness probe and /auth/ as liveness probe with 10s timeout.

Thanks and Regards,

Tishan.

dc...@prosentient.com.au

unread,

Aug 17, 2021, 2:59:28 AM8/17/21

to Tishan Chanaka, Keycloak User

I’ve been having issues with Keyclock HA as well where a cluster member unexpectedly leaving the cluster causes Keycloak to stop functioning properly.

I’m using JDBC_PING though.

David Cook

Senior Software Engineer

Prosentient Systems

Suite 7.03

6a Glen St

Milsons Point NSW 2061

Australia

Office: 02 9212 0899

Online: 02 8005 0595

--
You received this message because you are subscribed to the Google Groups "Keycloak User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/049898c0-03fa-419e-a905-0d655fd4662dn%40googlegroups.com.

Phil Fleischer

unread,

Aug 21, 2021, 12:36:43 PM8/21/21

to dc...@prosentient.com.au, Tishan Chanaka, Keycloak User

One thing that comes to mind is that perhaps you have distributed sync enabled on one or more caches and the node that goes down is one of the cache owners which causes a rebalance. That rebalance then steals resources from the remaining owner node and the new leader and overwhelms the api application.

We decided in most cases the benefit of having just a couple of key cloaks being owners of cache (saving on memory usage on certain nodes) was not helpful and switched to replicated caches.

This may not be the case but its the first thing that comes to mind. Infinispan *sigh*

— Phil

To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/02b501d79335%2466246550%24326d2ff0%24%40prosentient.com.au.

dc...@prosentient.com.au

unread,

Aug 22, 2021, 11:11:42 PM8/22/21

to Phil Fleischer, Tishan Chanaka, Keycloak User

I did notice that there seems to be a bug with the JDBC_PING (at least in older versions of Keycloak) where nodes that leave the cluster don’t clean up after themselves as the database connection appears to be closed before the jgroups table can be updated, which has led to the jgroups table containing a lot of invalid entries which caused a lot of performance problems. Using “remove_old_coords_on_view_change” and “remove_all_files_on_view_change” works around that issue though, as the jgroups table gets rewritten whenever there is a new cluster view.

Regarding Infinispan, we use a lot of the out-of-the-box “Standalone Clustered Configuration”, I haven’t delved deep enough into the Infinispan configuration (yet). Looking now… it seems that there is a range of local, distributed (most with 1 owner, one with 2 owners), invalidation, and replicated caches out of the box. I have been thinking about experimenting with making more replicated caches.

The Keycloak clustering leaves a bit to be desired so far.

David Cook

Senior Software Engineer

Prosentient Systems

Suite 7.03

6a Glen St

Milsons Point NSW 2061

Australia

Office: 02 9212 0899

Online: 02 8005 0595

To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/DF0DACCE-A69E-435F-A547-072DEDFE49FB%40gmail.com.

Tatsiana Tupeka

unread,

Aug 25, 2021, 10:08:17 AM8/25/21

to Keycloak User

having similar issue with JDBC_PING and stale cluster data , we do have “remove_all_data_on_view_change” enabled, but I am not sure why there is a need for “remove_old_coords_on_view_change”? it seems like it should be either?

https://github.com/belaban/JGroups/blob/0b978592a1fe4c5785d20065abbbc9206dbda6d9/src/org/jgroups/protocols/FILE_PING.java#L205

dc...@prosentient.com.au

unread,

Aug 25, 2021, 8:32:51 PM8/25/21

to Tatsiana Tupeka, Keycloak User

Thanks for sharing that. Definitely looks like an either/or. I’d only tried the properties on a test instance, so hadn’t looked into it too deeply yet. I was just basing it off what I read at https://www.keycloak.org/2019/08/keycloak-jdbc-ping.

Are you still having problems even with “remove_all_data_on_view_change”?

David Cook

Senior Software Engineer

Prosentient Systems

Suite 7.03

6a Glen St

Milsons Point NSW 2061

Australia

Office: 02 9212 0899

Online: 02 8005 0595

To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/12b4f731-0614-45ea-9379-0a29e4f5b8d9n%40googlegroups.com.

Reply all

Reply to author

Forward