Hello Keycloak team,
When Keycloak starts the offline-users and offline-client sessions are loaded from the database in incremental batches into the infinispan cache .
The batch size seems to be determined by the "sessionsPerSegment" setting used by the org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider.
By default, "sessionsPerSegment" is set to 64 in org.keycloak.models.sessions.infinispan.InfinispanUserSessionProviderFactory#getSessionsPerSegment.
This can lead to many queries if there is a big amount of offline sessions in the database, which considerably slows down the server startup of the first Keycloak server instance of a cluster.
In my case, there are around 400k offline user/client sessions loaded from a decently scaled remote database (PostgreSQL, 4 CPU, 16 GB memory) on Azure from a Keycloak Kubernetes cluster running in AKS.
Reading that data from the database currently takes around 14mins and is the most dominant factor contributing to downtime of around 20 mins for a full cluster restart.
Note, that in this setup the user-offline-sessions and client-offline-sessions are configured as replicated caches in infinispan, by default those caches are distributed-caches.
This was necessary to guarantee that every Keycloak instance can reliably read the same offline-session information as fast as possible, at the cost of a slower cache write time.
The offline-sessions are loaded after hibernate was initialized, which can be observed in the logs:
06:21:07,015 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 68) HHH000397: Using ASTQueryTranslatorFactory
// offline-session loading happens in between these two log messages...
06:35:00,301 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 68) RESTEASY002225: Deploying javax.ws.rs.core.Application: class
To reduce the number of queries being sent to the database I tried to increase the batch size which is returned, by increasing the sessionsPerSegment from 64 in power of two increments.
I learned that currently there is no configurable element in the keycloak-subsystem configuration to adjust that setting. The following SPI configuration creates a new configurable element in the keycloak-server subsystem.
### UserSessions SPI Configuration
echo SETUP: Customize Keycloak UserSessions SPI configuration
# Add dedicated userSession config element to allow configuring elements.
if (outcome == failed) of /subsystem=keycloak-server/spi=userSessions/:read-resource
echo SETUP: Add missing userSessions SPI
/subsystem=keycloak-server/spi=userSessions:add()
echo
end-if
echo SETUP: Configure built-in "infinispan" UserSessions loader
if (outcome == failed) of /subsystem=keycloak-server/spi=userSessions/provider=infinispan/:read-resource
echo SETUP: Add missing "infinispan" provider
/subsystem=keycloak-server/spi=userSessions/provider=infinispan:add(enabled=true)
/subsystem=keycloak-server/spi=userSessions/provider=infinispan:write-attribute(name=properties.sessionsPerSegment,value=${env.KEYCLOAK_INFINISPAN_SESSIONS_PER_SEGMENT:64})
/subsystem=keycloak-server/spi=userSessions:write-attribute(name=default-provider,value=infinispan)
echo
end-if
Which yields the following new SPI configuration in standalone-ha.xml
<spi name="userSessions">
<default-provider>infinispan</default-provider>
<provider name="infinispan" enabled="true">
<properties>
<property name="sessionsPerSegment" value="${env.KEYCLOAK_INFINISPAN_SESSIONS_PER_SEGMENT:64}"/>
</properties>
</provider>
</spi>
Now sessionsPerSegment can be configured via the environment variable
"KEYCLOAK_INFINISPAN_SESSIONS_PER_SEGMENT".
In my local test setup (3.8Ghz, 6 CPU, 32GB memory, SSD, PostgreSQL database) I created a realm with ~250k users with 250k offline-sessions / offline-client sessions for a single client. Between my testruns, I restarted Keycloak, but kept the database running.
Based on my test setup I can observe the following results:
Scenario 1: sessionsPerSegment: 64 (default, base-line)
Run 1 - Hibernate init duration: ~53s
11:21:07,085 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 68) HHH000397: Using ASTQueryTranslatorFactory
11:22:00,601 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 68) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Run 2 - Hibernate init duration: ~52s
11:23:43,417 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 63) HHH000397: Using ASTQueryTranslatorFactory
11:24:35,662 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 63) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Scenario 2 - sessionsPerSegment: 256
Run 1 - Hibernate init duration: 15s
11:26:29,017 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 61) HHH000397: Using ASTQueryTranslatorFactory
11:26:45,315 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 61) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Run 2 - Hibernate init duration: 16s
11:27:35,152 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory
11:27:51,788 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Scenario 3 - sessionsPerSegment: 512
Run 1 - Hibernate init duration: 12s
11:29:15,707 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory
11:29:27,147 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Run 2 - Hibernate init duration: 12s
11:30:10,827 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory
11:30:22,433 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Scenario 4 - sessionsPerSegment: 1024
Run 1 - Hibernate init duration: 13s
11:31:16,411 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory
11:31:29,190 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Run 2 - Hibernate init duration: 12s
11:32:09,167 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 60) HHH000397: Using ASTQueryTranslatorFactory
11:32:21,452 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 60) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Scenario 5 - sessionsPerSegment: 2048
Run 1 - Hibernate init duration: 10s
11:33:28,147 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 70) HHH000397: Using ASTQueryTranslatorFactory
11:33:38,758 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 70) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
Run 2 - Hibernate init duration: 10s
11:34:21,224 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 69) HHH000397: Using ASTQueryTranslatorFactory
11:34:31,741 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 69) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
...
As one can see, increasing the "sessionsPerSegment" can significantly speed up the load-time of offline sessions. However, increasing the "sessionsPerSegment" also seems to increase the memory footprint, probably because of internal book-keeping by infinispan.
I also tested with 8192 but then I got OOME exceptions in the Infinispan subsystem.
Currently, I lean toward raising the default "sessionsPerSegment" from 64 to at least 256, as it only uses a small amount of additional memory but increases the load-times by 3-4x, which leads to a much faster cluster restart and thus smaller downtimes.
@Keycloak team do you see a problem with this?
Note, that this is a cluster-wide setting which must be the same for all nodes in the cluster! This means that the change requires a cluster restart.
Cheers,
Thomas