Faster offline sessions loading by increasing infinispan sessionsPerSegment

80 zobrazení
Preskočiť na prvú neprečítanú správu

Thomas Darimont

neprečítané,
8. 5. 2020, 4:13:288. 5. 2020
komu: Keycloak Dev
Hello Keycloak team,

When Keycloak starts  the offline-users and offline-client sessions are loaded from the database in incremental batches into the infinispan cache . 
The batch size seems to be determined by the "sessionsPerSegment" setting used by the org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider.

 By default, "sessionsPerSegment" is set to 64 in org.keycloak.models.sessions.infinispan.InfinispanUserSessionProviderFactory#getSessionsPerSegment.

This can lead to many queries if there is a big amount of offline sessions in the database, which considerably slows down the server startup of the first Keycloak server instance of a cluster.
In my case, there are around 400k offline user/client sessions loaded from a decently scaled remote database (PostgreSQL, 4 CPU, 16 GB memory) on Azure from a Keycloak Kubernetes cluster running in AKS. 
Reading that data from the database currently takes around 14mins and is the most dominant factor contributing to downtime of around 20 mins for a full cluster restart.

Note, that in this setup the user-offline-sessions and client-offline-sessions are configured as replicated caches in infinispan, by default those caches are distributed-caches. 
This was necessary to guarantee that every Keycloak instance can reliably read the same offline-session information as fast as possible, at the cost of a slower cache write time.

The offline-sessions are loaded after hibernate was initialized, which can be observed in the logs:
06:21:07,015 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 68) HHH000397: Using ASTQueryTranslatorFactory 
// offline-session loading happens in between these two log messages...
06:35:00,301 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 68) RESTEASY002225: Deploying javax.ws.rs.core.Application: class 

To reduce the number of queries being sent to the database I tried to increase the batch size which is returned, by increasing the sessionsPerSegment from 64 in power of two increments.

I learned that currently there is no configurable element in the keycloak-subsystem configuration to adjust that setting. The following SPI configuration creates a new configurable element in the keycloak-server subsystem.

### UserSessions SPI Configuration

echo SETUP: Customize Keycloak UserSessions SPI configuration
# Add dedicated userSession config element to allow configuring elements.
if (outcome == failed) of /subsystem=keycloak-server/spi=userSessions/:read-resource
  echo SETUP: Add missing userSessions SPI
  /subsystem=keycloak-server/spi=userSessions:add()
  echo
end-if

echo SETUP: Configure built-in "infinispan"  UserSessions loader
if (outcome == failed) of /subsystem=keycloak-server/spi=userSessions/provider=infinispan/:read-resource
  echo SETUP: Add missing "infinispan" provider
  /subsystem=keycloak-server/spi=userSessions/provider=infinispan:add(enabled=true)
  /subsystem=keycloak-server/spi=userSessions/provider=infinispan:write-attribute(name=properties.sessionsPerSegment,value=${env.KEYCLOAK_INFINISPAN_SESSIONS_PER_SEGMENT:64})
  /subsystem=keycloak-server/spi=userSessions:write-attribute(name=default-provider,value=infinispan)
  echo
end-if

Which yields the following new SPI configuration in standalone-ha.xml

<spi name="userSessions">
    <default-provider>infinispan</default-provider>
    <provider name="infinispan" enabled="true">
        <properties>
            <property name="sessionsPerSegment" value="${env.KEYCLOAK_INFINISPAN_SESSIONS_PER_SEGMENT:64}"/>
        </properties>
    </provider>
</spi>

Now sessionsPerSegment can be configured via the environment variable
"KEYCLOAK_INFINISPAN_SESSIONS_PER_SEGMENT".

In my local test setup (3.8Ghz, 6 CPU, 32GB memory, SSD, PostgreSQL database) I created a realm with ~250k users with 250k offline-sessions / offline-client sessions for a single client. Between my testruns, I restarted Keycloak, but kept the database running.

Based on my test setup I can observe the following results:

Scenario 1: sessionsPerSegment: 64 (default, base-line)
Run 1 - Hibernate init duration: ~53s
11:21:07,085 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 68) HHH000397: Using ASTQueryTranslatorFactory 
11:22:00,601 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 68) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication 

Run 2 - Hibernate init duration: ~52s
11:23:43,417 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 63) HHH000397: Using ASTQueryTranslatorFactory 
11:24:35,662 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 63) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication 

Scenario 2 - sessionsPerSegment: 256 
Run 1 - Hibernate init duration: 15s
11:26:29,017 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 61) HHH000397: Using ASTQueryTranslatorFactory 
11:26:45,315 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 61) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication 

Run 2 - Hibernate init duration: 16s
11:27:35,152 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory 
11:27:51,788 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication 

Scenario 3 - sessionsPerSegment: 512 
Run 1 - Hibernate init duration: 12s
11:29:15,707 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory 
11:29:27,147 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication 

Run 2 - Hibernate init duration: 12s
11:30:10,827 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory 
11:30:22,433 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication


Scenario 4 - sessionsPerSegment: 1024 
Run 1 - Hibernate init duration: 13s
11:31:16,411 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 58) HHH000397: Using ASTQueryTranslatorFactory 
11:31:29,190 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 58) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication

Run 2 - Hibernate init duration: 12s
11:32:09,167 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 60) HHH000397: Using ASTQueryTranslatorFactory 
11:32:21,452 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 60) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication 

Scenario 5 - sessionsPerSegment: 2048 
Run 1 - Hibernate init duration: 10s
11:33:28,147 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 70) HHH000397: Using ASTQueryTranslatorFactory 
11:33:38,758 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 70) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication 

Run 2 - Hibernate init duration: 10s
11:34:21,224 INFO [org.hibernate.hql.internal.QueryTranslatorFactoryInitiator] (ServerService Thread Pool -- 69) HHH000397: Using ASTQueryTranslatorFactory 
11:34:31,741 INFO [org.jboss.resteasy.resteasy_jaxrs.i18n] (ServerService Thread Pool -- 69) RESTEASY002225: Deploying javax.ws.rs.core.Application: class org.keycloak.services.resources.KeycloakApplication
...

As one can see, increasing the "sessionsPerSegment" can significantly speed up the load-time of offline sessions. However, increasing the "sessionsPerSegment" also seems to increase the memory footprint, probably because of internal book-keeping by infinispan. 

I also tested with 8192 but then I got OOME exceptions in the  Infinispan subsystem. 
Currently, I lean toward raising the default "sessionsPerSegment" from 64 to at least 256, as it only uses a small amount of additional memory but increases the load-times by 3-4x, which leads to a much faster cluster restart and thus smaller downtimes.

@Keycloak team do you see a problem with this?

Note, that this is a cluster-wide setting which must be the same for all nodes in the cluster! This means that the change requires a cluster restart.

Cheers,
Thomas

Thomas Darimont

neprečítané,
8. 5. 2020, 4:27:128. 5. 2020
komu: Keycloak Dev
Forgot to add:
- The mentioned load times are always in the context of a single Keycloak instance.

Pedro Igor Craveiro e Silva

neprečítané,
8. 5. 2020, 7:31:108. 5. 2020
komu: Thomas Darimont, Keycloak Dev
Hi Thomas,

I don't see a problem as long as we don't reach limits for certain databases such as Oracle (ORAL-01795).

Another helpful solution for this particular problem is adding indexes to the offline session table. See https://issues.redhat.com/browse/KEYCLOAK-11019. Marek suggested this solution and it does help.

Recently, I was in a case where there was more than 3M offline sessions. I did some tests locally too using 1M.

When loading sessions from the database, if there are more nodes in the cluster the work for loading these sessions is distributed among these nodes. So, you should also see some improvements if the server instances are not competing for resources.

One improvement we can make in this area too is related to how this work is distributed among nodes in the cluster. Currently, there is a calculation that takes the number of CPUs + Nodes in the cluster. IMO, this could be configurable so that you configure this according to what better fits for you.

In any case, all that should be addressed by the new storage.

Regards.
Pedro Igor

On Fri, May 8, 2020 at 5:27 AM 'Thomas Darimont' via Keycloak Dev <keyclo...@googlegroups.com> wrote:
Forgot to add:
- The mentioned load times are always in the context of a single Keycloak instance.

--
You received this message because you are subscribed to the Google Groups "Keycloak Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-dev/fec1d39f-2170-47dd-ad17-7203e4207e26%40googlegroups.com.

Marek Posolda

neprečítané,
11. 5. 2020, 9:25:3611. 5. 2020
komu: Pedro Igor Craveiro e Silva, Thomas Darimont, Keycloak Dev
+1 to what Pedro mentioned.

It seems that adding the index can help a lot based on the feedback from users, however that would require also adding https://issues.redhat.com/browse/KEYCLOAK-11908 .

Hopefully the sessionsPerSegment can be done too and thanks for sharing your numbers Thomas. But we need to be careful due the Oracle DB limits...

Finally new store should rewrite this part entirely, so any improvement we do is probably just for next few years (maybe even one year or so).

Marek

Thomas Darimont

neprečítané,
14. 5. 2020, 13:09:3914. 5. 2020
komu: Keycloak Dev
Thanks Marek and Pedro,

I gave the index a try in my test setup, but the performance gains were barely noticeable - nevertheless such an index could be useful to have - perhaps one could add it as a liquibase migration?
Also, I still see a 3-4x improvement when I change sessionsPerSegment from 64 to 256. How about changing the Keycloak default to this?

@Pedro: Do you still have the setup with the 1 million offline sessions accessible? I'd be interested to see how the load-times would change on your side with sessionsPerSegment=64|256|512.

Cheers,
Thomas


On Monday, 11 May 2020 15:25:36 UTC+2, Marek Posolda wrote:
+1 to what Pedro mentioned.

It seems that adding the index can help a lot based on the feedback from users, however that would require also adding https://issues.redhat.com/browse/KEYCLOAK-11908 .

Hopefully the sessionsPerSegment can be done too and thanks for sharing your numbers Thomas. But we need to be careful due the Oracle DB limits...

Finally new store should rewrite this part entirely, so any improvement we do is probably just for next few years (maybe even one year or so).

Marek

On 08. 05. 20 13:30, Pedro Igor Craveiro e Silva wrote:
Hi Thomas,

I don't see a problem as long as we don't reach limits for certain databases such as Oracle (ORAL-01795).

Another helpful solution for this particular problem is adding indexes to the offline session table. See https://issues.redhat.com/browse/KEYCLOAK-11019. Marek suggested this solution and it does help.

Recently, I was in a case where there was more than 3M offline sessions. I did some tests locally too using 1M.

When loading sessions from the database, if there are more nodes in the cluster the work for loading these sessions is distributed among these nodes. So, you should also see some improvements if the server instances are not competing for resources.

One improvement we can make in this area too is related to how this work is distributed among nodes in the cluster. Currently, there is a calculation that takes the number of CPUs + Nodes in the cluster. IMO, this could be configurable so that you configure this according to what better fits for you.

In any case, all that should be addressed by the new storage.

Regards.
Pedro Igor

On Fri, May 8, 2020 at 5:27 AM 'Thomas Darimont' via Keycloak Dev <keyclo...@googlegroups.com> wrote:
Forgot to add:
- The mentioned load times are always in the context of a single Keycloak instance.
--
You received this message because you are subscribed to the Google Groups "Keycloak Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keyclo...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "Keycloak Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keyclo...@googlegroups.com.

Pedro Igor Craveiro e Silva

neprečítané,
19. 5. 2020, 14:02:0319. 5. 2020
komu: Thomas Darimont, Keycloak Dev
On Thu, May 14, 2020 at 2:09 PM 'Thomas Darimont' via Keycloak Dev <keyclo...@googlegroups.com> wrote:
Thanks Marek and Pedro,

I gave the index a try in my test setup, but the performance gains were barely noticeable - nevertheless such an index could be useful to have - perhaps one could add it as a liquibase migration?
Also, I still see a 3-4x improvement when I change sessionsPerSegment from 64 to 256. How about changing the Keycloak default to this?

@Pedro: Do you still have the setup with the 1 million offline sessions accessible? I'd be interested to see how the load-times would change on your side with sessionsPerSegment=64|256|512.

I have, locally. What I did was to run the `KeycloakService` in CLI mode. Like that https://github.com/keycloak/keycloak/blob/master/testsuite/utils/src/main/java/org/keycloak/testsuite/KeycloakServer.java#L249.

Another tip from Marek.
 
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-dev/99a4a6ea-0439-4ba9-af57-4e8ee7b7ed4b%40googlegroups.com.

Todor Kazakov

neprečítané,
6. 11. 2020, 21:23:416. 11. 2020
komu: Keycloak Dev
I have a Keycloak cluster in Kubernetes with 4 nodes for testing purposes with ~200k offline sessions in the database. When I start the 4 pods at once, it looks like only one pod is starting to load the sessions, while the other three can't even join the cluster due to the increased network traffic. Is there a way to postpone the loading process until the Keycloak nodes have formed a cluster?

Thanks,
Todd

Odpovedať všetkým
Odpovedať autorovi
Poslať ďalej
0 nových správ