Keycloak high CPU usage

2,036 views
Skip to first unread message

Neeraj

unread,
May 27, 2020, 8:18:46 AM5/27/20
to Keycloak User

Hi All,

 

We have a keycloak HA setup with 3 pods running in kubernetes environment. We ran UMA traffic with 10000 users at about 400 requests/second for around 10 hours. The CPU usage increased and there were continuous CPU spikes at every 30 minute interval whenever the refresh token was used to acquire access tokens (30 min access token lifespan). In one run, we stopped the traffic at around 9 hours but the CPU usage more than 1500 millicores and it stayed at the same level even after we stopped traffic whereas initial usage before traffic run was much below 500 millicores. In another run, after 10 hours (at the session time-out instance), the CPU usage spiked above 2000 millicores and pods started crashing. We suspect there is a possible leak in CPU usage (Have attached the CPU usage screenshot for reference).

Current CPU and memory configuration:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: 800m
memory: 768Mi

I had raised an issue for this https://issues.redhat.com/browse/KEYCLOAK-14269 and was suggested to send this to the mailing list.

Could you please check and suggest what can be done so that CPU usage is reduced especially after end of traffic run?

Similar issues observed in the past:
https://issues.redhat.com/browse/KEYCLOAK-13911
https://issues.redhat.com/browse/KEYCLOAK-13180
https://keycloak.discourse.group/t/cpu-and-memory-growing-linearly-over-time-is-there-a-leak/909

Screenshot_2020-05-05 A A resource usage - Grafana.png

Stian Thorgersen

unread,
May 27, 2020, 8:44:05 AM5/27/20
to Neeraj, Keycloak User
Is your test perhaps creating a new session for each request? If so that would generate a load of sessions that needs to be cleaned up, which may be the spikes you are seeing.

--
You received this message because you are subscribed to the Google Groups "Keycloak User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/bedb50bc-ffb4-4ef7-90d5-6a1f173bd594%40googlegroups.com.

Neeraj Ravindra

unread,
May 27, 2020, 9:06:37 AM5/27/20
to st...@redhat.com, Keycloak User
No Stian, the number of sessions in keycloak was constant at 10000, we are creating a new session only on user login and the session is maintained using refresh tokens.

Pedro Igor Craveiro e Silva

unread,
May 27, 2020, 10:19:57 AM5/27/20
to Neeraj Ravindra, st...@redhat.com, Keycloak User
Would you be able to provide thread dumps in particular when happening a spike?

I had the same feeling as Stian because there are some scheduled tasks running, but you said sessions are constant ....

Neeraj Ravindra

unread,
May 28, 2020, 5:18:16 AM5/28/20
to Pedro Igor Craveiro e Silva, st...@redhat.com, Keycloak User
Sure @Pedro Igor , I'll reproduce the issue and provide the thread dump in some time.

Neeraj

unread,
Jun 3, 2020, 12:22:08 PM6/3/20
to Keycloak User
Have attached the thread dumps of all 3 pods. tdpod1.txt was taken during a spike. I also notice that memory usage gradually increases over the duration of the traffic run. Please take a look.

Regards,
Neeraj
tdpod1.txt
tdpod2.txt
tdpod3.txt

Phil Fleischer

unread,
Jun 3, 2020, 1:49:35 PM6/3/20
to Neeraj, Keycloak User
This is a shot in the dark but we spent a lot of time tweaking our JVM settings to play nice with infinispan, the jvm garbage collection was doing “stop the world” which then caused the distributed memory to be infinite retry which stole the CPU.

We’re in the process of upgrading because this was supposedly fixed but I’d be sure to check your infinispan cluster has the recommended settings, (the doc says CMS but we used G1)

— Phil

--
You received this message because you are subscribed to the Google Groups "Keycloak User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/72eb5500-691d-48f8-86a9-17bf0fd696c3%40googlegroups.com.
<tdpod1.txt><tdpod2.txt><tdpod3.txt>

Pedro Igor Craveiro e Silva

unread,
Jun 4, 2020, 1:53:30 PM6/4/20
to Phil Fleischer, Neeraj, Keycloak User
I had a look at your thread dumps and I'm also seeing a lot of GC happening. The same goes for C1 and C2 compilations.

Were these dumps taken during a high load? Or when the server was not getting any load as you mentioned earlier?

For GC are you using CMS?

I also noted a "common-services-auth" that seems to be an EJB. It looks like a provider deployed into the server. Recently, we had really weird issues when investigating problems on a customer that were related with how they were using the SPI. Things like:

* Too much use of session notes
* Integration with external services and databases without considering timeouts/circuit breakers
* Overuse of user attributes
* Concurrency issues due to state shared between providers that were singletons
* Excessive logging

Plus others. Looking at your resources, maybe the dump explains why you might be having too much GC and compilations. With a high load, you should expect a huge number of objects into memory, some short-lived (like authentication sessions), and others that live longer (like user sessions). Plus the metadata you need so that the JVM can store it and perform optimizations, maybe this may be causing both GC and compilation to happen more often.

I agree with Phil that this is a shot in the dark. But considering your thread dump and the behavior you mentioned where the server continues to have these spikes over a period, I would suspect of the JVM consuming your resources when doing housekeeping.

Regards.
Pedro Igor

Neeraj

unread,
Jun 5, 2020, 2:18:33 AM6/5/20
to Keycloak User
Hi Pedro Igor,

The thread dumps posted earlier were during the traffic run. But even after end of traffic, CPU usage remained high. I have now attached the thread dump taken after end of traffic. Please take a look.

Thanks,
Neeraj

On Wednesday, 27 May 2020 17:48:46 UTC+5:30, Neeraj wrote:
tdtrafficend.txt

Neeraj

unread,
Jun 11, 2020, 2:21:11 AM6/11/20
to Keycloak User
There is one more problem due to this CPU spike, at the end of SSO session max duration the pods exceed the CPU limit and crash especially when running sustained traffic for a long time. Please provide any workaround or suggestions to overcome this. 

One more doubt: For garbage collection, which type would you suggest if not G1?

On Wednesday, 27 May 2020 17:48:46 UTC+5:30, Neeraj wrote:

Pedro Igor Craveiro e Silva

unread,
Jun 12, 2020, 10:04:46 AM6/12/20
to Neeraj, Keycloak User
G1 should give a higher overhead as you have limited CPU. Tune/choose your JVM is a science and requires testing different setups accordingly to the resources you have available and how these tests are run.

I would suggest you investigate this problem by steps:

* How do you behave when running a single node ?
* How do you behave when running a single node with no HA ?

The two above should eliminate JGroups/ISPN from the picture and see if you need to do something more specific to make them work better.

* Check GC logs and see if they show long pauses or many frequent collections over time.
* Also check for how JVM is compiling code so that it is caching code properly
* How other collectors behave for you? How they reduce the CPU utilization after the test finishes?
* Did you try to tune https://www.oracle.com/technical-resources/articles/java/g1gc.html ? Maybe reviewing the number of threads ?

Those above should give you a feeling of whether or not you are reaching JVM limits in terms of heap or metaspace. As well as how GC is impacting your tests.

I suspect that you are either suffering from long GC runs (there are quite a few threads waiting for a monitor).

--
You received this message because you are subscribed to the Google Groups "Keycloak User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages