Low CPU/RAM usage, but high latency under load?

1,086 views
Skip to first unread message

ed.le...@googlemail.com

unread,
Aug 24, 2020, 4:48:40 AM8/24/20
to Keycloak User
Hi,

Does anyone know what would cause KeyCloak to exhibit very high response latencies under load (eg >10s 99th percentile), even though CPU/RAM usage on both KC and the DB host is <10%?

Background:

We use the Codecentric Helm chart to deploy KeyCloak in clustered mode with 2 Infinispan owners talking to a Google CloudSQL database. Each pod is allocated 325mCPU and 2Gi RAM. We have written our own Horizontal Pod Autoscaler that monitors CPU & RAM and confirmed that it creates and removes pods as needed. However, after a new game release last week, we started noticing that under the resulting load spike, 99th percentile response latency was increasing massively, but CPU and RAM usage for every pod was still < 25%. This prevented the autoscaler from detecting the increased load and spinning up more replicas, so I had to manually babysit. 

I know having just 1% of requests take a long time doesn't sound like a huge deal, but because we're also using AppEngine for other systems that interact with KeyCloak, any request that take >10s is automatically killed, preventing some users from logging in or registering.

I am planning to look into creating a custom scaling metric that uses the response latency, but I would love to know what I'm missing.

Could it be that I need to increase the connection pool size?

Thanks in advance,
Ed

ed.le...@googlemail.com

unread,
Aug 24, 2020, 4:59:59 AM8/24/20
to Keycloak User
P.S: In case it's relevant, the same behaviour can be observed when we scale back the replica count. 

Eg: going from 7 replicas to 6:
CPU usage: ~0.026 
RAM usage:  840Mi
Network: 100ki
95th Percentile Latency: 12.2s
99th Percentile Latency: 27.3s

This lasts for about a minute while (I assume!) the Infinispan cache rebalances itself, then goes back down to <0.5s for both.

ed.le...@googlemail.com

unread,
Aug 25, 2020, 9:41:17 AM8/25/20
to Keycloak User
Update: I think I've cracked it!

It turns out that the Codecentric Helm Chart sets 
javaToolOptions: -XX:+UseContainerSupport -XX:MaxRAMPercentage=50.0
which prevents KeyCloak from using more that 50% of the RAM allocated to the container. This meant that all the replicas were memory constrained and constantly garbage collecting, but could never hit the 80% threshold that the HPA needed to increase the replica count.

Increasing the MaxRAMPercentage to 95.0 resolved our latency issues (<500ms 95th %ile, <4s 99th %ile). We haven't benchmarked properly, but 2 replicas are now comfortably handling 10 logins/s.

Phil Fleischer

unread,
Aug 26, 2020, 11:00:37 AM8/26/20
to ed.le...@googlemail.com, Keycloak User
I definitely recommend monitoring the JVM heap, especially if you’re using embedded infinispan.

In addition to setting custom min and max, we tweaked to use CMS and G1 so you might want to get ahead of the curve there as well.

-- 
You received this message because you are subscribed to the Google Groups "Keycloak User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keycloak-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/keycloak-user/2452a851-3fb5-46d9-9600-7540f3dde592n%40googlegroups.com.

ed.le...@googlemail.com

unread,
Aug 28, 2020, 4:14:16 AM8/28/20
to Keycloak User
Sorry, Philip, I'm not sure what CMS and G1 refer to. Could you elaborate, please?

Thanks :)

ed.le...@googlemail.com

unread,
Sep 1, 2020, 8:12:38 AM9/1/20
to Keycloak User
Ok, I'm really struggling to make sense of this. Changing the -XX:MaxRAMPercentage only seemed to work because the pods restarted and cleared their heap. I still can't get the pods to make good use of their available RAM.

Eg: I have 2 pods, each with 2.260GiB RAM allocated via K8s Requests mechanism, but Java/KC refuses to use more than 1.210GiB on each pod, even with -XX:MaxRAMPercentage=100 and -XX:InitialRAMPercentage=100. This completely scuppers any memory-based scaling, since usage never tops 50%.
Screenshot 2020-09-01 at 11.02.42.png

After lowering the RAM allocated to each pod to 1.25GiB, the Max JVM memory usage is still 1.21GiB, implying that there is some hard limit preventing it from using more than that amount.

Mark Phippard

unread,
Sep 1, 2020, 4:17:43 PM9/1/20
to Keycloak User
I have not gotten to the point of trying to tweak these settings, but a JVM app is always going to be limited by the max heap size. It does not matter how much RAM you give the pod. Unless Keycloak has a startup shell script that looks at available RAM and dynamically sets the heap size it would be irrelevant. Most JVM based apps will have some way to set the -Xmx2048m option to whatever you need. Here is an article I found:


Mark

Phil Fleischer

unread,
Sep 1, 2020, 4:27:42 PM9/1/20
to Mark Phippard, Keycloak User
Running out of heap space in jvm will have an impact on CPU as the garbage collector will run continuously which in some cases will steal the compute.

Definitely recommend monitoring within the heap with jmx or jstat (we use new relic but there are other tools).  Embedded infinispan will use a lot more old gen memory space for sessions (especially offline sessions), so as your user and session count increases, your minimum required memory will increase.  We try to always have 50% memory heap available off our maximum, which is a lot but our instance is pretty critical to our operation… if you’re just using keycloak for a couple apps, you probably could get away with 75%.

Choosing a better garbage collector will keep from impacting CPU usage transactions (that is cms and g1) as the heap becomes full.  With infinispan, we’ve seen some issues when using the standard collectors that errors and delays can occur during collections (sometimes referred to as “stop the world”)



ed.le...@googlemail.com

unread,
Sep 2, 2020, 7:06:21 AM9/2/20
to Keycloak User
Thank you both for you replies. 

Regarding allocating more space to the heap, my impression was that the  -XX:+UseContainerSupport combined with -XX:MaxRAMPercentage=100 and -XX:MinRAMPercentage=100 would force JVM to use all of the pod's available RAM. I did also try setting -Xms2048m and -Xmx2560m. Neither set of options increased the RAM used by the JVM from the value seen above (1.21GiB).

On the subject of garbage collection, the telltale saw-like memory usage graph, suggests that the garbage collector is active, but even with only 325mCPU allocated to each pod, CPU usage is <5%, and only ever spikes when changing the number of replicas (presumably due to infinispan syncing/rebalancing). 

The whole thing has me pretty stumped, to be honest.

Mark Phippard

unread,
Sep 2, 2020, 7:27:43 AM9/2/20
to Keycloak User
On Wednesday, September 2, 2020 at 7:06:21 AM UTC-4 ed.le wrote:
Thank you both for you replies. 

Regarding allocating more space to the heap, my impression was that the  -XX:+UseContainerSupport combined with -XX:MaxRAMPercentage=100 and -XX:MinRAMPercentage=100 would force JVM to use all of the pod's available RAM. I did also try setting -Xms2048m and -Xmx2560m. Neither set of options increased the RAM used by the JVM from the value seen above (1.21GiB).

It just show my Java age that these new JVM options just flew by me previously. I am not familiar with these but I guess they must be related to the Docker improvements that have been added in recent versions of Java? It seems like they would work. I would probably head to the JVM docs before looking too hard at the Java app.  The JVM typically makes decisions about memory management. Maybe the JVM tries to be conservative before it allocates all the available memory for the heap? Just speculating.  I hope you will post back here if you find an answer, this has already been helpful for me to be reminded of these new JVM options.

Mark

Phil Fleischer

unread,
Sep 2, 2020, 7:58:19 AM9/2/20
to ed.le...@googlemail.com, Keycloak User
+1 on the container options, we’re not using k8s but wish i was.

My guess is somehow your start script isn’t really getting your options somehow.

Maybe try to run some remote commands on the container to be sure… other than that, this stumps me as well.


Schuster Sebastian (IOC/PDL22)

unread,
Sep 2, 2020, 12:02:06 PM9/2/20
to Phil Fleischer, ed.le...@googlemail.com, Keycloak User

That would also be my guess. JAVA_OPTS is not picked up by the JVM directly, it is just passed as a command line parameter when the jvm is started in the docker-entrypoint.sh. You need to be sure your additions to JAVA_OPTS are really in the environment of the process. In our case, we modified the docker image to actually do a

“source our_script_where_java_opts_get_modified.sh” instead of it being run in a sub-shell with a different environment.

 

More specifically, we are doing this in our derived Keycloak image:

 

# patch autorun script so it sources our startup script instead of running it in another bash

# this way we can add environment variables

sed -i -e 's/^\([ ]*\)\("$f"\)$/\1. \2/g' /opt/jboss/tools/autorun.sh

sed -i -e 's/\/opt\/jboss\/tools\/autorun.sh/. &/g' /opt/jboss/tools/docker-entrypoint.sh

 

I know it’s a dirty hack, but it works. ;)

 

Best regards,

Sebastian

 

Mit freundlichen Grüßen / Best regards

Dr.-Ing.
Sebastian Schuster

Project Delivery Berlin 22 (IOC/PDL22)
Bosch.IO GmbH | Ullsteinstr.
128 | 12109 Berlin | GERMANY | www.bosch.io
Tel. +49 30 726112-485 | Mobil +49 152 02177668 | Telefax +49 30 726112-100 |
Sebastian...@bosch.io

Sitz: Berlin, Registergericht: Amtsgericht Charlottenburg; HRB 148411 B
Aufsichtsratsvorsitzender: Dr.-Ing. Thorsten Lücke; Geschäftsführung: Dr. Stefan Ferber, Dr. Aleksandar Mitrovic, Yvonne Reckling

Mark Phippard

unread,
Sep 2, 2020, 12:16:42 PM9/2/20
to Keycloak User
It would be a nice enhancement to the Keycloak Server Info page if it would display more of the JVM options. That would make this sort of troubleshooting a lot easier.

Mark

Ed Lepedus

unread,
Jul 30, 2021, 5:11:33 PM7/30/21
to Keycloak User
Apologies to revive this old thread, but we finally figured it out, and thought I'd update in case someone else has similar issues. 

It turns out the issue wasn't KeyCloak itself. We have a registration system that calls out to KeyCloak via the admin API. The original developer of this system didn't notice that a URL to the newly-created user was returned in the Location header of the register response, so would issue a call to /users?em...@address.com to find the newly created user. By default, this search endpoint has a limit of 100, and would perform a full table scan on 1.7M rows, pinning our database at 100% for tens of seconds at a time. Fixing this reduced DB load 100x, and brought latency down to <1s, even with 10x the request volume.

TLDR: Check your DB queries; Don't use the search endpoint without understanding its performance implications.

Screenshot 2021-07-30 220824.png

Reply all
Reply to author
Forward
0 new messages