Database and scalability issues

386 views
Skip to first unread message

Julien.Pliss...@cdiscount.com

unread,
Feb 25, 2021, 12:59:26 PM2/25/21
to keyclo...@googlegroups.com
Hello,

We planned to bench a HA keycloak setup (dual-site clusters with
replication) with around 100k test clients and as many test users but
we ran into severe issues just creating the test clients and users, so
we had to scale both counts down to 1k. On a sandbox cluster we had to
stop client creation just short of 10k as it impacted the availability
of the users REST API on all realms. We are still on Keycloak 8.0.2 for
these tests, we plan to upgrade to the latest 12.x in the following
weeks.

For the context, our target setup would have to manage in the tens to
hundreds millions of users, and in the tens to hundreds thousands of
clients.

A few JIRA issues around the subject were closed due to lack of
progress or update: KEYCLOAK-8275, KEYCLOAK-8283, KEYCLOAK-8589 (esp.
subtask KEYCLOAK-8271). This one is still open though: KEYCLOAK-14329.

Upon investigation, I discovered that:
- a single POST to create a client generates 73 SQL queries
- those queries are all run synchronously / sequentially and thus are
all on the critical path
- on our site that is remote to the active database instance there is a
8 ms penalty on each query, that's 584 ms alone due to distance and
number of queries
- most of those queries are not strictly necessary: some retrieve
result sets that are definitely supposed to be empty, some are
duplicate, some undo work done by previous queries
- there is a select on USER_ENTITY that is duplicated 4 times (exact
same query and input parameters with no other queries in between) and
apparently lacks proper indexation
- one of the queries in the way reloads the whole list of clients for
the realm and processes the entire result set
- batch operations are not used though that would greatly help to
reduce the amount of queries.

On the user creation POST I counted 18 SQL queries (10 selects, 6
inserts, 1 delete, 1 update) where 3 inserts would be enough.

It was not clear from the traces whether the JDBC prepared statements
cache is actually used, though that is probably necessary to reduce the
pressure on the SQL server. It was also not clear to me whether SQL
transactions are actually used, it looks like queries fall back on
autocommit, but maybe I missed something.

Exporting our sandbox realm using keycloak.migration.action with a bit
less than 9.7k clients and users took 64 minutes on the site that is
local to the active database instance, and about 60 minutes were spent
in a single transaction for the clients. That's about 375 ms per
client. Users were exported in about 45 sec, at a rate of about 216
users/sec (batch size 100) on a VM setup with limited CPU resources.
That's 2 orders of magnitude better.

I tried to remove the test realm using the admin UI but that resulted
in a timeout, forcing my session to close, and then I could not login
again (timeout after entering the password). I waited a bit more before
forcing a reload of the service. Then I cleaned up the JSONs and tried
the import with OVERWRITE_EXISTING but it failed because realms were
already existing. Ended up extracting client UUIDs from the exported
JSON and running a script to remove them individually, expecting that
to take maybe the whole night.

Of course, these performance issues are fairly disappointing and
worrying.

I've read some comments on this list or Jira issues about improvements
to come on the storage layer, is there some roadmap or ongoing
discussion about the current plans and priorities?


--
Julien Plissonneau Duquène

Reply all
Reply to author
Forward
0 new messages