Hey,
we also have a similar workload (~3M offline sessions / 6 months inactivity) and the most important setting is to use G1GC, else the GC will kill your cluster quite easily.
We’re running it as one single Infinispan cluster, but we’ve split the workloads. Only a few selected nodes are in the loadbalancing and these nodes have 0 data. So there’s some nodes with all the Infinispan data and a few nodes, which only serve frontend requests. This also makes it easier to deploy some of our plugins, as we basically only need to deploy them on these “frontend” nodes. This might not be strictly necessary, but it gives us a better feeling and better control about what’s happening in the cluster.
Unfortunately, we’ve recently also made the experience of timeouts, as described by Phil. We don’t know yet how they are caused, but they are concerning and seem to happen at random. If we replace a node, it either works and joins successfully or it doesn’t and suddenly we have a single node without any cluster knowledge. Fortunately this only happens to our “backend” nodes so far and customer impact is limited, though sometimes the cluster gets “unstable” and takes a while to settle before the error rate is 0 again.
Fortunately these issues were only transient so far and didn’t require a full cluster restart (which is another pain point with that many sessions, as a cold start takes ~15 minutes). That’s also still a very painful point, that whenever there’s a major version bump, you need to cold start the cluster which takes quite some time for the database migrations and then the aforementioned 15 minutes to fetch all sessions from the database. Even if all of this works sucessfully, we’ve had a 50% failure rate during the startup as sometimes it just suddenly decides to timeout on infinispan requests and you can try again...
FTR: We’re still on 9.0, as we’re running redhat-sso and not vanilla Keycloak.
In most cases we were debugging these issues on our own, as redhat support requires quite a lot of data and in most cases this data collection helps us to find the cause on our own. Though they were very helpful with some recent jgroups issue which caused some infinispan issues as well.
Cheers,
Christian
> To view this discussion on the web visit
https://groups.google.com/d/msgid/keycloak-user/344AB6AE-7AC2-4AAB-97CA-065156E66871%40gmail.com.