Re: [cas-user] very slow ticket delivery on CAS 6.6 & redis ticket registry

70 views
Skip to first unread message

Jérôme LELEU

unread,
Oct 28, 2022, 8:30:55 AM10/28/22
to CAS Developer
Hi,

Moving the discussion to the dev mailing list.

I think that using the userId suffix was made to improve performance when searching for user sessions.
Instead of scanning all existing keys to see if it is a TGT of this user, the idea was to scan only the tickets made for this user.

About the solutions, I don't think we should remove the userId suffix as we would lose the previous improvement.
I think that a new cache would make sense to improve performance but I would store it directly in Redis to make things easier.
In fact, we would have a double keys indexing, maybe something like: CAS_TICKET_USER:ticketId:userId => (ticket) + CAS_TICKET:ticketId => (CAS_TICKET_USER:ticketId:userId)

@Misagh: what do you think of this problem and solutions?

Thanks.
Best regards,
Jérôme


Le ven. 28 oct. 2022 à 14:05, Pascal Rigaux <pascal...@univ-paris1.fr> a écrit :
Solutions I can think of:

- add a memory cache [ ticketId => redisKey ]
   (it should help a lot, even if it will still be slower than before in case of load balancing)

- revert suffixing redis key with userid
   (easy change in RedisTicketRegistry.java)

   - and possibly add userid suffix in a UniqueTicketIdGenerator, the way HostNameBasedUniqueTicketIdGenerator suffixes with hostname
     (but it may be hard to do...)

cu

On 28/10/2022 11:13, Jérôme LELEU wrote:
> Hi,
>
> Thanks for raising the point.
>
> It's always hard to find a good balance between a generic design and performance.
>
> It seems to me that performing scans to get a ticket is not the best thing to do in terms of performance.
>
> The Redis ticket registry is commonly used and we should try to avoid any performance degradation.
>
> I have a few ideas in mind, but I'm not a Redis specialist: what do you propose?
>
> Thanks.
> Best regards,
> Jérôme
>
>
> Le jeu. 27 oct. 2022 à 19:59, Pascal Rigaux <pascal...@univ-paris1.fr <mailto:pascal...@univ-paris1.fr>> a écrit :
>
>     Hi,
>
>     In 6.6.x Redis ticket registry key is suffixed with userid (since 6.6.0-RC4)
>
>     This is great to know who owns a TGT or a ST.
>
>     Alas, this means getting a TGT from Redis now requires a "SCAN"... which is much more costly.
>     Example: full "SCAN" is ~100 times slower then "GET" on our production Redis (dbsize ~100k, because we have 1 month rememberMe TGT)
>
>
>     For the record, getting a ST triggers
>     - on 5.3 : 8 redis "GET" on the TGT
>     - on 6.5 : 17 redis "GET" on the TGT
>     - on 6.6 : 15 redis "SCAN" + "GET" on the TGT on a small redis db
>
>
>
>     PS: "cas.ticket.registry.core.enable-locking=false" fails on redis ticket registry with error
>       > Could not find a destroy method named 'destroy' on bean with name 'casTicketRegistryRedisLockRegistry'

--
- Website: https://apereo.github.io/cas
- Gitter Chatroom: https://gitter.im/apereo/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
---
You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
To view this discussion on the web visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/fcfa4754-17f7-384f-7254-e6faad0f4cff%40univ-paris1.fr.

Misagh

unread,
Nov 4, 2022, 1:57:52 PM11/4/22
to CAS Developer
I will take a second look to see if this can be improved, though I am
intrigued by your "double key indexing" solution. Do you think you
could put together some sort of POC to demonstrate this idea? The
other solutions are non-starters. I'll also poke around to see what
can be done to speed things up.
> You received this message because you are subscribed to the Google Groups "CAS Developer" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cas-dev+u...@apereo.org.
> To view this discussion on the web visit https://groups.google.com/a/apereo.org/d/msgid/cas-dev/CAP279Lw22-%2Bc4YRBunkfOErebndGrpQrTK39Dixxpa9CmVqTjg%40mail.gmail.com.

Jérôme LELEU

unread,
Nov 7, 2022, 1:35:32 AM11/7/22
to Misagh, CAS Developer
Hi,

I think we were very good on 6.5.x. Only GET for common operations.

Our main problem was the full SCAN of tickets to check if a ticket is a TGT and if the principal is the right one ("count/get SSO sessions").
For this one, I would have created a "double key indexing", big words for a simple thing ;-)
On 6.5.x, we stored tickets this way: key=CAS_TICKET:ticketId => VALUE=serialized ticket
I propose for the "add ticket" operation to check if it is a TGT: in that case, I would add to Redis: key=CAS_TICKET_USER:ticketId:userId => VALUE=nothing.
This way, we would "only" SCAN keys of TGT sessions to find the right user (CAS_TICKET_USER:*:userId) and we would retrieve all the TGT identifiers for a given user.
Then, a multi GET on these identifiers would find the SSO sessions of the user.

The new key should be updated and deleted when the related TGT is updated or deleted.

Thanks.
Best regards,
Jérôme



Misagh

unread,
Nov 7, 2022, 12:30:36 PM11/7/22
to Jérôme LELEU, CAS Developer
> Our main problem was the full SCAN of tickets to check if a ticket is a TGT and if the principal is the right one ("count/get SSO sessions").
> For this one, I would have created a "double key indexing", big words for a simple thing ;-)
> On 6.5.x, we stored tickets this way: key=CAS_TICKET:ticketId => VALUE=serialized ticket
> I propose for the "add ticket" operation to check if it is a TGT: in that case, I would add to Redis: key=CAS_TICKET_USER:ticketId:userId => VALUE=nothing.
> This way, we would "only" SCAN keys of TGT sessions to find the right user (CAS_TICKET_USER:*:userId) and we would retrieve all the TGT identifiers for a given user.
> Then, a multi GET on these identifiers would find the SSO sessions of the user.

That's quite clever. But it's not without complications. These are not
strictly blockers, but we should likely take these into account:

Doing the double-indexing for a TGT would also imply that the same
thing would/could be done for OIDC codes, access tokens and refresh
tokens. For example, think of operations where you'd want to execute
"get me all the access tokens issued to user X", or "all the refresh
tokens issued to user Y", etc. This would mean that the registry
would have to somehow be tied to modules that present those extra
ticket types though I imagine this can be somewhat solved with the
ticket catalog concept. And of course, the registry size sort of
grows. 10 TGTs for unique users would actually mean 20 entries, not to
mention for every update/remove operation you'd be issuing double
queries. So it's double the index, double the number of operations. At
scale, I am not so sure this would actually be all that better, but I
have not run any conclusive tests.

I would also be interested to see an actual test that showcases the
slowness. For example, I ran a test against a basic local redis
cluster, 7.0.5. My test, added 1000 TGTs to the registry, then fetched
them all, and then looped through the resultset asking the registry
for each ticket that was fetched again. This operation completed in
approx 13 seconds for non-encrypted tickets, and 14seconds encrypted
tickets. Then, I reverted the pattern back to what 6.5 used to do, and
I ran the same test, and I more or less saw the same execution time.
Double checked to make sure there are no obvious mistakes. Is this
also what you see? and if so, can you share some sort of a test that
actually shows or demonstrates the problem?

Jérôme LELEU

unread,
Nov 8, 2022, 1:57:11 AM11/8/22
to Misagh, CAS Developer
Hi,

Yes, double indexing is harder than simple indexing as the second operation may fail and you should revert the first one (transactional aspect).
If we did that for all tickets, we would double the size of the keys, but not the size of the database though.
And maybe we should have two different databases for better performance.

I will make a test to check the problem.

Thanks.
Best regards,
Jérôme

Jérôme LELEU

unread,
Nov 9, 2022, 5:39:18 AM11/9/22
to Misagh, CAS Developer
Hi,

I have made a Redis performance test between v6.5.9 and v7.0.0-RC1 and figures are very relevant.

In both cases, I have a CAS server using a Redis ticket registry (Docker image) and a custom authentication handler to validate any username starting with "jleleu".
I have a script which performs a certain number of logins (without any service) for jleleuRANDOMVALUE.
I have overriden the RedisTickerRegistry class to add time counting in v6.5:

private static AtomicLong getTime = new AtomicLong();
private static AtomicInteger nbGet = new AtomicInteger();
@Override
public Ticket getTicket(final String ticketId, final Predicate<Ticket> predicate) {
val t0 = System.currentTimeMillis();
try {
val redisKey = getTicketRedisKey(encodeTicketId(ticketId));
val t = this.client.boundValueOps(redisKey).get();
if (t != null) {
val result = decodeTicket(t);
if (predicate.test(result)) {
return result;
}
LOGGER.trace("The condition enforced by [{}] cannot successfully accept/test the ticket id [{}]", ticketId,
predicate.getClass().getSimpleName());
return null;
}
} catch (final Exception e) {
LOGGER.error("Failed fetching [{}]", ticketId);
LoggingUtils.error(LOGGER, e);
} finally {
val t1 = System.currentTimeMillis();
val time = t1 - t0;
val t = getTime.addAndGet(time);
val n = nbGet.incrementAndGet();
LOGGER.info("### GET time: {} ms | Average time: {} ms", time, t / n);
}
return null;
}

And in v7:

@Override
public Ticket getTicket(final String ticketId, final Predicate<Ticket> predicate) {
val t0 = System.currentTimeMillis();
try {
val redisKey = RedisCompositeKey.builder().id(encodeTicketId(ticketId)).build().toKeyPattern();
return getKeysStream(redisKey)
.map(key -> redisTemplate.boundValueOps(key).get())
.filter(Objects::nonNull)
.map(this::decodeTicket)
.filter(predicate)
.findFirst()
.orElse(null);
} catch (final Exception e) {
LOGGER.error("Failed fetching [{}]", ticketId);
LoggingUtils.error(LOGGER, e);
} finally {
val t1 = System.currentTimeMillis();
val time = t1 - t0;
val t = getTime.addAndGet(time);
val n = nbGet.incrementAndGet();
LOGGER.info("### GET time: {} ms | Average time: {} ms", time, t / n);
}
return null;
}

Then I perform 1 000 and 10 000 logins (with my script) and check my logs to see the average time:

v6.5.9:
1000 logins -> Average time: 3 ms
10000 logins -> Average time: 3 ms

v7.0.0-RC1:
1000 logins -> Average time: 22 ms
10000 logins -> Average time: 195 ms

So indeed, I notice a big performance issue.

Do you need more information?

Thanks.
Best regards,
Jérôme

Misagh

unread,
Nov 10, 2022, 4:04:05 AM11/10/22
to CAS Developer
Thank you Jérôme. I'll take a look. 

-- Misagh

Misagh

unread,
Nov 10, 2022, 11:27:30 AM11/10/22
to CAS Developer

Jérôme LELEU

unread,
Nov 10, 2022, 11:47:52 AM11/10/22
to Misagh, CAS Developer
Sure. I will test that on Monday (tomorrow is off in France :-)

--
You received this message because you are subscribed to the Google Groups "CAS Developer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-dev+u...@apereo.org.

Jérôme LELEU

unread,
Nov 14, 2022, 7:58:28 AM11/14/22
to Misagh, CAS Developer
Hi,

I have made new tests.

With the new implementation, I have experienced Redis crashes, but I'm not sure this is meaningful.
In any case, I have updated to Redis v7 with 500Mo of memory.

I have launched my previous scenario again (10 000 logins).
CAS v6.5: Average time: 2 ms
CAS v7.0.0 fix REDIS: Average time: 0 ms

Things are now blazing fast with the new implementation but I see you have added a memory cache so this is expected on a single node.

So I have created a 2 nodes scenario with 10 000 login?service + service ticket validation, each call (GET /login, POST /login, GET /serviceValidate) being performed on a different node than the previous call (round robin).

CAS v6.5 :
Average time node 1: 1 ms
Average time node 2: 1 ms

CAS v7.0.0 fix REDIS :
Average time node 1: 2 ms
Average time node 2: 2 ms

While it performs better on CAS v6.5, it now performs very well on CAS v7 as well.

Did you change something else in addition to the cache?

Thanks.
Best regards,
Jérôme

Misagh

unread,
Nov 14, 2022, 10:54:44 PM11/14/22
to CAS Developer
On Mon, Nov 14, 2022, 4:58 PM Jérôme LELEU <lel...@gmail.com> wrote:
Hi,

I have made new tests.

With the new implementation, I have experienced Redis crashes, but I'm not sure this is meaningful.
In any case, I have updated to Redis v7 with 500Mo of memory.

I ran into something similar. I think this is mainly due to the large number of operations and tickets and that the redis setup is not exactly tuned to handle the load. 



CAS v6.5 :
Average time node 1: 1 ms
Average time node 2: 1 ms

CAS v7.0.0 fix REDIS :
Average time node 1: 2 ms
Average time node 2: 2 ms

While it performs better on CAS v6.5, it now performs very well on CAS v7 as well.

Did you change something else in addition to the cache?

Yes I am experimenting with the ticket pattern lookup to not use scanning. This seems to be good enough even without the cache. If you disable the cache altogether on a single node by forcing its capacity to be at 0, (i.e never cache anything) you should see comparable performance numbers. This should fit the scope of 6.6, if we were to backport. 

I'd like to keep the cache changes in master and continue testing. Cache invalidation can be very tricky here to make sure updates and changes to one ticket on one node is correctly found and processed on another. Given the current caching model is incredibly fast, I'd like to stick to this strategy and work out the other possible issues with clustered setups and event sinks. If I cannot make it work reliably, then I would consider either removing the cache or changing its structure. It would be slower than what it is now, but still very very fast. 

And if this technique works ok, it might be something to extend to other registry plugins as the need comes up. 




Jérôme LELEU

unread,
Nov 15, 2022, 1:48:36 AM11/15/22
to Misagh, CAS Developer
EXCELLENT!

--
You received this message because you are subscribed to the Google Groups "CAS Developer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-dev+u...@apereo.org.

leleuj

unread,
Nov 18, 2022, 12:33:34 PM11/18/22
to CAS Developer, leleuj, CAS Developer, Misagh Moayyed
Hi,

Some follow up on this master.
I have made new performance tests between v6.6.2 and v6.6.3-SNAPSHOT to evaluate the backport from v7.

For 5000 logins and service ticket validations:
6.6.2 :
Average time: 221 ms
6.6.3-SNAPSHOT:
Average time: 4 ms

Performance are now very good for the incoming 6.6.3 release.

Thanks.
Best regards,
Jérôme

Misagh

unread,
Nov 18, 2022, 12:54:35 PM11/18/22
to CAS Developer
very nice!

Could I ask you to also verify master (or v7 RC2) with a clustered
setup, whenever you find the chance? I was able to make the cache work
across the cluster, and I think you should see the average time be
closer to 0.

Pascal Rigaux

unread,
Nov 18, 2022, 1:33:41 PM11/18/22
to cas...@apereo.org
Hi,

Commit "support a second-level inmemory cache for redis" implements a memory cache [ ticketId => Ticket ]
=> nice speed-up since getting a new ST requires a lot of calls to "getTicket" to get the TGT (~8 calls on CAS 5.3, ~15 calls on CAS 6.5)
(hence the 0ms vs 2ms with one java node)

But I wonder what happens if you cache tickets in memory when using multiple java nodes.
Is the cache shared between nodes? Otherwise what happens if:

- get ST-1 on nodeA,
logout on nodeB,
get ST-2 on nodeA => still allowed since cached TGT is not invalidated!?

- get ST-1 on nodeA,
get ST-2 on nodeB,
get ST-3 on nodeA
=> the "services" attr of TGT will not have ST-2!? => SLO will be half broken

- get ST on nodeA,
validate ST on nodeB,
validate ST on nodeA => second validation allowed!?

I was afraid of these issues that's why I suggested a simpler memory cache [ ticketId => redisKey ]
But hopefully I missed something :-)

cu

On 14/11/2022 13:58, Jérôme LELEU wrote:
> [...]
> I have launched my previous scenario again (10 000 logins).
> CAS v6.5: Average time: 2 ms
> CAS v7.0.0 fix REDIS: Average time: 0 ms
>
> Things are now blazing fast with the new implementation but I see you have added a memory cache so this is expected on a single node.
>
> So I have created a 2 nodes scenario with 10 000 login?service + service ticket validation, each call (GET /login, POST /login, GET /serviceValidate) being performed on a different node than the previous call (round robin).
>
> CAS v6.5 :
> Average time node 1: 1 ms
> Average time node 2: 1 ms
>
> CAS v7.0.0 fix REDIS :
> Average time node 1: 2 ms
> Average time node 2: 2 ms
>
> While it performs better on CAS v6.5, it now performs very well on CAS v7 as well.
> [...]

Pascal Rigaux

unread,
Nov 18, 2022, 2:35:19 PM11/18/22
to cas...@apereo.org
NB: my mail was stucked somewhere for a few days. It is obsolete since Misagh implemented a Redis-pubsub-synchronised-memory-cache for the tickets :-)
--
Pascal Rigaux

Expert en développement et déploiement d'applications
DSIUN-PAS (Pôle Applications et Services numériques)
Université Paris 1 Panthéon-Sorbonne - Centre Pierre Mendès France (PMF)
B 04 08 - 90, rue de Tolbiac - 75634 PARIS CEDEX 13 - FRANCE
Tél : 01 44 07 86 59 - 06 74 55 57 67

Jérôme LELEU

unread,
Nov 21, 2022, 2:22:20 AM11/21/22
to Misagh, CAS Developer
Hi,

I made a new test with 7.0.0-RC2 in a two node environment with round robin.
Indeed, the average time is 0 ms. Performances are excellent!
Thanks.
Best regards,
Jérôme


Reply all
Reply to author
Forward
0 new messages