Weird 404 rest responses on concurrent ST generation requests.

Y G

unread,

Aug 9, 2024, 5:21:49 PM8/9/24

to CAS Community

Hello all,

I'm having a weird issue when doing a light load testing for a CAS with Hazelcast enabled.

The steps i took:

1. Get a TGT for one user (by making a HTTP POST the /cas/v1/tickets with username and password like explained here)

2. Using one TGT, repeatedly call the ST generation rest endpoint (HTTP POST to /cas/v1/tickets/{{TGT}}?service={{MY-VALID-SERVICE-URL}} like explained here) by using this mini http load testing tool called oha(couldn't make ab work) these bash codes:

export TGT={{TGT-VALUE- FROM-FIRST-STEP}}

./oha -m POST \ -> makes a post request

--insecure \ -> allow insecure ssl

-H "Content-Type: application/x-www-form-urlencoded" \
-H "Accept: text/plain" \                                    -> add header
-n 1000 \                                                              -> total number of requests
-q 75 \                                                                  -> QUERY_PER_SECOND count
-c 500 \                                                                -> concurrent request count
--latency-correction \
https://localhost:8443/cas/v1/tickets/$TGT?service=https://localhost:8443

after a few couple of 100% successful HTTP 200 response codes, i started seeing HTTP 400 responses that returns:

"TGT-1-********4ij6IiQ-myhostname could not be found or is considered invalid"

i checked the logs and only see INVALID_TICKET and REST_API_SERVICE_TICKET_FAILED audit logs, couldn't see any more information even if setting the log level to `trace`.

i tried debugging it on the cas-overlay what i could find is an InvalidTicketException thrown, and i couldn't find where is was thrown, and can not see the code beyond the CentralAuthenticationService.grantServiceTicket method (i couldn't find the implementation class of this, anybody know this?)

i even tested this issue on a 3 node cas cluster with embedded hazelcast's properly set up and have the same problem. I tried setting cas.ticket.registry.core.enable-locking to false (docs)and see any changes, only see that 404's converting to timeouts (did not give a long timeout but nonetheless)

I read about https://apereo.github.io/cas/6.6.x/ticketing/Ticket-Registry-Locking.html

but i need more knowledge about this topic so here are my questions:

1. What's your opinion of this behaviour? why do you think this happens? Does this issue really about this registry-locking?

2. For a memory backed ticket registry, docs say (and i checked, it works) default ticket registry implementation does its thing and it works for single node with no problems on load, but what about clustered and hazelcast-backed ticket registry??? has it been implemented?

3. I tried debugging the cas-overlay project but could not properly walk the stacktrace, anybody can show me the implementation of this interface method: CentralAuthenticationService.grantServiceTicket? what does this code do and where, and how to debug it?

Thanks for your patience and have a nice weekend.

Yusuf

Ray Bon

unread,

Aug 9, 2024, 10:49:54 PM8/9/24

to cas-...@apereo.org

Yusuf,

How long did the test run before the 400's?

Do you experience this when only one cas server is running?

Is that really 75 queries per second?

You might want to try jMeter load tests included with the cas project, https://github.com/apereo/cas/tree/6.6.x/etc/loadtests

This way you can have more than one user and service.

Ray

Y G

unread,

Aug 10, 2024, 5:24:41 AM8/10/24

to CAS Community, Ray Bon

Hello Ray,

First of all, thank you for your response.

After sending this mail, i set up and ran the command with 200 concurrent request with 75 query per second with the tool i mentioned. First few times no 404's but after running the command 3rd or 4th time, 404's started appearing on the display and reporting it to me at the end of the test. About the query per second setting of this mini tool, it shows the reached query per second statistics when completed. So i can safely assume that it does the qps thing (i can't say if it can reach given count though)

I chose this tool called oha for the convenience and the display of the progress and results, but after your reply, i will try and use jmeter with the mentions on the docs. I think i'm trying a really niche case (using 1 user's TGT to call concurrent requests to ST generation endpoint).

I'll check and get back on this.

Thanks and a have good day

YG

10 Ağustos 2024 Cumartesi tarihinde saat 05:49:54 UTC+3 itibarıyla Ray Bon şunları yazdı:

Y G

unread,

Jun 7, 2025, 10:22:22 AM6/7/25

to CAS Community, Y G, Ray Bon

Hello again,

I wanted to fill anyone who's interested in this topic, here's is my findings.

To recap, i tried to stress-test one instance of CAS 6.6.7(in my context, i used Hazelcast to store the tickets) and found out that with one TGT, if a user tries to concurrently make ST generation requests, some of that concurrent requests return HTTP 404 with "given TGT is not found" even though TGT exists.

Here's the Apache Jmeter Test Case File For This Context (in first thread group get TGT and save it to a param name TGT and with using groovy save it to a global var named sharedTGT, and use it on the second thread group generating concurrent ST generation http request):

https://yusufgunduz.tr/cas/ExampleConcurrentStGenerationTestPlan.jmx

https://yusufgunduz.tr/cas/ss1.png

https://yusufgunduz.tr/cas/ss2.png

At first i thought about my testing is wrong, making more standardized load testing on my local machine and see the results. I recently found out that is not the case at all. The reason i later see is what CAS does when generating ST's, which is to update the TGT state (updating the latest usage and expiration times for example). Here's some trace logs when generating a ST:

2025-06-07 12:47:13,530 TRACE [org.apereo.cas.ticket.AbstractTicket] - <Before updating ticket [TGT-1-G4YHsg40VoltRu4EIbo8Iu7oq-O9Qjjvo7u2GvtrQqiG--fYm1BDcto5avX7ySkh0ew-cas1]
2025-06-07T12:47:13.531128863Z Previous time used: [null]
2025-06-07T12:47:13.531137061Z Last time used: [2025-06-07T12:47:05.217906789Z]
2025-06-07T12:47:13.531142758Z Usage count: [0]>
2025-06-07T12:47:13.531148293Z 2025-06-07 12:47:13,530 TRACE [org.apereo.cas.ticket.AbstractTicket] - <After updating ticket [TGT-1-G4YHsg40VoltRu4EIbo8Iu7oq-O9Qjjvo7u2GvtrQqiG--fYm1BDcto5avX7ySkh0ew-cas1]
2025-06-07T12:47:13.531153905Z Previous time used: [2025-06-07T12:47:05.217906789Z]
2025-06-07T12:47:13.531159254Z Last time used: [2025-06-07T12:47:13.530785797Z]
2025-06-07T12:47:13.531164391Z Usage count: [1]>

With concurrent ST requests with given one TGT, this concurrent state update operation(and subsequent ones) is done with using Locks(Reentrant Lock for this case which is ok for one JVM but not ok for distributed deployments, can be configurable though). With updating TGT state concurrently, what CAS does is, generate a pool of Locks and within given time period (3s is the default) if a lock is not acquired, make it so that that request returns HTTP 404 "TGT not found".

For multiple-node deployments, this for-one-JVM approach may not be a good idea for consistency (at least my searches would say about this), so i tred to override this behaviour to use a distributed locking, which Hazelcast can already can provide, named FencedLock. So after reading about CAS docs(my hosted one), i tried implementing like this on my overlay project:

package com.example.cas.config;

import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.function.Supplier;

import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.cp.lock.FencedLock;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.apereo.cas.util.lock.LockRepository;
import org.springframework.stereotype.Component;
import com.example.cas.rest.exception.TooManyStGenerationRequestsException;

@Slf4j
@RequiredArgsConstructor
@Component("hazelcastLockRepository")
public class HazelcastLockRepository implements LockRepository {

private final HazelcastInstance hazelcastInstance;
private static final long LOCK_TIMEOUT_SECONDS = 3;

@Override
public <T> Optional<T> execute(Object lockKey, Supplier<T> supplier) {
String key = lockKey.toString();
FencedLock lock = hazelcastInstance.getCPSubsystem().getLock(key);
boolean acquired = false;

try {
acquired = lock.tryLock(LOCK_TIMEOUT_SECONDS, TimeUnit.SECONDS);
if (acquired) {
LOGGER.debug("Acquired lock for key: {}", key);
return Optional.ofNullable(supplier.get());
} else {
LOGGER.warn("Too many concurrent requests — lock timeout for key: {}", key);
throw new TooManyStGenerationRequestsException("Too many concurrent requests for key: " + key);
}
} finally {
if (acquired) {
try {
lock.unlock();
LOGGER.debug("Released lock for key: {}", key);
} catch (Exception e) {
LOGGER.warn("Failed to unlock key: {} - {}", key, e.getMessage(), e);
}
}
}
}
}

and register it like this, as explained on the docs:

/**
* ST üretiminde, TGT durumu güncellenmekte(TGT expire zamanı güncellenmesi gibi) ve concurrent
* ST üretim isteklerinde olası stale-data durumunu önlemek amacıyla CAS içerisinde lock
* mekanizması ayarlanmış. Varsayılan olarak,
* single-node kullanımına elverişli olan {@link org.apereo.cas.util.lock.DefaultLockRepository}
* ile {@link org.springframework.integration.support.locks.DefaultLockRegistry} kullanılmakta,
* ve yoğun isteklerde 1024 adet ayarlanmış {@link java.util.concurrent.locks.ReentrantLock}
* dolduğunda, locklanan TGT beklenirken hata vermek yerine boş ({@link java.util.Optional#empty})
* dönmekte ve TGT bulunamadığından kullanıcıya 404 TGT bulunamadı hatası dönmektedir.
* Bunun yerine hali hazırda kullanılan ve distributed deployment'lardaki(ör k8s'te multi-pod
* çalışan CAS'lardaki gibi) kullanıma daha uygun Hazelcast'in FencedLock'ı kullanacak
* şekilde customize edip CAS'ın ST üretilirken bunu kullanması ve olası durumda kullanıcının
* çok sık ST üretim isteği gönderdiğini belirtmek adına HTTP 429 Too Many Requests
* dönecek şekilde olması adına burada ayarladım.
*
* @param hazelcastInstance CAS'ın iç mekanizmasında ayarladığı Hazelcast instance'ı.
*/
@Bean
public LockRepository casTicketRegistryLockRepository(HazelcastInstance hazelcastInstance) {
return new HazelcastLockRepository(hazelcastInstance);
}

with these configurations, instead of per JVM locks, i could use cluster-wide distributed locks and throw a custom exception ( TooManyStGenerationRequestsException ) the concurrent requests that can't get/hold a lock(for better consistency, configure the cas.ticket.registry.hazelcast.cluster.core.cp-member-count setting and use cp subsystem on your embedded Hazelcast). This config will make your request return 500(instead of default 404 behaviour). You can configure it as the Default one does, returning Optional.empty() instead of throwing an exception, which will result HTTP 404 TGT not found, or follow the case as below.

I also updated the ST generation endpoint (in ServiceTicketResource) on my overlay project, to catch this exception and return a 429 Too Many Requests instead of 500 like this:

...

catch (final TooManyStGenerationRequestsException e) {
return new ResponseEntity<>("Too many concurrent ST generation requests for: " + StringEscapeUtils.escapeHtml4(tgtId), HttpStatus.TOO_MANY_REQUESTS);
...

Here's the results of failed ST generation request screenshots with the same test case above:

https://yusufgunduz.tr/cas/ss3.png

https://yusufgunduz.tr/cas/ss4.png

In summary, my question was about the reason of these 404 requests on concurrent ST generations with the same TGTs, and the answer i think is, CAS's way of updating the TGT state when generating ST's and its protection on concurrently requests.

On newer CAS versions, i hope that with Hazelcast Ticketing enabled, this kind of configuration can be auto configured and used instead of single JVM protections and have more documentation on it.

In another hopeful suggestion, i hope that you can use external Hazelcast cluster and connect CAS to them, instead of auto configuring an embedded one in startup, because of the memory consumption embedded Hazelcast requires on production environments.

Thank you and have a nice day,

YG

10 Ağustos 2024 Cumartesi tarihinde saat 12:24:41 UTC+3 itibarıyla Y G şunları yazdı:

Reply all

Reply to author

Forward