Hello again,
I wanted to fill anyone who's interested in this topic, here's is my findings.
To recap, i tried to stress-test one instance of CAS 6.6.7(in my context, i used Hazelcast to store the tickets) and found out that with one TGT, if a user tries to concurrently make ST generation requests, some of that concurrent requests return HTTP 404 with "given TGT is not found" even though TGT exists.
Here's the Apache Jmeter Test Case File For This Context (in first thread group get TGT and save it to a param name TGT and with using groovy save it to a global var named sharedTGT, and use it on the second thread group generating concurrent ST generation http request):
At first i thought about my testing is wrong, making more standardized load testing on my local machine and see the results. I recently found out that is not the case at all. The reason i later see is what CAS does when generating ST's, which is to update the TGT state (updating the latest usage and expiration times for example). Here's some trace logs when generating a ST:
2025-06-07 12:47:13,530 TRACE [org.apereo.cas.ticket.AbstractTicket] - <Before updating ticket [TGT-1-G4YHsg40VoltRu4EIbo8Iu7oq-O9Qjjvo7u2GvtrQqiG--fYm1BDcto5avX7ySkh0ew-cas1]
2025-06-07T12:47:13.531128863Z Previous time used: [null]
2025-06-07T12:47:13.531137061Z Last time used: [2025-06-07T12:47:05.217906789Z]
2025-06-07T12:47:13.531142758Z Usage count: [0]>
2025-06-07T12:47:13.531148293Z 2025-06-07 12:47:13,530 TRACE [org.apereo.cas.ticket.AbstractTicket] - <After updating ticket [TGT-1-G4YHsg40VoltRu4EIbo8Iu7oq-O9Qjjvo7u2GvtrQqiG--fYm1BDcto5avX7ySkh0ew-cas1]
2025-06-07T12:47:13.531153905Z Previous time used: [2025-06-07T12:47:05.217906789Z]
2025-06-07T12:47:13.531159254Z Last time used: [2025-06-07T12:47:13.530785797Z]
2025-06-07T12:47:13.531164391Z Usage count: [1]>
With concurrent ST requests with given one TGT, this concurrent state update operation(and subsequent ones) is done with using Locks(Reentrant Lock for this case which is ok for one JVM but not ok for distributed deployments, can be configurable though). With updating TGT state concurrently, what CAS does is, generate a pool of Locks and within given time period (3s is the default) if a lock is not acquired, make it so that that request returns HTTP 404 "TGT not found".
For multiple-node deployments, this for-one-JVM approach may not be a good idea for consistency (at least my searches would say about this), so i tred to override this behaviour to use a distributed locking, which Hazelcast can already can provide, named FencedLock. So after reading about CAS docs(
my hosted one), i tried implementing like this on my overlay project:
package com.example.cas.config;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.function.Supplier;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.cp.lock.FencedLock;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.apereo.cas.util.lock.LockRepository;
import org.springframework.stereotype.Component;
import
com.example.cas.rest.exception.TooManyStGenerationRequestsException;
@Slf4j
@RequiredArgsConstructor
@Component("hazelcastLockRepository")
public class HazelcastLockRepository implements LockRepository {
private final HazelcastInstance hazelcastInstance;
private static final long LOCK_TIMEOUT_SECONDS = 3;
@Override
public <T> Optional<T> execute(Object lockKey, Supplier<T> supplier) {
String key = lockKey.toString();
FencedLock lock = hazelcastInstance.getCPSubsystem().getLock(key);
boolean acquired = false;
try {
acquired = lock.tryLock(LOCK_TIMEOUT_SECONDS, TimeUnit.SECONDS);
if (acquired) {
LOGGER.debug("Acquired lock for key: {}", key);
return Optional.ofNullable(supplier.get());
} else {
LOGGER.warn("Too many concurrent requests — lock timeout for key: {}", key);
throw new TooManyStGenerationRequestsException("Too many concurrent requests for key: " + key);
}
} finally {
if (acquired) {
try {
lock.unlock();
LOGGER.debug("Released lock for key: {}", key);
} catch (Exception e) {
LOGGER.warn("Failed to unlock key: {} - {}", key, e.getMessage(), e);
}
}
}
}
}
and register it like this, as explained on the docs:
/**
* ST üretiminde, TGT durumu güncellenmekte(TGT expire zamanı güncellenmesi gibi) ve concurrent
* ST üretim isteklerinde olası stale-data durumunu önlemek amacıyla CAS içerisinde lock
* mekanizması ayarlanmış. Varsayılan olarak,
* single-node kullanımına elverişli olan {@link org.apereo.cas.util.lock.DefaultLockRepository}
* ile {@link org.springframework.integration.support.locks.DefaultLockRegistry} kullanılmakta,
* ve yoğun isteklerde 1024 adet ayarlanmış {@link java.util.concurrent.locks.ReentrantLock}
* dolduğunda, locklanan TGT beklenirken hata vermek yerine boş ({@link java.util.Optional#empty})
* dönmekte ve TGT bulunamadığından kullanıcıya 404 TGT bulunamadı hatası dönmektedir.
* Bunun yerine hali hazırda kullanılan ve distributed deployment'lardaki(ör k8s'te multi-pod
* çalışan CAS'lardaki gibi) kullanıma daha uygun Hazelcast'in FencedLock'ı kullanacak
* şekilde customize edip CAS'ın ST üretilirken bunu kullanması ve olası durumda kullanıcının
* çok sık ST üretim isteği gönderdiğini belirtmek adına HTTP 429 Too Many Requests
* dönecek şekilde olması adına burada ayarladım.
*
* @param hazelcastInstance CAS'ın iç mekanizmasında ayarladığı Hazelcast instance'ı.
*/
@Bean
public LockRepository casTicketRegistryLockRepository(HazelcastInstance hazelcastInstance) {
return new HazelcastLockRepository(hazelcastInstance);
}
with these configurations, instead of per JVM locks, i could use cluster-wide distributed locks and throw a custom exception (
TooManyStGenerationRequestsException
) the concurrent requests that can't get/hold a lock(for better consistency, configure the
cas.ticket.registry.hazelcast.cluster.core.cp-member-count setting and use cp subsystem on your embedded Hazelcast). This config will make your request return 500(instead of default 404 behaviour). You can configure it as the
Default one does, returning Optional.empty() instead of throwing an exception, which will result HTTP 404 TGT not found, or follow the case as below.
I also updated the ST generation endpoint (in
ServiceTicketResource) on my overlay project, to catch this exception and return a 429 Too Many Requests instead of 500 like this:
...
catch (final TooManyStGenerationRequestsException e) {
return new ResponseEntity<>("Too many concurrent ST generation requests for: " + StringEscapeUtils.escapeHtml4(tgtId), HttpStatus.TOO_MANY_REQUESTS);
...
Here's the results of failed ST generation request screenshots with the same test case above:
In summary, my question was about the reason of these 404 requests on concurrent ST generations with the same TGTs, and the answer i think is, CAS's way of updating the TGT state when generating ST's and its protection on concurrently requests.
On newer CAS versions, i hope that with Hazelcast Ticketing enabled, this kind of configuration can be auto configured and used instead of single JVM protections and have more documentation on it.
In another hopeful suggestion, i hope that you can use external Hazelcast cluster and connect CAS to them, instead of auto configuring an embedded one in startup, because of the memory consumption embedded Hazelcast requires on production environments.
Thank you and have a nice day,
YG
10 Ağustos 2024 Cumartesi tarihinde saat 12:24:41 UTC+3 itibarıyla Y G şunları yazdı: