CAS v7.2.x recurring outage issue, unresponsive, no logs

71 views
Skip to first unread message

Ocean Liu

unread,
Oct 13, 2025, 7:05:32 PM (7 days ago) Oct 13
to CAS Community
Hello everyone,

We are tracking a recurring issue with our CAS service and are wondering if anyone in the community has experienced similar behavior.

Our environment is a single local Linux server. We originally deployed CAS v7.2.2 in May, and the system ran stably with no incidents until September. The issue has now occurred 4 times since September.

Our CAS service will, on occasion, become completely unresponsive. Here are the characteristics we've noticed:
- The outage consistently occurs during periods of low user activity (typically nights or weekends).
- When it happens, the application stops responding to any requests, and no new entries are written to the application log file.
- Once the system is in this state, a standard restart of the CAS service often gets stuck and does not complete successfully.
- The only successful workaround we have found is to block all HTTP incoming traffic before attempting the service restart.
- There are no obvious spikes in server resources (CPU, memory, disk, or network) when the incident occurs.

We are actively investigating this issue with our UNICON consultant.

Has anyone encountered this specific behavior, particularly the need to block inbound traffic to achieve a successful restart? Any shared experiences or guidance would be greatly appreciated.

Thank you!

Richard Frovarp

unread,
Oct 14, 2025, 12:53:45 PM (6 days ago) Oct 14
to cas-...@apereo.org

I don't know about the unresponsive part. But yes, I have issues where traffic needs to be blocked in order to start CAS. There is a race condition that can lead to a deadlock on startup. The HTTP port becomes live before the rest of the app starts, and my guess is that HTTP traffic triggers some of the same startup code. We have unfortunately ran into it a few times. Here's my post with more details:

https://groups.google.com/a/apereo.org/g/cas-user/c/9i32dWR0Z3g/m/OBaGCvIPBgAJ

Since it is a deadlock, everything just stops. You don't get any additional logging. The only way to find it is with a jstack call on the pid.

The "fix" is to put your single instance into a load balancer of some sort (HTTPD has one built in, NGINX probably does too), and pull the node during restarts.

I would suggest that when it becomes unresponsive that you run a jstack on the process before restarting. You may find a deadlock. The one I found is very specifically on startup. But you never know.

Thank you! --
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
---
You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/4334c7e6-25c3-45dd-b45d-bc7e1c93636en%40apereo.org.

Pascal Rigaux

unread,
Oct 14, 2025, 12:53:46 PM (6 days ago) Oct 14
to cas-...@apereo.org
On 14/10/2025 01:00, Ocean Liu wrote:

> Has anyone encountered this specific behavior, particularly the need to block inbound traffic to achieve a successful restart? Any shared experiences or guidance would be greatly appreciated.

On this subject, see msg "Deadlock on startup" https://www.mail-archive.com/cas-...@apereo.org/msg17421.html

We switched from internal tomcat to external tomcat and this issue is gone :-)

cu

Ocean Liu

unread,
Oct 14, 2025, 5:17:22 PM (6 days ago) Oct 14
to CAS Community, Pascal Rigaux, Richard Frovarp
Hi Richard and Pascal,

Thank you for the help! We will explore the external tomcat option.

Reply all
Reply to author
Forward
0 new messages