CAS v7.2.x recurring outage issue, unresponsive, no logs

228 views
Skip to first unread message

Ocean Liu

unread,
Oct 13, 2025, 7:05:32 PMOct 13
to CAS Community
Hello everyone,

We are tracking a recurring issue with our CAS service and are wondering if anyone in the community has experienced similar behavior.

Our environment is a single local Linux server. We originally deployed CAS v7.2.2 in May, and the system ran stably with no incidents until September. The issue has now occurred 4 times since September.

Our CAS service will, on occasion, become completely unresponsive. Here are the characteristics we've noticed:
- The outage consistently occurs during periods of low user activity (typically nights or weekends).
- When it happens, the application stops responding to any requests, and no new entries are written to the application log file.
- Once the system is in this state, a standard restart of the CAS service often gets stuck and does not complete successfully.
- The only successful workaround we have found is to block all HTTP incoming traffic before attempting the service restart.
- There are no obvious spikes in server resources (CPU, memory, disk, or network) when the incident occurs.

We are actively investigating this issue with our UNICON consultant.

Has anyone encountered this specific behavior, particularly the need to block inbound traffic to achieve a successful restart? Any shared experiences or guidance would be greatly appreciated.

Thank you!

Richard Frovarp

unread,
Oct 14, 2025, 12:53:45 PMOct 14
to cas-...@apereo.org

I don't know about the unresponsive part. But yes, I have issues where traffic needs to be blocked in order to start CAS. There is a race condition that can lead to a deadlock on startup. The HTTP port becomes live before the rest of the app starts, and my guess is that HTTP traffic triggers some of the same startup code. We have unfortunately ran into it a few times. Here's my post with more details:

https://groups.google.com/a/apereo.org/g/cas-user/c/9i32dWR0Z3g/m/OBaGCvIPBgAJ

Since it is a deadlock, everything just stops. You don't get any additional logging. The only way to find it is with a jstack call on the pid.

The "fix" is to put your single instance into a load balancer of some sort (HTTPD has one built in, NGINX probably does too), and pull the node during restarts.

I would suggest that when it becomes unresponsive that you run a jstack on the process before restarting. You may find a deadlock. The one I found is very specifically on startup. But you never know.

Thank you! --
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
---
You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/4334c7e6-25c3-45dd-b45d-bc7e1c93636en%40apereo.org.

Pascal Rigaux

unread,
Oct 14, 2025, 12:53:46 PMOct 14
to cas-...@apereo.org
On 14/10/2025 01:00, Ocean Liu wrote:

> Has anyone encountered this specific behavior, particularly the need to block inbound traffic to achieve a successful restart? Any shared experiences or guidance would be greatly appreciated.

On this subject, see msg "Deadlock on startup" https://www.mail-archive.com/cas-...@apereo.org/msg17421.html

We switched from internal tomcat to external tomcat and this issue is gone :-)

cu

Ocean Liu

unread,
Oct 14, 2025, 5:17:22 PMOct 14
to CAS Community, Pascal Rigaux, Richard Frovarp
Hi Richard and Pascal,

Thank you for the help! We will explore the external tomcat option.

Karol Zajac

unread,
Oct 20, 2025, 10:17:55 AMOct 20
to CAS Community, Ocean Liu, Pascal Rigaux, Richard Frovarp
Hello,

we have same issue on 7.3.0. Unfortunately i don't know how to reproduce and what is causing it.

Richard Frovarp

unread,
Oct 20, 2025, 12:32:49 PMOct 20
to CAS Community

If you can, jstack the process when it goes unresponsive. If there is a deadlock, it will tell you where it is.

As the same user that it is running as 

jstack <pid>

If there is a deadlock detected, it will tell you so at the end of the stack.

On 10/20/25 10:11, Ocean Liu wrote:
Hello Karol,

Thank you for confirming that you are seeing this issue on v7.3.0 as well. Unfortunately, we also do not have steps to reproduce it yet.

We had two more incidents just this morning, October 20th, around 7:00 AM and 8:00 AM PDT.
We have a current hypothesis that we are investigating: we are wondering if these CAS issues might be related to the widely reported AWS issues that occurred this morning, potentially impacting the availability of our service providers' SAML metadata.

Have you noticed any correlation between your incidents and any external cloud service provider outages?

Thanks again for sharing!

Ocean Liu

unread,
Oct 20, 2025, 12:32:59 PMOct 20
to CAS Community, Karol Zajac, Ocean Liu, Pascal Rigaux, Richard Frovarp
Hello Karol,

Thank you for confirming that you are seeing this issue on v7.3.0 as well. Unfortunately, we also do not have steps to reproduce it yet.

We had two more incidents just this morning, October 20th, around 7:00 AM and 8:00 AM PDT.
We have a current hypothesis that we are investigating: we are wondering if these CAS issues might be related to the widely reported AWS issues that occurred this morning, potentially impacting the availability of our service providers' SAML metadata.

Have you noticed any correlation between your incidents and any external cloud service provider outages?

Thanks again for sharing!
On Monday, October 20, 2025 at 7:17:55 AM UTC-7 Karol Zajac wrote:

Ocean Liu

unread,
Oct 23, 2025, 2:12:35 PMOct 23
to CAS Community, Richard Frovarp
Hi Richard,

Thank you for your response! We have made some progress on the diagnostics and have a strong new working theory.

We ran two initial `jstack` thread dumps and confirmed there are no signs of deadlocks among the standard platform threads.
However, the system's behavior still strongly suggests a deadlock condition, leading us to suspect the newer virtual threads.
We found this article from Netflix highly relevant to our suspicion: https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d
Our next step is to use `jcmd` to capture thread dumps in JSON format (`jcmd <pid> Thread.dump_to_file -format=json <filename>`) so we can specifically inspect the status of the virtual threads.
We will also capture a heap dump.

We discovered a key correlation that points to the root cause:
During the AWS outage on Monday morning (10/20), our CAS service repeatedly became unresponsive every 15-30 minutes. We knew that Instructure (Canvas) was down.
Once we switched the Instructure SAML metadata source from the external URL to a local backup copy, the unresponsiveness immediately stopped and has not recurred since.

Based on this evidence, our strong working theory is that the unresponsiveness is directly related to a SAML metadata fetching failure during periods of external network instability, likely causing a virtual thread deadlock.

Thank you for your suggestions, and we will keep you updated once we have analyzed the jcmd and heap dump results.

Derek Badge

unread,
Nov 11, 2025, 4:37:24 PMNov 11
to CAS Community, Ocean Liu, Richard Frovarp
My issues were definitely related to the virtual threads.  Intermittently (frequently) my CAS would fail to start on reboot/restart of service.  Similarly, there were no "deadlocks" for me, just thread forever waiting.  Like Richard, it would help when I blocked traffic during startup. 

Disabling these has completely fixed my issues (knock on wood, I've had about 10 restarts now with no hangs, it was 50% or greater chance before this), although I suspect the eager setting is un-needed.  
spring.cloud.refresh.scope.eager-init=false
spring.threads.virtual.enabled=false

Ocean Liu

unread,
Nov 11, 2025, 10:40:04 PMNov 11
to CAS Community, Derek Badge, Ocean Liu, Richard Frovarp
Thanks for sharing your experience, Derek!

We did consider disabling Virtual Threads but initially held off due to performance concerns.
We are now confident we've found the root cause without having to revert that feature.

Working with our Unicon consultant and analyzing jcmd thread dumps (which include Virtual Thread status), we determined the core issue was a Virtual Thread deadlock triggered during SAML SP metadata fetching as part of the Single Logout (SLO) process.

By default, CAS enables SLO and aggressively fetches SAML SP metadata from external URLs without using the local cache.

We implemented the following changes:
- SLO Disabled: We globally disabled Single Logout.
- Metadata Cache Priority: We configured CAS to prioritize and utilize the local metadata cache.
- Targeted Local Files: We manually moved several critical SAML SP metadata URLs (like RStudio) to local files.

These steps have kept our CAS service stable since implementation.

We also monitored the `CLOSE_WAIT` TCP sockets on our server, which provided a key metric for success:
- Before Changes: We saw spikes of 40–60 `CLOSE_WAIT` TCP sockets coinciding with SSO session timeouts.
- After Changes: The count is consistently low, hovering around 2 CLOSE_WAIT TCP sockets.

We hope this helps.

Ocean Liu

unread,
Nov 13, 2025, 1:18:09 PMNov 13
to Derek Badge, CAS Community
That's interesting! Glad you figured it out!

I am curious, did you only experience the hanging during CAS restarts, or was there also unresponsiveness while CAS was already running?

On Thu, Nov 13, 2025 at 8:08 AM Derek Badge <aldb...@davenport.edu> wrote:

Thanks for the help, it definitely put me on the right track. I went back and re-enabled Virtual Threads, and (conveniently, in this case) the service immediately failed to start.

I had a large number of CLOSE_WAIT connections on 8443 from our load balancer, but what I missed earlier was that they were all IPv6 addresses. Since we don't actively use IPv6, this led me to suspect a network stack conflict.

I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit, and that most likely has resolved the issue. The service is now starting reliably (at least on the test servers, after 10 or so restarts).

This also explains our previous workaround: blocking port 8443 with the firewall was preventing the load balancer's IPv6-mapped connections from hitting the service during the race-sensitive startup, which is why it worked.

It seems the other paths we were investigating were likely red herrings.



--

Ocean Liu | Enterprise Web Developer | Whitman College
WCTS Building 105F - 509.527.4973

Derek Badge

unread,
Nov 13, 2025, 1:18:09 PMNov 13
to CAS Community, Ocean Liu, Derek Badge, Richard Frovarp
Interesting, that sounds very different from what we had in our dumps, which all seemed related to beans, but gives me something else to check into.  We already had SLO off.  I was just using jstack, but I guess I should learn jcmd as well.
org.springframework.cloud.context.scope.GenericScope$BeanLifecycleWrapper.getBean(GenericScope.java:373)

        - locked <0x00000006c8588aa8> (a java.lang.String)

        at org.springframework.cloud.context.scope.GenericScope.get(GenericScope.java:177)

On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote:

Derek Badge

unread,
Nov 13, 2025, 1:18:09 PMNov 13
to CAS Community, Ocean Liu, Derek Badge, Richard Frovarp

Thanks for the help, it definitely put me on the right track. I went back and re-enabled Virtual Threads, and (conveniently, in this case) the service immediately failed to start.

I had a large number of CLOSE_WAIT connections on 8443 from our load balancer, but what I missed earlier was that they were all IPv6 addresses. Since we don't actively use IPv6, this led me to suspect a network stack conflict.

I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit, and that most likely has resolved the issue. The service is now starting reliably (at least on the test servers, after 10 or so restarts).

This also explains our previous workaround: blocking port 8443 with the firewall was preventing the load balancer's IPv6-mapped connections from hitting the service during the race-sensitive startup, which is why it worked.

It seems the other paths we were investigating were likely red herrings.


On Tuesday, November 11, 2025 at 10:40:04 PM UTC-5 Ocean Liu wrote:

Derek Badge

unread,
Nov 13, 2025, 3:41:57 PMNov 13
to Ocean Liu, CAS Community
Only during restarts in our case.  The only outage we have had while running was the /var/run space filling up to 100% due to logs over a very long period of uptime.   
Reply all
Reply to author
Forward
0 new messages