I don't know about the unresponsive part. But yes, I have issues where traffic needs to be blocked in order to start CAS. There is a race condition that can lead to a deadlock on startup. The HTTP port becomes live before the rest of the app starts, and my guess is that HTTP traffic triggers some of the same startup code. We have unfortunately ran into it a few times. Here's my post with more details:
https://groups.google.com/a/apereo.org/g/cas-user/c/9i32dWR0Z3g/m/OBaGCvIPBgAJ
Since it is a deadlock, everything just stops. You don't get any additional logging. The only way to find it is with a jstack call on the pid.
The "fix" is to put your single instance into a load balancer of some sort (HTTPD has one built in, NGINX probably does too), and pull the node during restarts.
I would suggest that when it becomes unresponsive that you run a jstack on the process before restarting. You may find a deadlock. The one I found is very specifically on startup. But you never know.
Thank you! --
- Website: https://apereo.github.io/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
---
You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
To view this discussion visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/4334c7e6-25c3-45dd-b45d-bc7e1c93636en%40apereo.org.
If you can, jstack the process when it goes unresponsive. If there is a deadlock, it will tell you where it is.
As the same user that it is running as
jstack <pid>
If there is a deadlock detected, it will tell you so at the end of the stack.
Hello Karol,
Thank you for confirming that you are seeing this issue on v7.3.0 as well. Unfortunately, we also do not have steps to reproduce it yet.
We had two more incidents just this morning, October 20th, around 7:00 AM and 8:00 AM PDT.
We have a current hypothesis that we are investigating: we are wondering if these CAS issues might be related to the widely reported AWS issues that occurred this morning, potentially impacting the availability of our service providers' SAML metadata.
Have you noticed any correlation between your incidents and any external cloud service provider outages?
Thanks again for sharing!
Thanks for the help, it definitely put me on the right track. I went back and re-enabled Virtual Threads, and (conveniently, in this case) the service immediately failed to start.
I had a large number of CLOSE_WAIT connections on 8443 from our load balancer, but what I missed earlier was that they were all IPv6 addresses. Since we don't actively use IPv6, this led me to suspect a network stack conflict.
I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit, and that most likely has resolved the issue. The service is now starting reliably (at least on the test servers, after 10 or so restarts).
This also explains our previous workaround: blocking port 8443 with the firewall was preventing the load balancer's IPv6-mapped connections from hitting the service during the race-sensitive startup, which is why it worked.
It seems the other paths we were investigating were likely red herrings.
Thanks for the help, it definitely put me on the right track. I went back and re-enabled Virtual Threads, and (conveniently, in this case) the service immediately failed to start.
I had a large number of CLOSE_WAIT connections on 8443 from our load balancer, but what I missed earlier was that they were all IPv6 addresses. Since we don't actively use IPv6, this led me to suspect a network stack conflict.
I added -Djava.net.preferIPv4Stack=true to the cas.service systemd unit, and that most likely has resolved the issue. The service is now starting reliably (at least on the test servers, after 10 or so restarts).
This also explains our previous workaround: blocking port 8443 with the firewall was preventing the load balancer's IPv6-mapped connections from hitting the service during the race-sensitive startup, which is why it worked.
It seems the other paths we were investigating were likely red herrings.