CAS 6.3 Loading SP metadata file multiple times for each SAML2 auth request.

178 views
Skip to first unread message

Noemi Valle

unread,
Jul 6, 2021, 5:28:38 PM7/6/21
to CAS Community

We recently upgraded CAS from version 5.3 to 6.3. After the upgrade we noticed a drastic increase in CPU usage and by looking at logs we noticed that every time a SAML2 authentication request is processed, the applications performs the action of loading the SP metadata file multiple times (more than 20). Here is one example of the log entry:

2021-07-06 17:04:48,552 INFO [org.apereo.cas.support.saml.services.idp.metadata.SamlRegisteredServiceServiceProviderMetadataFacade] - <Resolved metadata chain from [/etc/cas/saml/sp_metadata/10000836-everfi-metadata.xml]. Filtering the chain by entity ID [https://admin.fifoundry.net/floridaiu/saml/sp]>

Have anybody experienced a similar issue ?

Thanks in advance!
Posting...
CAS 6.3 Loading SP tetadata files multiple times for each authentication request

Jason B. Rappaport

unread,
Jul 7, 2021, 8:39:44 AM7/7/21
to cas-...@apereo.org

Good morning.  Related to increase in CPU usage post upgrade, yes we have indeed seen this.  Yesterday we upgraded to 6.3.5. 

 

From an experience perspective:

  • Login page took longer to load, data supporting this:
    • Synthetic transitions were failing for multiple services…all the result of timeouts (waiting for the login page to load)
    • We plot metrics (totalPrincipals, Unexpiredsts, and unexpiredtgts) via Nagios, they too were timing out resulting in many alarms
    • Some group runs what they call is a robot, basically it is a program by UI Path Orchestration that clicks through some site to automate something that couldn’t be automated via another process….well that too was timing out due waiting for the login page to load

 

Quantitatively:

  • CPU usage is substantially higher than it has been (this occurs only on our on-prem CAS servers, the off-prem CAS servers do not show a CPU increase post upgrade)

  • Memory is up, but this started prior to our upgrade (interestingly, our off-prem CAS servers dropped their memory load on July 2nd where as the on-prem servers did not)

 

 

Has anyone turned off HTTP options?  When looking into this, we saw two hosts that combined at 70k calls to the CAS servers using HTTP options.  We are considering disabling this. 

 

We are still investigating all of this.   

 

 

Thanks, Jay

 

________________________________

Jason Rappaport (he/him)

Identity and Access Management Analyst

Office of Information Technology

Email:  jaso...@princeton.edu

Office:  609-258-8464

--
- Website: https://apereo.github.io/cas
- Gitter Chatroom: https://gitter.im/apereo/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
---
You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
To view this discussion on the web visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/75e422d7-d51f-4a69-8911-7d5005d4cc12n%40apereo.org.

image001.png
image002.png

Juan Quintanilla

unread,
Jul 7, 2021, 1:42:10 PM7/7/21
to CAS Community
Hi,

We are running CAS 6.3 with tomcat 9 and Java 11, and have SAML2 and oauth dependencies installed with hazelcast as ticket registry and json files for service registry.  We have noticed that after a few days of running the CPU usage for tomcat spikes to above 100% and requires a restart for it to come back down. When we check the load on the server there isn't to many authentications happening. We had a similar tomcat configuration when running CAS 5.3 with tomcat 8.5 and didn't really see this issue. What we notice in the CAS logs is the metadata being loaded on several occasions over 10 times for a single authentication. [org.apereo.cas.support.saml.services.idp.metadata.SamlRegisteredServiceServiceProviderMetadataFacade] - <Resolved metadata chain from [/etc/cas/saml/sp_metadata

In our tomcat configuration we are using nio connector together with the opensslimplementation, we have also tried switching to nio2 but we see similar behavior.  We have even bumped up the CPU on the server but still encounter the same problem. Doing various thread dumps the issue doesn't seem to be with GC, or hazelcast.  Below is a snippet of one of the threads that was showing up with high cpu during a thread dump.

"https-openssl-nio-443-exec-63" #340 daemon prio=5 os_prio=0 cpu=783837.57ms elapsed=335561.54s tid=0x00007f44bc289800 nid=0x7c5f runnable [0x00007f4468f5d000]
java.lang.Thread.State: RUNNABLE
at java.lang.StackStreamFactory$AbstractStackWalker.fetchStackFrames(java...@11.0.11/Native Method)
at java.lang.StackStreamFactory$AbstractStackWalker.fetchStackFrames(java...@11.0.11/StackStreamFactory.java:386)
at java.lang.StackStreamFactory$AbstractStackWalker.getNextBatch(java...@11.0.11/StackStreamFactory.java:322)
at java.lang.StackStreamFactory$AbstractStackWalker.peekFrame(java...@11.0.11/StackStreamFactory.java:263)
at java.lang.StackStreamFactory$AbstractStackWalker.hasNext(java...@11.0.11/StackStreamFactory.java:351)
at java.lang.StackStreamFactory$StackFrameTraverser.tryAdvance(java...@11.0.11/StackStreamFactory.java:593)
at java.util.stream.ReferencePipeline.forEachWithCancel(java...@11.0.11/ReferencePipeline.java:127)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(java...@11.0.11/AbstractPipeline.java:502)
at java.util.stream.AbstractPipeline.copyInto(java...@11.0.11/AbstractPipeline.java:488)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(java...@11.0.11/AbstractPipeline.java:474)
at java.util.stream.FindOps$FindOp.evaluateSequential(java...@11.0.11/FindOps.java:150)
at java.util.stream.AbstractPipeline.evaluate(java...@11.0.11/AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.findFirst(java...@11.0.11/ReferencePipeline.java:543)
at org.apache.logging.log4j.util.StackLocator.lambda$getCallerClass$6(StackLocator.java:55)
at org.apache.logging.log4j.util.StackLocator$$Lambda$163/0x0000000800376840.apply(Unknown Source)
at java.lang.StackStreamFactory$StackFrameTraverser.consumeFrames(java...@11.0.11/StackStreamFactory.java:534)
at java.lang.StackStreamFactory$AbstractStackWalker.doStackWalk(java...@11.0.11/StackStreamFactory.java:306)
at java.lang.StackStreamFactory$AbstractStackWalker.callStackWalk(java...@11.0.11/Native Method)
at java.lang.StackStreamFactory$AbstractStackWalker.beginStackWalk(java...@11.0.11/StackStreamFactory.java:370)
at java.lang.StackStreamFactory$AbstractStackWalker.walk(java...@11.0.11/StackStreamFactory.java:243)
at java.lang.StackWalker.walk(java...@11.0.11/StackWalker.java:498)
at org.apache.logging.log4j.util.StackLocator.getCallerClass(StackLocator.java:54)
at org.apache.logging.log4j.util.StackLocatorUtil.getCallerClass(StackLocatorUtil.java:67)
at org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:51)
at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:48)
at org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:30)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:354)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:379)
at org.opensaml.core.xml.AbstractXMLObject.<init>(AbstractXMLObject.java:48)
at org.opensaml.saml.saml2.metadata.impl.EndpointImpl.<init>(EndpointImpl.java:59)
at org.opensaml.saml.saml2.metadata.impl.SingleSignOnServiceImpl.<init>(SingleSignOnServiceImpl.java:40)
at org.opensaml.saml.saml2.metadata.impl.SingleSignOnServiceBuilder.buildObject(SingleSignOnServiceBuilder.java:49)
at org.opensaml.saml.saml2.metadata.impl.SingleSignOnServiceBuilder.buildObject(SingleSignOnServiceBuilder.java:31)
at org.opensaml.core.xml.AbstractXMLObjectBuilder.buildObject(AbstractXMLObjectBuilder.java:58)
at org.opensaml.core.xml.AbstractXMLObjectBuilder.buildObject(AbstractXMLObjectBuilder.java:73)
at org.opensaml.core.xml.io.AbstractXMLObjectUnmarshaller.buildXMLObject(AbstractXMLObjectUnmarshaller.java:182)
at org.opensaml.core.xml.io.AbstractXMLObjectUnmarshaller.unmarshall(AbstractXMLObjectUnmarshaller.java:104)
at org.opensaml.core.xml.io.AbstractXMLObjectUnmarshaller.unmarshallChildElement(AbstractXMLObjectUnmarshaller.java:337)
at org.opensaml.core.xml.io.AbstractXMLObjectUnmarshaller.unmarshall(AbstractXMLObjectUnmarshaller.java:128)
at org.opensaml.core.xml.io.AbstractXMLObjectUnmarshaller.unmarshallChildElement(AbstractXMLObjectUnmarshaller.java:337)
at org.opensaml.core.xml.io.AbstractXMLObjectUnmarshaller.unmarshall(AbstractXMLObjectUnmarshaller.java:128)
at org.opensaml.saml.metadata.resolver.impl.DOMMetadataResolver.initMetadataResolver(DOMMetadataResolver.java:68)
at org.apereo.cas.support.saml.idp.metadata.locator.SamlIdPMetadataResolver.initMetadataResolver(SamlIdPMetadataResolver.java:64)
at org.opensaml.saml.metadata.resolver.impl.AbstractMetadataResolver.doInitialize(AbstractMetadataResolver.java:289)
at net.shibboleth.utilities.java.support.component.AbstractInitializableComponent.initialize(AbstractInitializableComponent.java:65)
- locked <0x00000005850f62c0> (a org.apereo.cas.support.saml.idp.metadata.locator.SamlIdPMetadataResolver)
..


Has anyone noticed something similar?

Thanks!

___________________

Juan Quintanilla

Jason B. Rappaport

unread,
Jul 7, 2021, 2:23:46 PM7/7/21
to cas-...@apereo.org

Good afternoon.  Just to add a bit more information to this. 

 

Today we doubled the CPU and RAM for on our on-prem CAS servers; they are now at 4 CPUS and 16 GB of RAM.  They are now stable (timeouts are gone for various services and the metrics endpoints are responding).

Our off-prem CAS servers are running fine with 2 CPUS and 8 GB of RAM; no change was made to them. 

 

Juan – you mentioned Hazelcast, we use that as well for replicating information from our on-prem CAS servers to the off-prem CAS servers.  We have also encountered several instances where our off-prem CAS servers CPU is pegged in both our QA and PROD environments.  We have little to no traffic using our QA CAS servers and what is interesting is that both environments (QA and PROD) have pegged CPUs on the same days.  When we investigated, we found that the Hazelcast cluster was constantly being reestablished.  I posted a message on the Hazelcast support community https://groups.google.com/g/hazelcast/c/UmB1VzOBm-4 and then talked to the folks at Hazelcast.  Basically what they said was that unless your virtual machine CAS servers are in the same datacenter, do not use the Hazelcast version that comes with CAS.  The Hazelcast folks indicated that using TCP/IP to maintain session information on CAS severs that are too distant (like having CAS servers off-prem and some on-prem) would likely cause issues.  They recommended purchasing their Hazelcast enterprise edition (which is really interesting and has a ton of cool features, but is also very expensive) that uses message queuing (MQ) technology instead of relying on TCP/IP to maintain session information.  In your logs, look for “ Initialized new cluster connection between” we had 20k messages in one day that the CPUs were pegged. 

 

We asked our networking team about the stability between on campus and off campus cloud provider and they indicated the connection was stable enough that we would not notice any glitches; which doesn’t explain why QA and PROD saw pegged CPUs on the same day as those hosts don’t talk to each other. 

 

So for now, doubling the CPU and RAM on our on-prem CAS servers (which only handle ½ the authentication traffic as our off-prem CAS servers) seems to keep us stable…..for now. 

 

Attached is a screenshot showing our CPU usage.  We upgraded CAS on 7.6 (yesterday), which can be seen on the chart where the CPU is averaging about 50%.  Today around 12:00pm, we doubled CPU and RAM, rebooted, and now CPU is at 1%.  The gap in the data…happens sometimes with our data collection tools. 

 

 

Thanks, Jay

 

 

________________________________

Jason Rappaport (he/him)

Identity and Access Management Analyst

Office of Information Technology

Email:  jaso...@princeton.edu

Office:  609-258-8464

 

 

From: cas-...@apereo.org <cas-...@apereo.org> On Behalf Of Juan Quintanilla
Sent: Wednesday, July 7, 2021 1:42 PM
To: CAS Community <cas-...@apereo.org>

--

- Website: https://apereo.github.io/cas
- Gitter Chatroom: https://gitter.im/apereo/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
---
You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.

image001.png

Jason B. Rappaport

unread,
Jul 8, 2021, 10:15:42 AM7/8/21
to cas-...@apereo.org

Related, I just got the latest shib-cas-authn plugin working on Shibboleth IDP 4.1.2 so we can delegate authentication to CAS (6.3.5).  When I do this and try to authenticate, I see the following log message (45 times) and the response time from CAS is so long that our IDP timesout the seeion.

 

^XJul  8 10:12:53 105W user Loading  SAML metadata from [/etc/cas/saml/metadata/sp_metadata.xml]

Jul  8 10:12:53 105W user No  metadata signature location is defined for [/etc/cas/saml/metadata/sp_metadata.xml], so SignatureValidationFilter will not be invoked

Jul  8 10:12:53 105W user Initialized  metadata resolver from [/etc/cas/saml/metadata/sp_metadata.xml]

Jul  8 10:12:53 105W user SAML  metadata resolver [org.opensaml.saml.metadata.resolver.ChainingMetadataResolver] obtained from the cache is unable to produce/resolve valid metadata [/etc/cas/saml/metadata/sp_metadata.xml]. Metadata resolver cache entry with key [f2317……] has been invalidated. Retry attempt: [1]

 

Anyone ever seen this before?

image001.png
Reply all
Reply to author
Forward
0 new messages