Server load w/ 4.2.6

31 views
Skip to first unread message

Tom Poage

unread,
Oct 13, 2016, 5:20:05 PM10/13/16
to CAS Community
Afternoon,

On moving from 4.2.1 to 4.2.6, our apparent system load increased dramatically.

Run queue went from as high as 4 to nearly 30, with (Linux) load average jumping from a max of 0.2 to about 15 for a user base (TGT count) of 46k.

A code diff doesn’t seem to show much, except perhaps for the addition of a synchronous ticketTransactionManager. The only other likely candidate is either the bump in Hazelcast version, or that we went from 3 to 4 (single CPU) VMs in the cluster (point-to-point instead of multicast). CPU increased from a high of about 20% (usually 5-8%) to the 50% range. This is on all nodes. Ironically, response time doesn’t seem all that bad, though is a bit sluggish.

Anyone else experience something similar?

Thanks!
Tom.

Tom Poage

unread,
Oct 14, 2016, 1:14:42 PM10/14/16
to CAS Community
Disabling the fourth node doesn't change anything.

Profiling shows the highest CPU/time is spent in Hazelcast. Whether this is a result of the updated Hazelcast version or the new synchronous CAS code remains to be seen.

Is it oossible to downgrade Hazelcast version (say, to 3.6) on CAS 4.2.6, i.e. were any new Hazelcast version-specific changes made between roughly 4.2.[12] and 4.2.6?

Thanks.
Tom.
> --
> CAS gitter chatroom: https://gitter.im/apereo/cas
> CAS mailing list guidelines: https://apereo.github.io/cas/Mailing-Lists.html
> CAS documentation website: https://apereo.github.io/cas
> CAS project website: https://github.com/apereo/cas
> ---
> You received this message because you are subscribed to the Google Groups "CAS Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
> To post to this group, send email to cas-...@apereo.org.
> Visit this group at https://groups.google.com/a/apereo.org/group/cas-user/.
> To view this discussion on the web visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/F67D31AA-2CFC-4DDA-8C5D-922E0B87798F%40ucdavis.edu.
> For more options, visit https://groups.google.com/a/apereo.org/d/optout.

Misagh Moayyed

unread,
Oct 14, 2016, 1:45:03 PM10/14/16
to Tom Poage, CAS Community

You can exclude the hazelcast dependency from the relevant module, and provide your own exact version. 


-- 
Misagh

From: Tom Poage <tfp...@ucdavis.edu>
Reply: Tom Poage <tfp...@ucdavis.edu>
Date: October 14, 2016 at 8:44:49 PM
To: CAS Community <cas-...@apereo.org>
Subject:  Re: [cas-user] Server load w/ 4.2.6

Disabling the fourth node doesn't change anything.

Profiling shows the highest CPU/time is spent in Hazelcast. Whether this is a result of the updated Hazelcast version or the new synchronous CAS code remains to be seen.

Is it oossible to downgrade Hazelcast version (say, to 3.6) on CAS 4.2.6, i.e. were any new Hazelcast version-specific changes made between roughly 4.2.[12] and 4.2.6?

Thanks.
Tom.

> On Oct 13, 2016, at 2:18 PM, Tom Poage <tfp...@ucdavis.edu> wrote:
> br/>> Afternoon, <
> br/>> On moving from 4.2.1 to 4.2.6, our apparent syystem load increased dramatically.
> br/>> Run queue went from as high as 4 to nearly 30, with (Linux) load average jumping from a max of 0.2 to about 15 for a user base (TGT count) of 46k.
> br/>> A code diff doesn’t seem to show much, exxcept perhaps for the addition of a synchronous ticketTransactionManager. The only other likely candidate is either the bump in Hazelcast version, or that we went from 3 to 4 (single CPU) VMs in the cluster (point-to-point instead of multicast). CPU increased from a high of about 20% (usually 5-8%) to the 50% range. This is on all nodes. Ironically, response time doesn’t seem all that bad, though is a bit sluggish.
> br/>> Anyone else experience something similar??
> br/>> Thanks!!
> Tom.
> br/>> -- br/>> CAS gitter chatroom: https://gitter.im/apereo/cas <
> CAS mailing list guidelines: https://apereo.github.io/cas/Mailing-Lists.html
> CAS documentation website: https://apereo.github.io/cas
> CAS project website: https://github.com/apereo/cas
> --- br/>> You received this message because you are ssubscribed to the Google Groups "CAS Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
> To post to this group, send email to cas-...@apereo.org.
> Visit this group at https://groups.google.com/a/apereo.org/group/cas-user/.
> To view this discussion on the web visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/F67D31AA-2CFC-4DDA-8C5D-922E0B87798F%40ucdavis.edu.
> For more options, visit https://groups.google.com/a/apereo.org/d/optout.

-- br/>CAS gitter chatroom: https://gitter.im/apereo/cas <
CAS mailing list guidelines: https://apereo.github.io/cas/Mailing-Lists.html
CAS documentation website: https://apereo.github.io/cas
CAS project website: https://github.com/apereo/cas
--- br/>You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
To post to this group, send email to cas-...@apereo.org.
Visit this group at https://groups.google.com/a/apereo.org/group/cas-user/.

Tom Poage

unread,
Oct 15, 2016, 2:23:22 PM10/15/16
to CAS Community
This email I sent looks like it got stuck in Google yesterday for nearly 2-1/2 hours before delivery (cf. Received lines in mail header). List maintainers: Two followup emails I sent yesterday mid-day on this topic still have not been delivered.

In summary:

The RegistryCleaner was moved/refactored in 4.2.5 out of the default (in-memory) ticket registry to a TicketRegistryCleaner and set to auto-start by default on all registry types (including Hazelcast, Ehcache, etc).

On deploying 4.2.6, cf. the most recent CAS security announcement, all servers in our pool started competing for access to the ticket cache (for us, 50k-60k entries), walking the entire cache every two minutes at chunks of 500 tickets at a time and using what looks like an exclusive lock.

Disabling the registry cleaner brought load average on our (4) servers back down to normal levels (0.01-0.20), a twenty-fold decrease in load average. We're currently using Hazelcast, which has its own TTL eviction mechanism, so this cleaner is not necessary. The same holds for Ehcache (which we used previously).


cas.properties:


ticket.registry.cleaner.startdelay=-1


(value could have been zero, but -1 seemed more mnemonic of the intent)


If someone does need to use this ticket cleaner, it seems to make sense to run on only a single node, assuming global cache semantics to the get-all-entries method (don't recall the exact name at the moment).


Tom.

Tom Poage

unread,
Oct 17, 2016, 4:12:49 AM10/17/16
to CAS Community
Disabling the registry cleaner brought load average on our (4) servers down to 0.01-0.20 (from 4.0-15.0).

cas.properties:

ticket.registry.cleaner.startdelay=-1

(value could have been zero, but -1 seemed more mnemonic of the intent)

Tom.

On Oct 14, 2016, at 1:28 PM, Tom Poage <tfp...@ucdavis.edu> wrote:

Looks like we found the source of the load issue.
 
Best we can tell, somewhere about 4.2.5 the RegistryCleaner embedded in the DefaultTicketRegistry was refactored into a TicketRegistryCleaner that’s now automatically picked up and started for all registry types (*). This cleaner walks the entire cache map, by default every two minutes, and by chunks with an exclusive lock. Multiply that by 50k entries and several servers all competing to do the same thing and it’s no wonder there’s some load. :-)
 
Question now is whether to disable on all nodes, or enable on only one in the cluster. Caches like Hazelcast and Ehcache have a time-to-live eviction policy, so it seems to me the registry cleaner is unnecessary for this type of cache.
 
The CAS code suggests the cleaner can be disabled, albeit somewhat indirectly, by setting the “ticket.registry.cleaner.startdelay” property to less than or equal to zero.
 
Tom.
 

Tom Poage

unread,
Oct 17, 2016, 4:12:49 AM10/17/16
to CAS Community

Looks like we found the source of the load issue.

 

Best we can tell, somewhere about 4.2.5 the RegistryCleaner embedded in the DefaultTicketRegistry was refactored into a TicketRegistryCleaner that’s now automatically picked up and started for all registry types (*). This cleaner walks the entire cache map, by default every two minutes, and by chunks with an exclusive lock. Multiply that by 50k entries and several servers all competing to do the same thing and it’s no wonder there’s some load. :-)

 

Question now is whether to disable on all nodes, or enable on only one in the cluster. Caches like Hazelcast and Ehcache have a time-to-live eviction policy, so it seems to me the registry cleaner is unnecessary for this type of cache.

 

The CAS code suggests the cleaner can be disabled, albeit somewhat indirectly, by setting the “ticket.registry.cleaner.startdelay” property to less than or equal to zero.

 

Tom.

 

* https://github.com/apereo/cas/commit/c1cbde11c5722e1930357d3dc3bdb6d4cffa8214

 

From: Misagh Moayyed <mmoa...@unicon.net>


Date: Friday, October 14, 2016 at 10:43 AM

dkopy...@unicon.net

unread,
Oct 17, 2016, 7:35:55 AM10/17/16
to CAS Community, Tom Poage
+1

And IMHO, the explicit cleaner is not such a good idea for distributed reg. impls that employ their own strategies for cache invalidation.

D.
--

CAS gitter chatroom: https://gitter.im/apereo/cas
CAS mailing list guidelines: https://apereo.github.io/cas/Mailing-Lists.html
CAS documentation website: https://apereo.github.io/cas
CAS project website: https://github.com/apereo/cas
---
You received this message because you are subscribed to the Google Groups "CAS Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
To post to this group, send email to cas-...@apereo.org.
Visit this group at https://groups.google.com/a/apereo.org/group/cas-user/.

Tom Poage

unread,
Oct 17, 2016, 11:15:01 AM10/17/16
to CAS Community

On Oct 15, 2016, at 11:23 AM, Tom Poage <tfp...@gmail.com> wrote:

This email I sent looks like it got stuck in Google yesterday for nearly 2-1/2 hours before delivery (cf. Received lines in mail header). List maintainers: Two followup emails I sent yesterday mid-day on this topic still have not been delivered.

Example of email hold up: Friday 2347Z to Monday 0812Z. The Received trace suggests servers in network 10. belong to Google:

X-Received: by 10.157.0.4 with SMTP id 4mr6139581ota.80.1476691968295;
        Mon, 17 Oct 2016 01:12:48 -0700 (PDT)
X-BeenThere: cas-...@apereo.org
Received: by 10.157.15.174 with SMTP id d43ls12332979otd.11.gmail; Mon, 17 Oct
 2016 01:12:47 -0700 (PDT)
X-Received: by 10.157.50.165 with SMTP id u34mr6693969otb.45.1476691967039;
        Mon, 17 Oct 2016 01:12:47 -0700 (PDT)
Received: by 10.202.244.67 with SMTP id s64msoih;
        Fri, 14 Oct 2016 16:51:47 -0700 (PDT)
X-Received: by 10.99.113.25 with SMTP id m25mr18200910pgc.173.1476489107284;
        Fri, 14 Oct 2016 16:51:47 -0700 (PDT)
Received: from smtp3.ucdavis.edu (smtp3.ucdavis.edu. [128.120.32.129])
        by mx.google.com with ESMTPS id i84si20253377pfi.299.2016.10.14.16.51.47
        for <cas-...@apereo.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 14 Oct 2016 16:51:47 -0700 (PDT)


Reply all
Reply to author
Forward
0 new messages