[cas-user] Observed/known memory leaks?

Tom Poage

unread,

Sep 10, 2013, 1:42:35 PM9/10/13

to cas-...@lists.jasig.org

Hello,

Looking for observations of (known?) memory leaks in CAS 3.5.2. Have incomplete information (servers were restarted before we could get in to see what was going on), but had an aberrant issue yesterday with a single user of a typically busy service looping on ST request/validation after authentication (only one TGT). Other clients to this service got in fine.

Along with the typical Monday morning surge in activity, we believe ticket replication eventually ground to a halt. Even after restart, two of three nodes have all but maxed out old-generation memory, so are spending relatively large amounts of time in garbage collection; the third node is using about 50%. Permanent and Eden memory use are comparable across all three.

Anyhow, shooting in the dark for the time being until we can obtain JVM dumps.

Thanks.
Tom.
--
You are currently subscribed to cas-...@lists.jasig.org as: jasig-cas-user...@googlegroups.com
To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user

Scott Battaglia

unread,

Sep 10, 2013, 1:48:47 PM9/10/13

to cas-...@lists.jasig.org

What ticket storage mechanism are you using?

Also, have you enabled:

-XX:+HeapDumpOnOutOfMemoryError

On Tue, Sep 10, 2013 at 1:42 PM, Tom Poage <tfp...@ucdavis.edu> wrote:

Hello,

Looking for observations of (known?) memory leaks in CAS 3.5.2. Have incomplete information (servers were restarted before we could get in to see what was going on), but had an aberrant issue yesterday with a single user of a typically busy service looping on ST request/validation after authentication (only one TGT). Other clients to this service got in fine.

Along with the typical Monday morning surge in activity, we believe ticket replication eventually ground to a halt. Even after restart, two of three nodes have all but maxed out old-generation memory, so are spending relatively large amounts of time in garbage collection; the third node is using about 50%. Permanent and Eden memory use are comparable across all three.

Anyhow, shooting in the dark for the time being until we can obtain JVM dumps.

Thanks.

Tom.
--
You are currently subscribed to cas-...@lists.jasig.org as: scott.b...@gmail.com

To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user

Tom Poage

unread,

Sep 10, 2013, 2:11:03 PM9/10/13

to cas-...@lists.jasig.org

On Sep 10, 2013, at 10:48 AM, Scott Battaglia <scott.b...@GMAIL.COM> wrote:
> What ticket storage mechanism are you using?
>
> Also, have you enabled:
> -XX:+HeapDumpOnOutOfMemoryError

Ehcache (RMI/multicast; TGT overflow to disk, ST in-memory only). No issues observed in extended load testing prior to rollout, though we didn't test for the particular ST request/validation issue observed.

Servers are relatively modest: three 2 core, 2 GB VMs. TGT and ST in-core cache sized to 10k entries.

Will see about adding the heap dump parameter to the servers, though we apparently didn't an hit out-of-memory condition. May decide to add ST overflow to disk as a safety valve.

Thanks.
Tom.

--

Scott Battaglia

unread,

Sep 10, 2013, 2:16:51 PM9/10/13

to cas-...@lists.jasig.org

What error did you hit? The GC Overhead Limit?

On Tue, Sep 10, 2013 at 2:11 PM, Tom Poage <tfp...@ucdavis.edu> wrote:

On Sep 10, 2013, at 10:48 AM, Scott Battaglia <scott.b...@GMAIL.COM> wrote:
> What ticket storage mechanism are you using?
>
> Also, have you enabled:
> -XX:+HeapDumpOnOutOfMemoryError

Ehcache (RMI/multicast; TGT overflow to disk, ST in-memory only). No issues observed in extended load testing prior to rollout, though we didn't test for the particular ST request/validation issue observed.

Servers are relatively modest: three 2 core, 2 GB VMs. TGT and ST in-core cache sized to 10k entries.

Will see about adding the heap dump parameter to the servers, though we apparently didn't an hit out-of-memory condition. May decide to add ST overflow to disk as a safety valve.

Thanks.
Tom.

--
You are currently subscribed to cas-...@lists.jasig.org as: scott.b...@gmail.com

To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user

Marvin S. Addison

unread,

Sep 10, 2013, 2:17:38 PM9/10/13

to cas-...@lists.jasig.org

> had an aberrant issue yesterday with a
> single user of a typically busy service looping on ST
> request/validation after authentication (only one TGT).

What does the audit log say about the validations? If they were
successful then it's almost certainly a problem with the CAS client
and/or user agent. In almost every case of a redirect loop we've seen,
the root cause was not the CAS server.

M

Tom Poage

unread,

Sep 10, 2013, 5:10:12 PM9/10/13

to cas-...@lists.jasig.org

On 09/10/2013 11:17 AM, Marvin S. Addison wrote:
>> had an aberrant issue yesterday with a
>> single user of a typically busy service looping on ST
>> request/validation after authentication (only one TGT).
>
> What does the audit log say about the validations? If they were
> successful then it's almost certainly a problem with the CAS client
> and/or user agent. In almost every case of a redirect loop we've seen,
> the root cause was not the CAS server.

Validations during looping were successful for a while--for almost two
hours. At one point I think I counted 12 ST requests per second (request
server A, validation server B). Then ST validations started failing. So,
agreed, suspect it's the CAS client. Am working with them. So far unable
to identify what's different about the user.

That said, we are observing ST replication errors. Seems to go in
bursts. Haven't checked carefully, but seems to correlate with ST
validation failures (SERVICE_TICKET_VALIDATE_FAILED).

> Sep 10, 2013 1:44:11 PM org.apache.catalina.core.StandardWrapperValve invoke
> INFO: 2013-09-10 13:44:11,773 ERROR [net.sf.ehcache.distribution.RMISynchronousCacheReplicator] - <Exception on replication of putNotification. null. Continuing...>
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:894)
> at java.util.HashMap$EntryIterator.next(HashMap.java:934)
> at java.util.HashMap$EntryIterator.next(HashMap.java:932)
> at java.util.HashMap.writeObject(HashMap.java:1098)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)

Given that using VMs for CAS is new to us, will have to assess whether
(1) we're underpowered cf. memory, (2) whether the instances are not
responsive enough (affected by demand on sibling VMs), (3) whether
multicast+RMI is insufficient for our needs (which didn't show up early
on), etc.

Need to figure out what is an acceptable/baseline/historical number of
SERVICE_TICKET_VALIDATE_FAILED log entries.

Tom.

Tom Poage

unread,

Sep 10, 2013, 6:22:26 PM9/10/13

to cas-...@lists.jasig.org

On 09/10/2013 02:10 PM, Tom Poage wrote:
> That said, we are observing ST replication errors. Seems to go in
> bursts. Haven't checked carefully, but seems to correlate with ST
> validation failures (SERVICE_TICKET_VALIDATE_FAILED).

So it appears we might be busting our ST cache. It's currently
configured for 10k entries (prior load analysis showed as sufficient),
with no overflow to disk, and LRU eviction.

Given the ST looping, it seems feasible cache "flooding" could be behind
our problems.

Plan to turn on overflow-to-disk for the ST cache to see if it solves
the issue.

Trenton D. Adams

unread,

Sep 10, 2013, 6:45:30 PM9/10/13

to cas-...@lists.jasig.org, Tom Poage

On 13-09-10 04:22 PM, Tom Poage wrote:
> On 09/10/2013 02:10 PM, Tom Poage wrote:
>> That said, we are observing ST replication errors. Seems to go in
>> bursts. Haven't checked carefully, but seems to correlate with ST
>> validation failures (SERVICE_TICKET_VALIDATE_FAILED).
>
> So it appears we might be busting our ST cache. It's currently
> configured for 10k entries (prior load analysis showed as sufficient),
> with no overflow to disk, and LRU eviction.
>
> Given the ST looping, it seems feasible cache "flooding" could be behind
> our problems.
>
> Plan to turn on overflow-to-disk for the ST cache to see if it solves
> the issue.

Are you able to possibly replicate it in a test environment? jvisualvm
is nice for looking into what's using memory.

--
Trenton D. Adams
Senior Systems Analyst/Web Software Developer
Navy Penguins at your service!
Athabasca University
(780) 675-6195
:wq!

--
This communication is intended for the use of the recipient to whom it
is addressed, and may contain confidential, personal, and or privileged
information. Please contact us immediately if you are not the intended
recipient of this communication, and do not copy, distribute, or take
action relying on it. Any communications received in error, or
subsequent reply, should be deleted or destroyed.
---

Tom Poage

unread,

Sep 10, 2013, 7:06:22 PM9/10/13

to cas-...@lists.jasig.org

On 09/10/2013 03:45 PM, Trenton D. Adams wrote:
>> Plan to turn on overflow-to-disk for the ST cache to see if it solves
>> the issue.
>
> Are you able to possibly replicate it in a test environment? jvisualvm
> is nice for looking into what's using memory.

Beat me to it. Yes, that's one of my next steps.

I also use jvisualvm to look at cache info (number of entries, etc.).

Tom.

Reply all

Reply to author

Forward