heap space and posting large numbers of users

11 views
Skip to first unread message

paulbr

unread,
Feb 22, 2010, 1:50:42 AM2/22/10
to Sakai Kernel
Do we have recommendations on java memory config for nakamura?

Up until now I've been just using the java defaults but I've been
looking at pushing large numbers of users in through the POST
interface at /system/userManager/user.create.html with http basic
authn.

Today I ran out of heap on a windows instance built 18 Feb after a few
hundred POSTs. I've started using:

java -Xms256m -Xmx1024m -XX:PermSize=64m -XX:MaxPermSize=256m -jar
org.sakaiproject.nakamura.app-0.3-SNAPSHOT.jar

I now get to around 1500 users before running out of heap. I seem to
have a pause at around 800 - I think perhaps nakamura is building ACE
entries for the users and consuming enough processor to block
requests. Once that request times out (my batch script is a single
thread) POSTing resumes until nakamura starts shutting down bundles
around 1500 users.

Windows task manager suggests I'm only using around 560Mb max for
nakamura. It quickly gets to around 350Mb and slowly grows into the
500s.

This is running java 5 on windows XP.

The good news is that with the higher memory configuration it's
processing around 4 posts per second where last week it was only 1 per
second.

Ian Boston

unread,
Feb 22, 2010, 3:40:04 AM2/22/10
to sakai-...@googlegroups.com
This sounds like there is a memory leak.

Can you try and run a script that exercises content and see if there are any leaks there.
eg add a user, and then as that user
1. Mount a webdav drive, upload content and run
find /Volumes/localhost -exec cat {} > /dev/null \;
in a loop.

2. Do the same but this time
cd /Volumes/localhost
find . -exec curl http://localhost:8080/{} > /dev/null \;


The first will test webdav servlet and jackrabbit identifying issues with jackrabbit, the second will test the SlingDefault Servlet identifying issues with sling. After that we should look at the userManager area.

If you can attach a profiler (YourKit (the Foundation has Licenses)) then that might identify any leak.

Ian

> --
> You received this message because you are subscribed to the Google Groups "Sakai Kernel" group.
> To post to this group, send email to sakai-...@googlegroups.com.
> To unsubscribe from this group, send email to sakai-kernel...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.
>

paulbr

unread,
Feb 22, 2010, 8:21:31 PM2/22/10
to Sakai Kernel
My guess is the memory leakage is from:

23.02.2010 12:16:13.329 *WARN* [Finalizer]
org.apache.jackrabbit.core.SessionImpl Unclosed session detected. The
session was opened here: java.lang.Exception: Stack Trace
at org.apache.jackrabbit.core.SessionImpl.<init>(SessionImpl.java:
249)
at org.apache.jackrabbit.core.SessionImpl.<init>(SessionImpl.java:
271)
at org.apache.jackrabbit.core.XASessionImpl.<init>(XASessionImpl.java:
105)
at
org.apache.jackrabbit.core.RepositoryImpl.createSessionInstance(RepositoryImpl.java:
1517)
at
org.apache.jackrabbit.core.RepositoryImpl.createSession(RepositoryImpl.java:
964)
at
org.apache.jackrabbit.core.SessionImpl.createSession(SessionImpl.java:
443)
at
org.apache.jackrabbit.core.DynamicSecurityManager.getUserManager(DynamicSecurityManager.java:
338)
at
org.apache.jackrabbit.core.SessionImpl.getUserManager(SessionImpl.java:
785)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.sling.jcr.base.SessionProxyHandler
$SessionProxyInvocationHandler.invoke(SessionProxyHandler.java:109)
at $Proxy4.getUserManager(Unknown Source)
at
org.apache.sling.jcr.base.util.AccessControlUtil.getUserManager(AccessControlUtil.java:
116)
at
org.sakaiproject.nakamura.user.servlet.CreateSakaiUserServlet.handleOperation(CreateSakaiUserServlet.java:
283)
at
org.apache.sling.jackrabbit.usermanager.impl.post.AbstractAuthorizablePostServlet.doPost(AbstractAuthorizablePostServlet.java:
127)
at
org.apache.sling.api.servlets.SlingAllMethodsServlet.mayService(SlingAllMethodsServlet.java:
143)
at
org.apache.sling.api.servlets.SlingSafeMethodsServlet.service(SlingSafeMethodsServlet.java:
338)
at
org.apache.sling.api.servlets.SlingSafeMethodsServlet.service(SlingSafeMethodsServlet.java:
369)
at
org.apache.sling.engine.impl.request.RequestData.service(RequestData.java:
525)
at
org.apache.sling.engine.impl.SlingMainServlet.processRequest(SlingMainServlet.java:
448)
at
org.apache.sling.engine.impl.filter.RequestSlingFilterChain.render(RequestSlingFilterChain.java:
48)
at
org.apache.sling.engine.impl.filter.AbstractSlingFilterChain.doFilter(AbstractSlingFilterChain.java:
64)
at
org.apache.sling.engine.impl.debug.RequestProgressTrackerLogFilter.doFilter(RequestProgressTrackerLogFilter.java:
59)
at
org.apache.sling.engine.impl.filter.AbstractSlingFilterChain.doFilter(AbstractSlingFilterChain.java:
60)
at
org.sakaiproject.nakamura.cluster.ClusterTrackingFilter.doFilter(ClusterTrackingFilter.java:
87)
at
org.apache.sling.engine.impl.filter.AbstractSlingFilterChain.doFilter(AbstractSlingFilterChain.java:
60)
at
org.sakaiproject.nakamura.persistence.TransactionManagerFilter.doFilter(TransactionManagerFilter.java:
95)
at
org.apache.sling.engine.impl.filter.AbstractSlingFilterChain.doFilter(AbstractSlingFilterChain.java:
60)
at
org.apache.sling.engine.impl.SlingMainServlet.service(SlingMainServlet.java:
326)
at
org.apache.sling.engine.impl.SlingMainServlet.service(SlingMainServlet.java:
207)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:
502)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:
389)
at
org.ops4j.pax.web.service.internal.HttpServiceServletHandler.handle(HttpServiceServletHandler.java:
64)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:
181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:
765)
at
org.ops4j.pax.web.service.internal.HttpServiceContext.handle(HttpServiceContext.java:
111)
at
org.ops4j.pax.web.service.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:
64)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:
152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:
535)
at org.mortbay.jetty.HttpConnection
$RequestHandler.content(HttpConnection.java:880)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
409)
at org.mortbay.thread.QueuedThreadPool
$PoolThread.run(QueuedThreadPool.java:520)

and the pause at 851 users maybe associated with (is same time as)

23.02.2010 12:16:32.595 *INFO* [IndexMerger]
org.apache.jackrabbit.core.query.lucene.IndexMerger merged 28022
documents in 672 ms into _m.

paulbr

unread,
Feb 23, 2010, 1:58:40 AM2/23/10
to Sakai Kernel
Well I've kind of done 1 and 2...

Windows makes it so easy - can't mount webdav unless on port 80, can't
mount root so running webdav through apache proxying localhost:8080 to
localhost:80/S3. Have to edit registry and reboot for basic auth.
Install hotfix, reboot... perl scripts to replicate find functionality

On running perl scripts, no huge memory growth but some slowness on
[2] - will investigate more

Ran populate script again up to 2000 users. Grew to 732Mb, didn't run
out, I wasn't running outlook so machine had lots more resources. No
refusal to serve request when lucene indexed, no unclosed sessions.
But 5 minutes later java process is still at 732Mb

Paul Bristow

unread,
Feb 23, 2010, 4:14:48 AM2/23/10
to sakai-...@googlegroups.com
For anyone interested my perl script(s) for the tests suggested by Ian are

use File::Find;

my $dir = '.';
my $i = 0;

for ($i = 0; $i < 100; $i++) {
find(\&process_file, $dir);
}

sub process_file
{
my $file = $_;
if (-d $file) {
; #print $file, " is dir\n";
}
else { if (-l $file) {
; #print $file, " is link\n";
}
else {if (-f $file) {
print $file, " is file\n";
### test 1###
#open(IN, "<$file");
#while (<IN>) {
# ;#print;
#}
#close IN;
### test 2###
$url = $File::Find::name; #gets relative path
$url =~ s/^\.\///; #directories get ./ prepended
$url = 'http://localhost:8080/' . $url;
print $url, "\n";
`curl $url`;
}}}
}


Ian Boston

unread,
Feb 23, 2010, 4:18:01 AM2/23/10
to sakai-...@googlegroups.com, Sakai Kernel
There are 2 types of oom we see. The first is heap and the second is
perm space. Heap exhaustion comes from jackrabbit sizing up it's
caches, these reach steady state in the default config at arround
512mb. The second is perm space, this is related to the classes,
strings and code ad to a certain extent the number of terms in lucene.
Iirc on 32 bit 128 is plenty on 64 bit you will need to double that.

You should enable jmx and connect jconsole to see what is really
happening as the process size will only tell you the highwater mark.

Ian

Sent from my iPhone

Ian Boston

unread,
Feb 23, 2010, 5:02:23 AM2/23/10
to sakai-...@googlegroups.com, Sakai Kernel
Could you jira please, this needs investigation. It may be connected
to a listener, but if not it might be a leak


Ian

Sent from my iPhone

paulbr

unread,
Feb 24, 2010, 12:43:13 AM2/24/10
to Sakai Kernel
From jconsole all the memory ends up in tenured heap running the add
user script

permanent stays around 41Mb
code cache has grown to 13Mb
eden peaks around 30-45 Mb
survivor is up to around 5.5 Mb
tenured is around 600Mb, GC takes it down to around 400Mb (grown to
500 in last 5 minutes)

permanent and code are basically flatlines, eden and survivor get
garbage collected to approx 0

task manager shows memory use at around the max pre GC

on killing user add script
permanent and code virtually unchanged
tenured flatlined at 450Mb, survivor and eden emptied
windows doesn't show memory returned
after 5 minutes tenured down to 380Mb, windows task manager still no
change

Ian Boston

unread,
Feb 24, 2010, 5:54:55 AM2/24/10
to sakai-...@googlegroups.com
That looks good, If you keep jconsole running does it go OOM, and when it does where are the graphs.

The fact that code cache and perm are staying flat and low indicates that we have not string growth or aggressive byte code manipulation going on. Its also well within the default settings for the JVM ~60% of perm space used.

In the graphs its the trough of each tenured cycle after full GC that interesting. If that consistently goes up, then it will cause more rapid GC's and a finer sawtooth eventually leading to OOM.
Also if the GC activity is rapid, then the whole app is in a bad state and needs tuning (JVM settings or code) as it will be spending more time looking for free memory than doing useful work. Sakai 2 was very bad under load with the DOM XML serializers in this area. (IMVHO).

So in your tests do you see the minimum tenured space on an upward trend ? If not then it doesn't look like a memory leak, but could be some other resource leak.

Ian

BTW, one thing that wont show up is file handles. I do regular checks with lsof | grep <processid of the app> | wc -l to check if there is growth there.

Paul Bristow

unread,
Feb 24, 2010, 6:11:31 AM2/24/10
to sakai-...@googlegroups.com
On 24/02/10 9:54 PM, Ian Boston wrote:
> That looks good, If you keep jconsole running does it go OOM, and when it does where are the graphs.

I stopped the user load at 2500 users and 730MB footprint. No errors
(except warnings about root not having a parent on the perl test2)

> In the graphs its the trough of each tenured cycle after full GC that interesting. If that consistently goes up, then it will cause more rapid GC's and a finer sawtooth eventually leading to OOM.
> Also if the GC activity is rapid, then the whole app is in a bad state and needs tuning (JVM settings or code) as it will be spending more time looking for free memory than doing useful work. Sakai 2 was very bad under load with the DOM XML serializers in this area. (IMVHO).

My recollection is that tenured GC was about every 60-90 seconds (from
memory of shape vs scale). I'd say tenured troughs trended up at around
10-20 MB a minute. There'd occasionally be a deeper trough but not
enough to buck the trend.

> BTW, one thing that wont show up is file handles. I do regular checks with lsof | grep<processid of the app> | wc -l to check if there is growth there.

Not sure of a windows equivalent ;)

I was thinking of trying java 6 tomorrow which would give me jmap (which
wasn't available for windows on java 5) and improved GC.

If you like I'll run the full 50000 user load first and see if I go OOM.

I haven't tried it against the solaris x86 instance yet either.

Paul

Ian Boston

unread,
Feb 24, 2010, 6:24:04 AM2/24/10
to sakai-...@googlegroups.com

On 24 Feb 2010, at 11:11, Paul Bristow wrote:

>>
>
> My recollection is that tenured GC was about every 60-90 seconds (from memory of shape vs scale). I'd say tenured troughs trended up at around 10-20 MB a minute. There'd occasionally be a deeper trough but not enough to buck the trend.


Ok that means there is a memory leak we need to find. Comparing snapshots in a profile should find what the objects are. It might be Lucene related with search terms and it could be the LRU Caches in Jackrabbit filling up. I will try and do a test today.

Ian


>
>> BTW, one thing that wont show up is file handles. I do regular checks with lsof | grep<processid of the app> | wc -l to check if there is growth there.
>
> Not sure of a windows equivalent ;)

Me neither, dont worry this wont be OS dependent so I can check here.

>
> I was thinking of trying java 6 tomorrow which would give me jmap (which wasn't available for windows on java 5) and improved GC.

A leak is a leak, I have a feeling that J6 wont make any difference.

>
> If you like I'll run the full 50000 user load first and see if I go OOM.

Could you please, thanks.
Ian

Reply all
Reply to author
Forward
0 new messages