[Note: I'm sorry for the delayed reply. The company I work has been
blocking Google Groups since last week. Very frustrating. Fortunately
enough, I saved my reply and I can now finally post it.]
Hi Igor.
I really appreciate youre reply. it's been very interesting to read.
> When we have confluence running in a cluster and there is a spike in
> the logging in activity, confluence goes down because it's tries to
> pull of our users from the db and synchronize this data between the
> nodes over and over again.
I read the links and the first one is a bit worrying. They've designed
the
cluster in a very awkward way.
>
> Then there are some random gc or network related delays that make
> confluence unhappy.
The first link clearly indicated that. Did you have problem with GC
stop-the-world and network latency?
>
> The worst of all is that if there is something wrong, confluence will
> shut down *all* of its nodes, rather than only those that are
> suspected to be failing. This caused us completely unnecessary
> outages, when a restart of the affected node would be good enough.
I was thinking about a way of put it more eloquently, but the only
thing
that sprang to mind was "that is exceptionally poor".
>
> Another potential issue ishttp://
jira.atlassian.com/browse/
> CONF-15233, which we haven't experienced in cluster because we were
> not running cluster at that time any more, but I can imagine
> possibility of cluster going down in this case too.
Oh, my. That could bring down the cluster. I'm getting less impressed
by
the minute.
Any examples or additional cautions would be very much appreciated.
I'd
like to pass on these concerns to people here before we go live with
the
cluster.
> We hit problems only after we had certain amount of users in our db.
>
> With us, we do SSO and can't do user management asynchronously because
> we have millions of users in our main identity system + user roles
> there change dynamicaly. For this reason only synchronous user
> management during login time is the only way we can do user
> provisioning for confluence securely and effectively.
We may only have 22,000, but they're all in the internal database, so
I
wouldn't be too surprised if we run into performance issues with users
and
groups as well.
> Interesting, sounds similar to what I did for confluence using
> Varnish. Except that I took it one step further and enabled caching of
> all content distributed to anonymous users with expiration set to a
> few minutes. I also fixed some broken caching instructions that
> confluence sends to clients, which improved the efficiency of static
> assets files in browsers.
I did meddle with some headers:
CacheIgnoreCacheControl On
CacheIgnoreNoLastMod On
...but nothing more than that, really. I'll do some reading up on that
and
Varnish.
> > We've also done a lot of system changes, especially moving the
> > different storage areas around, e.g. index, temp attachments and so
> > on.
>
> Can you elaborate? I haven't heard of anyone doing this.
It's probably mostly because I couldn't get hold of proper hardware. I
ran
the system on a blade with 2.5" disks. They were so slow they made the
system grind to a halt. Before we identified the problem with the
usage
stat plugin, the write to the index was horrific, so we mapped it to a
tmpfs (memory mapped) filesystem. We then, amazingly enough, gained
further performance by moving all variable data directories to NFS(!).
> You have Xms and Xmx there twice, I hope it's just a typo in the email.
Ha! Brilliant! I had forgotten to update the setenv.sh. Thanks for
spotting a bug for me. I owe you a beer.
>
> I also found that increasing the young generation makes things better
> as it prevents some short-lived object from being promoted to the old
> generation.
Ah, that is very interesting I'll talk to one of the JVM Teflon heads
here. That makes sense.
>
>
>
> > ...and the Confluence specific are:
> > -Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true -
> > Djava.awt.headless=true -Djira.jelly.on=true -
> > Dconfluence.optimize.index.modulo=30 -
> > Datlassian.indexing.contentbody.maxsize=131072 -
>
> any docs/comments on these two?
Sorry, but which ones? I'm not being difficult, but it'll take some
time
to dig up the details about all of them. From the top of my head:
-Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true
Limiting of Tomcat (JSP) caching (stability precaution)
-Djava.awt.headless=true -Djira.jelly.on=true
Recommended by Atlassian. AWT optimisation.
-Djira.jelly.on=true
Can't remember!
-Dconfluence.optimize.index.modulo=30
Gah! I know this one. I can't remember it now and I have to dash
shortly.
Datlassian.indexing.contentbody.maxsize=131072
The maximum size a document that is indexed can have. If it's above
this
value, then it won't be indexed.
I'll elaborate on any of them if you want me to, but it'll have to be
next
week.
Regards,
Dan