Hi all,I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pending
and then a subsequent collapse of the Jetty HTTP connection pool ... result = around 20' of outage :-(I know that by far the DB is the *weakest link* of Gerrit resilience, but I was wondering if there is away (in the meantime) to avoid this waterfall effect to happen again, should future DB spikes come whilst we are still busy in "ripening" the NoteDB :-)Any feedback is highly appreciated :-)Luca.
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:Hi all,I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pendingPending for a DB connection or pending on a running DB query/statement?If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).
If pending on a DB query/statement to finish then your DB might need some tuning.
Is your DB cache large enough? Have you traced for long running SQL queries in your DB?
Hi Saša,see my feedback below.On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:Hi all,I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pendingPending for a DB connection or pending on a running DB query/statement?If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).This is what we had, however when the pool is full there will be lots of threads waiting for a connection to be active
… and as the incoming traffic is huge, you’ll be running out of threads in seconds :-(If pending on a DB query/statement to finish then your DB might need some tuning.Yes, this is a possibility :-) However my point was “IF” the DB is getting slow for any reason … can I prevent an outage and have some “service degradation” ?
Is your DB cache large enough? Have you traced for long running SQL queries in your DB?Yes, the cache is full … but there are still SQL queries that could be avoided :-(I’ve pushed a couple of changes, one has been merged but I’m sure that there is more.However relying too much on caching may increase too much the JVM heap and thus resulting in quite a waste of CPU for GC.
I remember that Shawn mentioned they’ve shutdown cache altogether because of the huge CPU usage.
On 3 Sep 2015, at 09:48, Saša Živkov <ziv...@gmail.com> wrote:On Thu, Sep 3, 2015 at 6:26 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:Hi Saša,see my feedback below.On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:Hi all,I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pendingPending for a DB connection or pending on a running DB query/statement?If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).This is what we had, however when the pool is full there will be lots of threads waiting for a connection to be activeHow were those threads named?
If you have configured more SSH+HTTP worker threads than the connections in the DB connection pool then I can understand what you describe.However, if the DB connection pool is equals or larger than the number of the SSH+HTTP threads then I would like to knowwhich threads (their names) have you seen waiting?
… and as the incoming traffic is huge, you’ll be running out of threads in seconds :-(If pending on a DB query/statement to finish then your DB might need some tuning.Yes, this is a possibility :-) However my point was “IF” the DB is getting slow for any reason … can I prevent an outage and have some “service degradation” ?If all SSH + HTTP worker threads are busy then a new request should timeout after some time and the response shouldbe something like "Service Unavailable" ... At least this is what I would expect :-)
Is your DB cache large enough? Have you traced for long running SQL queries in your DB?Yes, the cache is full … but there are still SQL queries that could be avoided :-(I’ve pushed a couple of changes, one has been merged but I’m sure that there is more.However relying too much on caching may increase too much the JVM heap and thus resulting in quite a waste of CPU for GC.I remember that Shawn mentioned they’ve shutdown cache altogether because of the huge CPU usage.
I was talking about the DB cache which is not in Gerrit's JVM.
Hi Saša,see my feedback below.On 3 Sep 2015, at 09:48, Saša Živkov <ziv...@gmail.com> wrote:On Thu, Sep 3, 2015 at 6:26 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:Hi Saša,see my feedback below.On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:Hi all,I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pendingPending for a DB connection or pending on a running DB query/statement?If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).This is what we had, however when the pool is full there will be lots of threads waiting for a connection to be activeHow were those threads named?There were all incoming HTTP connection threads. SSH traffic was absolutely fine.
--
--
To unsubscribe, email repo-discuss+unsub...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.
public void execute(Runnable job)
{
if (!isRunning() || !_jobs.offer(job))
{
LOG.warn("{} rejected {}", this, job);
throw new RejectedExecutionException(job.toString());
}
else
{
// Make sure there is at least one thread executing the job.
if (getThreads() == 0)
startThreads(1);
}
}
This means that either the QueueThreadPool wasn't running (not very likely as it was working before the fault) or the the Jobs Queue (_jobs) has rejected to accept the new incoming item.
The only reason why the the _jobs BlockingQueue cannot accept a new element is ... the queue has reached its maximum capacity.
The problem is: org.eclipse.jetty.util.BlockingArrayQueue is supposed to auto-grow when the maximum capacity is reached. How come that it wasn't able to add a new element? Has the auto-grow mechanism stopped somehow?
There are two conditions for the BlockingArrayQueue to fail to add a new element:
1. It has already reached the *absolute* max capacity
2. It tried to grow but failed to do so
Gerrit configures the QueueThreadPool in this way:
private ThreadPool threadPool(Config cfg) {
int maxThreads = cfg.getInt("httpd", null, "maxthreads", 25);
int minThreads = cfg.getInt("httpd", null, "minthreads", 5);
int maxQueued = cfg.getInt("httpd", null, "maxqueued", 50);
int idleTimeout = (int)MILLISECONDS.convert(60, SECONDS);
int maxCapacity = maxQueued == 0
? Integer.MAX_VALUE
: Math.max(minThreads, maxQueued);
QueuedThreadPool pool = new QueuedThreadPool(
maxThreads,
minThreads,
idleTimeout,
new BlockingArrayQueue<Runnable>(
minThreads, // capacity,
minThreads, // growBy,
maxCapacity // maxCapacity
));
pool.setName("HTTP");
return pool;
}
The growth factor is set to minThreads (5 by default) so the queue should grow if needed. The only reason why then we get the error is because has reached its maxCapacity, which is the maximum between httpd.maxqueued and httpd.minthreads.
So I guess in order to resolve the problem of allowing "graceful failure" I should only allow more incoming connection to be queued instead of just "closing the door". By default Gerrit allows no more than 50 queued connections which is a bit low.
Luca.
Luca.
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
Luca.
--
--
To unsubscribe, email repo-discuss+unsub...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.