Gerrit resiliency: how to prevent the DB to screw the thread pool up?

402 views
Skip to first unread message

lucamilanesio

unread,
Sep 3, 2015, 9:04:21 AM9/3/15
to Repo and Gerrit Discussion
Hi all,
I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(

This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pending and then a subsequent collapse of the Jetty HTTP connection pool ... result = around 20' of outage :-(

I know that by far the DB is the *weakest link* of Gerrit resilience, but I was wondering if there is away (in the meantime) to avoid this waterfall effect to happen again, should future DB spikes come whilst we are still busy in "ripening" the NoteDB :-)

Any feedback is highly appreciated :-)

Luca.

Saša Živkov

unread,
Sep 3, 2015, 11:54:12 AM9/3/15
to lucamilanesio, Repo and Gerrit Discussion
On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
Hi all,
I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(

This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pending
 
Pending for a DB connection or pending on a running DB query/statement? 
If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).
If pending on a DB query/statement to finish then your DB might need some tuning.
Is your DB cache large enough? Have you traced for long running SQL queries in your DB?

and then a subsequent collapse of the Jetty HTTP connection pool ... result = around 20' of outage :-(

I know that by far the DB is the *weakest link* of Gerrit resilience, but I was wondering if there is away (in the meantime) to avoid this waterfall effect to happen again, should future DB spikes come whilst we are still busy in "ripening" the NoteDB :-)

Any feedback is highly appreciated :-)

Luca.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Milanesio

unread,
Sep 3, 2015, 12:27:04 PM9/3/15
to Saša Živkov, Repo and Gerrit Discussion
Hi Saša,
see my feedback below.

On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:



On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
Hi all,
I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(

This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pending
 
Pending for a DB connection or pending on a running DB query/statement? 
If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).

This is what we had, however when the pool is full there will be lots of threads waiting for a connection to be active … and as the incoming traffic is huge, you’ll be running out of threads in seconds :-(

If pending on a DB query/statement to finish then your DB might need some tuning.

Yes, this is a possibility :-) However my point was “IF” the DB is getting slow for any reason … can I prevent an outage and have some “service degradation” ?

Is your DB cache large enough? Have you traced for long running SQL queries in your DB?

Yes, the cache is full … but there are still SQL queries that could be avoided :-(
I’ve pushed a couple of changes, one has been merged but I’m sure that there is more.

However relying too much on caching may increase too much the JVM heap and thus resulting in quite a waste of CPU for GC.
I remember that Shawn mentioned they’ve shutdown cache altogether because of the huge CPU usage.

Did you guys had this problem in the past? What is the effect of Gerrit when the DB slows down for any reason?

Saša Živkov

unread,
Sep 3, 2015, 12:49:26 PM9/3/15
to Luca Milanesio, Repo and Gerrit Discussion
On Thu, Sep 3, 2015 at 6:26 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
Hi Saša,
see my feedback below.

On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:



On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
Hi all,
I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(

This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pending
 
Pending for a DB connection or pending on a running DB query/statement? 
If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).

This is what we had, however when the pool is full there will be lots of threads waiting for a connection to be active

How were those threads named?
If you have configured more SSH+HTTP worker threads than the connections in the DB connection pool then I can understand what you describe.
However, if the DB connection pool is equals or larger than the number of the SSH+HTTP threads then I would like to know
which threads (their names) have you seen waiting?

 
… and as the incoming traffic is huge, you’ll be running out of threads in seconds :-(

If pending on a DB query/statement to finish then your DB might need some tuning.

Yes, this is a possibility :-) However my point was “IF” the DB is getting slow for any reason … can I prevent an outage and have some “service degradation” ?
 
If all SSH + HTTP worker threads are busy then a new request should timeout after some time and the response should
be something like "Service Unavailable" ... At least this is what I would expect :-)



Is your DB cache large enough? Have you traced for long running SQL queries in your DB?

Yes, the cache is full … but there are still SQL queries that could be avoided :-(
I’ve pushed a couple of changes, one has been merged but I’m sure that there is more.

However relying too much on caching may increase too much the JVM heap and thus resulting in quite a waste of CPU for GC.
I remember that Shawn mentioned they’ve shutdown cache altogether because of the huge CPU usage.

I was talking about the DB cache which is not in Gerrit's JVM.

Martin Fick

unread,
Sep 3, 2015, 12:58:13 PM9/3/15
to repo-d...@googlegroups.com, Saša Živkov, Luca Milanesio
On Thursday, September 03, 2015 06:48:40 PM Saša Živkov
There are more threads in Gerrit than just the user facing
ones, so the count should actually be higher. There are
also the indexing threads, the merge queue (when it
existed), and the replciation threads. There may be more
that I am forgetting (any background processes, plugin
loader?...) Also, I think both the HTTP & SSHD have
acceptor threads that can easily be forgotten.

-Martin

-Martin




--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Luca Milanesio

unread,
Sep 3, 2015, 1:06:50 PM9/3/15
to Saša Živkov, Repo and Gerrit Discussion
Hi Saša,
see my feedback below.

On 3 Sep 2015, at 09:48, Saša Živkov <ziv...@gmail.com> wrote:



On Thu, Sep 3, 2015 at 6:26 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
Hi Saša,
see my feedback below.

On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:



On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
Hi all,
I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(

This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pending
 
Pending for a DB connection or pending on a running DB query/statement? 
If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).

This is what we had, however when the pool is full there will be lots of threads waiting for a connection to be active

How were those threads named?

There were all incoming HTTP connection threads. SSH traffic was absolutely fine.

If you have configured more SSH+HTTP worker threads than the connections in the DB connection pool then I can understand what you describe.
However, if the DB connection pool is equals or larger than the number of the SSH+HTTP threads then I would like to know
which threads (their names) have you seen waiting?

Yes, that could be an option: increase the DB connection pool. Need to understand how to soft-fail then the incoming HTTP connection without giving the impression of a service failure.
(if all new HTTP incoming connections are failing, the site appears to be down anyway)


 
… and as the incoming traffic is huge, you’ll be running out of threads in seconds :-(

If pending on a DB query/statement to finish then your DB might need some tuning.

Yes, this is a possibility :-) However my point was “IF” the DB is getting slow for any reason … can I prevent an outage and have some “service degradation” ?
 
If all SSH + HTTP worker threads are busy then a new request should timeout after some time and the response should
be something like "Service Unavailable" ... At least this is what I would expect :-)

Cool, let me try to reproduce the situation later today and I’ll tell you what is the user-experience :-)

Possibly a different strategy could be implemented where you just “kill” the sessions that are too much monopolising the DB.
Or alternatively provide a “degraded” service through a new dedicated HTTP thread pool: maybe a set of services that do not the DB at all :-)




Is your DB cache large enough? Have you traced for long running SQL queries in your DB?

Yes, the cache is full … but there are still SQL queries that could be avoided :-(
I’ve pushed a couple of changes, one has been merged but I’m sure that there is more.

However relying too much on caching may increase too much the JVM heap and thus resulting in quite a waste of CPU for GC.
I remember that Shawn mentioned they’ve shutdown cache altogether because of the huge CPU usage.

I was talking about the DB cache which is not in Gerrit's JVM.

Ah, understood. Will check the DB cache, thanks for the hint :-)

Luca Milanesio

unread,
Sep 3, 2015, 1:07:52 PM9/3/15
to Martin Fick, repo-d...@googlegroups.com, Saša Živkov
Thanks Martin for your feedback.
Will need to make some calculations to take all of threads into account.
Oh yes, true.

Saša Živkov

unread,
Sep 4, 2015, 5:23:27 AM9/4/15
to Martin Fick, repo-d...@googlegroups.com, Luca Milanesio
Correct... I wasn't really precise in mentioning all the thread pools
which could use DB connections. But the principle is clear: the DB
connection pool must be larger than the number of all threads in
Gerrit which consume DB connections (assuming each thread
consumes max one connection at a time, which mostly is the case).

Saša Živkov

unread,
Sep 4, 2015, 7:52:00 AM9/4/15
to Luca Milanesio, Repo and Gerrit Discussion
On Thu, Sep 3, 2015 at 7:06 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
Hi Saša,
see my feedback below.

On 3 Sep 2015, at 09:48, Saša Živkov <ziv...@gmail.com> wrote:



On Thu, Sep 3, 2015 at 6:26 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
Hi Saša,
see my feedback below.

On 3 Sep 2015, at 08:53, Saša Živkov <ziv...@gmail.com> wrote:



On Thu, Sep 3, 2015 at 3:04 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
Hi all,
I know that we are moving away from Gerrit DBMS (thanks G) but right now, we still have some stuff on it :-(

This morning (5:22 AM PST) we had a spike of DB accesses to GerritHub.io and this cause a waterfall effect on the number of active threads pending
 
Pending for a DB connection or pending on a running DB query/statement? 
If pending for a DB connection then you may have the DB connection pool smaller than the number of the worker threads (SSH + HTTP).

This is what we had, however when the pool is full there will be lots of threads waiting for a connection to be active

How were those threads named?

There were all incoming HTTP connection threads. SSH traffic was absolutely fine.

Then it might be something with the Jetty configuration... Because, if the DB connection pool would
be exhausted then also the SSH requests would be blocked.

Luca Milanesio

unread,
Sep 4, 2015, 8:39:50 AM9/4/15
to Saša Živkov, Repo and Gerrit Discussion
Hi Saša,
I double-checked the DB connection pool and it wasn’t actually exhausted :-O

There were 27 active DB connections whilst the pool was set to 32 and number of active threads was 23 (4 threads were possibly just waiting for their SQL queries to finish).
The DB was possibly slow but healthy: did not need to do anything to recover / restart it, it was only a temporary slowdown.

There was a "jetty.io.SelectorManager:659 - Could not process key for channel java.nio.channels.SocketChannel” on the logs … so possibly Jetty had some trouble in keeping up, which seems weird.
I need to simulate again a similar situation as there could possibly some problems in Jetty itself rather than Gerrit.

Will keep the mailing list posted :-)

Luca.

lucamilanesio

unread,
Sep 4, 2015, 8:49:59 AM9/4/15
to Repo and Gerrit Discussion, ziv...@gmail.com
There we go [1], someone else had exactly the same problem :-(
It seems that comes when Jetty connection pool is under stress ... 

There  is no "gerrit" in the stacktrace, so it is a Jetty-specific issue I guess.

Luca.


-- 
-- 
To unsubscribe, email repo-discuss+unsub...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.

lucamilanesio

unread,
Sep 4, 2015, 9:39:32 AM9/4/15
to Repo and Gerrit Discussion, ziv...@gmail.com
The point where Jetty throw that exeption is:

    public void execute(Runnable job)

    {

        if (!isRunning() || !_jobs.offer(job))

        {

            LOG.warn("{} rejected {}", this, job);

            throw new RejectedExecutionException(job.toString());

        }

        else

        {

            // Make sure there is at least one thread executing the job.

            if (getThreads() == 0)

                startThreads(1);

        }

    }


This means that either the QueueThreadPool wasn't running (not very likely as it was working before the fault) or the the Jobs Queue (_jobs) has rejected to accept the new incoming item.


The only reason why the the _jobs BlockingQueue cannot accept a new element is ... the queue has reached its maximum capacity.


The problem is: org.eclipse.jetty.util.BlockingArrayQueue is supposed to auto-grow when the maximum capacity is reached. How come that it wasn't able to add a new element? Has the auto-grow mechanism stopped somehow?


There are two conditions for the BlockingArrayQueue to fail to add a new element:

1. It has already reached the *absolute* max capacity 

2. It tried to grow but failed to do so


Gerrit configures the QueueThreadPool in this way:


  private ThreadPool threadPool(Config cfg) {

    int maxThreads = cfg.getInt("httpd", null, "maxthreads", 25);

    int minThreads = cfg.getInt("httpd", null, "minthreads", 5);

    int maxQueued = cfg.getInt("httpd", null, "maxqueued", 50);

    int idleTimeout = (int)MILLISECONDS.convert(60, SECONDS);

    int maxCapacity = maxQueued == 0

        ? Integer.MAX_VALUE

        : Math.max(minThreads, maxQueued);

    QueuedThreadPool pool = new QueuedThreadPool(

        maxThreads,

        minThreads,

        idleTimeout,

        new BlockingArrayQueue<Runnable>(

            minThreads, // capacity,

            minThreads, // growBy,

            maxCapacity // maxCapacity

    ));

    pool.setName("HTTP");

    return pool;

  }


The growth factor is set to minThreads (5 by default) so the queue should grow if needed. The only reason why then we get the error is because has reached its maxCapacity, which is the maximum between httpd.maxqueued and httpd.minthreads.


So I guess in order to resolve the problem of allowing "graceful failure" I should only allow more incoming connection to be queued instead of just "closing the door". By default Gerrit allows no more than 50 queued connections which is a bit low.


Luca.

lucamilanesio

unread,
Sep 4, 2015, 9:53:59 AM9/4/15
to Repo and Gerrit Discussion, ziv...@gmail.com
I was wondering if it makes sense to reserve a number of threads for a "degraded" experience in case of huge load.

A "degraded" experience could:
- only read data in-memory
- not allowing changes

As we know the DB is the weakest link, we shouldn't need the DB for authentication and access control when cache is fully populated.
Everything else should be already index in Lucene, isn't it?

Luca.

Saša Živkov

unread,
Sep 4, 2015, 10:06:29 AM9/4/15
to lucamilanesio, Repo and Gerrit Discussion
Indeed.

I just checked the httpd.maxQueued parameter on one of our productive servers and found that it was
increased to 200, with the following commit message:

    Increase httpd.maxQueued to 200.

    This is an attempt to address the following issue, from the error_log:

    [2015-01-07 14:57:22,924] WARN  org.eclipse.jetty.io.SelectorManager :
    Could not process key for channel java.nio.channels.SocketChannel[connected local=/10.66.148.128:8080 remote=/10.68.32.204:59801]
    java.util.concurrent.RejectedExecutionException: org.eclipse.jetty.io.AbstractConnection$1@43c70442
            at org.eclipse.jetty.util.thread.QueuedThreadPool.execute(QueuedThreadPool.java:361)
            at org.eclipse.jetty.io.AbstractConnection$FillingState.onEnter(AbstractConnection.java:344)
            at org.eclipse.jetty.io.AbstractConnection.next(AbstractConnection.java:238)
            at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:528)
            at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:82)
            at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:186)
            at org.eclipse.jetty.io.AbstractConnection$1.run(AbstractConnection.java:505)
            at org.eclipse.jetty.io.AbstractConnection$FillingState.onEnter(AbstractConnection.java:346)
            at org.eclipse.jetty.io.AbstractConnection.next(AbstractConnection.java:238)
            at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:528)
            at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:82)
            at org.eclipse.jetty.io.SelectChannelEndPoint.onSelected(SelectChannelEndPoint.java:109)
            at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processKey(SelectorManager.java:571)
            at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:542)
            at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:484)
            at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:607)
            at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:536)
            at java.lang.Thread.run(Thread.java:812) 



Luca.


-- 
-- 
To unsubscribe, email repo-discuss...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

luca.mi...@gmail.com

unread,
Sep 4, 2015, 10:36:51 AM9/4/15
to Saša Živkov, Repo and Gerrit Discussion
Yep, maybe I'll post a change to increase the default value as it seems that more people is having this issue and the Jetty exception is not obvious :-)

Luca

Sent from my iPhone

lucamilanesio

unread,
Sep 4, 2015, 11:05:04 AM9/4/15
to Repo and Gerrit Discussion, ziv...@gmail.com
Change pushed at [1] I raised the default value to 200.
Hopefully less people in the future will need to interpret the Jetty exception stack trace ;-)

Luca.



Luca.


-- 
-- 
To unsubscribe, email repo-discuss+unsub...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages