Pool Initialization question

137 views
Skip to first unread message

Gregory Haase

unread,
Jul 13, 2020, 12:59:41 PM7/13/20
to HikariCP
Hello,

We are currently trying to replace a vendor supplied pooling solution with HikariCP (said vendor solution has a memory leak and although they have given us a patch, they have yet to port it to the most recent versions).

My goal was to swap and optimize the HikariCP pool settings at the same time, however management has asked that we roll out with the existing pool configuration to minimize risk and accelerate the migration. Our current settings are maxConnections: 200, minIdle: 25

We have frequent (1x every 2w) issues where we lose customer concurrency and have a subsequent mass login event, which stresses the pool for a brief period. With the vendor-based solution, we see the active connections immediately grow to 200, and then a few minutes taper back down to ~15-20 active connections. During the growth window, there are no timeouts (connection timeout = 5s, query timeout = 2s).

When we load-tested HikariCP, we noted that, again connections immediately grow to 200 and then taper back down to ~15-20 active connections. During the growth window, there are multiple timeouts, e.g. "Connection is not available, request timed out after 5002ms" (connection timeout/query timeout in Hikari is set to 5s).

Since the steady state is ~15-20 active connections, I attempted to alleviate the issue by using a fixed pool size of 25 connections. Unfortunately, the system exhibited the same issues/timeouts.

I wanted to check my understanding of pool initialization here. I did a bunch of research and read a bunch of past issues, but I'm not 100% sure. My assumption is that although the pool is expected to have 25 minimum connections, those connections are not established until they are actually needed. Since we have a flood of incoming requests all at once, there is a large number of requests for pool connections all at once. Since physical connection requests are blocking, the requests stack up and eventually timeout. My further assumption is that the blocking portion includes not only the getConnection, but the actual query execution itself (I agree this is potentially an application issue, and we have plans to fix it, but loading user data for our system is fairly complex and this task is pretty far out).

Finally, is a scenario that would benefit from setting the undocumented "com.zaxxer.hikari.blockUntilFilled"  to true?

Sorry for the wall of text here, and thanks in advance,

Greg Haase

Kaartic Sivaraam

unread,
Jul 14, 2020, 12:27:51 AM7/14/20
to Gregory Haase, HikariCP
Hi,
To the best of my understanding, that's not the case. Ideally, the pool would ensure that it always has `minimumIdle` connections. The description of `minimumIdle` in the README [1] seems to support my statement.

"This property controls the minimum number of idle connections that HikariCP tries to maintain in the pool. If the idle connections dip below this value and total connections in the pool are less than maximumPoolSize, HikariCP will make a best effort to add additional connections quickly and efficiently. However, for maximum performance and responsiveness to spike demands, we recommend not setting this value and instead allowing HikariCP to act as a fixed size connection pool. Default: same as maximumPoolSize"

The only exception is the brief period after the pool initialization.

[1]: https://github.com/brettwooldridge/HikariCP

> Since we have a flood of incoming requests all at once, there
>is a
>large number of requests for pool connections all at once. Since
>physical
>connection requests are blocking, the requests stack up and eventually
>timeout.

This could likely be the cause as HikariCP tries to maintain the `connectionTimeout` contract strictly [2]. You could try increasing the "connectionTimeout" which might help.

[2]: https://github.com/brettwooldridge/HikariCP/wiki/Bad-Behavior:-Handling-Database-Down

> My further assumption is that the blocking portion includes
>not
>only the getConnection, but the actual query execution itself (I agree
>this
>is potentially an application issue, and we have plans to fix it, but
>loading user data for our system is fairly complex and this task is
>pretty
>far out).
>
>Finally, is a scenario that would benefit from setting the undocumented
>
>"com.zaxxer.hikari.blockUntilFilled" to true?
>

I don't think so. It would help if you get spike demand immediately after the pool initialization. Even if that's the case for you, I don't think it would help due to the huge gap between `minimumIdle` and the spike demand load.

Hope this helps,
Sivaraam

Sent from my Android device with K-9 Mail. Please excuse my brevity.

Michael K

unread,
Jul 14, 2020, 12:25:38 PM7/14/20
to HikariCP
I'm new here :) but I'm going to ask if you close everything (ResultSet, Statement/PreparedStatement, Connection) when you're done with it. Do you have some code that doesn't use the pool that isn't closing things? That's how I've seen that "Connection is not available" before. Also what happens if you don't set maxConnections/maximumPoolSize and minIdle and use defaults? 

I see my MySQL fill with the default 10 connections for the pool as soon as I start/deploy the webapp (watching in MySQL Workbench, Client Connections). I actually have 3 webapps in one Tomcat and I see all 3 establish their 10 connections if I restart Tomcat. If you are watching connections, you should be able to see when they are established, right? Since they show up on the MySQL side, I'd say they are actually established at initialization. And you're not initializing your pool right when this login event happens unless you have very bad luck. I haven't looked to see what Hikari is actually doing, just a user. 

Gregory Haase

unread,
Jul 14, 2020, 12:28:32 PM7/14/20
to HikariCP
Thank you for your response.

I understand your points, however, I don't think extending the timeout period is actually a good idea in our case. We have 1000s of requests stacking up behind the getConnection calls, If we extend the timeout, that number is actually going to climb significantly. At that point, the application itself is going to start spewing errors because the thread queues start to back up. It would probably be wiser to shorten the timeout, so those backed up queries will fail fast.

I think you glossed over a few of the finer points, such as the subsequent steady state under the same load after connections in the pool are established, and the fact that we performed a second test with only maxConnections set at 25.

I just looked back at the metrics implementation in the Hikari source code, and compared that to our metrics. We are using codehale metrics by supplying the pool config with a MetricRegistry. Our data shows that the pool had 25 active connections at the start of our load test. The getTotalSessions method in HikariPool just reports connectionBag.size. That method is used not only for metrics, but also for multiple methods in that class dealing with pool population - including the 'blockUntilFilled' conditional.

We have two metrics for counting timeouts: Examining logs from the application by searching explicitly for connection errors, and additional dashboards created based on application metrics.

From the data I realize that the pool does indeed have minIdle connections before we initiate our test. What I still don't understand is:
* Why we initially see a huge number of timeouts and a connection burst before pool usage drops to a steady state below 25 and errors go away
* Why the vendor supplied pool implementation, although still having the connection burst at onset, does not experience any timeouts (set to 2s as opposed to Hikari's 5s)


The great news is, once the pool usage enters a steady state for both pool implementations, the execution times on the Hikari pool are noticeably faster. 

Sergey Panko

unread,
Jul 16, 2020, 1:00:31 AM7/16/20
to HikariCP

Hi Gregory, I had similar issue on my project.

I'd recommend you to play with the following properties:
- connectionTimeout
- idleTimeout
- maxLifetime

Enabling debug logs would very helpful also.

From my experience connection timeout is not related to HikariCP, 
usually it appears because of slow API response, maybe your old CP was silent about this issue.
Anyway debug logs are very detailed and will be very helpful in issue investigation.
вторник, 14 июля 2020 г. в 19:28:32 UTC+3, Gregory Haase:

Gregory Haase

unread,
Jul 27, 2020, 7:34:56 PM7/27/20
to HikariCP
I haven't completely figured this out yet, but I didn't want to leave the thread hanging. I'll be doing some more tests in the near future and I've also done quite a bit of additional research.

I think I have this narrowed down to an issue of comparing apples to oranges when it comes to Hikari and the vendor supplied pool. It turns out that the vendor's timeout setting is ONLY active if there is no available connections AND the pool has reached it's max. So a thread could be waiting MUCH longer than 2s to get a connection while the pool goes from 25 to 200. We occasionally see RejectedExecutionException from ThreadPoolExecutor hitting max queue, but we never could explain them before - we assumed we had tuned to see connection timeouts and/or slow queries before the thread pool was full. Now we believe this occurs while the pool is being filled. When we were running our tests with Hikari, we didn't see any ThreadPool issues. It indicates that the connectionWait property works as intended and we are failing at the point which we expected.

On the subject of filling the pool...  I do not doubt that Hikari's single thread takes longer to fill the pool than the vendor supplied pool. I also suspect our connection times are long, which means adding those additional 175 connections would take 10s of seconds if not longer (10s seems like an eternity when you have 10,000 of concurrent requests waiting). However, I also believe that some of the unexplained errors we saw with the vendor were due to race conditions and/or thread safety issues. Unfortunately... can't look at the code.

Thanks,

-G



--
You received this message because you are subscribed to the Google Groups "HikariCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hikari-cp+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hikari-cp/e913ee41-0088-44c0-bc6b-b0475810e16cn%40googlegroups.com.

Gregory Haase

unread,
Jul 29, 2020, 3:02:16 PM7/29/20
to HikariCP
Just wanted to completely close this thread. We conducted a bunch of tests over the past few days and concluded that Hikari meets and in some cases exceeds performance of our existing pool implementation.

Thanks,

-G

Brett Wooldridge

unread,
Aug 2, 2020, 5:41:42 AM8/2/20
to HikariCP
Hi,

Brett here, author of HikariCP. I’m glad you got things stable, but I did want to ask, what is the source of the mass extinction event that is/was occurring?

-Brett

Gregory Haase

unread,
Aug 2, 2020, 10:26:46 AM8/2/20
to HikariCP
Unfortunately, the concurrency drop isn't really within our control. We are shared services within a much larger organization and one of the many upstream services will have a failure. They see all sorts of issues... DDoS, brief network outages, hardware failures, unscheduled maintenance, developer  induced trauma...

--
You received this message because you are subscribed to the Google Groups "HikariCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hikari-cp+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages