Performance issues in a high demand environment

134 views
Skip to first unread message

Sunny Ahuwanya

unread,
Mar 12, 2014, 4:24:07 PM3/12/14
to npgsq...@googlegroups.com
Hello all,

I recently noticed that Npgsql didn't perform well in a high demand environment (a site that has many users and huge spikes in traffic every now and then).
I discovered that the cause of the issue was that the connection pool distributed connectors in no particular order.

Say you have two web requests that are called at about the same time and each request makes 4 DB calls. I'll call the DB calls 1A, 1B, 1C and 1D for the first web request and 2A, 2B, 2C and 2D for the 2nd web request.

Let's say the DB operations try to grab connectors from the pool in this order 1A 2A 1B 2B 1C 2C 1D and 2D, the connectors might actually end up being distributed in the following order 1A 1B 1C 2A 1D 2B 2C 2D.
In the above example 1B and 1C got connectors before 2A got one.

In a high demand environment where there are many threads (web requests) simultaneously making DB calls, 2A might end up waiting for a very long time while other threads are handed connectors even though 2A made its request before those other threads. This often leads to an artificial time out. 
What's worse is that all successful DB calls made within a web request are wasted if a subsequent DB call times out. The web request only succeeds when all DB calls within that web request are successful.

The fix to this problem is to queue up requests for connectors and process them one by one.

Another issue is that I saw my server application run into a stackoverflow exception and go down.
It didn't happen very often (It has only happened three times) but it's still a cause of concern.
I'm not sure what caused it but I suspect the recursion in GetPooledConnector() might be to blame.
Even if that's not the cause, it's safer to convert the recursion into an iterative loop since all sorts of code depend on this library.

I've created Pull Requests 178 and 182 to resolve this issues.

Thanks,
Sunny

Olivier MATROT

unread,
Apr 16, 2014, 3:31:42 AM4/16/14
to npgsq...@googlegroups.com
Hi Sunny,

I have a few questions for you.

Is there a reason you need to open and close 4 times a connection on a same thread, instead of using only one connection ?
What is the size of your pool in the connection string ? Especially the minimum requested size ?

I'm sure the locking mechanism in the pool is not that perfect for high concurrency environment.
I'm also not sure that queuing the requests for connectors is a good idea either, because it may not be suitable for all contexts.

Kind Regards,
Olivier.

Sunny Ahuwanya

unread,
Apr 16, 2014, 2:35:44 PM4/16/14
to npgsq...@googlegroups.com
Hi Olivier,

See my responses inline:


On Wednesday, April 16, 2014 3:31:42 AM UTC-4, Olivier MATROT wrote:
Hi Sunny,

I have a few questions for you.

Is there a reason you need to open and close 4 times a connection on a same thread, instead of using only one connection ?

I have a data layer that I make calls to from a business layer that returns a model filled with data fetched from the DB. This is a common pattern.
I make a call to the data layer to fetch some data and based on the content of that data and other factors I could make another call to fetch some other data and so forth.
Combine this with common DB calls for web applications such as user authentication and logging and you can easily get more than 6 DB calls per thread.

I do not want to pass a connection object to each data layer call because the business layer shouldn't care about the implementation detail of the data layer.
Moreover, the main advantage of pooling is that we don't have to keep track of connection objects. That's automatically handled by the client library.
The client library will pool open connections and only close them when they get stale or have errors, so multiple DB calls per thread are still relatively efficient.
 
What is the size of your pool in the connection string ? Especially the minimum requested size ?

I was using a postgres-as-a-service provider that had a maximum connection limit of 20, when I experienced this issue. I had a maximum pool size of 20 (which was the highest I could go) and a minimum pool size of 1. I applied the patch and that brought down the error count significantly.
I've switched to a Heroku postgres plan which has a 200 maximum connections limit and adjusted the connection string to have a 200 max pool size. I hardly run into any more errors (and I'm still using the patch).

It's worth noting that the default pool size is 20, so users that do not set the max pool size in the connection string will run into this issue if there is suddenly a spike in traffic, even though they are using an internal server with an unlimited (barring port exhaustion and memory) connection limit.

Also, even if there is a large max pool connections setting in the connection string (say 300), the client can still perform terribly without the queued-requests patch if there is a huge spike, say 10,000 users visiting the site over a period of 1 minute, which translates to 166 users per second or 830 DB calls per second ( at 5 DB calls on the average per user).
Each connection in the pool has an average of 2.76 operations to perform per second, which is fine (on a fast network and fast DB). The queued-requests patch will be able to handle this situation smoothly whereas the existing code will handle it well until the 301th connection when new requests start waiting for old requests to complete.

Although such traffic is unlikely for 90% of all websites out there, you can imagine a big sports website with a huge DB server running into this issue.
 

I'm sure the locking mechanism in the pool is not that perfect for high concurrency environment.
I'm also not sure that queuing the requests for connectors is a good idea either, because it may not be suitable for all contexts.

One other way I thought of was to have a dedicated worker thread to service the requests queue but that seemed more complex than having the requesting threads service the queue.
Do you by chance know how the SQL Server client deals with this issue?

Best Regards,
Sunny
 

Olivier MATROT

unread,
Apr 17, 2014, 3:24:41 AM4/17/14
to npgsq...@googlegroups.com
Hi Sunny,

I was not aware that the thread grabbing a connection has to go to sleep 1s if none is available, before trying again.
I think your fix will do and try to use it in our stress test environment.

Thanks for sharing the details of your software architecture.
I'm using only one connection per thread in a WCF Service context and never experienced your problem after tweaking the min/max pool size in the connection string. Fortunately I'm not limited in terms of connection to the database.


Le mercredi 12 mars 2014 21:24:07 UTC+1, Sunny Ahuwanya a écrit :

Sunny Ahuwanya

unread,
Apr 17, 2014, 11:56:48 AM4/17/14
to npgsq...@googlegroups.com
You're welcome Olivier,
Let me know if you run into any issues.

Olivier MATROT

unread,
Jun 2, 2014, 4:37:21 AM6/2/14
to npgsq...@googlegroups.com
Hi Sunny,

I'm back on this subject.
On may 30th, one of our customer experienced a BSOD on their PostgreSQL 9.2 database system (yes I'm running PostgreSQL on Windows and it works fine).
I don't know if what we've experienced (see below) is tied to your modifications but I wanted to share it with you anyway.

The connector pool resilience started to remove faulty connections from the pool. Because there is no timeout specified to the connection test NpgsqlConnector.IsValid(), the default 20s applied. I have 20 connections in the pool so it took about 7 minutes to recover the pool. I don't know why it took up to the timeout to see that the connection was broken...

At a minimum, we should make sure that a timeout of 1s is used to check the connection is still available.
And we probably need to recover broken connections on a request basis instead of cleaning the entire pool.

What do you think ?

Le mercredi 12 mars 2014 21:24:07 UTC+1, Sunny Ahuwanya a écrit :

Sunny Ahuwanya

unread,
Oct 14, 2014, 2:22:35 PM10/14/14
to npgsq...@googlegroups.com
Hi Olivier,

I'm sorry for taking sooo long to respond. I wanted to respond ASAP but got caught up with so much stuff and completely forgot.
I don't think this is related to my changes.
Yes, a 1-second timeout seems suitable for detecting broken connections but that may be too small for a slow connection.

Does Postgres send out a heartbeat signal periodically?
Were you able to figure out the cause of the issue?
 
 I'm still running my fork in production without any issues.

My apologies again.
-- Sunny
Reply all
Reply to author
Forward
0 new messages