Dns.BeginGetHostName() giving problems. Make it possible to use sync version as well?

390 views
Skip to first unread message

Francisco Figueiredo Jr.

unread,
Jul 9, 2014, 5:00:57 PM7/9/14
to npgsq...@googlegroups.com

Hi all!

Miłosz Kubański sent me a mail talking about problems he is having with dns async resolve calls. 

Previously we used the synchronous version to resolve dns names. This was changed by Glen Parker when he was working in the connection timeout optimizations:

Miłosz  said that after he reverted the change and used the non async version, his problems were solved.

I think that maybe the async dns resolve doesn't work to everybody, so I'd like to propose that we have a connectionstring setting which would allow the user to specify if Npgsql should use async or sync dns resolution. We would need to warn the user that if she uses the sync version, the connection timeout may not be honored if the dns takes too long.

What do you all think?



--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://gplus.to/franciscojunior
http://fxjr.blogspot.com
http://twitter.com/franciscojunior

Josh Cooley

unread,
Jul 9, 2014, 9:39:36 PM7/9/14
to fran...@npgsql.org, npgsq...@googlegroups.com
I'd be curious what sort of problems he encountered. I'd prefer to see those fixed rather than just having people turn it off if they encounter the bug.  Interestingly, that method is marked obsolete in the MSDN docs.  Perhaps it's related to the reason it was obsoleted.

Josh
--
You received this message because you are subscribed to the Google Groups "Npgsql Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to npgsql-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Glen Parker

unread,
Jul 9, 2014, 9:48:20 PM7/9/14
to Josh Cooley, fran...@npgsql.org, npgsq...@googlegroups.com
Obsolete?  I guess that's my screw up.  Looks like there are plenty of async alternatives though.  So a good start would be to use one that's not obsolete.  I may be able to get that done this weekend, although it's a lot harder to find time for this stuff now that they make me do stuff at work :)

Looks like there are plenty of async alternatives though.  And I definitely agree with Josh, we should drill into this and find the actual problem, not just allow it to be turned off.

-Glen

Shay Rojansky

unread,
Jul 10, 2014, 4:35:00 AM7/10/14
to Glen Parker, Josh Cooley, fran...@npgsql.org, npgsq...@googlegroups.com
Agree with Josh and Glen, this doesn't seem like an option that should be exposed to users - what would they base their decision on... There should definitely be a safe way to asynchronously resolve DNS in .NET and that's what we should use... BTW, while some methods in the System.Net.Dns class are obsolete, others aren't so this should be fine.

My work on async actually intersects with this as well - this morning I was just working on *asynchronously* implementing the timeout. So again, not having a thread block with WaitOne() as a timeout implementation, but use a system timer which invokes a callback. I should be able to show this work later today or tomorrow. If you guys agree, it makes to me to wait with async DNS until we see where the other async work leads to?

Miłosz Kubański

unread,
Jul 10, 2014, 9:31:29 AM7/10/14
to npgsq...@googlegroups.com, fran...@npgsql.org
Hello Everybody.

We've encountered a random (but quite often 1 per ~12 requests) problem with slow host name resolving. The npgsql throws a connection timeout exception. First of all I thought it was a problem in our company hardware/software DNS and talk with our admin to check that. The network tools doesn't show any issues.

After that with Francisco help I've made a few test in which I was counting the async method execution time. Indeed there is a problem. Sometimes it tooks around 2xslower than sync method, and sometimes it tooks > 15s (yes, seconds :/) to resolve it. 

I;ve tried to seperate a bug into the test environment but according to the Murphies Law everything was just fine :). In our system we're strongly using the parallel extension mechanisms as well as native threads. Maybe the combination of "boundary conditions" cases the problem. 

Currently I've made a switch in my local code which allows to change sync vs async method and send it to the test team (in sync mode) to check the whole system.

If you have any other ideas or code to test please don't hesitate to contact me. 



Best Regards,
Miłosz

Josh Cooley

unread,
Jul 10, 2014, 4:15:16 PM7/10/14
to Miłosz Kubański, npgsq...@googlegroups.com, fran...@npgsql.org
This sounds like the thread pool is full and the completed callback can't be called.  This is a hazard when writing async code.  It's possible the new async work will help with this. Particularly if it removes a wait call.  But when the thread pool is full of work, there's not much you can do to avoid this type of problem short of holding onto any thread you've got.

Josh

Shay Rojansky

unread,
Jul 10, 2014, 5:42:36 PM7/10/14
to Josh Cooley, Miłosz Kubański, npgsq...@googlegroups.com, fran...@npgsql.org
I agree with Josh's diagnosis - sounds like the thread pool is full, the system tries to run the async callback in the threadpool, and you get delays because no thread is available. If this is what's happening, it's a dangerous situation that is really to be avoided (I think there are all kinds of things that depend on a responsive threadpool in .NET, not just async operations). This could be caused either from an Npgsql bug (holding onto threadpool threads) or from the application.

There's one way to understand what's going on in a definite way: a thread dump while the hanging occurs. This would allow us to understand exactly where all the threads are. Miłosz, if you really can't reproduce this in a test environment, there's a way to use sysinternal's process explorer to view the .NET threads (and I think also to dump them for analysis in Visual Studio). Take a look here: http://blogs.msdn.com/b/dotnet/archive/2012/06/26/new-net-diagnostic-info-added-to-process-explorer.aspx. But be careful in production: I've seen situations where this destabilized the application.


Francisco Figueiredo Jr.

unread,
Jul 10, 2014, 10:18:33 PM7/10/14
to Shay Rojansky, Josh Cooley, Miłosz Kubański, npgsq...@googlegroups.com
In this article: http://msdn.microsoft.com/en-us/library/ms973903.aspx
, the author talks about thread pool and in a section about deadlocks
he uses a code to check how many threads are in the thread pool:

static void ShowAvailableThreads()
{
int workerThreads, completionPortThreads;

ThreadPool.GetAvailableThreads(out workerThreads,
out completionPortThreads);
Console.WriteLine("WorkerThreads: {0}," +
" CompletionPortThreads: {1}",
workerThreads, completionPortThreads);
}

I used this code and under mono on OSX, I got:

WorkerThreads: 400, CompletionPortThreads: 16

On Windows 7, I got:

WorkerThreads: 1023, CompletionPortThreads: 1000


Miłosz, try to run this code and see how many threads are available in
your system.
Thread pool exhaustion would also explain why you only get this
problem in your production system. Maybe a lot of applications are
already using all the threads from the pool.

In the same article, the author says ASP.Net can be configured to
change those max values.

I hope it helps.

Francisco Figueiredo Jr.

unread,
Jul 16, 2014, 4:58:45 PM7/16/14
to npgsq...@googlegroups.com, ro...@roji.org, jbco...@tuxinthebox.net, milo...@gmail.com, fran...@npgsql.org


Just another thought:

Given the fact that there may be contention problems with thread pool and async calls, as Miłosz seems to be having, I think it would be good if there was a choice to use async or sync for dns queries.

What do you think?
> wrote:
>
>> This sounds like the thread pool is full and the completed callback can't
>> be called.  This is a hazard when writing async code.  It's possible the
>> new async work will help with this. Particularly if it removes a wait
>> call.
>>  But when the thread pool is full of work, there's not much you can do to
>> avoid this type of problem short of holding onto any thread you've got.
>>
>> Josh
>>
>> On Jul 10, 2014, at 9:31 AM, "Miłosz Kubański" <>
>> For more options, visit https://groups.google.com/d/optout.
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Npgsql Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Npgsql Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Shay Rojansky

unread,
Jul 17, 2014, 1:55:51 AM7/17/14
to npgsq...@googlegroups.com, fran...@npgsql.org

Miłosz Kubański

unread,
Sep 17, 2014, 11:56:34 AM9/17/14
to npgsq...@googlegroups.com, ro...@roji.org, jbco...@tuxinthebox.net, milo...@gmail.com, fran...@npgsql.org
Hi Francisco.

I've run this code just after the exception throw.

WorkerThreads: 32760, CompletionPortThreads: 1000
Windows 8.

So it seams that's not the thread exhaustion. :/
Unfortunateley the problem occures also in the newest official version.


Best Regards,
Miłosz

Miłosz Kubański

unread,
Sep 18, 2014, 7:22:23 AM9/18/14
to npgsq...@googlegroups.com, ro...@roji.org, jbco...@tuxinthebox.net, milo...@gmail.com, fran...@npgsql.org
I've created the proof of concept test program. In my case the async calls problems occures once per a few test program executions. This is the proof of concept wrote v. quickly, but it's not the point:

Just run the following code a few times. I've set the timeout to 1 second to increase the problem occurence chance. 


Regards,
Miłosz

string host = "google.pl";
int millisecondsTimeout = 1000;
int asyncFails = 0;
int syncFails = 0;
long asyncTime = 0;
long syncTime = 0;
bool breakAfterError = false;


//begin test
Console.WriteLine("Async start");
Monitor.TryEnter(this);
int loopCount = 500;



try
{
bool hasTimeoutOccured = false;

Parallel.For(0, loopCount, (x) =>
{
Stopwatch sw = new Stopwatch();

for (int i = 0; i < loopCount && (hasTimeoutOccured == false || breakAfterError == false); i++)
{
sw.Restart();

var result = Dns.BeginGetHostAddresses(host, null, null);


if (!result.AsyncWaitHandle.WaitOne(millisecondsTimeout, true))
{
++asyncFails;
hasTimeoutOccured = true;
// Timeout was used up attempting the Dns lookup
}

Dns.EndGetHostAddresses(result);

sw.Stop();

asyncTime += sw.ElapsedMilliseconds;

if (i % 400 == 0)
{
Console.Write('.');
}
}

})
;

if (hasTimeoutOccured)
{
throw new TimeoutException(":/");
}
}
catch (Exception ex)
{
Console.WriteLine("Async :/");
}


Console.WriteLine("Sync start");


try
{
bool hasTimeoutOccured = false;

Parallel.For(0, loopCount, (x) =>
{
Stopwatch sw = new Stopwatch();

for (int i = 0; i < loopCount && (hasTimeoutOccured == false || breakAfterError == false); i++)
{
sw.Restart();

Dns.GetHostAddresses(host);

sw.Stop();

syncTime += sw.ElapsedMilliseconds;

if (sw.ElapsedMilliseconds > millisecondsTimeout)
{
++syncFails;
hasTimeoutOccured = true;
// Timeout was used up attempting the Dns lookup
}

if (i % 400 == 0)
{
Console.Write('.');
}
}
});

if (hasTimeoutOccured)
{
throw new TimeoutException("Sync");
}
}
catch (Exception ex)
{
Console.WriteLine(":/");
}
Monitor.Exit(this);

Console.WriteLine("\r\n\r\nAsync Fails: {0},\r\nSync fails: {1}", asyncFails, syncFails);

if (breakAfterError == false)
{
Console.WriteLine("\r\n\r\nAsync Time: \t{0},\r\nSync Time: \t{1}", asyncTime, syncTime);
}

Miłosz Kubański

unread,
Sep 18, 2014, 7:26:22 AM9/18/14
to npgsq...@googlegroups.com, ro...@roji.org, jbco...@tuxinthebox.net, milo...@gmail.com, fran...@npgsql.org
resolveNameResult.JPG

Josh Cooley

unread,
Sep 18, 2014, 1:24:28 PM9/18/14
to Miłosz Kubański, npgsq...@googlegroups.com, Shay Rojansky, Francisco Figueiredo Jr.
This is definitely a thread-pool issue.  If I run the test a second time (after the thread-pool has increased in size) I get no errors. I also get no errors if I run the test in a single thread (just a for loop and not parallel for).  The async is also faster for me in about half the tests (at least once I put a lock around the time counters).

I also put calls into GetAvailableThreads, but the function is not returning a value that represents real threads.  I can look at the process and see that it doesn't have the number of threads claimed.  The timeouts are occurring because it takes some time to realize another thread must be created to service the request.

Shay Rojansky

unread,
Sep 18, 2014, 2:28:34 PM9/18/14
to npgsq...@googlegroups.com, milo...@gmail.com, ro...@roji.org, fran...@npgsql.org
I agree with Josh's analysis. Firing up 500 parallel open attempts which do async I/O is bound to have some slowdown in the beginning...

But here's a crazy thought. If I understand correctly, the use of async for DNS is to implement the timeout. Is npgsql really supposed to manage something like a timeout? Npgsql absolutely *is* supposed to allow the user to configure the backend timeout, and also the socket timeout. But I'm not sure what I think about the idea that we're supposed to monitor every operation we do internally and impose a timeout on it. The same thing goes for how we split the total timeout between the different IPs that get resolved - that seems like quite an extreme and maybe unnecessary complication.

IMHO, if what the user wants is to put a timeout on the *entire* connection open process, they can do that like any timeout on a sync process - by running it in the thread pool and waiting on it with a timeout. If there's some specific problem with DNS, I'd say let the user handle it themselves by resolving outside Npgsql and passing us an IP. Every other library out there that does networking simply resolves internally and makes no attempt to guarantee timeouts for such things...

What do you think?

Josh Cooley

unread,
Sep 18, 2014, 3:09:11 PM9/18/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
I'm split on this.  I currently work on a team that is dissatisfied with the database clients they use (not for postgresql).  There is a lot of work that goes on in fixing up the client to meet expectations.  As a library developer it's easy to agree with your points, but as a consumer of the library it is quite unexpected for Open to take significantly longer than the specified timeout.  If the user has to solve the timeout issue, how different is the solution going to be between different consumers?  If it is largely the same, then why aren't we solving that problem for them?

Glen Parker

unread,
Sep 18, 2014, 3:40:07 PM9/18/14
to Josh Cooley, Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
I'm not split on this at all.  If a timeout is provided, it absolutely should be honored as accurately as possible, which is why I spent the time to re-write the connection code last year to do a better job in that area.  If a timeout is provided, there's very good reason to expect it to mean something.

The way the remaining time is split between the IP addresses we receive is a compromise.  Either you give all remaining time to the current address, or you split it up evenly.  Francisco and I went back and forth on it and neither way seemed absolutely correct, so we essentially flipped a coin over it :)

-Glen

Shay Rojansky

unread,
Sep 18, 2014, 3:47:06 PM9/18/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, fran...@npgsql.org
I see what you mean. At the end of the day, I think npgsql is the first library I've come across that actually tries to deal with stuff like DNS taking too long; any other networking library just does a DNS lookup (I think). Think of it from the other way around: if DNS timeout is an actual issue we should be dealing with in npgsql, it means every library out there that opens a socket must implement DNS timeout as well, since npgsql isn't special in any way here. And they would all run into the lack of a synchronous DNS client in .NET that supports timeouts...

The same goes with the multiple responses we may get for the DNS query. Are we really meant to split the "total timeout" between the hosts? That means that if the client's DNS server happens to have many IPs defined for the same hostname, the timeout value is reduced (this seems to me like an actual bug in npgsql). The only way to guarantee that the Open() method returns within a given timeout, is to run it in a separate thread and wait on that thread.

To summarize, I think it's more an issue of making clear to users exactly what our timeout does and what it doesn't. I may be a minimalist here, but here's what I think. We definitely can and should support query timeouts, implemented both backend- and client-side. But we can make it very clear that if they want DNS timeouts, they need to do it on their own, and make up their own mind as to what happens when resolving gives multiple results. In other words, our timeouts should clearly be on Postgresql backend operations, and not on Npgsql function calls.

That's my 2c anyway...

Shay
To unsubscribe from this group and stop receiving emails from it, send an email to npgsql-dev+unsubscribe@googlegroups.com.

Josh Cooley

unread,
Sep 18, 2014, 3:47:24 PM9/18/14
to Glen Parker, Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
You're argument is the reason I'm split.  The counter to that is that a DNS lookup is not a connection to the database.  There's a strong argument that the connection timeout applies to connecting to the database, and not the additional work that goes in to connecting to the database.  Should we timeout if the code execution, jitting time, or garbage collection takes up too much time and we don't even make it to the network connection?  I believe the answer is no, but is the direction you head when you argue that the timeout applies to more than the connection to the database.  I think the middle ground is not that clear, but does need some sort of decision on it.

Shay Rojansky

unread,
Sep 18, 2014, 3:54:02 PM9/18/14
to npgsq...@googlegroups.com, glen...@gmail.com, ro...@roji.org, milo...@gmail.com, fran...@npgsql.org
May I just add a suggestion... It may make sense to look at other database clients to see what they're doing in this respect. Not that it's binding, but it does give a good idea on what users expect.
To unsubscribe from this group and stop receiving emails from it, send an email to npgsql-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Npgsql Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to npgsql-dev+unsubscribe@googlegroups.com.

Glen Parker

unread,
Sep 18, 2014, 4:04:06 PM9/18/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
I can see how this discussion would have merit if the timeout functionality wasn't already written, but why would anyone want to remove good functionality and reduce accuracy in the process?   If we discover that other libraries do a worse job of honoring timeouts than we do, then what, we make our code worse to match??  That's 180* from the direction we should be going.

If the async DNS call is actually broken, then (after we prove it) we can work around that by starting our own task to do a sync DNS call, and put a timeout on the Wait().  But at this point it sounds like we don't actually know what's going wrong, and I have a really tough time believing the async call is really the problem.

-Glen



Shay Rojansky

unread,
Sep 18, 2014, 4:17:30 PM9/18/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, fran...@npgsql.org
I don't think the async DNS code is broken in any way (after all it is the standard .NET resolver).

For me it's an issue of what npgsql guarantees and what it doesn't. Right now our timeout value guarantees very little - after the DNS timeout, Open() also sends a startup packet and then has an authentication back-and-forth process. These things take time, and they're not covered by any timeout mechanism (how could they be, except for executing the whole thing in a separate thread). Call it a personal preference, but I believe a library should make very clear guarantees on the features it provides; it's not a matter of providing a "more accurate" or "less accurate" timeout, it's a matter of defining exactly what it is that timeout does. If the timeout is clearly defined as a database timeout, we're 100% accurate with respect to what we provide.

The split-timeout-across-IPs thing I really consider a bug, or at the very least complicated behavior that's bound to lead to nasty bugs. Just define 15 IPs for your Postgresql server for any reason and boom, you'll start to get connection timeouts. The Npgsql connection process shouldn't depend on an totally arbitrary and external factor such as number of IPs some hostname resolves to. The other option, which is to have one timeout for all the connection attempts, also seems problematic. The point is that there's no good *generic* solution here, so it should be left up to the user...

Glen Parker

unread,
Sep 18, 2014, 4:47:42 PM9/18/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
Good points guys :)

Shay, my response to the comment that connection startup has other overhead not included in the timeout enforcement would be, that ideally we would include it to improve accuracy.  As far as the splitting of timeout between IP's, I think you're seeing what I mean about no 100% right way to do it.  But, as you say, if the server you're connecting to has 15 IP's and that's the cause of timeouts, then that is up to the user to deal with.  If accurate timeout logic causes timeouts, then I think pretty clearly the actual problem is with the timeout value chosen by the user.  If you give each IP the full timeout value, that could result in a connect attempt failing after 15 times as long as the caller expects (60 second timeout goes to 15 minutes).  That's pretty bad IMO.

As far as the point about wrapping async logic within a sync call...  That plane already sailed, because we already document a timeout for the sync connect call.  And that's a good thing.  ALL network dependent API's should off a timeout IMO, async or not.

But here's a question.  If the Dns.BeginGetHostAddresses() call times out because of a thread pool issue, why wouldn't the Socket.BeginConnect() calls timeout for the exact same reason?  Moreover, if the thread pool is as starved as it sounds like in this case, how can anything else work reliably?

-Glen


Shay Rojansky

unread,
Sep 18, 2014, 5:23:07 PM9/18/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, fran...@npgsql.org
It's definitely an interesting discussion,


> Shay, my response to the comment that connection startup has other overhead not included in the timeout enforcement would be, that ideally we would include it to improve accuracy.

How? IMO we're getting to the crux of the matter. If I follow your argument and we want to provide a 100% accurate timeout, the *only* thing we can do is something along the following:

public void Open() {
    Task.Run(() => { // ... all the actual connection logic }).Wait(the_timeout);
}

In other words, run everything in a different thread and wait on that thread with a timeout. Anything else will leave operations out of the timeout and is inaccurate.

At this point I find myself wrapping all my API methods in this fashion, because Open() isn't really special: other methods do multiple back-and-forths with the server.

So we end up with a whole API full of thread-pool wrappers. Aside from the performance impact, the strain on the thread pool and the general yuckiness, any user who needs a timeout of this sort could simply do the task wrapping *outside* npgsql, like this:

Task.Run(() => { conn.Open() }).Wait(the_timeout);

In other words, we're providing almost zero added value by internally wrapping all of npgsql's code in Task.Run(): any user who needs it can do it themselves just as well. It's just a bit of syntactic sugar.

It all comes down to what you're saying you're providing: database timeouts, which can be cleanly and accurately implemented, or timeouts on npgsql methods, which make no sense to me...


> But, as you say, if the server you're connecting to has 15 IP's and that's the cause of timeouts, then that is up to the user to deal with

How would you suggest a user deal with it? Do they need to do a DNS query outside npgsql just in order to know how many IPs are resolved, and then multiply their timeout by that number? Or should they change their DNS setup because of npgsql's timeout mechanism? I think there is simply no way out here, this seems like a classical situation where a minimalist approach of "not handling it" works better than forcing a compromise on the user...

> ALL network dependent API's should off a timeout IMO, async or not

That's a strange statement, in all my years I've yet to see a library that attempts to do this, when *multiple* underlying I/O operations are involved... can you provide an example? You're right that all network-dependent APIs should allow you to set the *socket* timeout, but that timeout is valid only for a *single* socket I/O operation. The moment a library provides a method that does more than one operation, the socket timeout no longer corresponds with the API operation. But I might be misunderstanding what you're suggesting...

> But here's a question.  If the Dns.BeginGetHostAddresses() call times out because of a thread pool issue, why wouldn't the Socket.BeginConnect() calls timeout for the exact same reason?  Moreover, if the thread pool is as starved as it sounds like in this case, how can anything else work reliably?

Socket.BeginConnect() absolutely would show the same phenomenon, if what Josh said is true (and I think it is): it's a matter of trying to do too many operations in parallel and (temporarily) putting enough strain on the thread pool (when it's still small) and we miss the timeout. It's a rather extreme scenario, and if you really do have to spin up 500 connections in parallel it's probably fair to ask you to wait a little longer by setting a higher timeout...

Shay

Miłosz Kubański

unread,
Sep 19, 2014, 4:07:50 AM9/19/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, fran...@npgsql.org
Hi All.

I think the same. We should support async as well as timeout but we shouldn't "throw the baby out with the bathwater". 

Usually resolving the host name should take < 1ms. Of course because of a bad network configuration it can take longer at first time, but if it'll be receitved it'll be cached by the OS as well as many routers and dns'es "on the road". So creating another thread/task for such small operation (in most cases except the first call in bad configured networks) is not cost-effective. I think we should still help people to find out why the query is executing to long. But in the dns timeout situation the better solution is log the problem (Hey buddy, your dns resolve name tooks 60s, something is really wrong with your network configuration) or throw the exception as in the previous versions than creating artificial async operation only for dns resolve name. Probably most npgsql users won't see the problem which occurred in my case until they need to create an application which will need to handle a really lot of data simultaneously, but we won't fix the dns timeout problem making the operation async. 

BTW. yesterday I've found an information (I didn't had a time to check it, but I think it could be the truth) that async dns resolve is not a true async operation as f.ex. reading the stream data. Its only create a thread and calls sync dns resolve. 

Of course the async data transfer and other operations which take a while should be async as you've made it.


Best Regards,
MiloszeS 

Glen Parker

unread,
Sep 19, 2014, 6:44:14 PM9/19/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
Holy wall of text Batman!



On Thu, Sep 18, 2014 at 2:23 PM, Shay Rojansky <ro...@roji.org> wrote:
It's definitely an interesting discussion,

> Shay, my response to the comment that connection startup has other overhead not included in the timeout enforcement would be, that ideally we would include it to improve accuracy.

How? IMO we're getting to the crux of the matter. If I follow your argument and we want to provide a 100% accurate timeout, the *only* thing we can do is something along the following:


I did not mean to imply that 100% accuracy is a reasonable expectation for timeout behavior.  Sorry if that's how it came across.  But I don't think it's unreasonable to expect a timeout to occur within maybe 10% of the timeout value supplied (10% longer, never shorter).

 

public void Open() {
    Task.Run(() => { // ... all the actual connection logic }).Wait(the_timeout);
}

In other words, run everything in a different thread and wait on that thread with a timeout. Anything else will leave operations out of the timeout and is inaccurate.

At this point I find myself wrapping all my API methods in this fashion, because Open() isn't really special: other methods do multiple back-and-forths with the server.


Isn't that because timeout aren't accurate enough?  If they all provided +10% accuracy. would you still do this type of wrapper?  In a perfect world everything would be 100% accurate.  I think it's generally a good idea to keep perfection in mind at all times, as kind of the unobtainable goal that we should always try to move toward, but 100% accuracy in a timeout is very much unnecessary.

The trouble with doing timeouts this way is that the task can continue to run even after you quit waiting.  So, if our 15 IP example leads  to a fifteen minute run time, and you stop waiting after one minute, the task is going to continue to run for 14 more minutes.  That's not good at all!

This is exactly why I implemented statement timeouts using PG's statement_timeout setting.   If a query gets stuck in the backend and there's no statement_timeout, we quit waiting, but the connection remains useless until the query finishes.  I had tons of trouble with that and implemented the same timeout mechanism something like 12 years ago in our in-house middleware software (which was old c++ code using ODBC).

The above is an extreme example, I grant you.

 

So we end up with a whole API full of thread-pool wrappers. Aside from the performance impact, the strain on the thread pool and the general yuckiness, any user who needs a timeout of this sort could simply do the task wrapping *outside* npgsql, like this:

Task.Run(() => { conn.Open() }).Wait(the_timeout);

In other words, we're providing almost zero added value by internally wrapping all of npgsql's code in Task.Run(): any user who needs it can do it themselves just as well. It's just a bit of syntactic sugar.

It all comes down to what you're saying you're providing: database timeouts, which can be cleanly and accurately implemented, or timeouts on npgsql methods, which make no sense to me...

> But, as you say, if the server you're connecting to has 15 IP's and that's the cause of timeouts, then that is up to the user to deal with

How would you suggest a user deal with it? Do they need to do a DNS query outside npgsql just in order to know how many IPs are resolved, and then multiply their timeout by that number? Or should they change their DNS setup because of npgsql's timeout mechanism? I think there is simply no way out here, this seems like a classical situation where a minimalist approach of "not handling it" works better than forcing a compromise on the user...


The guy writing the software to connect to his company's RDBMS is the guy who needs to know about the 15 IP's, so of course he's the one to deal with it.  He should be able to expect some level of accuracy from Npgsql's timeout mechanism, and set his timeout values according to the conditions at hand.

 
> ALL network dependent API's should off a timeout IMO, async or not

That's a strange statement, in all my years I've yet to see a library that attempts to do this, when *multiple* underlying I/O operations are involved... can you provide an example? You're right that all network-dependent APIs should allow you to set the *socket* timeout, but that timeout is valid only for a *single* socket I/O operation. The moment a library provides a method that does more than one operation, the socket timeout no longer corresponds with the API operation. But I might be misunderstanding what you're suggesting...


It's just my opinion.  No examples needed :)

 

> But here's a question.  If the Dns.BeginGetHostAddresses() call times out because of a thread pool issue, why wouldn't the Socket.BeginConnect() calls timeout for the exact same reason?  Moreover, if the thread pool is as starved as it sounds like in this case, how can anything else work reliably?

Socket.BeginConnect() absolutely would show the same phenomenon, if what Josh said is true (and I think it is): it's a matter of trying to do too many operations in parallel and (temporarily) putting enough strain on the thread pool (when it's still small) and we miss the timeout. It's a rather extreme scenario, and if you really do have to spin up 500 connections in parallel it's probably fair to ask you to wait a little longer by setting a higher timeout...

THIS is the crux of the issue.  All this talk about how timeouts should work and their accuracy is a side show.

It appears that we're discussing the removal of perfectly good code in a strange (and ultimately likely to be ineffectual) attempt to band-aid somebody's system performance issues.

-Glen



 

Shay Rojansky

unread,
Sep 19, 2014, 8:20:05 PM9/19/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, fran...@npgsql.org
On Saturday, September 20, 2014 1:44:14 AM UTC+3, glenebob wrote:
Holy wall of text Batman!

Didn't mean to be this wordy, I just got a little excited :)
 
On Thu, Sep 18, 2014 at 2:23 PM, Shay Rojansky <ro...@roji.org> wrote:
It's definitely an interesting discussion,

> Shay, my response to the comment that connection startup has other overhead not included in the timeout enforcement would be, that ideally we would include it to improve accuracy.

How? IMO we're getting to the crux of the matter. If I follow your argument and we want to provide a 100% accurate timeout, the *only* thing we can do is something along the following:


I did not mean to imply that 100% accuracy is a reasonable expectation for timeout behavior.  Sorry if that's how it came across.  But I don't think it's unreasonable to expect a timeout to occur within maybe 10% of the timeout value supplied (10% longer, never shorter).

Maybe I took it to the extreme with 100%, no problem. But see my comment in the bottom regarding the 10% precision.
 
public void Open() {
    Task.Run(() => { // ... all the actual connection logic }).Wait(the_timeout);
}

In other words, run everything in a different thread and wait on that thread with a timeout. Anything else will leave operations out of the timeout and is inaccurate.

At this point I find myself wrapping all my API methods in this fashion, because Open() isn't really special: other methods do multiple back-and-forths with the server.


Isn't that because timeout aren't accurate enough?  If they all provided +10% accuracy. would you still do this type of wrapper?  In a perfect world everything would be 100% accurate.  I think it's generally a good idea to keep perfection in mind at all times, as kind of the unobtainable goal that we should always try to move toward, but 100% accuracy in a timeout is very much unnecessary.

Well, all the timeouts I know are pretty much 100% accurate: socket timeout, Postgresql backend timeout. The point I'm trying to make here (and below) is that most Npgsql operations are complex, in the sense that they contain multiple operations. Each *single* operation can (usually) provide 100% accuracy (e.g. via a socket timeout). But what does a timeout mean when two or more operations are involved? What can anybody do except wrap the whole think in a Task?

One additional small argument here, is that Postgresql's default behavior is no timeout whatsoever. I personally believe npgsql should be as "pass-through" as possible to Postgresql, which is why I think it would be better if the backend timeout were not on by default (but it's definitely important for it to be supported).

The trouble with doing timeouts this way is that the task can continue to run even after you quit waiting.  So, if our 15 IP example leads  to a fifteen minute run time, and you stop waiting after one minute, the task is going to continue to run for 14 more minutes.  That's not good at all!

It's not amazing, but it's not a catastrophe either - we're talking about a single thread sitting around waiting on I/O for 14 minutes in an extreme situation...

This is exactly why I implemented statement timeouts using PG's statement_timeout setting.   If a query gets stuck in the backend and there's no statement_timeout, we quit waiting, but the connection remains useless until the query finishes.  I had tons of trouble with that and implemented the same timeout mechanism something like 12 years ago in our in-house middleware software (which was old c++ code using ODBC).

Here I'm in total agreement with you - it's perfectly useful and a great idea to provide backend timeouts - these are clear and effective, and also 100% accurate with respect to what they provide.

The above is an extreme example, I grant you.

 

So we end up with a whole API full of thread-pool wrappers. Aside from the performance impact, the strain on the thread pool and the general yuckiness, any user who needs a timeout of this sort could simply do the task wrapping *outside* npgsql, like this:

Task.Run(() => { conn.Open() }).Wait(the_timeout);

In other words, we're providing almost zero added value by internally wrapping all of npgsql's code in Task.Run(): any user who needs it can do it themselves just as well. It's just a bit of syntactic sugar.

It all comes down to what you're saying you're providing: database timeouts, which can be cleanly and accurately implemented, or timeouts on npgsql methods, which make no sense to me...

> But, as you say, if the server you're connecting to has 15 IP's and that's the cause of timeouts, then that is up to the user to deal with

How would you suggest a user deal with it? Do they need to do a DNS query outside npgsql just in order to know how many IPs are resolved, and then multiply their timeout by that number? Or should they change their DNS setup because of npgsql's timeout mechanism? I think there is simply no way out here, this seems like a classical situation where a minimalist approach of "not handling it" works better than forcing a compromise on the user...


The guy writing the software to connect to his company's RDBMS is the guy who needs to know about the 15 IP's, so of course he's the one to deal with it.  He should be able to expect some level of accuracy from Npgsql's timeout mechanism, and set his timeout values according to the conditions at hand.

The assumption about the guy writing the software knowing about the DNS, well... I really don't feel it's our job to speculate on this kind of thing. In large organizations the app programmer, the DB admin and the system/network admin are very different people; environments are complex and so forth. I'm not used to libraries varying their behavior based on such an arbitrary and unrelated factor.

> ALL network dependent API's should off a timeout IMO, async or not

That's a strange statement, in all my years I've yet to see a library that attempts to do this, when *multiple* underlying I/O operations are involved... can you provide an example? You're right that all network-dependent APIs should allow you to set the *socket* timeout, but that timeout is valid only for a *single* socket I/O operation. The moment a library provides a method that does more than one operation, the socket timeout no longer corresponds with the API operation. But I might be misunderstanding what you're suggesting...


It's just my opinion.  No examples needed :)

No problem, but I still don't understand how anyone is supposed to manage timeouts in the case of complex operations... 
 
> But here's a question.  If the Dns.BeginGetHostAddresses() call times out because of a thread pool issue, why wouldn't the Socket.BeginConnect() calls timeout for the exact same reason?  Moreover, if the thread pool is as starved as it sounds like in this case, how can anything else work reliably?

Socket.BeginConnect() absolutely would show the same phenomenon, if what Josh said is true (and I think it is): it's a matter of trying to do too many operations in parallel and (temporarily) putting enough strain on the thread pool (when it's still small) and we miss the timeout. It's a rather extreme scenario, and if you really do have to spin up 500 connections in parallel it's probably fair to ask you to wait a little longer by setting a higher timeout...

THIS is the crux of the issue.  All this talk about how timeouts should work and their accuracy is a side show.

It appears that we're discussing the removal of perfectly good code in a strange (and ultimately likely to be ineffectual) attempt to band-aid somebody's system performance issues.

Well, I'll try to sum it up to make sure I'm understood... My motivation for removing the timeout isn't at all to band-aid the user's performance issue. IMO the timeout doesn't improve accuracy, but rather hurts npgsql behavior and what users can expect from it. My problem is that the timeout we're providing right now isn't accurate at all. If the DNS record has 3 IPs it's divided by 3. If DNS, startup and authentication are slow it's multiplied by 3. So we're letting the user configure a timeout that, in practice, means almost nothing. What can users rely on here, how would this timeout even be documented?

Reading over what I wrote I think I've come across a bit aggressively, I'm sorry about that (too late/tired to go back and rewrite). If we really disagree on this, maybe we should have a vote between the core devs?
 
-Glen 

Josh Cooley

unread,
Sep 19, 2014, 9:00:47 PM9/19/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
On Fri, Sep 19, 2014 at 7:20 PM, Shay Rojansky <ro...@roji.org> wrote:
On Saturday, September 20, 2014 1:44:14 AM UTC+3, glenebob wrote:
Holy wall of text Batman!

Didn't mean to be this wordy, I just got a little excited :)

I remove a bunch of text so I could reply to the points I had something to say about. 
 
One additional small argument here, is that Postgresql's default behavior is no timeout whatsoever. I personally believe npgsql should be as "pass-through" as possible to Postgresql, which is why I think it would be better if the backend timeout were not on by default (but it's definitely important for it to be supported).

I can't agree that we should be as pass through as possible.  We should be as good a .net citizen as possible.  Often that means we should do some additional work so that the ADO.NET provider behaves the way expected.  Unprepared statements with parameters is a prime example here.
 

The trouble with doing timeouts this way is that the task can continue to run even after you quit waiting.  So, if our 15 IP example leads  to a fifteen minute run time, and you stop waiting after one minute, the task is going to continue to run for 14 more minutes.  That's not good at all!

It's not amazing, but it's not a catastrophe either - we're talking about a single thread sitting around waiting on I/O for 14 minutes in an extreme situation...

This is not your call to make.  Npgsql is not a theoretical exercise, it's a library meant to be used in real world applications.  The severity of the problem is not for us to decide, but the applications.  In implementing connection timeouts, we don't need to be considering what the postgresql back end will do with these values.  We need to be considering what the caller expects these values to do.  In the work I do, this kind of violation of the contract would mean that Npgsql could not be used without a wrapper.  This defeats the ability to use the abstract base classes, and would make the consumers generally untrusting of other contracts in the library.
 

> But, as you say, if the server you're connecting to has 15 IP's and that's the cause of timeouts, then that is up to the user to deal with

How would you suggest a user deal with it? Do they need to do a DNS query outside npgsql just in order to know how many IPs are resolved, and then multiply their timeout by that number? Or should they change their DNS setup because of npgsql's timeout mechanism? I think there is simply no way out here, this seems like a classical situation where a minimalist approach of "not handling it" works better than forcing a compromise on the user...


The guy writing the software to connect to his company's RDBMS is the guy who needs to know about the 15 IP's, so of course he's the one to deal with it.  He should be able to expect some level of accuracy from Npgsql's timeout mechanism, and set his timeout values according to the conditions at hand.

The assumption about the guy writing the software knowing about the DNS, well... I really don't feel it's our job to speculate on this kind of thing. In large organizations the app programmer, the DB admin and the system/network admin are very different people; environments are complex and so forth. I'm not used to libraries varying their behavior based on such an arbitrary and unrelated factor.

I agree that those jobs can be greatly separated.  That said, the developer will either have expectations for how the timeout will work or the connection string settings will be externalized.  It doesn't seem reasonable these three roles would work to build a system where the repeated timeout is the expected behavior without that being part of some stated contract and the actual behavior in some released version.
 

THIS is the crux of the issue.  All this talk about how timeouts should work and their accuracy is a side show.

It appears that we're discussing the removal of perfectly good code in a strange (and ultimately likely to be ineffectual) attempt to band-aid somebody's system performance issues.

Well, I'll try to sum it up to make sure I'm understood... My motivation for removing the timeout isn't at all to band-aid the user's performance issue. IMO the timeout doesn't improve accuracy, but rather hurts npgsql behavior and what users can expect from it. My problem is that the timeout we're providing right now isn't accurate at all. If the DNS record has 3 IPs it's divided by 3. If DNS, startup and authentication are slow it's multiplied by 3. So we're letting the user configure a timeout that, in practice, means almost nothing. What can users rely on here, how would this timeout even be documented?

I have to say that I agree with Glen's sentiment if not the description of it as "perfectly good".  There are some problems with it, but I don't see the problem you are describing.  If the following description is not what is intended, can you describe what you mean in more detail?

It looks to me like we have a timeout that should limit the time used between requesting a connection and establishing the tcp connection.  I don't see where the multiplication comes in.  There's been a suggestion that the timeout extend beyond this and include the startup and authentication, but I still don't understand where the multiplication comes in.  I don't think the discussed behavior is impossible to document, if this is what we think it is currently doing.

Josh
 

Glen Parker

unread,
Sep 19, 2014, 10:27:18 PM9/19/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
On Fri, Sep 19, 2014 at 5:20 PM, Shay Rojansky <ro...@roji.org> wrote:

 
public void Open() {
    Task.Run(() => { // ... all the actual connection logic }).Wait(the_timeout);
}

In other words, run everything in a different thread and wait on that thread with a timeout. Anything else will leave operations out of the timeout and is inaccurate.

At this point I find myself wrapping all my API methods in this fashion, because Open() isn't really special: other methods do multiple back-and-forths with the server.


Isn't that because timeout aren't accurate enough?  If they all provided +10% accuracy. would you still do this type of wrapper?  In a perfect world everything would be 100% accurate.  I think it's generally a good idea to keep perfection in mind at all times, as kind of the unobtainable goal that we should always try to move toward, but 100% accuracy in a timeout is very much unnecessary.

Well, all the timeouts I know are pretty much 100% accurate: socket timeout, Postgresql backend timeout. The point I'm trying to make here (and below) is that most Npgsql operations are complex, in the sense that they contain multiple operations. Each *single* operation can (usually) provide 100% accuracy (e.g. via a socket timeout). But what does a timeout mean when two or more operations are involved? What can anybody do except wrap the whole think in a Task?

One additional small argument here, is that Postgresql's default behavior is no timeout whatsoever. I personally believe npgsql should be as "pass-through" as possible to Postgresql, which is why I think it would be better if the backend timeout were not on by default (but it's definitely important for it to be supported).


Hmm, I can't tell if I agree with that or not. I do like options, the more the merrier, but one thing is for sure: I would never use Npgsql with timeouts turned off.  It's hard to imagine anyone wanting to do that.

 

The trouble with doing timeouts this way is that the task can continue to run even after you quit waiting.  So, if our 15 IP example leads  to a fifteen minute run time, and you stop waiting after one minute, the task is going to continue to run for 14 more minutes.  That's not good at all!

It's not amazing, but it's not a catastrophe either - we're talking about a single thread sitting around waiting on I/O for 14 minutes in an extreme situation...

This is exactly why I implemented statement timeouts using PG's statement_timeout setting.   If a query gets stuck in the backend and there's no statement_timeout, we quit waiting, but the connection remains useless until the query finishes.  I had tons of trouble with that and implemented the same timeout mechanism something like 12 years ago in our in-house middleware software (which was old c++ code using ODBC).

Here I'm in total agreement with you - it's perfectly useful and a great idea to provide backend timeouts - these are clear and effective, and also 100% accurate with respect to what they provide.

But your argument certainly has merit, since working with a default closer to PG's own does seem like a valid idea.



The above is an extreme example, I grant you.

 

So we end up with a whole API full of thread-pool wrappers. Aside from the performance impact, the strain on the thread pool and the general yuckiness, any user who needs a timeout of this sort could simply do the task wrapping *outside* npgsql, like this:

Task.Run(() => { conn.Open() }).Wait(the_timeout);

In other words, we're providing almost zero added value by internally wrapping all of npgsql's code in Task.Run(): any user who needs it can do it themselves just as well. It's just a bit of syntactic sugar.

It all comes down to what you're saying you're providing: database timeouts, which can be cleanly and accurately implemented, or timeouts on npgsql methods, which make no sense to me...

> But, as you say, if the server you're connecting to has 15 IP's and that's the cause of timeouts, then that is up to the user to deal with

How would you suggest a user deal with it? Do they need to do a DNS query outside npgsql just in order to know how many IPs are resolved, and then multiply their timeout by that number? Or should they change their DNS setup because of npgsql's timeout mechanism? I think there is simply no way out here, this seems like a classical situation where a minimalist approach of "not handling it" works better than forcing a compromise on the user...


The guy writing the software to connect to his company's RDBMS is the guy who needs to know about the 15 IP's, so of course he's the one to deal with it.  He should be able to expect some level of accuracy from Npgsql's timeout mechanism, and set his timeout values according to the conditions at hand.

The assumption about the guy writing the software knowing about the DNS, well... I really don't feel it's our job to speculate on this kind of thing. In large organizations the app programmer, the DB admin and the system/network admin are very different people; environments are complex and so forth. I'm not used to libraries varying their behavior based on such an arbitrary and unrelated factor.

I'm not following here.  The only presumption I wish to make is that we're the least qualified to make any decisions about our users' environments, and that they are more qualified than us.  I don't understand what you mean by speculating...  Unless you mean that I'm speculating that a user probably does not want a 1 minute timeout to result in a 15 minute duration.

 

> ALL network dependent API's should off a timeout IMO, async or not

That's a strange statement, in all my years I've yet to see a library that attempts to do this, when *multiple* underlying I/O operations are involved... can you provide an example? You're right that all network-dependent APIs should allow you to set the *socket* timeout, but that timeout is valid only for a *single* socket I/O operation. The moment a library provides a method that does more than one operation, the socket timeout no longer corresponds with the API operation. But I might be misunderstanding what you're suggesting...


It's just my opinion.  No examples needed :)

No problem, but I still don't understand how anyone is supposed to manage timeouts in the case of complex operations... 


It's hard!  And, timeouts are kind of an extra-credit feature, if you will.  They are very commonly not done all that well.  It's as simple as that.  But the fact that something isn't done very well in a lot of software doesn't mean we shouldn't try to do it well.  It shouldn't be a high priority, but if someone wants to spend the time to improve accuracy I don't see that as a bad thing.

All I'm saying is that we should try, if we can, to do what the caller asks within the limits requested.  I'm not saying we have to create some perfect panacea of timeout precision or die in the attempt.

 
 
> But here's a question.  If the Dns.BeginGetHostAddresses() call times out because of a thread pool issue, why wouldn't the Socket.BeginConnect() calls timeout for the exact same reason?  Moreover, if the thread pool is as starved as it sounds like in this case, how can anything else work reliably?

Socket.BeginConnect() absolutely would show the same phenomenon, if what Josh said is true (and I think it is): it's a matter of trying to do too many operations in parallel and (temporarily) putting enough strain on the thread pool (when it's still small) and we miss the timeout. It's a rather extreme scenario, and if you really do have to spin up 500 connections in parallel it's probably fair to ask you to wait a little longer by setting a higher timeout...

THIS is the crux of the issue.  All this talk about how timeouts should work and their accuracy is a side show.

It appears that we're discussing the removal of perfectly good code in a strange (and ultimately likely to be ineffectual) attempt to band-aid somebody's system performance issues.

Well, I'll try to sum it up to make sure I'm understood... My motivation for removing the timeout isn't at all to band-aid the user's performance issue. IMO the timeout doesn't improve accuracy, but rather hurts npgsql behavior and what users can expect from it. My problem is that the timeout we're providing right now isn't accurate at all. If the DNS record has 3 IPs it's divided by 3. If DNS, startup and authentication are slow it's multiplied by 3. So we're letting the user configure a timeout that, in practice, means almost nothing. What can users rely on here, how would this timeout even be documented?


Well again, that was a compromise.  As I recall, I was applying the full remaining timeout to each socket.  But if the first one times out, you're completely out of time, so you never get the chance to try the next one.  Francisco suggested going the way the code is now.  But why isn't it accurate?  You spend up to the timeout trying to get IP's and connecting to one.  If there's more than one IP, none of them get the full time slice, but in total only the expected time is spent.

Before I made that change, there was no timeout on DNS.  Granted, it is somewhat difficult to get DNS service to misbehave, so the timeout will almost never be hit, but it is possible.  After the DNS call completed. then each IP was given the full timeout value.  If they all time out, the call would take (timeout * numberOfIPs) duration to fail.  That's certainly not accurate.

A function is a black box.  If a function has a timeout parameter, the only logical assumption from the outside is that the timeout value itself will be honored, not some unknown multiple of it.

Which gets us back to the user.  If a user is experiencing timeouts because he's trying to connect to a server with 15 IP's using a 10 second timeout (which is actually being honored, not ignored), how on Earth is that our problem, more than it is his?  We don't even know he and his 15 IP server exist.

But, back to the crux :)  What makes us think that going back to sync DNS won't just push the error condition down to the IP connect loop? Why is it better to die of thread starvation there than5 lines of code before?  What are we fixing?



 

Reading over what I wrote I think I've come across a bit aggressively, I'm sorry about that (too late/tired to go back and rewrite). If we really disagree on this, maybe we should have a vote between the core devs?


Meh, no need to be sorry.  And ultimately yes, of course it needs to be agreed upon in some way, maybe with a vote.  But FIRST, we need to be able to demonstrate that we're actually fixing anything by doing it, and I don't think we are.  I think the fix is whatever causes the system to not be so starved for threads that completion ports don't work, and I'm pretty sure Npgsql isn't the problem there.

-Glen

Shay Rojansky

unread,
Sep 20, 2014, 2:52:14 AM9/20/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, fran...@npgsql.org
Thanks for the comments guys, it's good to have a serious and in-depth argument like this one. I think I've understood the situation better (after a night's sleep).

> Josh: I can't agree that we should be as pass through as possible.  We should be as good a .net citizen as possible.  Often that means we should do some additional work so that the ADO.NET provider behaves the way expected.  Unprepared statements with parameters is a prime example here.

Of course you're right, what I meant was that we should be as passthrough as possible while being good .NET and ADO.NET citizens. I also agree with the latter being of higher importance than the former.

I admit that I wasn't aware of the ConnectionTimeout contract, and even less of 15 seconds being its default. This makes me feel kind of ridiculous for this rant, if you guys had simply sent a link to the ConnectionTimeout MSDN page I would have shut my mouth much earlier.

A totally unrelated sidenote: the passthrough principle is what's making me want to remove our special setting of extra_float_digits (#324) and lc_monetary (#266), even if (the latter) means a bit more trouble for us - they're definitely not mandated by any .NET/ADO contract AFAIK. I hope you guys agree there.

> Glen: Hmm, I can't tell if I agree with that or not. I do like options, the more the merrier, but one thing is for sure: I would never use Npgsql with timeouts turned off.  It's hard to imagine anyone wanting to do that.

I agree about options always being good. Keep in mind though that, again, Postgresql has no timeout by default, so it seems the Postgresql people (and maybe some of their users) aren't necessary as fond of timeouts as you :) But the discussion is moot: the Postgresql default is in conflict with the ADO.NET default, and I agree it's the latter that wins. So 15 seconds by default it is.

----

So you guys are right. Now here's an idea how to make the timeout mechanism better and a few other suggestions/comments.

* Rather than dividing the connection timeout between the different IPs we get from DNS, we have the option of doing them in parallel using async IO. In other words, if DNS gives us 4 IPs, we use Socket.ConnectAsync() and wait for the first completion. This is not a flawless idea: it may mean a bit more load when doing 500 in parallel, but gives you a truly predictable timeout. I'm not 100% sure this is a good idea: the load increases the more IPs there are, but I wanted to put it out there - what do you think?
* Regarding the DNS query itself - searching around yield no result for sync-with-timeout lookups. I'd say we stay with the current mechanism - I'm not sure I see an actual issue here, as Josh explained in his very first reply (thread pool needs to scale up etc.). What I can do is rewrite the async DNS mechanism to use the more modern async/await idiom (now that we're .NET 4 and above), which is functionally the same but cleaner code.
* I'll also open an issue on extending the connection timeout over the later connection phases, i.e. startup and authentication. That shouldn't be too hard

Shay

Glen Parker

unread,
Sep 20, 2014, 1:53:53 PM9/20/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.

* Rather than dividing the connection timeout between the different IPs we get from DNS, we have the option of doing them in parallel using async IO. In other words, if DNS gives us 4 IPs, we use Socket.ConnectAsync() and wait for the first completion. This is not a flawless idea: it may mean a bit more load when doing 500 in parallel, but gives you a truly predictable timeout. I'm not 100% sure this is a good idea: the load increases the more IPs there are, but I wanted to put it out there - what do you think?
* Regarding the DNS query itself - searching around yield no result for sync-with-timeout lookups. I'd say we stay with the current mechanism - I'm not sure I see an actual issue here, as Josh explained in his very first reply (thread pool needs to scale up etc.). What I can do is rewrite the async DNS mechanism to use the more modern async/await idiom (now that we're .NET 4 and above), which is functionally the same but cleaner code.
* I'll also open an issue on extending the connection timeout over the later connection phases, i.e. startup and authentication. That shouldn't be too hard


Well, you realize, the normal case could result in all connection attempts succeeding, leaving us to have to pick (random?), and disconnect the rest.

Actually, it's worse than that.  I think we'd have to code it such that each task is able to close the socket it just got upon finding it is one of the losers.  We could get a connection quickly and start doing queries, while the other tasks were still running in the background cleaning up the mess.  "Hello, PG?  Yeah, I need to talk to you.  Oh, uh, never mind." :)

This sounds like all kinds of bad to me.

=Glen





Shay Rojansky

unread,
Sep 20, 2014, 3:33:53 PM9/20/14
to Glen Parker, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
On Sat, Sep 20, 2014 at 8:53 PM, Glen Parker <glen...@gmail.com> wrote:

* Rather than dividing the connection timeout between the different IPs we get from DNS, we have the option of doing them in parallel using async IO. In other words, if DNS gives us 4 IPs, we use Socket.ConnectAsync() and wait for the first completion. This is not a flawless idea: it may mean a bit more load when doing 500 in parallel, but gives you a truly predictable timeout. I'm not 100% sure this is a good idea: the load increases the more IPs there are, but I wanted to put it out there - what do you think?
* Regarding the DNS query itself - searching around yield no result for sync-with-timeout lookups. I'd say we stay with the current mechanism - I'm not sure I see an actual issue here, as Josh explained in his very first reply (thread pool needs to scale up etc.). What I can do is rewrite the async DNS mechanism to use the more modern async/await idiom (now that we're .NET 4 and above), which is functionally the same but cleaner code.
* I'll also open an issue on extending the connection timeout over the later connection phases, i.e. startup and authentication. That shouldn't be too hard


Well, you realize, the normal case could result in all connection attempts succeeding, leaving us to have to pick (random?), and disconnect the rest.

That's right, although it seems to make sense to just pick the first one that connects (rather than random). The others gets asynchronously cancelled (if not yet connected) or closed (if connected).
 
Actually, it's worse than that.  I think we'd have to code it such that each task is able to close the socket it just got upon finding it is one of the losers.  We could get a connection quickly and start doing queries, while the other tasks were still running in the background cleaning up the mess.  "Hello, PG?  Yeah, I need to talk to you.  Oh, uh, never mind." :)

I'm not sure I see the issue... True, we'd have the "I need to talk to you. Oh, uh, never mind" scenario, but is that a problem? I don't see anything wrong in going ahead with queries on the successful connection, while the others get asynchronously dumped in the background (except, again, for slightly more load).

This sounds like all kinds of bad to me.

I'm still not sure why... and it would provide an actual, predictable timeout.

Glen Parker

unread,
Sep 20, 2014, 4:09:46 PM9/20/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
You keep saying that, about the timeout.  I think you're reading the connect code wrong or something.  It's quite predictable now, from the perspective of the caller.

Anyway, this seems like something that would raise a db admin's hackles if he saw how many "never mind" connections are being made to his server.  I don't know what the startup overhead is like in PG, but I'm pretty sure it involves a new process just to get ready for the handshake.

I suppose I could see this having merit in a pooled environment, where the extra connections could actually be built up and added to the pool rather than dropped right away...

-Glen


Shay Rojansky

unread,
Sep 20, 2014, 5:26:20 PM9/20/14
to Glen Parker, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
On Sat, Sep 20, 2014 at 11:09 PM, Glen Parker <glen...@gmail.com> wrote:


On Sat, Sep 20, 2014 at 12:33 PM, Shay Rojansky <ro...@roji.org> wrote:
On Sat, Sep 20, 2014 at 8:53 PM, Glen Parker <glen...@gmail.com> wrote:

* Rather than dividing the connection timeout between the different IPs we get from DNS, we have the option of doing them in parallel using async IO. In other words, if DNS gives us 4 IPs, we use Socket.ConnectAsync() and wait for the first completion. This is not a flawless idea: it may mean a bit more load when doing 500 in parallel, but gives you a truly predictable timeout. I'm not 100% sure this is a good idea: the load increases the more IPs there are, but I wanted to put it out there - what do you think?
* Regarding the DNS query itself - searching around yield no result for sync-with-timeout lookups. I'd say we stay with the current mechanism - I'm not sure I see an actual issue here, as Josh explained in his very first reply (thread pool needs to scale up etc.). What I can do is rewrite the async DNS mechanism to use the more modern async/await idiom (now that we're .NET 4 and above), which is functionally the same but cleaner code.
* I'll also open an issue on extending the connection timeout over the later connection phases, i.e. startup and authentication. That shouldn't be too hard


Well, you realize, the normal case could result in all connection attempts succeeding, leaving us to have to pick (random?), and disconnect the rest.

That's right, although it seems to make sense to just pick the first one that connects (rather than random). The others gets asynchronously cancelled (if not yet connected) or closed (if connected).
 
Actually, it's worse than that.  I think we'd have to code it such that each task is able to close the socket it just got upon finding it is one of the losers.  We could get a connection quickly and start doing queries, while the other tasks were still running in the background cleaning up the mess.  "Hello, PG?  Yeah, I need to talk to you.  Oh, uh, never mind." :)

I'm not sure I see the issue... True, we'd have the "I need to talk to you. Oh, uh, never mind" scenario, but is that a problem? I don't see anything wrong in going ahead with queries on the successful connection, while the others get asynchronously dumped in the background (except, again, for slightly more load).

This sounds like all kinds of bad to me.

I'm still not sure why... and it would provide an actual, predictable timeout.


You keep saying that, about the timeout.  I think you're reading the connect code wrong or something.  It's quite predictable now, from the perspective of the caller.

I'm sorry if I'm misunderstanding it or something, I'm not trying to be annoying. The way it's currently implemented, if the user sets CommandTimeout to 10 seconds and there are 2 IPs, it gets cut down to 5 instead. This is unexpected behavior, and unpredictable if the user isn't aware of DNS specifics. I know I'm repeating myself, but it's hard to understand how this isn't at least recognized as a potential issue.

But I guess this discussion has gone on long enough, let's drop it.

Glen Parker

unread,
Sep 20, 2014, 8:09:25 PM9/20/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
On Sat, Sep 20, 2014 at 2:26 PM, Shay Rojansky <ro...@roji.org> wrote:


On Sat, Sep 20, 2014 at 11:09 PM, Glen Parker <glen...@gmail.com> wrote:


On Sat, Sep 20, 2014 at 12:33 PM, Shay Rojansky <ro...@roji.org> wrote:
On Sat, Sep 20, 2014 at 8:53 PM, Glen Parker <glen...@gmail.com> wrote:

* Rather than dividing the connection timeout between the different IPs we get from DNS, we have the option of doing them in parallel using async IO. In other words, if DNS gives us 4 IPs, we use Socket.ConnectAsync() and wait for the first completion. This is not a flawless idea: it may mean a bit more load when doing 500 in parallel, but gives you a truly predictable timeout. I'm not 100% sure this is a good idea: the load increases the more IPs there are, but I wanted to put it out there - what do you think?
* Regarding the DNS query itself - searching around yield no result for sync-with-timeout lookups. I'd say we stay with the current mechanism - I'm not sure I see an actual issue here, as Josh explained in his very first reply (thread pool needs to scale up etc.). What I can do is rewrite the async DNS mechanism to use the more modern async/await idiom (now that we're .NET 4 and above), which is functionally the same but cleaner code.
* I'll also open an issue on extending the connection timeout over the later connection phases, i.e. startup and authentication. That shouldn't be too hard


Well, you realize, the normal case could result in all connection attempts succeeding, leaving us to have to pick (random?), and disconnect the rest.

That's right, although it seems to make sense to just pick the first one that connects (rather than random). The others gets asynchronously cancelled (if not yet connected) or closed (if connected).
 
Actually, it's worse than that.  I think we'd have to code it such that each task is able to close the socket it just got upon finding it is one of the losers.  We could get a connection quickly and start doing queries, while the other tasks were still running in the background cleaning up the mess.  "Hello, PG?  Yeah, I need to talk to you.  Oh, uh, never mind." :)

I'm not sure I see the issue... True, we'd have the "I need to talk to you. Oh, uh, never mind" scenario, but is that a problem? I don't see anything wrong in going ahead with queries on the successful connection, while the others get asynchronously dumped in the background (except, again, for slightly more load).

This sounds like all kinds of bad to me.

I'm still not sure why... and it would provide an actual, predictable timeout.


You keep saying that, about the timeout.  I think you're reading the connect code wrong or something.  It's quite predictable now, from the perspective of the caller.

I'm sorry if I'm misunderstanding it or something, I'm not trying to be annoying. The way it's currently implemented, if the user sets CommandTimeout to 10 seconds and there are 2 IPs, it gets cut down to 5 instead. This is unexpected behavior, and unpredictable if the user isn't aware of DNS specifics. I know I'm repeating myself, but it's hard to understand how this isn't at least recognized as a potential issue.


You're not being annoying, it's just that I'd rather be on the same page than just drop it.  Maybe I'm the one misunderstanding something. 

Currently. in your example, the first IP will time out after 5 seconds.  Then the second IP will be attempted, also with a 5 second timeout.  That's a total of 10 seconds, which is what the caller expects.

Now, if the first IP fails for some other reason after only 1 second, the entire remaining time slice (9 seconds) will be given to the second IP, so if it ends up timing out, again we have a total of 10 seconds.

If you're seeing something that causes what I just described to be false, then we do indeed have a bug, and it should be fixed.  But I looked at it again this morning, and it looks to me like it's working as expected.
 
-Glen

Shay Rojansky

unread,
Sep 21, 2014, 2:28:53 AM9/21/14
to Glen Parker, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
You keep saying that, about the timeout.  I think you're reading the connect code wrong or something.  It's quite predictable now, from the perspective of the caller.

I'm sorry if I'm misunderstanding it or something, I'm not trying to be annoying. The way it's currently implemented, if the user sets CommandTimeout to 10 seconds and there are 2 IPs, it gets cut down to 5 instead. This is unexpected behavior, and unpredictable if the user isn't aware of DNS specifics. I know I'm repeating myself, but it's hard to understand how this isn't at least recognized as a potential issue.


You're not being annoying, it's just that I'd rather be on the same page than just drop it.  Maybe I'm the one misunderstanding something. 

Currently. in your example, the first IP will time out after 5 seconds.  Then the second IP will be attempted, also with a 5 second timeout.  That's a total of 10 seconds, which is what the caller expects.

Now, if the first IP fails for some other reason after only 1 second, the entire remaining time slice (9 seconds) will be given to the second IP, so if it ends up timing out, again we have a total of 10 seconds.

If you're seeing something that causes what I just described to be false, then we do indeed have a bug, and it should be fixed.  But I looked at it again this morning, and it looks to me like it's working as expected.

Your description isn't false. But I have a very strong belief that when sets CommandTimeout to 10, quitting a given IP after 5 is very unintuitive behavior. And in the sense that a user doesn't necessarily know the number of IPs, it's unpredictable.

I'll give another argument for the user not knowing: cloud situations. Amazon provides Amazon RDS (a managed Postgresql), Amazon Redshift (a non-Postgresql datastore that can be accessed with the Postgresql protocol), etc. Users can't possibly be expected to know in advance the number of IPs, and moreoever, it may change as Amazon changes their infrastructure. The only option remaining to these users is to do a DNS lookup of their own, outside npgsql, and give what I consider to be a "fake timeout" to make our mechanism function as they want. This seems like a strange burden to impose.

I'm also really bothered by the fact that many IPs may reduce the timeout to such a small timeout that attempts start failing. Maybe the right word here isn't "accurate" or "correct", but you have to agree it's problematic.

I do agree that the alternative of giving 10 seconds to each IP is worse (the total timeout being 20 seconds), so between those two options you clearly chose the better one. I'm just saying there's the 3rd option of executing them in parallel, providing a "perfect" timeout with respect to the issues raised above. There are indeed downsides (increased load on backend due to unused connections), which is why I agree we shouldn't necessarily implement what I proposed, at least not right away. I was looking more for recognition as to the problematic sides of the current implementation, which my proposal does solve.

Let's wait and see if anyone complains. At the end of the day there probably aren't many setups with multiple IPs for Postgresql backend hosts (but who really knows, especially with the cloud scenarios). If we get complaints, we can always add a connection string option to choose between the two behaviors...

Josh Cooley

unread,
Sep 21, 2014, 6:32:14 PM9/21/14
to Shay Rojansky, Glen Parker, npgsq...@googlegroups.com, Miłosz Kubański, Francisco Figueiredo Jr.
Perhaps what is needed here is an additional connection string setting specifying what to do when you have multiple ip addresses for the given name.

I think having a maximum time limit given the value is a good goal, but we may want to let the developer or "connection string admin" set the desired behavior.  Possibly limiting the number of servers to attempt to connect to would limit the division of the time.

Josh

Miłosz Kubański

unread,
Sep 22, 2014, 3:54:17 AM9/22/14
to npgsq...@googlegroups.com, glen...@gmail.com, milo...@gmail.com, fran...@npgsql.org
"...I'm not sure I see the issue... True, we'd have the "I need to talk to you. Oh, uh, never mind" scenario, but is that a problem? I don't see anything wrong in going ahead with queries on the successful connection, while the others get asynchronously dumped in the background (except, again, for slightly more load)..."

Well, like someone wrote a few messages ago, a npgsql users (especially corporations) can have complex environments. For example my client have > 500 applications which exports data to his clients. So it's possible we've found other users which could have much more client applications and it's also possible they'll have a few ips in host name. For example they applications can work correctly right now, but can have a problem in case we provide such approach (if I understood correctly opening more connections than we need). We'll have a lot of connections on the server side which will consume a lot of memory and CPU unnecessarily

Regards,
MiloszeS

Miłosz Kubański

unread,
Sep 22, 2014, 4:28:30 AM9/22/14
to npgsq...@googlegroups.com, glen...@gmail.com, milo...@gmail.com, fran...@npgsql.org
Another problem with unnecessary connections it could be the cost problems. For example Microsoft has also charged (I'm not sure how it works now) for the system handle usage in their cloud DB. Probably we will also find (maybe in the future) similar solution based on the postgresql systems....

I'm wondering what would you like to do with the async resolve name. As it was proven, it's not a perfect scenario for each application. It would be much better if we would not use the default thread pool. I think the best solution would be resolve it synchronously or give a possibility to user to choose whether (s)he wants to make it synchronously or asynchronously. Of course everyone can grab the code change it and compile for his purpose, but it's not the good solution. 


Best Regards

Shay Rojansky

unread,
Sep 22, 2014, 7:34:45 AM9/22/14
to Miłosz Kubański, npgsq...@googlegroups.com, glen...@gmail.com, Francisco Figueiredo Jr.
For the sake of making the discussion as clear as possible, here's a summary of the discussion at the moment:

Question #1: Sync vs. async DNS resolution
  1. Sync DNS resolution with the built-in resolver. This will not support any sort of timeout.
  2. Async DNS resolution (the current implementation) supports timeouts, but has some scalability issues when attempting a large number of connections in parallel, because of strain on the thread pool
  3. Sync DNS with an external implementation. There may be an external DNS implementation out there that will (e.g. http://www.codeproject.com/Articles/23673/DNS-NET-Resolver-C), but it seems like a huge overkill to add an external dependency just for this. If we can compile the source code for the DNS resolver into Npgsql it may be an option (but licensing may be an issue).
I'm not aware of a viable #3 (although I haven't checked very well). Between #1 and #2 I think #2 is clearly the better one. It's worth noting that if you scale enough you'll start to get performance issues (and therefore also timeouts) regardless of your DNS implementation. Async APIs are in use in many different applications across the world; and the extreme load scenario described above, although valid for them too, doesn't cause a mass abandonment of async...

I'd recommend to the user to implement a batching solution, where only X connections are attempted at once etc.

Question #2: ConnectionTimeout meaning when multiple IPs are involved
  1. Apply the (remaining) timeout to each IP separately. This means the total connection time could be longer then ConnectionTimeout.
  2. Divide the (remaining) timeout between each IP (current implementation). This shortens the individual wait based on the number of IPs, and could lead to failed connections if the number of IPs is large.
  3. Connect to all IPs in parallel. This provides the full time to each IP (like #1) but without making the total connection time longer. On the downside there's increased server load due to unused connections.
There's obviously no perfect solution here, so I'd say leave it at #2 until we get any sort of complaint from users. At that point we can add a connection string parameter to allow the user to choose between #2 and #3, as Josh suggested.


Shay Rojansky

unread,
Oct 12, 2014, 5:35:07 AM10/12/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
Note virtually identical discussion on github: https://github.com/npgsql/npgsql/issues/376

Workaround mentioned by the user is to either preconfigure the thread pool to allocate more threads, to correspond with the load burst anticipated, or in an extreme case to use a custom pool (haven't investigated this yet).

Miłosz Kubański

unread,
Oct 13, 2014, 3:31:11 AM10/13/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
Hi.

I thing the allocation unnecessary threads is very wrong attitude, because you're using much more resources than you need. You'll use much more memory (If I remember correctly each thread allocates around 1MB for a stack) as well as you'll use to much CPU for the context switching. If you run a few applications which uses the npgsq, you'll simply run out of memory just for the application start and may crush the system performance. 
If you want to stay with async resolve name the better option is to create the separate thread pool. I talk about this with Francisco via email and forgot to mention it on this group. IMHO this thread pool should be separated only for a dns resolve name and shouldn't be shared for example with socket read/write, because we can have the similar problems (more complicated operations will cause the simple operations to starve). 


Best Regards,
Miłosz

Shay Rojansky

unread,
Oct 13, 2014, 3:59:46 AM10/13/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
Hi Miłosz.

Nobody is happy about using async for the DNS lookup - it's being done only because the sync version has no timeout. If you can find a good DNS client implementation that supports a timeout I think we'll glad use it.

Having said that, we're not exactly talking about allocating unnecessary threads just for this task; the .thread pool is a standard part of .NET and is used in many operations across many applications. It's natural that large load/stress circumstances should require a larger pool. Sure, the context switch incurs a bit more CPU, but we're talking about a single switch that happens when performing a database connection - which is supposed by its nature to be an expensive, lengthy operation (roundtrips, authentication...).

One more workaround to consider, is simply to preallocate connections on application startup. If you want to support 100 connections, open them at your leisure on startup and keep them around in Npgsql's pool. Then when requests actually come they will already be open.

Despite all the above, Glen, Josh, with all the complaints about this maybe we should just add a connstring param called SyncDns, through which the user acknowledges that they will potentially get stuck for more than the timeout? This way high-load users which don't care about the DNS timeout can be happy?

Shay

Miłosz Kubański

unread,
Oct 13, 2014, 6:15:47 AM10/13/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
Hi Shay.


I think the idea with SyncDns switch would be the best for developers (like me) which needs to create a high load applications. It resolves all my problems with efficiency.


Sorry, I've used the mind shortcut :). Starting from the beginning you're right that the best approach would be to extend the connection pool per application. So if someone have a few application it probably solves the problem. The problem arise when you have much more applications which needs to use the same db and some of them are high load and some of them are not. Because the good practice is to keep max connections on the server side in up to 300-500 connections (I don't want to complicate the problem talking about the system configuration, db replication, balancing, etc.), you cannot allow to many applications to use to much connection into the database. Besides I think that in case we have have all operations handled from the general pool you'll need to create the connections just before you start to high load the app. It's simple when you just can, and want to use minconn parameter. But it starts to be more problematic when you want to use dynamically allocated connections which need to be created during the app high load. In that case it could not to give the resolution because thread which resolves the dns could be starved by threads which load the bunch of data or a method which does a lot and forgot to get a breath to CPU that other thread can also work correctly. So i think that putting the v. quick operations in the general purpose TP could be a problematic to people who designs the application because they need to know that there could be an issue with that. When you add for example a ORM persistence layer or use some 3rd party strange libraries into the equation it could be very problematic to find problems with slow queries, because at beginning you don't know exactly (in low level) how all features are implemented in 3rd party code. So the people who don't know about this fact can loose a lot of time to find the real problem (in case he won't find the appropriate help on groups/web sites). 
From my experience I can tell that it's not so hard to meet with this problem (a lot of work to do). In case you have a WCF service it's enough that it came much more popular. In other more complicated application your colleague can optimize some mechanism using many threads and it could reveal a few months after he publish his changes, when some specific data will came into the system. 

In case we want to use a separate TP we also should unsure that this TP will be used only for tasks for which we're sure that they execute v. quickly, +- as quick as hostname resolving. Otherwise we can still have problems with thread starving. 


I'm sorry if you think that I grumble to much. I think you're made and still making a great job with the Npgsql. We've compared 3-4 years ago npgsql with some other commercial driver and the npgsql has a great advantage over the commercial one (I don't remember the name now), and it was before many npgsql optimisation made since there. So thanks for a great job :)

I just try to give you other perspective look. I don't want to do like my colleagues from other companies which are making themselves own forks and keeps their changes across different npgsql versions, because it does not have a sence. 


Ok, I've wrote to much and maybe redo unnecessary long talk about this problem. 


Summarizing, from my point of view I think that:

  • switch which disable async dns is a very good idea without concerning how async will be implemented. 
  • the seperate Thread Pool for very quick tasks (but if we're sure that they really quick) is also a good idea. 

Emil Lenngren

unread,
Oct 13, 2014, 7:51:28 AM10/13/14
to Miłosz Kubański, glen...@gmail.com, npgsq...@googlegroups.com, fran...@npgsql.org

Just a thought: the user could enter an ip adress to skip dns lookup. Else, If our method of resolving isn't good enough, the user can manually do dns lookup by himself and pass the ip address to Npgsql...

--

Miłosz Kubański

unread,
Oct 13, 2014, 8:01:36 AM10/13/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
The static IP configuration is not an option in big virtualized environment. An admin will kill you for such approach or your clients when admin change the configuration ;).

The manual dns lookup is also not a solution in case you're using the persistence layer. 

Francisco Figueiredo Jr.

unread,
Oct 13, 2014, 8:18:04 AM10/13/14
to Shay Rojansky, npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com


Em segunda-feira, 13 de outubro de 2014, Shay Rojansky <ro...@roji.org> escreveu:
Hi Miłosz.

Nobody is happy about using async for the DNS lookup - it's being done only because the sync version has no timeout. If you can find a good DNS client implementation that supports a timeout I think we'll glad use it.

Having said that, we're not exactly talking about allocating unnecessary threads just for this task; the .thread pool is a standard part of .NET and is used in many operations across many applications. It's natural that large load/stress circumstances should require a larger pool. Sure, the context switch incurs a bit more CPU, but we're talking about a single switch that happens when performing a database connection - which is supposed by its nature to be an expensive, lengthy operation (roundtrips, authentication...).

One more workaround to consider, is simply to preallocate connections on application startup. If you want to support 100 connections, open them at your leisure on startup and keep them around in Npgsql's pool. Then when requests actually come they will already be open.

Despite all the above, Glen, Josh, with all the complaints about this maybe we should just add a connstring param called SyncDns, through which the user acknowledges that they will potentially get stuck for more than the timeout? This way high-load users which don't care about the DNS timeout can be happy?



+1 to add a connection string parameter to allow the user to specify if she wants a sync or async DNS lookup. 

Another idea I remember I saw in the async connection discussion in the github was to use a timeout-able block of code for the sync DNS code? This way, if the DNS takes too long, the code times out aborting any sync DNS query and we continue to honor the user timeout settings. 

I think Milosz won't be the only person with this problem and the connection string configuration or the timeout-able approach would help solve this question. 

I know we would like that the async approach would work all the times flawlessly as it was supposed to, but unfortunately we found this limitation in the thread pool which is making it to fail sometimes. And increasing the thread pool, again from the discussion from github, isn't a very trivial configuration change or it may not be possible for all server deployments which use Npgsql. 

Another point is that one of the objectives of Npgsql is to be enjoyable to work with. It is to be a part of the solution and not of the problem. And if we can help avoid this type of problem, or avoid changing some not so common to settings and configurations , I think we should try to do it. 

I'm sorry if I'm being too much simplistic or talking non sense. 


 
Shay

On Monday, October 13, 2014 9:31:11 AM UTC+2, Miłosz Kubański wrote:
Hi.

I thing the allocation unnecessary threads is very wrong attitude, because you're using much more resources than you need. You'll use much more memory (If I remember correctly each thread allocates around 1MB for a stack) as well as you'll use to much CPU for the context switching. If you run a few applications which uses the npgsq, you'll simply run out of memory just for the application start and may crush the system performance. 
If you want to stay with async resolve name the better option is to create the separate thread pool. I talk about this with Francisco via email and forgot to mention it on this group. IMHO this thread pool should be separated only for a dns resolve name and shouldn't be shared for example with socket read/write, because we can have the similar problems (more complicated operations will cause the simple operations to starve). 


Best Regards,
Miłosz


W dniu niedziela, 12 października 2014 11:35:07 UTC+2 użytkownik Shay Rojansky napisał:
Note virtually identical discussion on github: https://github.com/npgsql/npgsql/issues/376

Workaround mentioned by the user is to either preconfigure the thread pool to allocate more threads, to correspond with the load burst anticipated, or in an extreme case to use a custom pool (haven't investigated this yet).

--
You received this message because you are subscribed to the Google Groups "Npgsql Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to npgsql-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shay Rojansky

unread,
Oct 13, 2014, 12:17:23 PM10/13/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
Hi Miłosz, no problem for the long email and thanks for the positive feedback.

I ultimately agree with almost everything you say. Your options are currently:
  • Pre-open *database* connections to match burst requirements. As you say this may impose too much load on the database itself.
  • Configure the .NET TP for minimum threads (as cbenien proposed in #376). This will make new connections open quickly under burst, since threads will be available for the async DNS. The downside here is thread-related resources, which frankly doesn't seem like a huge deal to me (but I admit it's a factor). An additional point is that since TP threads are a common resource, they may be in use just when you need them, so this makes connection open less deterministic.
  • Increase the Npgsql connection timeout. Not an amazing solution, since under burst connections will open quite slowly (although without exceptions).
I'm guessing nobody will be against adding the SyncDns option as well - I'll open an issue for that and wait for Glen's and Josh's feedback.

Emil: Passing an IP rather than a hostname circumvents the whole problem, but I agree with Miłosz that it's not feasible in many situations

Francisco, you wrote:

> Another idea I remember I saw in the async connection discussion in the github was to use a timeout-able block of code for the sync DNS code? This way, if the DNS takes too long, the code times out aborting any sync DNS query and we continue to honor the user timeout settings. 

The problem with the timeout-able block of code, is that... it requires a thread :) If you want to run a code block with a timeout, the way to do this is to send it to the thread pool (or allocate your own new thread), and we're back at the original problem. I agree with everything else.

Shay Rojansky

unread,
Oct 13, 2014, 12:22:12 PM10/13/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org

Francisco Figueiredo Jr.

unread,
Oct 13, 2014, 12:36:55 PM10/13/14
to npgsq...@googlegroups.com, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org


On Monday, October 13, 2014 1:17:23 PM UTC-3, Shay Rojansky wrote:
Hi Miłosz, no problem for the long email and thanks for the positive feedback.

I ultimately agree with almost everything you say. Your options are currently:
  • Pre-open *database* connections to match burst requirements. As you say this may impose too much load on the database itself.
  • Configure the .NET TP for minimum threads (as cbenien proposed in #376). This will make new connections open quickly under burst, since threads will be available for the async DNS. The downside here is thread-related resources, which frankly doesn't seem like a huge deal to me (but I admit it's a factor). An additional point is that since TP threads are a common resource, they may be in use just when you need them, so this makes connection open less deterministic.
  • Increase the Npgsql connection timeout. Not an amazing solution, since under burst connections will open quite slowly (although without exceptions).
I'm guessing nobody will be against adding the SyncDns option as well - I'll open an issue for that and wait for Glen's and Josh's feedback.


+1

 
Emil: Passing an IP rather than a hostname circumvents the whole problem, but I agree with Miłosz that it's not feasible in many situations


I think we should add a note in the documentation talking about that. This way, if the user is facing any problem with dns resolution, this note can serve as a reminder about using a resolved address.


 
Francisco, you wrote:

> Another idea I remember I saw in the async connection discussion in the github was to use a timeout-able block of code for the sync DNS code? This way, if the DNS takes too long, the code times out aborting any sync DNS query and we continue to honor the user timeout settings. 

The problem with the timeout-able block of code, is that... it requires a thread :) If you want to run a code block with a timeout, the way to do this is to send it to the thread pool (or allocate your own new thread), and we're back at the original problem. I agree with everything else.


I see. I was only thinking about avoid the TP and use our own new thread to make the dns calls. And probably creating a new thread only for this purpose is overkill :)


 

Josh Cooley

unread,
Oct 13, 2014, 1:43:06 PM10/13/14
to Shay Rojansky, npgsq...@googlegroups.com, Miłosz Kubański, Glen Parker, Francisco Figueiredo Jr.
I was initially opposed to a connection string option since I thought we had a bug to fix.  I'm not excited about this feature, but have begun to think this is a necessary change.  Given that the current solution relies on the thread pool for async dns and the next async call appears to be using io completion port pool, it should work.  Ideally, we shouldn't be using the regular thread pool unless asked, but there's no option to do this differently currently.

Josh

--

Shay Rojansky

unread,
Oct 13, 2014, 6:23:44 PM10/13/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
Note the very interesting comments by emill here: https://github.com/npgsql/npgsql/issues/376#issuecomment-58952568, and especially the informative article he links to about the DNS client.

I really don't want to start the whole discussion again, but as emill says, in light of the Windows DNS timeout mechanism (and its configurability/tweakability in the registry) it may really be an option to say that the whole DNS timeout thing is external to Npgsql - let the user tweak it as they see fit. It would mean the connection timeout isn't fully managed within Npgsql (since it depends on the external DNS client) but that maybe the least bad option here.

Otherwise we can add the SyncDNS flag.

Emil Lenngren

unread,
Oct 13, 2014, 7:24:32 PM10/13/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
I've checked how other database providers do it.
All I checked: JDBC Postgresql, MySQL .NET Connector, Mono's SqlClient all use the system's library function to synchronously resolve dns names without any explicit timeout. The timeout is only used on the tcp connection itself.

.NET's BeginGetHostAddresses() is "fake-async", i.e. it simply offloads a sync GetHostAddresses() to a worker thread.

Since offloading DNS resolving to a worker thread only for the timeout purpose seems to introduce bugs and trouble, I think these are the only options:

The ADO docs hasn't any details about the connection timeout parameter on DNS resolving, it simply says "Indicates how long to wait while establishing a connection before terminating the attempt and generating an error.".
I would vote for either 1 or 3. Option 2 seems to risky and it feels really overkill to use a custom DNS resolver only to be able to supply accurate DNS timeouts.
Since a postgresql application usually connects to only one distinct hostname (at least I think so), the thread pool can probably be very small. DNS entries can also be cached.

Miłosz Kubański

unread,
Oct 14, 2014, 5:29:12 AM10/14/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
Hi all.

Currently my only option was fork the code and change the dns resolve conf to sync. My environment situation is much more complicated than I described. One of my team product is an db access library (facade which automates many things) which is used by our internal client by around 500 export applications. I don't know exactly where, how many apps are run in the same time (there are some business rules when it should run and how they behave) but in the postgres server status I see a lot of connections. That's why in my situation I cannot just add another threads, add to many connections to a pool, etc. and need to carry very precaution about efficiency and each change. We also using the ORM (LLBLGen) which also complicates in this case the whole picture (of course simplifies a lot of other stuff). So for the dns tweak in my case the fork is the only one option for a while. 

The following text is just my "loud thinking". You don't need to agree or disagree ;)
I'm not sure whether the whole dns timeout is a "game worth the candle". Like Emil wrote in last msg (and it was also told a few times in this discussion) the dns resolve name is cached by all dns on the road as well as Operation System (exactly the DNS Client service). Its also cached by many dns'es on the road which doesn't known the host name. So at first call it can take ~20ms to resolve or like in my case for the strange name it was at most ~220ms. But the second call was <1ms - the stop watch results in 0ms. I've tested it many times how it works and it works perfect. A few months ago I've changed the npgsql a little in my fork with a Francisco's help and logged resolve name times and it was 0ms all the time. Maybe this is a reason why other db connectors doesn't do that. 
In case you really need to add the TP I think the 1 thread in that pool is enough. Like I said before it should take ~0ms to resolve that most of the time. You can make a tests with sync resolve name and checks the average resolve time (as well as min and max) to test it. 

But like I said in my opinion the game is not worth the candle. Usually applications and dbs are in the same internal (sometimes virtually) networks. So the dns which can resolve the name will be probably the first dns which will be met (after the first call the resolve name will be availiable in the local cache until it'll be flushed or just expires). So the resolve name time will be for sure v. small. So IMHO it'll be enough if you only validate whether the timeout constraint has been break during the dns resolve and an inform the user that something is probably wrong with his network. After that its an administrator job to fix it. In that case as someone told in this message thread it'll be also good to add the description in the documentation that timeout parameter doesn't correspond to the dns name resolve and everything should be clear. 


It's wa just my thoughts and I'll be more than happy if I get the sync host resolve name somehow, no meter if it'll be settable by some switch or not. 



Have a nice day :).
Miłosz

Miłosz Kubański

unread,
Oct 14, 2014, 5:41:48 AM10/14/14
to npgsq...@googlegroups.com, ro...@roji.org, milo...@gmail.com, glen...@gmail.com, fran...@npgsql.org
I've forgot to put the test code to unsure that dns name is cached.


Stopwatch sw = new Stopwatch();

sw.Restart();

string hostName = "abracadabra.com";
var s = Dns.GetHostEntry(hostName);

sw.Stop();

var el = sw.ElapsedMilliseconds;

sw.Restart();

s = Dns.GetHostEntry(hostName);

sw.Stop();

var el2 = sw.ElapsedMilliseconds;


Console.WriteLine(el);
Console.WriteLine(el2);
Console.ReadKey();

return;

I also found a small thing in the npgsql code. I've found out that you're calculating the elapsed time using the DateTime class. If I'm correct (it was in older .net frameworks I didn't check that in newer) there is some optimization in that class which leads to cheats in calculating the miliseconds intervals. The suggested class for such operations (counting miliseconds) is the Stopwatch (http://msdn.microsoft.com/en-us/library/System.Diagnostics.Stopwatch(v=vs.110).aspx). But this is only the suggestion :).


Best Regards,
Miłosz
Reply all
Reply to author
Forward
0 new messages