2005-09-29 00:41:09 FATAL: could not duplicate socket 1880 for use
in backend: error code 10038
and for each message printed, a new postgres process is created. To make
things worse, those processes do not die when I stop the service.
I use sysinternals tcpview to monitor my sockets. I know that no other
process is using 1880. Each started postgres process will occupy two,
seemingly random ports that apparently form a loop somehow. This is a
typical entry:
<non-existent>:3136 TCP 127.0.0.1:1554 127.0.0.1:1555 ESTABLISHED
<non-existent>:3136 TCP 127.0.0.1:1555 127.0.0.1:1554 ESTABLISHED
The weird thing is that there is no process with pid 3136 (hence the
name <non-existent>). There is a postgres process with another pid in my
process listing. If I kill that, the <non-existstent> entries go away.
Looks like pid 3136 is talking to itself. A pipe() followed by failure
to start the new process perhaps?
Regards,
Thomas Hallgren
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Do you by any chance run any antivirus or firewall software? If so, can
you try removing it (note! actual uninstall, not just disabling it!)
//Magnus
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
- thomas
Magnus Hagander wrote:
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
That's from postmaster.c:write_inheritable_socket(). Error 10038 is
WSAENOTSOCK. Very odd, time to get out the debugger? Get a backtrace at
least.
Hope this helps,
--
Martijn van Oosterhout <kle...@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.
Anyway. The netstat indicates that the pipe() call works. The order is
pretty much:
parent: create socket pair, connected to each other.
parent: Duplicate socket [this is what fails]
parent: close own copy of socket
child: recreate socket from structure [this is never called, thus the
new socket is never "attached" to a process]
Now *why* it's doing this, I hav eno idea.
Questions:
1) Does it actually work? ;-) And just logs the error anyway?
2) Does this happen on *every* connection?
3) Can you reproduce this on a different machine, or just one?
//Magnus
> -----Original Message-----
> From: Thomas Hallgren [mailto:th...@mailblocks.com]
> Sent: Thursday, September 29, 2005 9:48 AM
> To: Magnus Hagander
> Cc: PostgreSQL-development
> Subject: Re: [HACKERS] Socket problem using beta2 on Windows-XP
>
> Nope, no anti-virus and no firewall (other then the box that
> fronts my home-network to the outside world).
>
> - thomas
>
> Magnus Hagander wrote:
>
> >>Hi,
> >>I've installed PostgreSQL 8.1-beta2 as a service on my
> Windows-XP box.
> >>It runs fine but I get repeated messages like this in the log:
> >>
> >> 2005-09-29 00:41:09 FATAL: could not duplicate socket
> 1880 for use
> >>in backend: error code 10038
> >>
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
It sounds like you know where it happens. Martijn requested a
stacktrace. Do you still need that? If you do, I'll try to get some time
over this weekend.
Regards,
Thomas Hallgren
Magnus Hagander wrote:
---------------------------(end of broadcast)---------------------------
> 2. It happens while the postmaster is idle. If I leave it idle for a
> while and then come back, I'll have a whole bunch of new processes in my
> task-manager and zombies in tcpview.
Hmm ... how many processes? Did you enable autovacuum perchance? If
so, does the number of processes correspond approximately to the
"autovacuum_naptime"?
--
Alvaro Herrera http://www.advogato.org/person/alvherre
"La espina, desde que nace, ya pincha" (Proverbio africano)
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
IIRC, the win32 installer will enable autovacuum by default. And yes,
autovacuum was my first thought as well after Thomas last mail - that
would be a good explanation to why it happens when the postmaster is
idle.
//Magnus
---------------------------(end of broadcast)---------------------------
>IIRC, the win32 installer will enable autovacuum by default. And yes,
>autovacuum was my first thought as well after Thomas last mail - that
>would be a good explanation to why it happens when the postmaster is
>idle.
>
>
I used the win32 installer defaults so autovacuum is probably a safe
assumption.
- thomas
Right. Please try turning it off and see if the problem goes away.
//Magnus
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
>Right. Please try turning it off and see if the problem goes away.
>
>
It does (go away).
- thomas
---------------------------(end of broadcast)---------------------------
>Right. Please try turning it off and see if the problem goes away.
>
>
No, wait! It does *not* go away. Do I need to do anything more than
setting this in my postgresql.conf file:
autovacuum = false # enable autovacuum subprocess?
and restart the service?
The two zombie entries occurs directly when I start the service, then
there's two new entries popping up every minute.
- thomas
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly
Yes, that should be enough.
Hmm. Weird!
If you can get a backtrace from the point where the error msg shows up,
that certainly would help - this means it's not coming from where we
thought it was coming from :-(
//Magnus
---------------------------(end of broadcast)---------------------------
If it's two zombies per minute, then I bet it's the stat collector and
stat bufferer. They are restarted by the postmaster if not found to be
running.
The weird thing is that the postmaster _should_ call wait() for them if
it detects that they died (when receiving a SIGCHLD signal AFAIR). If
it doesn't, maybe it indicates there's a problem with the signal
handling on Win32.
--
Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 17.7", W 73º 14' 26.8"
"We are who we choose to be", sang the goldfinch
when the sun is high (Sandman)
That would make some sense, because the stat processes need to set up new
sockets (for the pipe between them). The autovacuum theory didn't hold
any water in my eyes because autovacuum doesn't create any new sockets.
However, why two zombies? That would mean that the grandchild process
started, which should mean that the pipe was already created ...
Does Windows have any equivalent of strace whereby we could watch what's
happening during stats process launch?
regards, tom lane
First of all, I won't be able to dig into this any more until next week
- sorry about that. But others are always free to :-)
There is no strace equivalent builtin, but you can get an addon from
http://www.bindview.com/Services/RAZOR/Utilities/Windows/strace_readme.c
fm. Don't put it on a production box permanently, though, it tends to
cause BSODs in some cases.
//Magnus
>However, why two zombies? That would mean that the grandchild process
>started, which should mean that the pipe was already created ...
>
>
To clarify, I talk about the tcpview window and connections, and thus
zombi-connections. They both belong to the same pid and seems to point
to eachother. The actual process no longer exists (it can't be viewed
anywhere).
Regards,
Thomas Hallgren
>On Thu, Sep 29, 2005 at 08:50:30AM +0200, Thomas Hallgren wrote:
>
>
>>Hi,
>>I've installed PostgreSQL 8.1-beta2 as a service on my Windows-XP box.
>>It runs fine but I get repeated messages like this in the log:
>>
>> 2005-09-29 00:41:09 FATAL: could not duplicate socket 1880 for use
>>in backend: error code 10038
>>
>>
>
>That's from postmaster.c:write_inheritable_socket(). Error 10038 is
>WSAENOTSOCK. Very odd, time to get out the debugger? Get a backtrace at
>least.
>
>
I finally managed to debug the postmaster and I'm now pretty sure the
message is not from the postmaster itself. I put a breakpoint where the
message is printed (postmaster.c:3762) and in errstart() where elevel >=
ERROR (elog.c:152) but I never get there although the message is
printed. I know that my debugger works because if I put a break on
elog.c:194 it stops for other messages.
Regards,
Thomas Hallgren
---------------------------(end of broadcast)---------------------------
StartupDatabase will call internal_fork_exec, it calls
write_inheritable_socket 4 times and succeeds.
During the first iteration of ServerLoop:
StartBackgroundWriter will call internal_fork_exec and succeed.
pgstat_forkexec will call internal_fork_exec and succeed.
In the second iteration of ServerLoop, pgstat_forkexec will again call
will call internal_fork_exec. This time it fails.
According to the log it fails on line:
write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid);
i.e. on it's second call to write_inheriable_socket. The failure is in a
postgres.exe process, not postmaster.exe (and that's why I can't debug
propery on Windoz).
Hope this helps.
Regards,
Thomas Hallgren
Magnus Hagander wrote:
---------------------------(end of broadcast)---------------------------
<snip>
> In the second iteration of ServerLoop, pgstat_forkexec will again call
> will call internal_fork_exec. This time it fails.
> According to the log it fails on line:
>
> write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid);
Well, pgStatSock is the only SOCK_DGRAM socket, all the others are
SOCK_STREAM, maybe that's the difference? It's also connected to
itself, although for DGRAM sockets that's not that special.
The documentation isn't totally clear about this. Yet the error thrown
should terminate the process, yet it obviously isn't. Very odd.
Any Windows programmers with ideas?
Regards,
Thomas Hallgren
Thomas Hallgren wrote:
> I added some traces to the code. I know that the following happens
> when I start a postmaster.
>
> StartupDatabase will call internal_fork_exec, it calls
> write_inheritable_socket 4 times and succeeds.
>
> During the first iteration of ServerLoop:
> StartBackgroundWriter will call internal_fork_exec and succeed.
> pgstat_forkexec will call internal_fork_exec and succeed.
>
> In the second iteration of ServerLoop, pgstat_forkexec will again
> call will call internal_fork_exec. This time it fails.
> According to the log it fails on line:
>
> write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid);
>
> i.e. on it's second call to write_inheriable_socket. The failure is in
> a postgres.exe process, not postmaster.exe (and that's why I can't
> debug propery on Windoz).
>
> Hope this helps.
>
> Regards,
> Thomas Hallgren
>
>
> Magnus Hagander wrote:
>
---------------------------(end of broadcast)---------------------------
> With great help from Magnus, who advised me to use lspfix from cexx.org
> to list my lsp's, I found that I had gapsp.dll, "Neoteris DNS Provider"
> installed. An uninstall of the Neoteris software made this problem go away.
I guess the question is, why is a "DNS Provider" software blocking
socket creation? Is there a way we could work around that?
--
Alvaro Herrera Architect, http://www.EnterpriseDB.com
"El destino baraja y nosotros jugamos" (A. Schopenhauer)
It's just another version of the "Broken LSP" that we've been having
problems iwth before. But before, it's only been AV and firewall stuff.
I guess they somehow put a LSP in there to intercept DNS packets or
soemthign. Completely broken design IMHO, but that's a different thing
;-) And they apparantly don't support socket inheritance. The only way
we can work around them breaking the concept of socket inheritance is to
stop using it. Which would mean going multithread instead of
multiprocess, which isn't very likely...
To reiterate the basic point: The broken LSP breaks a fundamental
promise in the sockets API that we absolutely require. The bug is
completely within the LSP.
//Magnus
---------------------------(end of broadcast)---------------------------
We used to have this, but we removed it when we aded the code that fixed
the problem in 95% of the cases. It's probably a good idea to bring it
back :-(
ISTM that maybe what we have here is a documentation shortcoming.
I'm thinking that our Windows FAQ ought to suggest troubleshooting
socket-related problems by removing LSPs one at a time.
regards, tom lane
---------------------------(end of broadcast)---------------------------