Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[GENERAL] FATAL: lock file "postmaster.pid" already exists

1,025 views
Skip to first unread message

deepak

unread,
May 7, 2012, 6:34:02 PM5/7/12
to
Hi,

On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.

I tried rebooting a few times and even force shutting down the server, and it started up fine.
It seems to be a race-condition of sorts in the code that detects whether the process with PID
in the file is running or not.

Does any one have this same problem?  Any way to fix it besides removing the PID file
manually each time the server complains about this?



Thanks,
Deepak

Alban Hertroys

unread,
May 8, 2012, 3:09:35 AM5/8/12
to
On 8 May 2012, at 24:34, deepak wrote:

> Hi,
>
> On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
>
> I tried rebooting a few times and even force shutting down the server, and it started up fine.
> It seems to be a race-condition of sorts in the code that detects whether the process with PID
> in the file is running or not.

No, it means that postgres wasn't shut down properly when Windows shut down. Removing the pid-file is one of the last things the shut-down procedure does. The file is used to prevent 2 instances of the same server running on the same data-directory.

If it's a race-condition, it's probably one in Microsoft's shutdown code. I've seen similar problems with Outlook mailboxes on a network directory; Windows unmounts the remote file-systems before Outlook finished updating its files under that mount point, so Outlook throws an error message and Windows doesn't shut down because of that.

I don't suppose that pid-file is on a remote file-system?

> Does any one have this same problem? Any way to fix it besides removing the PID file
> manually each time the server complains about this?


You could probably script removal of the pid file if its creation date is before the time the system started booting up.

Alban Hertroys

--
The scale of a problem often equals the size of an ego.



--
Sent via pgsql-general mailing list (pgsql-...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

deepak

unread,
May 8, 2012, 12:13:22 PM5/8/12
to
On Tue, May 8, 2012 at 3:09 AM, Alban Hertroys <hara...@gmail.com> wrote:
On 8 May 2012, at 24:34, deepak wrote:

> Hi,
>
> On Windows 2008, sometimes the server fails to start due to an existing "postmaster.pid' file.
>
> I tried rebooting a few times and even force shutting down the server, and it started up fine.
> It seems to be a race-condition of sorts in the code that detects whether the process with PID
> in the file is running or not.

No, it means that postgres wasn't shut down properly when Windows shut down. Removing the pid-file is one of the last things the shut-down procedure does. The file is used to prevent 2 instances of the same server running on the same data-directory.

If it's a race-condition, it's probably one in Microsoft's shutdown code. I've seen similar problems with Outlook mailboxes on a network directory; Windows unmounts the remote file-systems before Outlook finished updating its files under that mount point, so Outlook throws an error message and Windows doesn't shut down because of that.

I don't suppose that pid-file is on a remote file-system?

No, it's local.
 
> Does any one have this same problem?  Any way to fix it besides removing the PID file
> manually each time the server complains about this?


You could probably script removal of the pid file if its creation date is before the time the system started booting up.


Thanks, it looks like the code already seems to overwrite an old pid file if no other process is using it (if I understand the code correctly, it just echoes a byte onto a pipe to detect this).

Still, I can't see under what conditions this occurs, but I have seen it happen a couple of times, just that I don't know how to predictably reproduce the problem.


--
Deepak

deepak

unread,
May 16, 2012, 11:34:52 AM5/16/12
to
Hi!

We could reproduce the start-up problem on Windows 2003. After a reboot, postmaster, in its start-up sequence cleans up old temporary files, and this step used to take several minutes (a little over 4 minutes), delaying the writing of line 6 onwards into the PID file. This delay caused pg_ctl to timeout, leaving behind an orphaned postgres.exe process (which eventually forks off many other postgres.exe processes). But since pg_ctl itself isn't running after the timeout, Windows thinks the service isn't running. A subsequent attempt to start the service using pg_ctl now complains about the existing lock file still being used by one of the postgres.exe processes that was spawned before.

We have observed conclusively that file system cache is coming into play. We tested the scenario in which a reboot was followed by navigating the file system under the data directory using "find" Cygwin command, following which there was "no" timeout for pg_ctl and the server started up fine, suggesting that the clean up is way faster when the file system is cached.

Any ideas on fixing this start-up delay in postmaster? 

Could the task of cleanup move elsewhere, specifically to somewhere after the writing of PID file is complete so that pg_ctl doesn't timeout?

Any other suggestions for working around this problem?


Thanks,

Deepak

Tom Lane

unread,
May 21, 2012, 10:55:22 PM5/21/12
to
deepak <deep...@gmail.com> writes:
> We could reproduce the start-up problem on Windows 2003. After a reboot,
> postmaster, in its start-up sequence cleans up old temporary files, and
> this step used to take several minutes (a little over 4 minutes), delaying
> the writing of line 6 onwards into the PID file. This delay caused pg_ctl
> to timeout, leaving behind an orphaned postgres.exe process (which
> eventually forks off many other postgres.exe processes).

Hmm. It's easy enough to postpone temp file cleanup till after the
postmaster's PID file is completely written, so I've committed a patch
for that. However, I find it mildly astonishing that such cleanup could
take multiple minutes. What are you using for storage, a man with an
abacus?

regards, tom lane

deepak

unread,
May 23, 2012, 12:03:33 PM5/23/12
to
Thanks, I have put one of the other developers working on this issue, to comment.

--
Deepak

Tom Lane

unread,
May 23, 2012, 12:50:13 PM5/23/12
to
Mark Dilger <markd...@yahoo.com> writes:
> I tried moving the call to RemovePgTempFiles until
> after the PID file is fully written, but it did not help.

I wonder whether you correctly identified the source of the slowness.
The thing I would have suspected is identify_system_timezone(), which
will attempt to read every file in the timezone-database directory tree,
of which there are about 600. It's not unusual for that to take several
seconds on a cold-started machine that doesn't have any of that tree in
filesystem cache. It's still a stretch to believe that it'd take
several minutes on any storage system more advanced than a floppy disk;
but at least we'd only be trying to pin about one order of magnitude
slowdown on the filesystem, rather than several orders.

If that is what is causing it, there is a very simple workaround, which
is to set the timezone setting explicitly in postgresql.conf instead of
leaving the postmaster to try to figure it out from the environment.

(9.2 will use a better answer, which is for initdb to do this once and
store the result in postgresql.conf.)

Tom Lane

unread,
May 23, 2012, 2:17:30 PM5/23/12
to
Mark Dilger <markd...@yahoo.com> writes:
> Prior to posting to the mailing list, we made some
> changes in postmaster.c to identify where time was
> being spent.� Based on the elog(NOTICE,...) lines
> we put in the file, we determined the time was spent
> inside RemovePgTempFiles.

> I then altered RemovePgTempFiles to take a starttime
> parameter and, while recursing, to check if more than
> 5 seconds has passed since it started.� I did not want
> to add the complexity of setting an alarm and catching
> the signal, so I just made the code check the wallclock
> time at each step of the recursion.� When more than
> 5 seconds has passed, it does not recurse further.
> After making this change, we have not been able to
> reproduce the slowness.

OK, so we're back to the original question: how could this possibly be
taking that long? Have you got thousands of tablespaces (and if so why)?
Does your system have a habit of crashing at times when there are
thousands of temp files? Maybe you're using IP over avian carriers to
access your SAN? It just doesn't make any sense given the information
you've provided.

Tom Lane

unread,
May 23, 2012, 3:23:42 PM5/23/12
to
Mark Dilger <markd...@yahoo.com> writes:
> We do not use tablespaces at all.

[ scratches head... ] If you aren't using any tablespaces, there should
be only *one* pgsql_tmp directory, which makes this even more confusing.

(Unless you're using a pre-8.3 release, in which case there would be one
per database, so maybe if you've got hundreds/thousands of databases in
the cluster that would explain it. But I sure hope you're not still
using pre-8.3, especially not on Windows.)

Tom Lane

unread,
May 23, 2012, 4:54:43 PM5/23/12
to
Mark Dilger <markd...@yahoo.com> writes:
> We only use one database, not counting the
> built-in template databases.  The server is
> running 9.1.3.  We were running 9.1.1 until
> fairly recently.

OK. I had forgotten that in recent versions, RemovePgTempFiles doesn't
only iterate through the pgsql_tmp directories; it scans the regular
database directories too, looking for possibly orphaned temp relations.
So if you had lots and lots of files in your regular database
directories, possibly scanning those could be slow. Still, it's only
looking at the file names, not attempting to stat() them or anything,
so it would be a pretty shoddy filesystem that would take a really long
time for that.

Tom Lane

unread,
May 23, 2012, 7:25:58 PM5/23/12
to
Mark Dilger <markd...@yahoo.com> writes:
> I am running this code on Windows 2003.  It
> appears that postgres has in src/port/dirent.c
> a port of readdir() that internally uses the
> WIN32_FIND_DATA structure, and the function
> FindNextFile() to iterate through the directory.
> Looking at the documentation, it seems that
> this function does collect file creation time,
> last access time, last write time, file size, etc.,
> much like performing a stat.

> In my case, the code is iterating through roughly
> 56,000 files. Apparently, this is doing the
> equivalent of a stat on each of them.

That would explain it all right. I think you're basically screwed here,
because so far as I can see Windows doesn't provide any means to
enumerate a directory's contents without fetching that info; at least
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364232(v=vs.85).aspx
doesn't seem to offer any substitutes for FindFirstFile/FindNextFile.

It's barely possible that using FindFirstFileEx with fInfoLevelId =
FindExInfoBasic would save enough to be useful, except that that option
doesn't exist on Windows 2003 anyway.

Consider using another operating system ...

Magnus Hagander

unread,
May 24, 2012, 6:58:18 AM5/24/12
to
On Thu, May 24, 2012 at 12:47 AM, Mark Dilger <markd...@yahoo.com> wrote:
> I am running this code on Windows 2003.  It
> appears that postgres has in src/port/dirent.c
> a port of readdir() that internally uses the
> WIN32_FIND_DATA structure, and the function
> FindNextFile() to iterate through the directory.
> Looking at the documentation, it seems that
> this function does collect file creation time,
> last access time, last write time, file size, etc.,
> much like performing a stat.
>
> In my case, the code is iterating through roughly
> 56,000 files.  Apparently, this is doing the
> equivalent of a stat on each of them.

how did you end up with 56,000 files? Lots and lots and lots of tables?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Magnus Hagander

unread,
May 24, 2012, 7:00:39 AM5/24/12
to
On Thu, May 24, 2012 at 2:42 AM, Mark Dilger <markd...@yahoo.com> wrote:
> FindFirstFile can take a wildcard filename
> pattern.  It appears that we are effectively
> calling FindFirstFile without a pattern, getting
> all 56000 file names with complete stat
> information, doing a poor-man's regex on
> those names, and matching just the temporary
> files.
>
> If RemovePgTempFiles were modified to
> pass a filter, this code might perform better
> on Windows.  I'll look into this.

It might in that case be worthwhile looking at using scandir() on
platforms that support that as well, so that other platforms can
benefit from an optimization as well. Though I'm not sure how much
that would actually help - ISTM that one actually scans the whole
directory anyway, just you don't have to do it yourself...

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

0 new messages