Transaction-Overflow

pingu...@web.de

unread,

Aug 7, 2007, 9:28:48 AM8/7/07

to

Hi,

i've a problem with my postgresql database (v8.1.5-13 on opensuse 10.2). The database transaction
limit is running over and postgres does a restart.
I do daily a "vacuum verbose analyze" over the hole database. So what I'm doing wrong? The db has
restarted itself to prevent a transaction-override.

The log ouput says:
...
DETAIL: Der Postmaster hat diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen und die Sitzung zu beenden, weil ein anderer
Serverprozess abnormal beendet wurde und möglicherweise das Shared Memory verfälscht hat.
TIPP: In einem Moment sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl wiederholen können.
WARNUNG: breche Verbindung ab wegen Absturz eines anderen Serverprozesses
DETAIL: Der Postmaster hat diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen und die Sitzung zu beenden, weil ein anderer
Serverprozess abnormal beendet wurde und möglicherweise das Shared Memory verfälscht hat.
TIPP: In einem Moment sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl wiederholen können.
FATAL: das Datenbanksystem ist im Wiederherstellungsmodus
FATAL: das Datenbanksystem ist im Wiederherstellungsmodus
FATAL: das Datenbanksystem ist im Wiederherstellungsmodus
LOG: alle Serverprozesse beendet; initialisiere neu
LOG: Datenbanksystem wurde am 2007-08-07 13:01:19 CEST unterbrochen
LOG: Checkpoint-Eintrag ist bei 41/4D06D224
LOG: Redo-Eintrag ist bei 41/4D049AF0; Undo-Eintrag ist bei 0/0; Shutdown FALSE
LOG: nächste Transaktions-ID: 114926807; nächste OID: 230334262
LOG: nächste MultiXactId: 1; nächster MultiXactOffset: 0
LOG: Datenbanksystem wurde nicht richtig heruntergefahren; automatische Wiederherstellung läuft
LOG: Redo beginnt bei 41/4D049AF0
LOG: Datensatz mit Länge null bei 41/4DCF8A48
LOG: Redo fertig bei 41/4DCF8A20
LOG: Datenbanksystem ist bereit
LOG: Grenze für Transaktionsnummernüberlauf ist 1187023047, begrenzt durch Datenbank »isohost«
...

The vacuum before this looks fine. The only thing is a max_fsm_pages warning:

...
INFO: Free-Space-Map enthält 210922 Seiten in 32 Relationen
DETAIL: Es sind insgesamt 200000 Page-Slots in Benutzung (einschließlich Overhead).
391152 Page-Slots werden benötigt, um den gesamten freien Platz verwalten zu können.
Aktuelle Begrenzungen sind: 200000 Page-Slots, 100 Relationen, 1180 KB in Benutzung.
HINWEIS: Anzahl der benötigten Page-Slots (391152) überschreitet max_fsm_pages (200000)
HINT: Erhöhen Sie eventuell den Konfigurationsparameter »max_fms_pages« auf über 391152.
VACUUM
...

Is this my fault? So is the solution to set max_fsm_pages to aroung 450000?

Off Topic:
Theres a little fault in the log output (fms instead of fsm!), is this already changed in the actual
version?

Thanks a lot!

Regards,

Martin

______________________________________________________________________________
Jetzt neu! Im riesigen WEB.DE Club SmartDrive Dateien freigeben und mit
Freunden teilen! http://www.freemail.web.de/club/smartdrive_ttc.htm/?mc=021134

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Tom Lane

unread,

Aug 7, 2007, 10:33:42 AM8/7/07

to

pingu...@web.de writes:
> i've a problem with my postgresql database (v8.1.5-13 on opensuse 10.2). Th=

> e database transaction
> limit is running over and postgres does a restart.

I don't read German well, but as far as I can see there is nothing in
what you posted that suggests a transaction wraparound issue. Something
crashed, clearly, but whatever evidence the log might have about why is
up above what you posted --- all of this looks like a standard crash
recovery cycle.

> Theres a little fault in the log output (fms instead of fsm!), is this alre=

> ady changed in the actual
> version?

Seems to be fixed in CVS HEAD, I didn't check the back branches.

regards, tom lane

pingu...@web.de

unread,

Aug 7, 2007, 11:34:53 AM8/7/07

to

Hi Tom,

thank you for the very fast answer!

On the top in the log file is this, do you know why the pid is killed with 11? I'm a little bit confused :(.

LOG: Serverprozess (PID 30399) wurde von Signal 11 beendet
LOG: aktive Serverprozesse werden abgebrochen

WARNUNG: breche Verbindung ab wegen Absturz eines anderen Serverprozesses
DETAIL: Der Postmaster hat diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen und die Sitzung
zu beenden, weil ein anderer Serverprozess abnormal beendet wurde und möglicherweise das Shared Memory verfälscht h
at.
TIPP: In einem Moment sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl wiederholen können.
WARNUNG: breche Verbindung ab wegen Absturz eines anderen Serverprozesses
DETAIL: Der Postmaster hat diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen und die Sitzung
zu beenden, weil ein anderer Serverprozess abnormal beendet wurde und möglicherweise das Shared Memory verfälscht h
at.

The log says also that the max possible xid is 1187023047:

LOG: Grenze für Transaktionsnummernüberlauf ist 1187023047, begrenzt durch Datenbank »mydb«

And my DB is already on 1076856894:

Should this be the problem?
What if the max xid is reached, does postgres then do a restart?
How can I clean the counter, this should "vacuum analyze" do, or?
Is it possible, that vacuum (without full) doesn't freeing any space, if max_fsm_pages are set to low? I've read something like this in the admin-mailinglist.

Thanks a lot!

Regards,

Martin
_____________________________________________________________________
Der WEB.DE SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
http://smartsurfer.web.de/?mc=100071&distributionid=000000000066

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Decibel!

unread,

Aug 7, 2007, 2:30:03 PM8/7/07

to

On Tue, Aug 07, 2007 at 05:34:53PM +0200, pingu...@web.de wrote:
> On the top in the log file is this, do you know why the pid is killed with 11? I'm a little bit confused :(.
>
> LOG: Serverprozess (PID 30399) wurde von Signal 11 beendet

That means that a backend was killed with a signal 11. IIRC, that
indicates faulty hardware.
--
Decibel!, aka Jim Nasby dec...@decibel.org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

Tom Lane

unread,

Aug 7, 2007, 2:38:44 PM8/7/07

to

pingu...@web.de writes:
> On the top in the log file is this, do you know why the pid is killed with =

> 11? I'm a little bit confused :(.

> LOG: Serverprozess (PID 30399) wurde von Signal 11 beendet

SIG 11 (ie SIGSEGV) is pretty much the typical "generic crash"
indication. It most likely means you ran into a software bug or
corrupted data. There is no reason at all to think that it's got
anything to do with transaction ID wraparound --- that message is
only coming out because it always comes out at a database restart.

What you ought to look into is what *did* cause the crash. Did it
produce a core file, and if so can you get a gdb stack trace from
the core?

pingu...@web.de

unread,

Aug 8, 2007, 5:38:11 AM8/8/07

to

Hi,

first thanks for your answers.

Now I found some ECC-exceptions in the Kernel.:

EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow
EDAC MC0: CE page 0xfeb8e, offset 0x0, grain 4096, syndrome 0xb32, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE - no information available: e7xxx CE log register overflow

This is on both servers, production and backup. Right know, I'm updating the Kernel
to 2.6.22.1. Hopefully this helps :/. But I think there is no hope.

There are also Traces in dmesg:

Code: f3 a5 89 c1 f3 a4 eb 21 89 c8 83 f9 07 76 18 89 f9 f7 d9 83 e1 07 29 c8 f3 a4 89 c1 c1 e9 02 83 e0 03 90 f3 a5 89 c1 f3 a4 5e 89 <c8> 5f c3 57 85 c9 56 89 c7 89
d6 79 08 0f 0b 0a 03 71 ce 2c c0
EIP: [<c01c3a2c>] __copy_from_user_ll_nozero+0xd7/0xda SS:ESP 0068:dca2fd94
<4>EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
EDAC MC0: CE page 0xb4124, offset 0x0, grain 4096, syndrome 0xfc1, row 4, channel 0, label "": e7xxx CE
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000002
printing eip:
c01c3a2c
*pde = 2f39e001
Oops: 0000 [#3]
SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/class
Modules linked in: nfs lockd nfs_acl sunrpc iptable_filter ip_tables x_tables lp parport_pc parport af_packet joydev st sr_mod ipv6 button battery ac apparmor aamatch
_pcre loop dm_mod e1000 ide_cd cdrom i2c_i801 e7xxx_edac edac_mc i2c_core ext3 mbcache jbd edd fan sg gdth aic79xx scsi_transport_spi piix thermal processor sd_mod sc
si_mod ide_disk ide_core
CPU: 0
EIP: 0060:[<c01c3a2c>] Tainted: G U VLI
EFLAGS: 00010206 (2.6.18.2-34-bigsmp #1)
EIP is at __copy_from_user_ll_nozero+0xd7/0xda
eax: e5f17dbc ebx: 00000001 ecx: 00000006 edx: bff0ef9a
esi: 00000000 edi: 01de802f ebp: 00000006 esp: e5f17d94
ds: 007b es: 007b ss: 0068
Process postmaster (pid: 13414, ti=e5f16000 task=e93710b0 task.ti=e5f16000)
Stack: c01a81f2 00000000 00003466 0000000e e5f17dbc 00000000 00000000 00000001
00000002 00000000 ffff0002 c0100000 00000000 b6b86840 b6b85000 c1d9df20
00001000 d4f6ae9c 21741707 46b611a0 c0125770 3b9aca00 00000163 80000000
Call Trace:
[<c01a81f2>] exit_sem+0x58/0x14c
[<c0125770>] current_fs_time+0x4f/0x5b
[<c014ca56>] get_page_from_freelist+0x2f1/0x371
[<c01487f7>] find_lock_page+0x1a/0x77
[<c015f3b5>] shmem_getpage+0x4f2/0x552
[<c0160375>] shmem_nopage+0xa4/0xb6
[<c0154076>] __handle_mm_fault+0x63e/0xb9c
[<c01325aa>] autoremove_wake_function+0x0/0x35
[<c0108567>] sys_ipc+0x5e/0x1bb
[<c0103ddd>] sysenter_past_esp+0x56/0x79
Code: f3 a5 89 c1 f3 a4 eb 21 89 c8 83 f9 07 76 18 89 f9 f7 d9 83 e1 07 29 c8 f3 a4 89 c1 c1 e9 02 83 e0 03 90 f3 a5 89 c1 f3 a4 5e 89 <c8> 5f c3 57 85 c9 56 89 c7 89
d6 79 08 0f 0b 0a 03 71 ce 2c c0
EIP: [<c01c3a2c>] __copy_from_user_ll_nozero+0xd7/0xda SS:ESP 0068:e5f17d94
<6>device eth0 left promiscuous mode

The hardware is 5 years old... It was not possible to get new hardware
for this project. :/

Regards,

Martin

-----Ursprüngliche Nachricht-----
Von: Tom Lane <t...@sss.pgh.pa.us>
Gesendet: 07.08.07 20:41:08
An: pingu...@web.de
CC: pgsql...@postgresql.org
Betreff: Re: [ADMIN] Transaction-Overflow

regards, tom lane

http://www.postgresql.org/about/donate

_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 3 Monate
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220

Tom Lane

unread,

Aug 8, 2007, 10:34:05 AM8/8/07

to

pingu...@web.de writes:
> The hardware is 5 years old... It was not possible to get new hardware
> for this project. :/

Disassembling and cleaning the machine has worked for me in similar
cases. You'd be amazed how much dust can build up on a circuit board
... and if the dust is even a little bit conductive, it can be the cause
of misbehavior.

regards, tom lane

---------------------------(end of broadcast)---------------------------

Hajek, Nick

unread,

Aug 8, 2007, 11:01:37 AM8/8/07

to

More likely than electrical conduction is the thermal insulating effect
of accumulated dust which will can components to overheat.