Recovering from corrupt xlog files.

da...@awakenetworks.com

unread,

Jul 21, 2017, 8:07:00 PM7/21/17

to Greenplum Users

We have a gpdb running on about 15TB of data. For a reason we have to investigate more, the transaction log is in a bad state on the machine. Attaching the log on the master:

Searching online, it seems folks use pg_resetxlog to get around this issue. The tool is marked as "not used in gpdb" in https://gpdb.docs.pivotal.io/4320/utility_guide/admin_utilities/util_overview.html.

What would be the equivalent tool to get the master started on gpdb?

Thanks,

Dash.


2017-07-21 23:19:20.377511 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","invalid record length at 1D/79A75B88",,,,,,,0,,"xlog.c",4045,
2017-07-21 23:19:20.377520 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","invalid primary checkpoint record at location 1D/79A75B88 (==> seg 30, offset 0x1A75B88)",,,,,,,0,,"xlog.c",7952,
2017-07-21 23:19:20.377538 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","contrecord is requested by 1D/78000020",,,,,,,0,,"xlog.c",4013,
2017-07-21 23:19:20.377549 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","Couldn't read transaction log file (logid 29, seg 30)",,,,,,,0,,"xlog.c",5878,
2017-07-21 23:19:20.377567 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","invalid record length at 1D/79A74AC0",,,,,,,0,,"xlog.c",4045,
2017-07-21 23:19:20.377574 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","invalid secondary checkpoint record at location 1D/79A74AC0 (==> seg 30, offset 0x1A74AC0)",,,,,,,0,,"xlog.c",7957,
2017-07-21 23:19:20.377589 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","contrecord is requested by 1D/78000020",,,,,,,0,,"xlog.c",4013,
2017-07-21 23:19:20.377596 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","Couldn't read transaction log file (logid 29, seg 30)",,,,,,,0,,"xlog.c",5878,
2017-07-21 23:19:20.385679 UTC,,,p299,th-1232492480,,,,0,,,seg-1,,,,,"PANIC","XX000","could not locate a valid checkpoint record (xlog.c:6460)",,,,,,,0,,"xlog.c",6460,"Stack trace:
1    0x8c8038 postgres errstart + 0x278
2    0x4e7959 postgres StartupXLOG + 0x1899
3    0x4eb12e postgres StartupProcessMain + 0x27e
4    0x53d3f6 postgres AuxiliaryProcessMain + 0x486
5    0x771620 postgres <symbol not found> + 0x771620
6    0x77b6bd postgres StartMasterOrPrimaryPostmasterProcesses + 0x2d
7    0x77df9b postgres doRequestedPrimaryMirrorModeTransitions + 0x3bb
8    0x776e3d postgres <symbol not found> + 0x776e3d
9    0x77a819 postgres PostmasterMain + 0x789
10   0x486145 postgres main + 0x3c5
11   0x7f48b4ce4b15 libc.so.6 __libc_start_main + 0xf5
12   0x486265 postgres <symbol not found> + 0x486265
"
2017-07-21 23:19:20.385957 UTC,,,p297,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","startup process (PID 299) was terminated by signal 6: Aborted",,,,,,,0,,"postmaster.c",5599,
2017-07-21 23:19:20.385966 UTC,,,p297,th-1232492480,,,,0,,,seg-1,,,,,"LOG","00000","aborting startup due to startup process failure",,,,,,,0,,"postmaster.c",4436,

Ignacio Elizaga

unread,

Jul 22, 2017, 6:52:24 AM7/22/17

to da...@awakenetworks.com, Greenplum Users

Hi Dash,

AFAIK, it is possible to run pg_resetxlog in any segment directory or master data directory in Greenplum (haven't tried in OSS version but I have seen it referenced in the code) - specifically with this error, I can't think of a wide range of options to go with besides the pg_resetxlog itself. The reason why Pivotal documentation doesn't provide any information about this command is that it is quite a dangerous tool and it should only be used as the very last resort to fix a problematic segment/master. I'm of the opinion that commands like these should be documented as well in an OSS Documentation, even if they're dangerous (I mean, PostgreSQL does), but that's a different debate.

Running pg_resetxlog can cause data loss, create catalog inconsistencies and under certain circumstances, the database might become completely unusable. For this reason, I would suggest taking a backup of $MASTER_DATA_DIRECTORY in a different location, in case things go south after running the pg_resetxlog. If despite all this you still want to go with the pg_resetxlog - after it runs and if the database comes up normally I would suggest restoring from the latest available backup. If this is not possible, try running a gpcheckcat to check for catalog inconsistencies that you'll have to manually fix and also keep in mind that you'll have to accept data loss due to the partially committed transactions.

Some other options that come to my mind:

- Try to fail over to the standby-master if it exists and it is healthy to bring up the database

- If everything else fails - reinitialise the cluster and restore from the latest available backup

Regards,

Ignacio Elizaga

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.
To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.

Dave Cramer

unread,

Jul 24, 2017, 1:17:40 PM7/24/17

to Ignacio Elizaga, da...@awakenetworks.com, Greenplum Users

It's very likely you will lose data doing this.

at the very least make a binary copy of your cluster before you execute this.

Dave Cramer

Debabrata Dash

unread,

Jul 24, 2017, 1:24:35 PM7/24/17

to Dave Cramer, Ignacio Elizaga, Greenplum Users

Gave gp_resetxlog a shot, and it worked. The database is back to being functional.