Point-In-Time Recovery failing on OpenStack

jim.g...@enterprisedb.com

unread,

Apr 10, 2015, 12:27:43 PM4/10/15

to wa...@googlegroups.com, Dave Page

I've run into a problem trying to do a point-in-time recovery that appears to be specific to OpenStack using wal-e version 0.7.3. wal-e appears to be restoring non-existent .history files even though the wal-fetch is failing. as a result, it continues to increment and try and fetch the next .history files ad infinitum (e.g. 00000002.history, 00000003.history, 00000004.history and on and on and on) since the DB "thinks" that the .history files are being restored when the are really not.

everything works ok with AWS, stopping after the first failure with no message noting "restored log file "00000002.history" from archive" as is happening (and shown and bolded below) with OpenStack.

wal_e.operator.backup INFO MSG: begin wal restore

STRUCTURED: time=2015-04-09T16:05:32.741598-00 pid=8005 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000002.history state=begin

swiftclient INFO RESP STATUS: 404 Not Found

swiftclient INFO RESP HEADERS: [('date', 'Thu, 09 Apr 2015 16:05:35 GMT'), ('content-length', '70'), ('content-type', 'text/html; charset=UTF-8'), ('connection', 'keep-alive'), ('x-trans-id', 'tx73ff9ce52e0442bea81eb-005526a34f')]

swiftclient INFO RESP BODY: <html><h1>Not Found</h1><p>The resource could not be found.</p></html>

swiftclient ERROR Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

Traceback (most recent call last):

File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 1243, in _retry

rv = func(self.url, self.token, *args, **kwargs)

File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 862, in get_object

http_response_content=body)

ClientException: Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

lzop: <stdin>: not a lzop file

wal_e.blobstore.swift.utils WARNING MSG: could no longer locate object while performing wal restore

DETAIL: The absolute URI that could not be located is swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo.

HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.

STRUCTURED: time=2015-04-09T16:05:33.415014-00 pid=8005

wal_e.operator.backup INFO MSG: complete wal restore

STRUCTURED: time=2015-04-09T16:05:33.415424-00 pid=8005 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000002.history state=complete

2015-04-09 16:05:33 GMT LOG: restored log file "00000002.history" from archive

wal_e.operator.backup INFO MSG: begin wal restore

STRUCTURED: time=2015-04-09T16:05:33.736182-00 pid=8021 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000003.history state=begin

swiftclient INFO RESP STATUS: 404 Not Found

swiftclient INFO RESP HEADERS: [('date', 'Thu, 09 Apr 2015 16:05:36 GMT'), ('content-length', '70'), ('content-type', 'text/html; charset=UTF-8'), ('connection', 'keep-alive'), ('x-trans-id', 'tx34d437e9bfb54837b8732-005526a350')]

swiftclient INFO RESP BODY: <html><h1>Not Found</h1><p>The resource could not be found.</p></html>

swiftclient ERROR Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

Traceback (most recent call last):

File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 1243, in _retry

rv = func(self.url, self.token, *args, **kwargs)

File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 862, in get_object

http_response_content=body)

ClientException: Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

lzop: <stdin>: not a lzop file

wal_e.blobstore.swift.utils WARNING MSG: could no longer locate object while performing wal restore

DETAIL: The absolute URI that could not be located is swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo.

HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.

STRUCTURED: time=2015-04-09T16:05:34.405067-00 pid=8021

wal_e.operator.backup INFO MSG: complete wal restore

STRUCTURED: time=2015-04-09T16:05:34.405459-00 pid=8021 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000003.history state=complete

2015-04-09 16:05:34 GMT LOG: restored log file "00000003.history" from archive

Daniel Farina

unread,

Apr 10, 2015, 12:49:47 PM4/10/15

to jim.g...@enterprisedb.com, wal-e, Dave Page

On Fri, Apr 10, 2015 at 9:12 AM, <jim.g...@enterprisedb.com> wrote:
> I've run into a problem trying to do a point-in-time recovery that appears
> to be specific to OpenStack using wal-e version 0.7.3. wal-e appears to be
> restoring non-existent .history files even though the wal-fetch is failing.
> as a result, it continues to increment and try and fetch the next .history
> files ad infinitum (e.g. 00000002.history, 00000003.history,
> 00000004.history and on and on and on) since the DB "thinks" that the
> .history files are being restored when the are really not.
>
> everything works ok with AWS, stopping after the first failure with no
> message noting "restored log file "00000002.history" from archive" as is
> happening (and shown and bolded below) with OpenStack.

Hum. People have definitely noted some bugs where failures can result
in empty files being left behind by WAL-E. I thought all of those
were in v0.8 -- and fixed -- but maybe not this one.

Is there a way you can identify if an empty file by the name
"00000002.history" (or whatever bogus file is "thought" to exist) is
in pg_xlog while this bug is occuring?

Secondly, would you consider trying v0.8? (Alternatively: nail the bug
by inspection in v0.7, if you feel up for it)

Daniel Farina

unread,

Apr 10, 2015, 12:51:28 PM4/10/15

to jim.g...@enterprisedb.com, wal-e, Dave Page

Also, I think a reasonable suspicion is this is somehow similar to
https://github.com/wal-e/wal-e/pull/169

jim.g...@enterprisedb.com

unread,

Apr 10, 2015, 6:06:21 PM4/10/15

to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com

unfortunately i am seeing the same problem with v0.8.0

jim.g...@enterprisedb.com

unread,

Apr 10, 2015, 6:06:21 PM4/10/15

to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com

On Friday, April 10, 2015 at 12:49:47 PM UTC-4, Daniel Farina wrote:

yes, there is an empty file called RECOVERYHISTORY being created in pg_xlog.

Secondly, would you consider trying v0.8? (Alternatively: nail the bug
by inspection in v0.7, if you feel up for it)

i will look into giving v0.8 a try.

jim.g...@enterprisedb.com

unread,

Apr 10, 2015, 6:06:21 PM4/10/15

to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com

On Friday, April 10, 2015 at 12:49:47 PM UTC-4, Daniel Farina wrote:

actually i was mistaken in my previous reply (i accidentally looked at the wrong machine). i just retested and there is no empty file in pg_xlog while this bug is occurring.

Daniel Farina

unread,

Apr 13, 2015, 4:28:24 AM4/13/15

to Jim Glennon, wal-e, Dave Page

Thanks for confirming that. It does help triangulate a problem as
that code got whacked around quite a bit.

Can you try the branch/patch here?
https://github.com/fdr/wal-e/tree/delete-on-error

jim.g...@enterprisedb.com

unread,

Apr 14, 2015, 8:22:38 AM4/14/15

to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com

it turns out this was not a wal-e issue. the script that was invoking wal-fetch was not returning the return code correctly.

sorry for the false alarm..... : /

Daniel Farina

unread,

Apr 14, 2015, 12:56:54 PM4/14/15

to jim.g...@enterprisedb.com, wal-e, dave...@enterprisedb.com

On Tue, Apr 14, 2015 at 5:22 AM, <jim.g...@enterprisedb.com> wrote:
> it turns out this was not a wal-e issue. the script that was invoking
> wal-fetch was not returning the return code correctly.
>
> sorry for the false alarm..... : /

No problems. Someone else was reporting a bug involving empty files
also, so I may commit the change anyway.

Reply all

Reply to author

Forward