Point-In-Time Recovery failing on OpenStack

34 views
Skip to first unread message

jim.g...@enterprisedb.com

unread,
Apr 10, 2015, 12:27:43 PM4/10/15
to wa...@googlegroups.com, Dave Page
I've run into a problem trying to do a point-in-time recovery that appears to be specific to OpenStack using wal-e version 0.7.3. wal-e appears to be restoring non-existent .history files even though the wal-fetch is failing. as a result, it continues to increment and try and fetch the next .history files ad infinitum (e.g. 00000002.history, 00000003.history, 00000004.history and on and on and on) since the DB "thinks" that the .history files are being restored when the are really not. 

everything works ok with AWS, stopping after the first failure with no message noting "restored log file "00000002.history" from archive" as is happening (and shown and bolded below) with OpenStack. 

wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2015-04-09T16:05:32.741598-00 pid=8005 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000002.history state=begin
swiftclient  INFO     RESP STATUS: 404 Not Found
swiftclient  INFO     RESP HEADERS: [('date', 'Thu, 09 Apr 2015 16:05:35 GMT'), ('content-length', '70'), ('content-type', 'text/html; charset=UTF-8'), ('connection', 'keep-alive'), ('x-trans-id', 'tx73ff9ce52e0442bea81eb-005526a34f')]
swiftclient  INFO     RESP BODY: <html><h1>Not Found</h1><p>The resource could not be found.</p></html>
swiftclient  ERROR    Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
        Traceback (most recent call last):
          File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 1243, in _retry
            rv = func(self.url, self.token, *args, **kwargs)
          File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 862, in get_object
            http_response_content=body)
        ClientException: Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
lzop: <stdin>: not a lzop file
wal_e.blobstore.swift.utils WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2015-04-09T16:05:33.415014-00 pid=8005
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2015-04-09T16:05:33.415424-00 pid=8005 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000002.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000002.history state=complete
2015-04-09 16:05:33 GMT LOG:  restored log file "00000002.history" from archive
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2015-04-09T16:05:33.736182-00 pid=8021 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000003.history state=begin
swiftclient  INFO     RESP STATUS: 404 Not Found
swiftclient  INFO     RESP HEADERS: [('date', 'Thu, 09 Apr 2015 16:05:36 GMT'), ('content-length', '70'), ('content-type', 'text/html; charset=UTF-8'), ('connection', 'keep-alive'), ('x-trans-id', 'tx34d437e9bfb54837b8732-005526a350')]
swiftclient  INFO     RESP BODY: <html><h1>Not Found</h1><p>The resource could not be found.</p></html>
swiftclient  ERROR    Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
        Traceback (most recent call last):
          File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 1243, in _retry
            rv = func(self.url, self.token, *args, **kwargs)
          File "/opt/wal-e-venv/lib/python2.6/site-packages/swiftclient/client.py", line 862, in get_object
            http_response_content=body)
        ClientException: Object GET failed: http://swift.ox.uk.enterprisedb.com:8080/v1/AUTH_ea9b0e51f1614134b720391e65a25df6/PPCD//jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo 404 Not Found  [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<
lzop: <stdin>: not a lzop file
wal_e.blobstore.swift.utils WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2015-04-09T16:05:34.405067-00 pid=8021
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2015-04-09T16:05:34.405459-00 pid=8021 action=wal-fetch key=swift://PPCD/jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/wal_005/00000003.history.lzo prefix=jtest-917/491d81c1-0cd4-48a0-bde5-6cb4205d7812/ seg=00000003.history state=complete
2015-04-09 16:05:34 GMT LOG:  restored log file "00000003.history" from archive


Daniel Farina

unread,
Apr 10, 2015, 12:49:47 PM4/10/15
to jim.g...@enterprisedb.com, wal-e, Dave Page
On Fri, Apr 10, 2015 at 9:12 AM, <jim.g...@enterprisedb.com> wrote:
> I've run into a problem trying to do a point-in-time recovery that appears
> to be specific to OpenStack using wal-e version 0.7.3. wal-e appears to be
> restoring non-existent .history files even though the wal-fetch is failing.
> as a result, it continues to increment and try and fetch the next .history
> files ad infinitum (e.g. 00000002.history, 00000003.history,
> 00000004.history and on and on and on) since the DB "thinks" that the
> .history files are being restored when the are really not.
>
> everything works ok with AWS, stopping after the first failure with no
> message noting "restored log file "00000002.history" from archive" as is
> happening (and shown and bolded below) with OpenStack.

Hum. People have definitely noted some bugs where failures can result
in empty files being left behind by WAL-E. I thought all of those
were in v0.8 -- and fixed -- but maybe not this one.

Is there a way you can identify if an empty file by the name
"00000002.history" (or whatever bogus file is "thought" to exist) is
in pg_xlog while this bug is occuring?

Secondly, would you consider trying v0.8? (Alternatively: nail the bug
by inspection in v0.7, if you feel up for it)

Daniel Farina

unread,
Apr 10, 2015, 12:51:28 PM4/10/15
to jim.g...@enterprisedb.com, wal-e, Dave Page
Also, I think a reasonable suspicion is this is somehow similar to
https://github.com/wal-e/wal-e/pull/169

jim.g...@enterprisedb.com

unread,
Apr 10, 2015, 6:06:21 PM4/10/15
to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com
unfortunately i am seeing the same problem with v0.8.0 

jim.g...@enterprisedb.com

unread,
Apr 10, 2015, 6:06:21 PM4/10/15
to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com
On Friday, April 10, 2015 at 12:49:47 PM UTC-4, Daniel Farina wrote:
yes, there is an empty file called RECOVERYHISTORY being created in pg_xlog.
 
Secondly, would you consider trying v0.8? (Alternatively: nail the bug
by inspection in v0.7, if you feel up for it)

i will look into giving v0.8 a try. 

jim.g...@enterprisedb.com

unread,
Apr 10, 2015, 6:06:21 PM4/10/15
to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com


On Friday, April 10, 2015 at 12:49:47 PM UTC-4, Daniel Farina wrote:
actually i was mistaken in my previous reply (i accidentally looked at the wrong machine). i just retested and there is no empty file in pg_xlog while this bug is occurring.

Daniel Farina

unread,
Apr 13, 2015, 4:28:24 AM4/13/15
to Jim Glennon, wal-e, Dave Page
Thanks for confirming that. It does help triangulate a problem as
that code got whacked around quite a bit.

Can you try the branch/patch here?
https://github.com/fdr/wal-e/tree/delete-on-error

jim.g...@enterprisedb.com

unread,
Apr 14, 2015, 8:22:38 AM4/14/15
to wa...@googlegroups.com, jim.g...@enterprisedb.com, dave...@enterprisedb.com
it turns out this was not a wal-e issue. the script that was invoking wal-fetch was not returning the return code correctly.

sorry for the false alarm.....  : / 

Daniel Farina

unread,
Apr 14, 2015, 12:56:54 PM4/14/15
to jim.g...@enterprisedb.com, wal-e, dave...@enterprisedb.com
On Tue, Apr 14, 2015 at 5:22 AM, <jim.g...@enterprisedb.com> wrote:
> it turns out this was not a wal-e issue. the script that was invoking
> wal-fetch was not returning the return code correctly.
>
> sorry for the false alarm..... : /

No problems. Someone else was reporting a bug involving empty files
also, so I may commit the change anyway.
Reply all
Reply to author
Forward
0 new messages