Temporary EBS failure

23 views
Skip to first unread message

Paul Carey

unread,
Aug 20, 2010, 4:38:44 AM8/20/10
to ec2ubuntu
Hi

I posted the following to the AWS forums yesterday, but reposting here
as I'm very keen to understand what happened.

At 06:25 a scheduled invocation of ec2-consistent-snapshot failed to
complete. My EBS volumes continued to receive writes until about
06:41. At 06:48 logrotate failed to complete (output from kern.log
below).

When I logged into my server I could run ls on some but not all of the
directors on the EBS mount. I could also view some parts of some logs
files - others were corrupted, and less / tail simply hung on others.
Running kill -9 failed to kill the process that was tailing one of the
logs. Running umount and madam --stop both failed. (My instance had
two striped volumes).

I started a new instance and created new volumes from the previous
day's snapshots. I then force detached the volumes that I thought had
failed and attached them to a different new instance. There was no
evidence of data corruption and I replicated the data created since my
last snapshot of top of the new volumes (thanks to CouchDB for trivial
replication).

Given that the volumes seem to have been unaffected, I'm guessing that
problems arose due to connectivity issues. But that may be a poor
guess - I'd be very grateful if anyone could shed any light on what
might have happened and what, if anything, I could do in future
prevent this happening or to recover more easily.

Paul

----

Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812923] INFO:
task logrotate:14524 blocked for more than 120 seconds.
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812928] "echo
0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812934]
logrotate D 00014001 0 14524 14523 0x00000000Aug 19 06:48:26
domU-12-31-39-0E-1A-52 kernel: [7847162.812940] ebf01d78 00000282
00000010 00014001 013ffa00 00000000 0000ce4a 00000000
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812946]
c06ea1c0 ebf6d52c c06ea1c0 c06ea1c0 c06ea1c0 ebf6d52c c06ea1c0
c06ea1c0
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812950]
ebc61e40 001be0d3 ebf6d280 c0681280 c0353d71 00006868 ebdacd58
ebf01d80
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812955] Call
Trace:
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812959]
[<c0353d71>] ? xfs_buf_rele+0xa1/0xe0
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812962]
[<c014c7e9>] ? prepare_to_wait+0x39/0x60
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812965]
[<c0348495>] xfs_trans_alloc+0x45/0x80
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812968]
[<c014c5c0>] ? autoremove_wake_function+0x0/0x40
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812971]
[<c0346019>] xfs_rename+0xb9/0x5b0
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812975]
[<c0321c18>] ? xfs_dir_lookup+0xd8/0x120
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812978]
[<c0357706>] xfs_vn_rename+0x66/0x80
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812981]
[<c01d6e1f>] vfs_rename_other+0x9f/0xd0
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812984]
[<c01d7fbb>] vfs_rename+0xeb/0x290
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812987]
[<c01d8745>] ? __lookup_hash+0xc5/0x110
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812990]
[<c01d9bed>] sys_renameat+0x20d/0x230
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812994]
[<c01d2362>] ? sys_stat64+0x22/0x30
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812996]
[<c01d9c38>] sys_rename+0x28/0x30
Aug 19 06:48:26 domU-12-31-39-0E-1A-52 kernel: [7847162.812999]
[<c01047e0>] syscall_call+0x7/0xb

Eric Hammond

unread,
Aug 20, 2010, 5:33:03 PM8/20/10
to ec2u...@googlegroups.com
Paul:

It's possible that ec2-consistent-snapshot terminated without having a
chance to unfreeze the XFS file system, even though it takes great pains
to avoid this. I add an explicit unfreeze after running it in my cron
jobs. E.g.,

sudo ec2-consistent-snapshot ...; sudo xfs_freeze -u /EBSVOL 2>/dev/null

--
Eric Hammond

Mark Goris

unread,
Aug 21, 2010, 9:05:27 PM8/21/10
to ec2ubuntu
Hi Paul,

I had something very similar happen with an EBS volume containing
MySQL database files. Eric made the same suggestion, and it seems to
have done the trick. I've had EBS weirdness happen running ec2-
consistent-snapshot since, but the instance hasn't yet been left in a
bad situation.

Mark

Paul Carey

unread,
Sep 13, 2010, 11:01:59 AM9/13/10
to ec2u...@googlegroups.com
Thanks Eric, Mark.

An AWS member stated that the Cloudwatch graphs showed performance
issues with the volume, but they didn't elaborate.

I've added an explicit unfreeze as Eric suggested... so far so good :)

Paul

> --
> You received this message because you are subscribed to the "ec2ubuntu" Google Group:
>  http://groups.google.com/group/ec2ubuntu
> To unsubscribe, send email to ec2ubuntu-...@googlegroups.com

Reply all
Reply to author
Forward
0 new messages