Problem getting 'tester' to run without "got wrong content block" errors.

Jeff Byers

unread,

Mar 20, 2017, 9:25:56 AM3/20/17

to s3backer-devel

Hello,

I think I've found a problem with 's3backer' that can result in
data corruption when used in S3 Eventual Consistency mode and
while also using the Block Cache. This has been seen very
intermittently, a total of about five times, resulting in "got
wrong content block" errors like:

1527.487 [02] got wrong content block 00000e21

when running the unit test code 'tester' with the following settings:

# tester -d --directIO --blockCacheThreads=30 --blockCacheThreads=30 --blockSize=1M --size=5G

The problem seems to be that the Block Cache is violating the
efforts of the ec_protect.c code to guarantee consistency:

    s3backer enforces a minimum delay between consecutive PUT or
    DELETE operations on the same block. This ensures that
    Amazon S3 doesn't receive these operations out of order.

because using the delay --minWriteDelay=500 is lost on the
block_cache.c code as it doesn't care or know anything about S3
Eventual Consistency rules; that is ec_protect.c's job.

What I see happening is that the block cache had been sitting on
a write (or delete), and then submits it, when a new write for
the same block comes in from ec_protect.c. This causes
block_cache.c to abort the write in progress, and to
*immediately* submit the new write (or delete). Most of the time,
this works, but in some rare cases it does not.

It is easy to see that there is no delay on a wire trace; that
whenever a PUT is aborted by block_cache.c, the next PUT or
DELETE to the same block is nearly instantaneous. Strictly
speaking, this is not an order of operation issue as the aborted
I/O was clearly first, but the question is when S3 is ready for
next I/O to the same block after an abort.

I was only able to get the 'tester' to run the full 24 hours by
changing the code in block_cache.c block_cache_worker_main() to
disable the aborting mechanism:

if (1) { // JB
    r = (*priv->inner->write_block)(priv->inner, entry->block_num, buf, md5, NULL, NULL); // JB
} else {
    r = (*priv->inner->write_block)(priv->inner, entry->block_num, buf, md5, block_cache_check_cancel, priv);
}

I doubt that this hack is the right kind of fix; only that the
unit test now passes.

It might be better to abort the write I/O, and then wait for
ec_protect.c config->min_write_delay before submitting the next
I/O to the same block.

Either way, not aborting the I/O, or aborting and waiting would
require giving the block_cache.c config knowledge of when and
for how long to do this.

Let me know if I'm on the right track here. Thanks.

~ Jeff Byers ~

Jeff Byers

unread,

Mar 20, 2017, 1:25:01 PM3/20/17

to s3backer-devel

Hello,

I'm also seeing a different, perhaps related case of "got
wrong content block", this time without any "aborts" on the
block in question.

The core 's3backer' code is Release 1.4.3, but with my
debugging modifications to the tester code to check both the
beginning and end of the write block contents (I was worried
about a buffer changing during the PUT), to add millisecond
resolution to the time stamps, and to change the times logged
by 'tester' to be time of day with millisecond resolution instead
of relative for easier correlation with the debug logging.

This case:

2017-03-20 09:02:57.862 [00] wr 00000dfd content=0xf608ac52 START
2017-03-20 09:02:57.862 [00] wr 00000dfd content=0xf608ac52 COMPLETE
2017-03-20 09:02:58.109 DEBUG: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
> PUT /00000dfd HTTP/1.1
2017-03-20 09:02:58.148 [09] rd 00000dfd START
2017-03-20 09:02:58.149 [09] rd 00000dfd content=0xf608ac52 COMPLETE
2017-03-20 09:02:58.561 DEBUG: success: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
2017-03-20 09:04:02.741 [07] rd 00000dfd START
2017-03-20 09:04:02.741 DEBUG: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
> GET /00000dfd HTTP/1.1
2017-03-20 09:04:03.125 DEBUG: success: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
2017-03-20 09:04:03.125 [07] rd 00000dfd content=0xf608ac52 COMPLETE
2017-03-20 09:04:06.442 [02] rd 00000dfd START
2017-03-20 09:04:06.442 [02] rd 00000dfd content=0xf608ac52 COMPLETE
2017-03-20 09:04:27.972 [04] wr 00000dfd content=0x00000000 START
2017-03-20 09:04:27.973 [04] wr 00000dfd content=0x00000000 COMPLETE
2017-03-20 09:04:28.197 DEBUG: DELETE http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
> DELETE /00000dfd HTTP/1.1
2017-03-20 09:04:28.231 DEBUG: success: DELETE http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
2017-03-20 09:05:47.791 [00] wr 00000dfd content=0x6166ad04 START
2017-03-20 09:05:47.791 [00] wr 00000dfd content=0x6166ad04 COMPLETE
2017-03-20 09:05:48.023 DEBUG: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
> PUT /00000dfd HTTP/1.1
2017-03-20 09:05:48.567 DEBUG: success: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
2017-03-20 09:05:59.947 [05] rd 00000dfd START
2017-03-20 09:05:59.947 DEBUG: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
> GET /00000dfd HTTP/1.1
2017-03-20 09:05:59.978 DEBUG: rec'd 404 response: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000dfd
2017-03-20 09:05:59.979 [05] ****** READ ERROR: got wrong content block 00000dfd, expected content=0x6166ad04, got content=0x00000000, content=0x00000000

At 09:05:48.023, a PUT was done, apparently finishing
successfully at 09:05:48.567, but where did it go?

More than an hour later, I checked on S3, and the 00000dfd
block is there now, and it has the expected content=0x6166ad04.
S3 says that was last modified at 09:05:48, which is correct.

The GET at 09:05:59.947 was more than 10 seconds after the
prior PUT was done at 09:05:48.567, but the object wasn't
there yet?

It sounds similar to what they reported:

    AWS Developer Forums: S3 PUT after DELETE consistency ...
    https://forums.aws.amazon.com/thread.jspa?messageID=752877&tstart=0

but they didn't provide an explanation/resolution.

~ Jeff Byers ~

Archie Cobbs

unread,

Mar 20, 2017, 6:44:14 PM3/20/17

to s3backe...@googlegroups.com

Hi Jeff,

I think you are correct, in that there is a race condition here.

Let me try to rephrase the issue, which I think is slightly more general.

If we get an error when writing a block, no matter what the error code or cause, we cannot safely assume that the block was not actually written. For example, some intermediate proxy could have had a problem with the reply, even though the block was safely received by the final destination.

Actually, if not all of the block's content has been transmitted, then we can assume that it was not actually written (the signature would not validate with only partial data), but that's an optimization not worth dealing with right now because this occurrence is rare.

So I believe your suggestion to wait for at least min_write_delay time before accessing the block - either read or write - is the right answer. Once this time has passed, we then can revert the block to the default (i.e., unknown) status, so it will have to be actually read in from S3 the next time.

By the way, this logic applies to every layer, not just the MD5 layer. For example, the block cache should also not assume a write error implies the block was not successfully written. Fortunately it already does this implicitly, because it retries indefinitely after a failed write attempt.

Please try your test with the fix committed as 4cb6d10.

Thanks,

-Archie

--
You received this message because you are subscribed to the Google Groups "s3backer-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3backer-devel+unsubscribe@googlegroups.com.
To post to this group, send email to s3backer-devel@googlegroups.com.
Visit this group at https://groups.google.com/group/s3backer-devel.
For more options, visit https://groups.google.com/d/optout.

--

Archie L. Cobbs

Archie Cobbs

unread,

Mar 20, 2017, 6:46:27 PM3/20/17

to s3backe...@googlegroups.com

This sounds like a problem for Amazon to fix.

I'm sure they'll respond with "`Eventually' means `eventually'- if the data is wrong then you just haven't waited long enough" :)

-AC

--

You received this message because you are subscribed to the Google Groups "s3backer-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3backer-devel+unsubscribe@googlegroups.com.
To post to this group, send email to s3backer-devel@googlegroups.com.
Visit this group at https://groups.google.com/group/s3backer-devel.
For more options, visit https://groups.google.com/d/optout.

--

Archie L. Cobbs

Jeff Byers

unread,

Mar 21, 2017, 2:25:22 PM3/21/17

to s3backer-devel

I integrated your 4cb6d10 commit to ec_protect.c, but dont
think it is going to help, since I didnt notice any errors
from ec_protect occuring.

My recent tests with this change:

--blockCacheThreads=30 --blockCacheSize=30 --blockSize=1M

2017-03-21 10:02:14.466 [02] wr 00000575 content=0x00000000 START
2017-03-21 10:02:14.466 [02] wr 00000575 content=0x00000000 COMPLETE
2017-03-21 10:02:14.519 DEBUG: DELETE http://jlb-tst1.s3-us-west-1.amazonaws.com/00000575
2017-03-21 10:02:14.558 DEBUG: success: DELETE http://jlb-tst1.s3-us-west-1.amazonaws.com/00000575
2017-03-21 10:02:22.622 [02] wr 00000575 content=0x79560f19 START
2017-03-21 10:02:22.622 [02] wr 00000575 content=0x79560f19 COMPLETE
2017-03-21 10:02:22.786 DEBUG: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000575
2017-03-21 10:02:23.422 DEBUG: success: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000575
2017-03-21 10:02:29.322 [06] rd 00000575 START
2017-03-21 10:02:29.322 DEBUG: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000575
2017-03-21 10:02:29.351 DEBUG: rec'd 404 response: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000575
2017-03-21 10:02:29.352 [06] ****** READ ERROR: got wrong content block 00000575, expected content=0x79560f19, got content=0x00000000, content=0x00000000

In this case, although the PUT completed at 10:02:23.422, the
GET 6 seconds later at 10:02:29.322 did not see it due to
eventual consistancy delays again. The PUT must have been
evicted from the block cache, else it would have been read
from there. Not sure why the MD5 cache didnt protect us, other
than perhaps the 404 error may have prevented any retries of
the slow previous PUT.

It seems that a 0.5 second minWriteDelay is long enough, so I
tried running with 10 seconds instead:

--minWriteDelay=10000 --md5CacheTime=10000 --blockCacheThreads=30 --blockCacheSize=30 --blockSize=1M

2017-03-21 10:38:41.391 [04] wr 00000ad4 content=0x57520540 START
2017-03-21 10:38:41.391 [04] wr 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:38:41.495 DEBUG: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:38:41.676 DEBUG: success: PUT http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:38:42.106 [03] rd 00000ad4 START
2017-03-21 10:38:42.116 DEBUG: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:38:42.198 DEBUG: success: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:38:42.198 [03] rd 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:39:35.577 [09] rd 00000ad4 START
2017-03-21 10:39:35.788 DEBUG: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:39:36.158 DEBUG: success: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:39:36.159 [09] rd 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:41:16.913 [07] rd 00000ad4 START
2017-03-21 10:41:16.924 DEBUG: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:41:17.325 DEBUG: success: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:41:17.326 [07] rd 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:41:51.911 [05] wr 00000ad4 content=0x00000000 START
2017-03-21 10:41:51.911 [05] wr 00000ad4 content=0x00000000 COMPLETE
2017-03-21 10:41:51.984 DEBUG: DELETE http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:41:52.070 DEBUG: success: DELETE http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:42:03.458 [05] rd 00000ad4 START
2017-03-21 10:42:03.480 DEBUG: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:42:03.653 DEBUG: success: GET http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad4
2017-03-21 10:42:03.653 [05] ****** READ ERROR: got wrong content block 00000ad4, expected content=0x00000000, got content=0x57520540, content=0x57520540

Again, even with the --minWriteDelay=10000, the GET at
10:42:03.480 did not see the DELETE that finished at
10:41:52.070, which was 11.42 seconds earlier. Is it because
the prior write was a delete, perhaps it was not really in the
MD5 cache so no protection there?

Earlier though, I see the GET at 10:38:42.116 occurring only
0.41 seconds after the PUT finished at 10:38:41.676. It worked
in that case, and the --minWriteDelay=10000 setting only
guards PUTs and DELETEs from each other, not GETs. Perhaps the
MD5 cache protected us here (but nothing logged), or more
likely the S3 was simply fast enough that time.

Thanks.

~ Jeff Byers ~

To unsubscribe from this group and stop receiving emails from it, send an email to s3backer-deve...@googlegroups.com.
To post to this group, send email to s3backe...@googlegroups.com.

Visit this group at https://groups.google.com/group/s3backer-devel.
For more options, visit https://groups.google.com/d/optout.

--
Archie L. Cobbs

Archie Cobbs

unread,

Mar 21, 2017, 6:15:44 PM3/21/17

to s3backe...@googlegroups.com

On Tue, Mar 21, 2017 at 1:25 PM, Jeff Byers <jeff....@gmail.com> wrote:

Again, even with the --minWriteDelay=10000, the GET at
10:42:03.480 did not see the DELETE that finished at
10:41:52.070, which was 11.42 seconds earlier. Is it because
the prior write was a delete, perhaps it was not really in the
MD5 cache so no protection there?

Correct, after 10 seconds the MD5 cache forgets the block completely.

10 seconds seems like a lot longer delay than "usual" for S3 convergence... maybe they've gotten slower over time... ?

Earlier though, I see the GET at 10:38:42.116 occurring only
0.41 seconds after the PUT finished at 10:38:41.676. It worked
in that case, and the --minWriteDelay=10000 setting only
guards PUTs and DELETEs from each other, not GETs. Perhaps the
MD5 cache protected us here (but nothing logged), or more
likely the S3 was simply fast enough that time.

In this case the GET will have a restriction on the MD5 of the returned block via If-Match header. So Amazon must have already converged by then.

-Archie

--

Archie L. Cobbs

Reply all

Reply to author

Forward