I integrated your 4cb6d10 commit to ec_protect.c, but dont
think it is going to help, since I didnt notice any errors
from ec_protect occuring.
My recent tests with this change:
--blockCacheThreads=30 --blockCacheSize=30 --blockSize=1M
2017-03-21 10:02:14.466 [02] wr 00000575 content=0x00000000 START
2017-03-21 10:02:14.466 [02] wr 00000575 content=0x00000000 COMPLETE
2017-03-21 10:02:14.519 DEBUG: DELETE
http://jlb-tst1.s3-us-west-1.amazonaws.com/000005752017-03-21 10:02:14.558 DEBUG: success: DELETE
http://jlb-tst1.s3-us-west-1.amazonaws.com/000005752017-03-21 10:02:22.622 [02] wr 00000575 content=0x79560f19 START
2017-03-21 10:02:22.622 [02] wr 00000575 content=0x79560f19 COMPLETE
2017-03-21 10:02:22.786 DEBUG: PUT
http://jlb-tst1.s3-us-west-1.amazonaws.com/000005752017-03-21 10:02:23.422 DEBUG: success: PUT
http://jlb-tst1.s3-us-west-1.amazonaws.com/000005752017-03-21 10:02:29.322 [06] rd 00000575 START
2017-03-21 10:02:29.322 DEBUG: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/000005752017-03-21 10:02:29.351 DEBUG: rec'd 404 response: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/000005752017-03-21 10:02:29.352 [06] ****** READ ERROR: got wrong content block 00000575, expected content=0x79560f19, got content=0x00000000, content=0x00000000
In this case, although the PUT completed at 10:02:23.422, the
GET 6 seconds later at 10:02:29.322 did not see it due to
eventual consistancy delays again. The PUT must have been
evicted from the block cache, else it would have been read
from there. Not sure why the MD5 cache didnt protect us, other
than perhaps the 404 error may have prevented any retries of
the slow previous PUT.
It seems that a 0.5 second minWriteDelay is long enough, so I
tried running with 10 seconds instead:
--minWriteDelay=10000 --md5CacheTime=10000 --blockCacheThreads=30 --blockCacheSize=30 --blockSize=1M
2017-03-21 10:38:41.391 [04] wr 00000ad4 content=0x57520540 START
2017-03-21 10:38:41.391 [04] wr 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:38:41.495 DEBUG: PUT
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:38:41.676 DEBUG: success: PUT
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:38:42.106 [03] rd 00000ad4 START
2017-03-21 10:38:42.116 DEBUG: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:38:42.198 DEBUG: success: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:38:42.198 [03] rd 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:39:35.577 [09] rd 00000ad4 START
2017-03-21 10:39:35.788 DEBUG: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:39:36.158 DEBUG: success: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:39:36.159 [09] rd 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:41:16.913 [07] rd 00000ad4 START
2017-03-21 10:41:16.924 DEBUG: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:41:17.325 DEBUG: success: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:41:17.326 [07] rd 00000ad4 content=0x57520540 COMPLETE
2017-03-21 10:41:51.911 [05] wr 00000ad4 content=0x00000000 START
2017-03-21 10:41:51.911 [05] wr 00000ad4 content=0x00000000 COMPLETE
2017-03-21 10:41:51.984 DEBUG: DELETE
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:41:52.070 DEBUG: success: DELETE
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:42:03.458 [05] rd 00000ad4 START
2017-03-21 10:42:03.480 DEBUG: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:42:03.653 DEBUG: success: GET
http://jlb-tst1.s3-us-west-1.amazonaws.com/00000ad42017-03-21 10:42:03.653 [05] ****** READ ERROR: got wrong content block 00000ad4, expected content=0x00000000, got content=0x57520540, content=0x57520540
Again, even with the --minWriteDelay=10000, the GET at
10:42:03.480 did not see the DELETE that finished at
10:41:52.070, which was 11.42 seconds earlier. Is it because
the prior write was a delete, perhaps it was not really in the
MD5 cache so no protection there?
Earlier though, I see the GET at 10:38:42.116 occurring only
0.41 seconds after the PUT finished at 10:38:41.676. It worked
in that case, and the --minWriteDelay=10000 setting only
guards PUTs and DELETEs from each other, not GETs. Perhaps the
MD5 cache protected us here (but nothing logged), or more
likely the S3 was simply fast enough that time.