Hi, Thank you for your reply! Here are my thoughts on this:
1. I believe that a more dynamic verification strategy can help improve handle corruption scenarios. I also understand your idea of why you need some sort of agreement to decide how to fix the corrupted replicas. When implemented, I believe this feature will require good amount of testing which our simple set of tools can help with. I would be happy to discuss more on this when dynamic verification is implemented.
2. I will file the issues that we found with respect to storage related errors in github. When these issues are fixed, we can rerun the tests and verify that all corner cases are fixed.
I agree that on-disk ECC can detect and fix almost all corruptions due to bit-rots in the disk. But on-disk ECC cannot help against buggy controller firmware (or problems in higher level software such as bugs in Kernel, FS etc). Checksums in file system (ZFS) can help detect corruption reliably but cannot fix them as they do not maintain redundant data blocks (at least in the commonly-used mount options). Moreover, many commonly used file systems in Linux (for example, Linux default FS, ext4) do not checksum data blocks and so cannot even detect silent corruptions. Although I agree that advising users to deploy ZFS or similar FS can help some issues, I believe it is important for critical applications to be resilient on most commonly-used file systems such as ext4.
I agree memory corruption is also a problem which has not been the focus of our tests. In case of disk corruptions which has reliable ways of detection (using application level checksums), and given that replicated copies of data exist, I believe we can improve an application's resilience to corruptions by fixing the corrupted replica using other intact replicas (or at least detect them reliably and do something to recover from the situation). I concur with you on the fact that sometimes absolute protection may not be possible. But, I believe we can certainly improve the current state of resiliency in CockroachDB.
I agree that most (not all) storage systems are susceptible to corruptions. But, many replicated storage systems have taken measures against data corruption by carefully adding checksums (at application level), verifying them and so on. For example, other system do not do anything to recover from corruptions but they detect corruption reliably. On detecting a corruption, the node just shuts down (i.e., recovery is no-op). Although it may seem like a not-so-good way of reacting to the corruption, in practice, it works well. Once the corrupted node goes down, the remaining nodes elect a new leader (if the corrupted node was the old leader) and continue to make progress. In CockroachDB, those corner cases still seem to be problematic. For example, the cluster can be unavailable even if one node's data is corrupted. I believe such issues are important to fix as they can affect the reliability of the system as a whole.
I would be happy to know your thoughts.
Thanks
Ram