Resiliency to disk corruption and storage errors

Ramnatthan Ala

unread,

Jul 28, 2016, 12:30:59 PM7/28/16

to CockroachDB

Hi,

I have been performing few tests on CockroachDB with respect to disk errors and disk corruptions. From this article, I understand that CockroachDB takes very good safety measures (which is really great!) to survive all possible bad things that can happen.

I use a three-node CockroachDB cluster and my workloads always deal with a single database and single simple table within it. On simulating some disk corruptions and corner case storage related errors, I see that CockroachDB sometimes does not detect and sometimes detects but does not recover/repair as mentioned in the post utilizing the other clean replicas. The version I am using: cockroach-beta-20160714.linux-amd64 (I downloaded the single statically linked binary from github releases). From reading the code changes wrt this, my vague guess is that the checksum calculation and VerifyChecksum commands are executed only once in a day or so -- this is just my understanding from code and may be wrong. Here are some specific questions:

1. I think running the consistency checker in once in a while is great, but given that there could be many queries served between two check intervals, I am guessing there could be a better way to handle corruptions by utilizing the correct replicas?

2. Apart from corruption, disk errors such as EIO, ENOSPC, EDQUOT also seem to be somewhat problematic when trying to write/read specific blocks.

I would like to know more about CockroachDB's resiliency with respect to disk corruptions and disk errors. I would happy to give a detailed report of the tests (or file issues) if CockroachDB is expected and designed to be robust to all such corner cases. Thank you for your time!

Thanks

Ram

k...@cockroachlabs.com

unread,

Jul 29, 2016, 6:08:22 AM7/29/16

to CockroachDB

Hi Ram

Thanks for your interest in CockroachDB.

To answer your first question, in addition to the "cold" checksum checker, operations with the underlying RocksDB storage should also maintains checksums on K/V entries; we aim for more dynamic verifications than the checksum checker alone. To address the second point in your first question, currently we merely detect corruption but do not take steps to fix it. We are considering schemes where we could use majority voting among replicas to restore a known-good version of the data in case one of the replicas is corrupted, however this is not implemented yet.

To answer your second question, you are right we are not handling all OS-signalled storage errors the best way we could. We would be glad to see the misbehaviours you found land as issues on our GitHub repository.

Finally so you know we are generally aware that storage corruption is an issue. You already have researched on-disk storage, for which multiple protections at lower levels exist already, in the form of correction code in hard drives and checksums in filesystems (e.g. with ZFS or btrfs). We could (although we do not do this yet) advise users to use these protections with CockroachDB to increase resilience.

The reason however why we do not advise this yet is that we need first to develop a comprehensive story about data resilience. Next to on-disk corruption, which has known software-level mitigation solutions, we are also concerned about corruption in RAM, which is a thing (see e.g. http://www.tezzaron.com/media/soft_errors_1_1_secure.pdf and http://www.opsalacarte.com/pdfs/Tech_Papers/DRAM_Soft_Errors_White_Paper.pdf ) and for which unfortunately the only protection is expensive ECC RAM, often not available in Cloud datacenters. We might conclude that data resilience is a property on a sliding scale where users can only obtain higher levels of protection by helping CockroachDB by investing in special equipment to run their database.

In short we are absolutely dedicated to provide the best possible mechanisms to support data resilience, but absolute protection against data protection might not be something we can physically attain.

Of course meanwhile let's keep in thought that all current database engines are susceptible to these errors.

Would love to hear your thoughts on this.

Best regards

Ramnatthan Ala

unread,

Jul 31, 2016, 12:39:25 PM7/31/16

to CockroachDB

Hi, Thank you for your reply! Here are my thoughts on this:

1. I believe that a more dynamic verification strategy can help improve handle corruption scenarios. I also understand your idea of why you need some sort of agreement to decide how to fix the corrupted replicas. When implemented, I believe this feature will require good amount of testing which our simple set of tools can help with. I would be happy to discuss more on this when dynamic verification is implemented.

2. I will file the issues that we found with respect to storage related errors in github. When these issues are fixed, we can rerun the tests and verify that all corner cases are fixed.

I agree that on-disk ECC can detect and fix almost all corruptions due to bit-rots in the disk. But on-disk ECC cannot help against buggy controller firmware (or problems in higher level software such as bugs in Kernel, FS etc). Checksums in file system (ZFS) can help detect corruption reliably but cannot fix them as they do not maintain redundant data blocks (at least in the commonly-used mount options). Moreover, many commonly used file systems in Linux (for example, Linux default FS, ext4) do not checksum data blocks and so cannot even detect silent corruptions. Although I agree that advising users to deploy ZFS or similar FS can help some issues, I believe it is important for critical applications to be resilient on most commonly-used file systems such as ext4.

I agree memory corruption is also a problem which has not been the focus of our tests. In case of disk corruptions which has reliable ways of detection (using application level checksums), and given that replicated copies of data exist, I believe we can improve an application's resilience to corruptions by fixing the corrupted replica using other intact replicas (or at least detect them reliably and do something to recover from the situation). I concur with you on the fact that sometimes absolute protection may not be possible. But, I believe we can certainly improve the current state of resiliency in CockroachDB.

I agree that most (not all) storage systems are susceptible to corruptions. But, many replicated storage systems have taken measures against data corruption by carefully adding checksums (at application level), verifying them and so on. For example, other system do not do anything to recover from corruptions but they detect corruption reliably. On detecting a corruption, the node just shuts down (i.e., recovery is no-op). Although it may seem like a not-so-good way of reacting to the corruption, in practice, it works well. Once the corrupted node goes down, the remaining nodes elect a new leader (if the corrupted node was the old leader) and continue to make progress. In CockroachDB, those corner cases still seem to be problematic. For example, the cluster can be unavailable even if one node's data is corrupted. I believe such issues are important to fix as they can affect the reliability of the system as a whole.

I would be happy to know your thoughts.

Thanks

Ram

Reply all

Reply to author

Forward