Petabytes, sure. Petabytes of integral data, nope. No DIF == no integrity.
From the article: “…When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.”
Can’t argue with that. But what data? The data that you think you wrote, at the address you think you wrote it to, at the LUN or ID you think you sent it to? How do you know?
But let’s look a little deeper here. They use two power supplies, 760W rated each. It looks like the unit must have both supplies up to function, as they have subdivided duties. Lose one supply, half your drives disappear, and if you lose half the drives in the scheme they use (see below) your box is dead. Not very good if you have a halfway decent SLA. But perhaps they are assuming cloud vendors will not have SLAs with teeth – which is certainly the case today.
Then there are disk rebuilds and sparing. No hot spares configured, or at least described. Looks like they do three sets of double parity 13+2. So all drives are holding necessary information. No mention of how one spares out a drive, so you have to assume entire unit downtime to replace a drives and allow a rebuild to begin. Speaking of which, how long does that take? Rebuilding a 1.5TB drive – and the SiL chipset they use cannot recognize white space, so they have to rebuild the entire drive – takes tens of hours, easily. Have you ever rebuilt a drive from a 13+2 set? It’s ugly. I have to read 19.5 TB of data to rebuild a single 1.5TB drive. How long does it take to read 19.5 TB of data? Even if the box could write data at max speed to the spare, which is 120MB/sec or 432 GB/hour, it would take 4 hours. But that doesn’t count the read time – and it takes 45 hours to read 19.5 TB of data @ 432 GB/hr – never mind the compute time needed to recalculate double parity on that entire set. That’s the problem.
Then there is the mechanicals. Quite literally, they use rubber bands to ‘dampen vibration’. Now, that’s certainly novel, but I wouldn’t trust my data to that. What’s next, duct tape? I bet this box goes 30 rads easily on the vibration test. They also mount all the drives in the same orientation, a sure-fire way to induce torque and stress. I would also not want to be a drive in the rearmost group of 15, given their depiction of fans – and they use fans, not blowers.
On average, they will lose 2 drives every year (4 per 100 per annum) and by year 5, it will probably rise to 6-8 per year, given their use of consumer-level drives. So, over the course of 5 years, you may have to replace roughly half the drive population. Did they mention that in their cost analysis?
Seagate themselves says this drive is not best adapted for array-based storage. Look at their website for this drive and see their recommended applications – PCs, desktop (not array) RAID and personal external storage. This is because of the duty cycle that these drives are built to. Is this array powered on and active all the time? Verbatim from the drive manual:
Normal I/O duty cycle for desktop personal computers. Operation at excessive I/O duty cycle may degrade product reliability.
The paper mentions expense. Sure, I am worried about expense too. I am worried about the biggest expense of all – humans. I need a server admin to manage and babysit this machine. How many of these can one admin manage? The answer is, much less than I would expect of a cloud storage array. Management is completely by hand. Compare to storage arrays which are intelligent and need no management beyond initial provisioning.
Just replicate data to protect it, right? To what degree of replication? Is 2X enough? I know several cloud providers that replicate 3X using arrays that are far more reliable than this, not made of nylon shims and rubber bands. So, now that petabyte becomes $351,000, and I also need triple the humans to manage triple the units. Plus, that is not counting any drive replacement cost over time, and for 3 units (135 drives) I will probably have to replace 60-70 drives over 5 years for another 9 grand or so. This also assumes there is no cost for a human to replace a drive, an ‘extraordinary event’.
But let’s be very generous and say they can build and operate/maintain this box for 5 years for $500 grand. That’s certainly cheaper than commercial arrays. But again, for me, the biggest issue is integrity. At petabyte scale, it’s a huge problem for any ATA technology. Just ask the guys at CERN, even with FC drives that didn’t do DIF it’s a huge problem.
Just the UCR errors alone at petabyte scale with this box are scary. This drive is only 1 in 10^14. 1*10^14 bits is 1.25*10^13 bytes, or 125 TB. This means if you spin and read a petabyte of this stuff, you get 8 errors which are uncorrectable. Murphy says those 8 errors are in the data you really needed.
Good news – cheap storage. Bad news – cheap storage. This thing is a nice home-built whitebox. It would do well to stay there. You wouldn’t run a serious cloud compute infrastructure on dirt-cheap whitebox servers; likewise, you shouldn’t run a serious cloud storage infrastructure on dirt-cheap whitebox storage.
Rob
No virus
found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09
05:50:00
Figures below actually should be: 1.25 * 10^13 bytes is 12.5 TB, not 125 TB. This means there are 80 unrecoverable errors per petabyte, not 8. Sorry for any confusion. It actually makes the problem worse…
Rob
It is absolutely true. ZFS can’t and doesn’t know about addressing and LUN/ID errors at the disk level. It can check the integrity of the data (payload) as do some other filesystems, but that’s only 1/3rd of the problem. Now, if ZFS appended the T10 DIF field to each 512 at the initiator, your answer would be correct, but ZFS does no such thing. It does scrubbing and “resilvering”, but that is not DIF. The former merely checks to see if blocks _can_ be read and the latter makes an educated guess on a block mismatch which one is right and attempts to rewrite it.
But again, that is just the data, not the addressing. DIF prevents such errors.
>>> If you know for certain your data is corrupted
And therein lies the rub.
Lack of DIF at petabyte-scale does indeed damn it for integrity. You are correct, there are even more checks beyond T10 DIF that could be done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that the initiator supplies; payload, address/length (and its subsequent V->P translation), and LUN.
As for market pricing, what value do you place on your data? It’s really an exercise in risk management. Look at it this way; how much would it cost you to recreate/restore that petabyte? What is the time value of not having access to that petabyte? If a petabyte of data is only worth $117,000 to you, that’s one thing. For most companies, though, their petabyte is worth far, far more than that.
Then, there is the issue of storage performance – which is an entirely different (but related) topic. Cost of space is one thing, cost of time is very much another. There is absolutely no mention of performance in the post.
If they really wanted massive petabytes on the cheap, without regard to latency or performance, they should use the technology which draws no power when not doing I/O and can be purchased for under $30,000/raw PB as opposed to the $80,000/raw PB they spend now. The power savings alone would be significant.
Version: 8.5.409 / Virus Database: 270.13.76/2344 - Release Date: 09/03/09
18:05:00
>>> If you know for certain your data is corrupted
And therein lies the rub.
Lack of DIF at petabyte-scale does indeed damn it for integrity. You are correct, there are even more checks beyond T10 DIF that could be done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that the initiator supplies; payload, address/length (and its subsequent V->P translation), and LUN.
As for market pricing, what value do you place on your data? It’s really an exercise in risk management. Look at it this way; how much would it cost you to recreate/restore that petabyte? What is the time value of not having access to that petabyte? If a petabyte of data is only worth $117,000 to you, that’s one thing. For most companies, though, their petabyte is worth far, far more than that.
Then, there is the issue of storage performance – which is an entirely different (but related) topic. Cost of space is one thing, cost of time is very much another. There is absolutely no mention of performance in the post.
From the article: “…When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.”
Can’t argue with that. But what data? The data that you think you wrote, at the address you think you wrote it to, at the LUN or ID you think you sent it to? How do you know?
But let’s look a little deeper here. They use two power supplies, 760W rated each. It looks like the unit must have both supplies up to function, as they have subdivided duties. Lose one supply, half your drives disappear, and if you lose half the drives in the scheme they use (see below) your box is dead. Not very good if you have a halfway decent SLA. But perhaps they are assuming cloud vendors will not have SLAs with teeth – which is certainly the case today.
Then there are disk rebuilds and sparing. No hot spares configured, or at least described. Looks like they do three sets of double parity 13+2. So all drives are holding necessary information. No mention of how one spares out a drive, so you have to assume entire unit downtime to replace a drives and allow a rebuild to begin. Speaking of which, how long does that take? Rebuilding a 1.5TB drive – and the SiL chipset they use cannot recognize white space, so they have to rebuild the entire drive – takes tens of hours, easily. Have you ever rebuilt a drive from a 13+2 set? It’s ugly. I have to read 19.5 TB of data to rebuild a single 1.5TB drive. How long does it take to read 19.5 TB of data? Even if the box could write data at max speed to the spare, which is 120MB/sec or 432 GB/hour, it would take 4 hours. But that doesn’t count the read time – and it takes 45 hours to read 19.5 TB of data @ 432 GB/hr – never mind the compute time needed to recalculate double parity on that entire set. That’s the problem.
Then there is the mechanicals. Quite literally, they use rubber bands to ‘dampen vibration’. Now, that’s certainly novel, but I wouldn’t trust my data to that. What’s next, duct tape? I bet this box goes 30 rads easily on the vibration test. They also mount all the drives in the same orientation, a sure-fire way to induce torque and stress. I would also not want to be a drive in the rearmost group of 15, given their depiction of fans – and they use fans, not blowers.
On average, they will lose 2 drives every year (4 per 100 per annum) and by year 5, it will probably rise to 6-8 per year, given their use of consumer-level drives. So, over the course of 5 years, you may have to replace roughly half the drive population. Did they mention that in their cost analysis?
Seagate themselves says this drive is not best adapted for array-based storage. Look at their website for this drive and see their recommended applications – PCs, desktop (not array) RAID and personal external storage. This is because of the duty cycle that these drives are built to. Is this array powered on and active all the time? Verbatim from the drive manual:
Normal I/O duty cycle for desktop personal computers. Operation at excessive I/O duty cycle may degrade product reliability.
The paper mentions expense. Sure, I am worried about expense too. I am worried about the biggest expense of all – humans. I need a server admin to manage and babysit this machine. How many of these can one admin manage? The answer is, much less than I would expect of a cloud storage array. Management is completely by hand. Compare to storage arrays which are intelligent and need no management beyond initial provisioning.
Just replicate data to protect it, right? To what degree of replication? Is 2X enough? I know several cloud providers that replicate 3X using arrays that are far more reliable than this, not made of nylon shims and rubber bands. So, now that petabyte becomes $351,000, and I also need triple the humans to manage triple the units. Plus, that is not counting any drive replacement cost over time, and for 3 units (135 drives) I will probably have to replace 60-70 drives over 5 years for another 9 grand or so. This also assumes there is no cost for a human to replace a drive, an ‘extraordinary event’.
But let’s be very generous and say they can build and operate/maintain this box for 5 years for $500 grand. That’s certainly cheaper than commercial arrays. But again, for me, the biggest issue is integrity. At petabyte scale, it’s a huge problem for any ATA technology. Just ask the guys at CERN, even with FC drives that didn’t do DIF it’s a huge problem.
Just the UCR errors alone at petabyte scale with this box are scary. This drive is only 1 in 10^14. 1*10^14 bits is 1.25*10^13 bytes, or 125 TB. This means if you spin and read a petabyte of this stuff, you get 8 errors which are uncorrectable. Murphy says those 8 errors are in the data you really needed.
Good news – cheap storage. Bad news – cheap storage. This thing is a nice home-built whitebox. It would do well to stay there. You wouldn’t run a serious cloud compute infrastructure on dirt-cheap whitebox servers; likewise, you shouldn’t run a serious cloud storage infrastructure on dirt-cheap whitebox storage.
Rob
From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Jayarama Shenoy
Sent: Thursday, September 03, 2009 10:36 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Build your own storage - blog from BackBlaze
Very interesting post on making Petabytes on a budget for cloud storage.
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/#more-150
A question to those who'd think of doing something similar themselves - at what price point would there be an opportunity for a vendor to be a value add?
Backbalze's pod is not too different than Sun's Thumper (aka X4550). Some plumbing details are different, CPU & memory configs are different and so is Linux/JFS vs Solaris/ZFS. You can argue that in each case, Sun's options are superior, but that and support/warranty were not enough to offset the $8K vs $50K.
So what price point would've made sense? Any one from Backblaze on this list?
With Windows Live, you can organize, edit, and share your photos. Click here.
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00
Windows Live: Make it easier for your friends to see what you’re up to on Facebook. Find out more.
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00
Windows Live: Make it easier for your friends to see what you’re up to on Facebook. Find out more.
No virus found in this incoming message.
One last time as well. DIF is still essential regardless if the filesystem itself does payload checking. The problem is not that the given filesystem will have bad data integrity of its own(ed) data, but can munge _someone else’s_ data.
If the FS, say, orders a write to LUN 5 but it ends up on LUN 6, without DIF that write potentially destroyed someone else’s data, and your filesystem has no idea what just happened or who it happened to. Same thing can happen if the error is addressing (V->P) instead of LUN. Silent corruption of other people’s data.
The poor soul who comes along later expecting integral data on LUN 6 is in for, as they say, a world of hurt. Sure, your filesystem can detect and fix its own, but how about the other guy?
Thanks and this is the last post on this. Onward.
Version: 8.5.409 / Virus Database: 270.13.83/2352 - Release Date: 09/07/09
18:03:00