Build your own storage - blog from BackBlaze

Jayarama Shenoy

unread,

Sep 3, 2009, 11:35:49 AM9/3/09

to cloud-c...@googlegroups.com

Very interesting post on making Petabytes on a budget for cloud storage.

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/#more-150

A question to those who'd think of doing something similar themselves - at what price point would there be an opportunity for a vendor to be a value add?

Backbalze's pod is not too different than Sun's Thumper (aka X4550). Some plumbing details are different, CPU & memory configs are different and so is Linux/JFS vs Solaris/ZFS. You can argue that in each case, Sun's options are superior, but that and support/warranty were not enough to offset the $8K vs $50K.

So what price point would've made sense? Any one from Backblaze on this list?

With Windows Live, you can organize, edit, and share your photos. Click here.

Peglar, Robert

unread,

Sep 3, 2009, 2:17:27 PM9/3/09

to cloud-c...@googlegroups.com

Petabytes, sure. Petabytes of integral data, nope. No DIF == no integrity.

From the article: “…When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.”

Can’t argue with that. But what data? The data that you think you wrote, at the address you think you wrote it to, at the LUN or ID you think you sent it to? How do you know?

But let’s look a little deeper here. They use two power supplies, 760W rated each. It looks like the unit must have both supplies up to function, as they have subdivided duties. Lose one supply, half your drives disappear, and if you lose half the drives in the scheme they use (see below) your box is dead. Not very good if you have a halfway decent SLA. But perhaps they are assuming cloud vendors will not have SLAs with teeth – which is certainly the case today.

Then there are disk rebuilds and sparing. No hot spares configured, or at least described. Looks like they do three sets of double parity 13+2. So all drives are holding necessary information. No mention of how one spares out a drive, so you have to assume entire unit downtime to replace a drives and allow a rebuild to begin. Speaking of which, how long does that take? Rebuilding a 1.5TB drive – and the SiL chipset they use cannot recognize white space, so they have to rebuild the entire drive – takes tens of hours, easily. Have you ever rebuilt a drive from a 13+2 set? It’s ugly. I have to read 19.5 TB of data to rebuild a single 1.5TB drive. How long does it take to read 19.5 TB of data? Even if the box could write data at max speed to the spare, which is 120MB/sec or 432 GB/hour, it would take 4 hours. But that doesn’t count the read time – and it takes 45 hours to read 19.5 TB of data @ 432 GB/hr – never mind the compute time needed to recalculate double parity on that entire set. That’s the problem.

Then there is the mechanicals. Quite literally, they use rubber bands to ‘dampen vibration’. Now, that’s certainly novel, but I wouldn’t trust my data to that. What’s next, duct tape? I bet this box goes 30 rads easily on the vibration test. They also mount all the drives in the same orientation, a sure-fire way to induce torque and stress. I would also not want to be a drive in the rearmost group of 15, given their depiction of fans – and they use fans, not blowers.

On average, they will lose 2 drives every year (4 per 100 per annum) and by year 5, it will probably rise to 6-8 per year, given their use of consumer-level drives. So, over the course of 5 years, you may have to replace roughly half the drive population. Did they mention that in their cost analysis?

Seagate themselves says this drive is not best adapted for array-based storage. Look at their website for this drive and see their recommended applications – PCs, desktop (not array) RAID and personal external storage. This is because of the duty cycle that these drives are built to. Is this array powered on and active all the time? Verbatim from the drive manual:

Normal I/O duty cycle for desktop personal computers. Operation at excessive I/O duty cycle may degrade product reliability.

The paper mentions expense. Sure, I am worried about expense too. I am worried about the biggest expense of all – humans. I need a server admin to manage and babysit this machine. How many of these can one admin manage? The answer is, much less than I would expect of a cloud storage array. Management is completely by hand. Compare to storage arrays which are intelligent and need no management beyond initial provisioning.

Just replicate data to protect it, right? To what degree of replication? Is 2X enough? I know several cloud providers that replicate 3X using arrays that are far more reliable than this, not made of nylon shims and rubber bands. So, now that petabyte becomes $351,000, and I also need triple the humans to manage triple the units. Plus, that is not counting any drive replacement cost over time, and for 3 units (135 drives) I will probably have to replace 60-70 drives over 5 years for another 9 grand or so. This also assumes there is no cost for a human to replace a drive, an ‘extraordinary event’.

But let’s be very generous and say they can build and operate/maintain this box for 5 years for $500 grand. That’s certainly cheaper than commercial arrays. But again, for me, the biggest issue is integrity. At petabyte scale, it’s a huge problem for any ATA technology. Just ask the guys at CERN, even with FC drives that didn’t do DIF it’s a huge problem.

Just the UCR errors alone at petabyte scale with this box are scary. This drive is only 1 in 10^14. 1*10^14 bits is 1.25*10^13 bytes, or 125 TB. This means if you spin and read a petabyte of this stuff, you get 8 errors which are uncorrectable. Murphy says those 8 errors are in the data you really needed.

Good news – cheap storage. Bad news – cheap storage. This thing is a nice home-built whitebox. It would do well to stay there. You wouldn’t run a serious cloud compute infrastructure on dirt-cheap whitebox servers; likewise, you shouldn’t run a serious cloud storage infrastructure on dirt-cheap whitebox storage.

Rob

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00

Jayarama Shenoy

unread,

Sep 3, 2009, 3:49:18 PM9/3/09

to cloud-c...@googlegroups.com

BTW, (No DIF == no integrity) isn't true. (Why I'd recommend ZFS for such an undertaking).

Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze
Date: Thu, 3 Sep 2009 13:17:27 -0500
From: Robert...@xiotech.com
To: cloud-c...@googlegroups.com

Windows Live: Make it easier for your friends to see what you’re up to on Facebook. Find out more.

Peglar, Robert

unread,

Sep 3, 2009, 3:57:02 PM9/3/09

to cloud-c...@googlegroups.com

Figures below actually should be: 1.25 * 10^13 bytes is 12.5 TB, not 125 TB. This means there are 80 unrecoverable errors per petabyte, not 8. Sorry for any confusion. It actually makes the problem worse…

Rob

Peglar, Robert

unread,

Sep 3, 2009, 4:11:50 PM9/3/09

to cloud-c...@googlegroups.com

It is absolutely true. ZFS can’t and doesn’t know about addressing and LUN/ID errors at the disk level. It can check the integrity of the data (payload) as do some other filesystems, but that’s only 1/3^rd of the problem. Now, if ZFS appended the T10 DIF field to each 512 at the initiator, your answer would be correct, but ZFS does no such thing. It does scrubbing and “resilvering”, but that is not DIF. The former merely checks to see if blocks _can_ be read and the latter makes an educated guess on a block mismatch which one is right and attempts to rewrite it.

But again, that is just the data, not the addressing. DIF prevents such errors.

Jayarama Shenoy

unread,

Sep 3, 2009, 8:30:52 PM9/3/09

to cloud-c...@googlegroups.com

If you know for certain your data is corrupted, you have the opportunity to recover that particular file from your replica/s. (And replica would be normal in cloud storage, as you point out). Knowing which block the error occurred in and if there was an addressing error in your block storage becomes a bit moot if you can recover your data.

So while a file system that also checked block level integrity on top of it's own checksum would be better (more usually is), ZFS in this application would be good enough.

DIF itself might not be fail proof either (phantom writes, but you can read the ZFS blog as well for other examples) and so on and so forth. My point is that a lack of T10 DIF in itself does not damn BackBlaze's solution for data integrity (And, in theory, ZFS could've cost them just as much as whatever they chose).

My original question - since it is now 3 emails deep - is: what price will this market bear to improve on the home brewed stuff. One company voted that 8X was not the answer, still curious to know what they think it might be. (I guess there are no Backblaze people on here, sigh).

Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze

Date: Thu, 3 Sep 2009 15:11:50 -0500

Peglar, Robert

unread,

Sep 4, 2009, 6:42:53 AM9/4/09

to cloud-c...@googlegroups.com

>>> If you know for certain your data is corrupted

And therein lies the rub.

Lack of DIF at petabyte-scale does indeed damn it for integrity. You are correct, there are even more checks beyond T10 DIF that could be done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that the initiator supplies; payload, address/length (and its subsequent V->P translation), and LUN.

As for market pricing, what value do you place on your data? It’s really an exercise in risk management. Look at it this way; how much would it cost you to recreate/restore that petabyte? What is the time value of not having access to that petabyte? If a petabyte of data is only worth $117,000 to you, that’s one thing. For most companies, though, their petabyte is worth far, far more than that.

Then, there is the issue of storage performance – which is an entirely different (but related) topic. Cost of space is one thing, cost of time is very much another. There is absolutely no mention of performance in the post.

If they really wanted massive petabytes on the cheap, without regard to latency or performance, they should use the technology which draws no power when not doing I/O and can be purchased for under $30,000/raw PB as opposed to the $80,000/raw PB they spend now. The power savings alone would be significant.

Version: 8.5.409 / Virus Database: 270.13.76/2344 - Release Date: 09/03/09 18:05:00

Jacob Farmer

unread,

Sep 4, 2009, 10:38:53 AM9/4/09

to cloud-c...@googlegroups.com

This is a fascinating thread. The folks at Backblaze deserve a lot of credit for assembling incredibly cheap storage that probably works really well -- relative to the cost. At the same time, they are breaking a lot of rules of storage system design -- rules that have been learned the hard way over many, many years. I wish these guys the best of luck and I hope to see if their contraption truly holds water in the long run.

My question -- Does anyone out there have a reference design (or better yet a product) that can achieve real economies of scale and tell a good story for hosting Petabytes of general purpose data? Even better, does anyone have a good story for backing up such a beast?

I've seen some interesitng "object" storage models that manage large collections of files, but does anyone have a general purpose file system that can scale to the moon and still be affordable and resilient? If not, does anyone want to speculate as to what such a file system might look like?

-> Jacob

________________________________

From: cloud-c...@googlegroups.com on behalf of Peglar, Robert
Sent: Fri 9/4/2009 6:42 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze

>>> If you know for certain your data is corrupted

And therein lies the rub.

Lack of DIF at petabyte-scale does indeed damn it for integrity. You are correct, there are even more checks beyond T10 DIF that could be done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that the initiator supplies; payload, address/length (and its subsequent V->P translation), and LUN.

As for market pricing, what value do you place on your data? It's really an exercise in risk management. Look at it this way; how much would it cost you to recreate/restore that petabyte? What is the time value of not having access to that petabyte? If a petabyte of data is only worth $117,000 to you, that's one thing. For most companies, though, their petabyte is worth far, far more than that.

Then, there is the issue of storage performance - which is an entirely different (but related) topic. Cost of space is one thing, cost of time is very much another. There is absolutely no mention of performance in the post.

From the article: "...When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive."

Can't argue with that. But what data? The data that you think you wrote, at the address you think you wrote it to, at the LUN or ID you think you sent it to? How do you know?

But let's look a little deeper here. They use two power supplies, 760W rated each. It looks like the unit must have both supplies up to function, as they have subdivided duties. Lose one supply, half your drives disappear, and if you lose half the drives in the scheme they use (see below) your box is dead. Not very good if you have a halfway decent SLA. But perhaps they are assuming cloud vendors will not have SLAs with teeth - which is certainly the case today.

Then there are disk rebuilds and sparing. No hot spares configured, or at least described. Looks like they do three sets of double parity 13+2. So all drives are holding necessary information. No mention of how one spares out a drive, so you have to assume entire unit downtime to replace a drives and allow a rebuild to begin. Speaking of which, how long does that take? Rebuilding a 1.5TB drive - and the SiL chipset they use cannot recognize white space, so they have to rebuild the entire drive - takes tens of hours, easily. Have you ever rebuilt a drive from a 13+2 set? It's ugly. I have to read 19.5 TB of data to rebuild a single 1.5TB drive. How long does it take to read 19.5 TB of data? Even if the box could write data at max speed to the spare, which is 120MB/sec or 432 GB/hour, it would take 4 hours. But that doesn't count the read time - and it takes 45 hours to read 19.5 TB of data @ 432 GB/hr - never mind the compute time needed to recalculate double parity on that entire set. That's the problem.

Then there is the mechanicals. Quite literally, they use rubber bands to 'dampen vibration'. Now, that's certainly novel, but I wouldn't trust my data to that. What's next, duct tape? I bet this box goes 30 rads easily on the vibration test. They also mount all the drives in the same orientation, a sure-fire way to induce torque and stress. I would also not want to be a drive in the rearmost group of 15, given their depiction of fans - and they use fans, not blowers.

On average, they will lose 2 drives every year (4 per 100 per annum) and by year 5, it will probably rise to 6-8 per year, given their use of consumer-level drives. So, over the course of 5 years, you may have to replace roughly half the drive population. Did they mention that in their cost analysis?

Seagate themselves says this drive is not best adapted for array-based storage. Look at their website for this drive and see their recommended applications - PCs, desktop (not array) RAID and personal external storage. This is because of the duty cycle that these drives are built to. Is this array powered on and active all the time? Verbatim from the drive manual:

Normal I/O duty cycle for desktop personal computers. Operation at excessive I/O duty cycle may degrade product reliability.

The paper mentions expense. Sure, I am worried about expense too. I am worried about the biggest expense of all - humans. I need a server admin to manage and babysit this machine. How many of these can one admin manage? The answer is, much less than I would expect of a cloud storage array. Management is completely by hand. Compare to storage arrays which are intelligent and need no management beyond initial provisioning.

Just replicate data to protect it, right? To what degree of replication? Is 2X enough? I know several cloud providers that replicate 3X using arrays that are far more reliable than this, not made of nylon shims and rubber bands. So, now that petabyte becomes $351,000, and I also need triple the humans to manage triple the units. Plus, that is not counting any drive replacement cost over time, and for 3 units (135 drives) I will probably have to replace 60-70 drives over 5 years for another 9 grand or so. This also assumes there is no cost for a human to replace a drive, an 'extraordinary event'.

But let's be very generous and say they can build and operate/maintain this box for 5 years for $500 grand. That's certainly cheaper than commercial arrays. But again, for me, the biggest issue is integrity. At petabyte scale, it's a huge problem for any ATA technology. Just ask the guys at CERN, even with FC drives that didn't do DIF it's a huge problem.

Just the UCR errors alone at petabyte scale with this box are scary. This drive is only 1 in 10^14. 1*10^14 bits is 1.25*10^13 bytes, or 125 TB. This means if you spin and read a petabyte of this stuff, you get 8 errors which are uncorrectable. Murphy says those 8 errors are in the data you really needed.

Good news - cheap storage. Bad news - cheap storage. This thing is a nice home-built whitebox. It would do well to stay there. You wouldn't run a serious cloud compute infrastructure on dirt-cheap whitebox servers; likewise, you shouldn't run a serious cloud storage infrastructure on dirt-cheap whitebox storage.

Rob

From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Jayarama Shenoy
Sent: Thursday, September 03, 2009 10:36 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Build your own storage - blog from BackBlaze

Very interesting post on making Petabytes on a budget for cloud storage.

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/#more-150

A question to those who'd think of doing something similar themselves - at what price point would there be an opportunity for a vendor to be a value add?

Backbalze's pod is not too different than Sun's Thumper (aka X4550). Some plumbing details are different, CPU & memory configs are different and so is Linux/JFS vs Solaris/ZFS. You can argue that in each case, Sun's options are superior, but that and support/warranty were not enough to offset the $8K vs $50K.

So what price point would've made sense? Any one from Backblaze on this list?

________________________________

With Windows Live, you can organize, edit, and share your photos. Click here. <http://www.windowslive.com/Desktop/PhotoGallery>

No virus found in this incoming message.

Checked by AVG - www.avg.com <http://www.avg.com/>

Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00

________________________________

Windows Live: Make it easier for your friends to see what you're up to on Facebook. Find out more. <http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON:WL:en-US:SI_SB_facebook:082009>

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00

________________________________

Windows Live: Make it easier for your friends to see what you're up to on Facebook. Find out more. <http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON:WL:en-US:SI_SB_facebook:082009>

winmail.dat

Jeff Darcy

unread,

Sep 4, 2009, 12:05:20 PM9/4/09

to cloud-c...@googlegroups.com

On 09/04/2009 10:38 AM, Jacob Farmer wrote:
> I've seen some interesitng "object" storage models that manage large collections of files, but does anyone have a general purpose file system that can scale to the moon and still be affordable and resilient? If not, does anyone want to speculate as to what such a file system might look like?

It sort of depends on what kind of scaling you really want. There are a
couple of dozen parallel filesystems - e.g. Lustre, PVFS2, GlusterFS,
Ceph - that can scale very high in terms of capacity and performance.
There are fewer that can scale across distance, primarily due to the
difficulty of maintaining consistency across high-latency
low-reliability links. That brings up the question of whether most
people really need a true "general purpose" filesystem - which is
actually rather "special purpose" in terms of the very detailed
semantics that such a thing must preserve. Many people don't need
strong or fine-grained consistency. Many people don't need a
hierarchical namespace with full support for hard and soft links, atomic
rename, read/write to removed files, etc. Many people don't need full
VFS-layer integration, though it's easily had via mechanisms like FUSE
and allows applications to run without using special libraries so long
as they remain within the "semantic envelope" defined by whatever
non-filesystem thing is at the other end (e.g. Amazon S3 for s3fs). The
more of these features one is willing to give up, the more options one
gains wrt distributing stuff on a global scale. You can start with a
traditional filesystem and water down its semantics to make it perform
reasonably at global scale, or you can start with a distributed
key/value store and enrich its semantics until it's "enough like" a
traditional filesystem for your own purposes. It's all a matter of what
tradeoffs make sense for a particular application or user.

Jayarama Shenoy

unread,

Sep 4, 2009, 1:33:01 PM9/4/09

to cloud-c...@googlegroups.com

Hi

Thanks for re-setting. From the Tier1 vendors, I thought Sun's Thumper was the closest box product (and it seemed to have had it's share of teething troubles initially - around the mechanical design aspects of it).

For the file system, wouldn't Hadoop/GFS be the thing you described. (Also need to mention MogileFS and it's derivatives). These are not strictly Posix compliant file systems, but they appear to get the job done (and interfacing to them in your applications does not appear to be that hard).

I suspect (from Backblaze's description) that they may be using a roll-their-own 'file system' and would use MogileFS as a guess to what it looks like.

Jay

ps I also make my living in the 'enterprise storage world', but am inclined to believe that the Backblaze box will hold together far better than people give it credit for. And the robustness that the box itself may lack might be made up above this layer.

I would've done this box just a bit differently (and correct some of the potential shortcomings without the expenditure of too much money). My original question was to determine if there's enough money in it to make it worth my time to think more about this problem.

Subject: [ Cloud Computing ] If not Backblaze, than what? --- changing the discussion a little
Date: Fri, 4 Sep 2009 10:38:53 -0400
From: jfa...@cambridgecomputer.com
To: cloud-c...@googlegroups.com

This is a fascinating thread. The folks at Backblaze deserve a lot of credit for assembling incredibly cheap storage that probably works really well -- relative to the cost. At the same time, they are breaking a lot of rules of storage system design -- rules that have been learned the hard way over many, many years. I wish these guys the best of luck and I hope to see if their contraption truly holds water in the long run.

My question -- Does anyone out there have a reference design (or better yet a product) that can achieve real economies of scale and tell a good story for hosting Petabytes of general purpose data? Even better, does anyone have a good story for backing up such a beast?

I've seen some interesitng "object" storage models that manage large collections of files, but does anyone have a general purpose file system that can scale to the moon and still be affordable and resilient? If not, does anyone want to speculate as to what such a file system might look like?

-> Jacob

From: cloud-c...@googlegroups.com on behalf of Peglar, Robert
Sent: Fri 9/4/2009 6:42 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze