Build your own storage - blog from BackBlaze

214 views
Skip to first unread message

Jayarama Shenoy

unread,
Sep 3, 2009, 11:35:49 AM9/3/09
to cloud-c...@googlegroups.com
Very interesting post on making Petabytes on a budget for cloud storage.

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/#more-150

A question to those who'd think of doing something similar themselves - at what price point would there be an opportunity for a vendor to be a value add?

Backbalze's pod is not too different than Sun's Thumper (aka X4550). Some plumbing details are different, CPU & memory configs are different and so is Linux/JFS vs Solaris/ZFS. You can argue that in each case, Sun's options are superior, but that and support/warranty were not enough to offset the $8K vs $50K.

So what price point would've made sense? Any one from Backblaze on this list?




With Windows Live, you can organize, edit, and share your photos. Click here.

Peglar, Robert

unread,
Sep 3, 2009, 2:17:27 PM9/3/09
to cloud-c...@googlegroups.com

Petabytes, sure.  Petabytes of integral data, nope.  No DIF == no integrity. 

 

From the article:  “…When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.”

 

Can’t argue with that.  But what data?  The data that you think you wrote, at the address you think you wrote it to, at the LUN or ID you think you sent it to?  How do you know?

 

But let’s look a little deeper here.  They use two power supplies, 760W rated each.  It looks like the unit must have both supplies up to function, as they have subdivided duties.  Lose one supply, half your drives disappear, and if you lose half the drives in the scheme they use (see below) your box is dead.  Not very good if you have a halfway decent SLA.  But perhaps they are assuming cloud vendors will not have SLAs with teeth – which is certainly the case today.

 

Then there are disk rebuilds and sparing.  No hot spares configured, or at least described.  Looks like they do three sets of double parity 13+2.  So all drives are holding necessary information.  No mention of how one spares out a drive, so you have to assume entire unit downtime to replace a drives and allow a rebuild to begin.  Speaking of which, how long does that take?  Rebuilding a 1.5TB drive – and the SiL chipset they use cannot recognize white space, so they have to rebuild the entire drive – takes tens of hours, easily.  Have you ever rebuilt a drive from a 13+2 set?  It’s ugly.  I have to read 19.5 TB of data to rebuild a single 1.5TB drive.  How long does it take to read 19.5 TB of data?  Even if the box could write data at max speed to the spare, which is 120MB/sec or 432 GB/hour, it would take 4 hours.  But that doesn’t count the read time – and it takes 45 hours to read 19.5 TB of data @ 432 GB/hr – never mind the compute time needed to recalculate double parity on that entire set.  That’s the problem.

 

Then there is the mechanicals.  Quite literally, they use rubber bands to ‘dampen vibration’.  Now, that’s certainly novel, but I wouldn’t trust my data to that.  What’s next, duct tape?  I bet this box goes 30 rads easily on the vibration test.  They also mount all the drives in the same orientation, a sure-fire way to induce torque and stress.  I would also not want to be a drive in the rearmost group of 15, given their depiction of fans – and they use fans, not blowers. 

 

On average, they will lose 2 drives every year (4 per 100 per annum) and by year 5, it will probably rise to 6-8 per year, given their use of consumer-level drives.  So, over the course of 5 years, you may have to replace roughly half the drive population.  Did they mention that in their cost analysis?

 

Seagate themselves says this drive is not best adapted for array-based storage.  Look at their website for this drive and see their recommended applications – PCs, desktop (not array) RAID and personal external storage.  This is because of the duty cycle that these drives are built to.  Is this array powered on and active all the time?  Verbatim from the drive manual:

 

Normal I/O duty cycle for desktop personal computers. Operation at excessive I/O duty cycle may degrade product reliability.   

 

The paper mentions expense.  Sure, I am worried about expense too.  I am worried about the biggest expense of all – humans.   I need a server admin to manage and babysit this machine.  How many of these can one admin manage?  The answer is, much less than I would expect of a cloud storage array.  Management is completely by hand.  Compare to storage arrays which are intelligent and need no management beyond initial provisioning.

 

Just replicate data to protect it, right?  To what degree of replication?  Is 2X enough?  I know several cloud providers that replicate 3X using arrays that are far more reliable than this, not made of nylon shims and rubber bands.  So, now that petabyte becomes $351,000, and I also need triple the humans to manage triple the units.  Plus, that is not counting any drive replacement cost over time, and for 3 units (135 drives) I will probably have to replace 60-70 drives over 5 years for another 9 grand or so.  This also assumes there is no cost for a human to replace a drive, an ‘extraordinary event’. 

 

But let’s be very generous and say they can build and operate/maintain this box for 5 years for $500 grand.  That’s certainly cheaper than commercial arrays.  But again, for me, the biggest issue is integrity.  At petabyte scale, it’s a huge problem for any ATA technology.  Just ask the guys at CERN, even with FC drives that didn’t do DIF it’s a huge problem. 

 

Just the UCR errors alone at petabyte scale with this box are scary.  This drive is only 1 in 10^14.  1*10^14 bits is 1.25*10^13 bytes, or 125 TB.  This means if you spin and read a petabyte of this stuff, you get 8 errors which are uncorrectable.  Murphy says those 8 errors are in the data you really needed. 

 

Good news – cheap storage.  Bad news – cheap storage.  This thing is a nice home-built whitebox.  It would do well to stay there.  You wouldn’t run a serious cloud compute infrastructure on dirt-cheap whitebox servers; likewise, you shouldn’t run a serious cloud storage infrastructure on dirt-cheap whitebox storage.

 

Rob

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00

Jayarama Shenoy

unread,
Sep 3, 2009, 3:49:18 PM9/3/09
to cloud-c...@googlegroups.com
BTW, (No DIF == no integrity) isn't true. (Why I'd recommend ZFS for such an undertaking).


Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze
Date: Thu, 3 Sep 2009 13:17:27 -0500
From: Robert...@xiotech.com
To: cloud-c...@googlegroups.com

Windows Live: Make it easier for your friends to see what you’re up to on Facebook. Find out more.

Peglar, Robert

unread,
Sep 3, 2009, 3:57:02 PM9/3/09
to cloud-c...@googlegroups.com

Figures below actually should be: 1.25 * 10^13 bytes is 12.5 TB, not 125 TB.  This means there are 80 unrecoverable errors per petabyte, not 8.  Sorry for any confusion.  It actually makes the problem worse…

 

Rob

Peglar, Robert

unread,
Sep 3, 2009, 4:11:50 PM9/3/09
to cloud-c...@googlegroups.com

It is absolutely true.  ZFS can’t and doesn’t know about addressing and LUN/ID errors at the disk level.  It can check the integrity of the data (payload) as do some other filesystems, but that’s only 1/3rd of the problem.   Now, if ZFS appended the T10 DIF field to each 512 at the initiator, your answer would be correct, but ZFS does no such thing.  It does scrubbing and “resilvering”, but that is not DIF.  The former merely checks to see if blocks _can_ be read and the latter makes an educated guess on a block mismatch which one is right and attempts to rewrite it. 

 

But again, that is just the data, not the addressing.  DIF prevents such errors.

Jayarama Shenoy

unread,
Sep 3, 2009, 8:30:52 PM9/3/09
to cloud-c...@googlegroups.com
If you know for certain your data is corrupted, you have the opportunity to recover that particular file from your replica/s. (And replica would be normal in cloud storage, as you point out). Knowing which block the error occurred in and if there was an addressing error in your block storage becomes a bit moot if you can recover your data.

So while a file system that also checked block level integrity on top of it's own checksum would be better (more usually is), ZFS in this application would be good enough.

DIF itself might not be fail proof either (phantom writes, but you can read the ZFS blog as well for other examples) and so on and so forth. My point is that a lack of T10 DIF in itself does not damn BackBlaze's solution for data integrity (And, in theory, ZFS could've cost them just as much as whatever they chose).

My original question - since it is now 3 emails deep - is: what price will this market bear to improve on the home brewed stuff. One company voted that 8X was not the answer, still curious to know what they think it might be. (I guess there are no Backblaze people on here, sigh).







Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze
Date: Thu, 3 Sep 2009 15:11:50 -0500

Peglar, Robert

unread,
Sep 4, 2009, 6:42:53 AM9/4/09
to cloud-c...@googlegroups.com

>>> If you know for certain your data is corrupted

 

And therein lies the rub.

 

Lack of DIF at petabyte-scale does indeed damn it for integrity.  You are correct, there are even more checks beyond T10 DIF that could be done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that the initiator supplies; payload, address/length (and its subsequent V->P translation), and LUN.

 

As for market pricing, what value do you place on your data?  It’s really an exercise in risk management.  Look at it this way; how much would it cost you to recreate/restore that petabyte?  What is the time value of not having access to that petabyte?  If a petabyte of data is only worth $117,000 to you, that’s one thing.  For most companies, though, their petabyte is worth far, far more than that.

 

Then, there is the issue of storage performance – which is an entirely different (but related) topic.  Cost of space is one thing, cost of time is very much another.  There is absolutely no mention of performance in the post.

 

If they really wanted massive petabytes on the cheap, without regard to latency or performance, they should use the technology which draws no power when not doing I/O and can be purchased for under $30,000/raw PB as opposed to the $80,000/raw PB they spend now.  The power savings alone would be significant.

Version: 8.5.409 / Virus Database: 270.13.76/2344 - Release Date: 09/03/09 18:05:00

Jacob Farmer

unread,
Sep 4, 2009, 10:38:53 AM9/4/09
to cloud-c...@googlegroups.com
This is a fascinating thread. The folks at Backblaze deserve a lot of credit for assembling incredibly cheap storage that probably works really well -- relative to the cost. At the same time, they are breaking a lot of rules of storage system design -- rules that have been learned the hard way over many, many years. I wish these guys the best of luck and I hope to see if their contraption truly holds water in the long run.

My question -- Does anyone out there have a reference design (or better yet a product) that can achieve real economies of scale and tell a good story for hosting Petabytes of general purpose data? Even better, does anyone have a good story for backing up such a beast?

I've seen some interesitng "object" storage models that manage large collections of files, but does anyone have a general purpose file system that can scale to the moon and still be affordable and resilient? If not, does anyone want to speculate as to what such a file system might look like?

-> Jacob



________________________________

From: cloud-c...@googlegroups.com on behalf of Peglar, Robert
Sent: Fri 9/4/2009 6:42 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze



>>> If you know for certain your data is corrupted



And therein lies the rub.



Lack of DIF at petabyte-scale does indeed damn it for integrity. You are correct, there are even more checks beyond T10 DIF that could be done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that the initiator supplies; payload, address/length (and its subsequent V->P translation), and LUN.



As for market pricing, what value do you place on your data? It's really an exercise in risk management. Look at it this way; how much would it cost you to recreate/restore that petabyte? What is the time value of not having access to that petabyte? If a petabyte of data is only worth $117,000 to you, that's one thing. For most companies, though, their petabyte is worth far, far more than that.



Then, there is the issue of storage performance - which is an entirely different (but related) topic. Cost of space is one thing, cost of time is very much another. There is absolutely no mention of performance in the post.
From the article: "...When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive."



Can't argue with that. But what data? The data that you think you wrote, at the address you think you wrote it to, at the LUN or ID you think you sent it to? How do you know?



But let's look a little deeper here. They use two power supplies, 760W rated each. It looks like the unit must have both supplies up to function, as they have subdivided duties. Lose one supply, half your drives disappear, and if you lose half the drives in the scheme they use (see below) your box is dead. Not very good if you have a halfway decent SLA. But perhaps they are assuming cloud vendors will not have SLAs with teeth - which is certainly the case today.



Then there are disk rebuilds and sparing. No hot spares configured, or at least described. Looks like they do three sets of double parity 13+2. So all drives are holding necessary information. No mention of how one spares out a drive, so you have to assume entire unit downtime to replace a drives and allow a rebuild to begin. Speaking of which, how long does that take? Rebuilding a 1.5TB drive - and the SiL chipset they use cannot recognize white space, so they have to rebuild the entire drive - takes tens of hours, easily. Have you ever rebuilt a drive from a 13+2 set? It's ugly. I have to read 19.5 TB of data to rebuild a single 1.5TB drive. How long does it take to read 19.5 TB of data? Even if the box could write data at max speed to the spare, which is 120MB/sec or 432 GB/hour, it would take 4 hours. But that doesn't count the read time - and it takes 45 hours to read 19.5 TB of data @ 432 GB/hr - never mind the compute time needed to recalculate double parity on that entire set. That's the problem.



Then there is the mechanicals. Quite literally, they use rubber bands to 'dampen vibration'. Now, that's certainly novel, but I wouldn't trust my data to that. What's next, duct tape? I bet this box goes 30 rads easily on the vibration test. They also mount all the drives in the same orientation, a sure-fire way to induce torque and stress. I would also not want to be a drive in the rearmost group of 15, given their depiction of fans - and they use fans, not blowers.



On average, they will lose 2 drives every year (4 per 100 per annum) and by year 5, it will probably rise to 6-8 per year, given their use of consumer-level drives. So, over the course of 5 years, you may have to replace roughly half the drive population. Did they mention that in their cost analysis?



Seagate themselves says this drive is not best adapted for array-based storage. Look at their website for this drive and see their recommended applications - PCs, desktop (not array) RAID and personal external storage. This is because of the duty cycle that these drives are built to. Is this array powered on and active all the time? Verbatim from the drive manual:



Normal I/O duty cycle for desktop personal computers. Operation at excessive I/O duty cycle may degrade product reliability.



The paper mentions expense. Sure, I am worried about expense too. I am worried about the biggest expense of all - humans. I need a server admin to manage and babysit this machine. How many of these can one admin manage? The answer is, much less than I would expect of a cloud storage array. Management is completely by hand. Compare to storage arrays which are intelligent and need no management beyond initial provisioning.



Just replicate data to protect it, right? To what degree of replication? Is 2X enough? I know several cloud providers that replicate 3X using arrays that are far more reliable than this, not made of nylon shims and rubber bands. So, now that petabyte becomes $351,000, and I also need triple the humans to manage triple the units. Plus, that is not counting any drive replacement cost over time, and for 3 units (135 drives) I will probably have to replace 60-70 drives over 5 years for another 9 grand or so. This also assumes there is no cost for a human to replace a drive, an 'extraordinary event'.



But let's be very generous and say they can build and operate/maintain this box for 5 years for $500 grand. That's certainly cheaper than commercial arrays. But again, for me, the biggest issue is integrity. At petabyte scale, it's a huge problem for any ATA technology. Just ask the guys at CERN, even with FC drives that didn't do DIF it's a huge problem.



Just the UCR errors alone at petabyte scale with this box are scary. This drive is only 1 in 10^14. 1*10^14 bits is 1.25*10^13 bytes, or 125 TB. This means if you spin and read a petabyte of this stuff, you get 8 errors which are uncorrectable. Murphy says those 8 errors are in the data you really needed.



Good news - cheap storage. Bad news - cheap storage. This thing is a nice home-built whitebox. It would do well to stay there. You wouldn't run a serious cloud compute infrastructure on dirt-cheap whitebox servers; likewise, you shouldn't run a serious cloud storage infrastructure on dirt-cheap whitebox storage.



Rob





From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Jayarama Shenoy
Sent: Thursday, September 03, 2009 10:36 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Build your own storage - blog from BackBlaze



Very interesting post on making Petabytes on a budget for cloud storage.

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/#more-150

A question to those who'd think of doing something similar themselves - at what price point would there be an opportunity for a vendor to be a value add?

Backbalze's pod is not too different than Sun's Thumper (aka X4550). Some plumbing details are different, CPU & memory configs are different and so is Linux/JFS vs Solaris/ZFS. You can argue that in each case, Sun's options are superior, but that and support/warranty were not enough to offset the $8K vs $50K.

So what price point would've made sense? Any one from Backblaze on this list?

________________________________

With Windows Live, you can organize, edit, and share your photos. Click here. <http://www.windowslive.com/Desktop/PhotoGallery>



No virus found in this incoming message.
Checked by AVG - www.avg.com <http://www.avg.com/>
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00





________________________________

Windows Live: Make it easier for your friends to see what you're up to on Facebook. Find out more. <http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON:WL:en-US:SI_SB_facebook:082009>



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00






________________________________

Windows Live: Make it easier for your friends to see what you're up to on Facebook. Find out more. <http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON:WL:en-US:SI_SB_facebook:082009>
winmail.dat

Jeff Darcy

unread,
Sep 4, 2009, 12:05:20 PM9/4/09
to cloud-c...@googlegroups.com
On 09/04/2009 10:38 AM, Jacob Farmer wrote:
> I've seen some interesitng "object" storage models that manage large collections of files, but does anyone have a general purpose file system that can scale to the moon and still be affordable and resilient? If not, does anyone want to speculate as to what such a file system might look like?

It sort of depends on what kind of scaling you really want. There are a
couple of dozen parallel filesystems - e.g. Lustre, PVFS2, GlusterFS,
Ceph - that can scale very high in terms of capacity and performance.
There are fewer that can scale across distance, primarily due to the
difficulty of maintaining consistency across high-latency
low-reliability links. That brings up the question of whether most
people really need a true "general purpose" filesystem - which is
actually rather "special purpose" in terms of the very detailed
semantics that such a thing must preserve. Many people don't need
strong or fine-grained consistency. Many people don't need a
hierarchical namespace with full support for hard and soft links, atomic
rename, read/write to removed files, etc. Many people don't need full
VFS-layer integration, though it's easily had via mechanisms like FUSE
and allows applications to run without using special libraries so long
as they remain within the "semantic envelope" defined by whatever
non-filesystem thing is at the other end (e.g. Amazon S3 for s3fs). The
more of these features one is willing to give up, the more options one
gains wrt distributing stuff on a global scale. You can start with a
traditional filesystem and water down its semantics to make it perform
reasonably at global scale, or you can start with a distributed
key/value store and enrich its semantics until it's "enough like" a
traditional filesystem for your own purposes. It's all a matter of what
tradeoffs make sense for a particular application or user.

Jayarama Shenoy

unread,
Sep 4, 2009, 1:33:01 PM9/4/09
to cloud-c...@googlegroups.com
Hi

Thanks for re-setting. From the Tier1 vendors, I thought Sun's Thumper was the closest box product (and it seemed to have had it's share of teething troubles initially - around the mechanical design aspects of it).

For the file system, wouldn't Hadoop/GFS be the thing you described. (Also need to mention MogileFS and it's derivatives). These are not strictly Posix compliant file systems, but they appear to get the job done (and interfacing to them in your applications does not appear to be that hard).

I suspect (from Backblaze's description) that they may be using a roll-their-own 'file system' and would use MogileFS as a guess to what it looks like.

Jay

ps I also make my living in the 'enterprise storage world', but am inclined to believe that the Backblaze box will hold together far better than people give it credit for. And the robustness that the box itself may lack might be made up above this layer.

 I would've done this box just a bit differently (and correct some of the potential shortcomings without the expenditure of too much money). My original question was to determine if there's enough money in it to make it worth my time to think more about this problem.


Subject: [ Cloud Computing ] If not Backblaze, than what? --- changing the discussion a little
Date: Fri, 4 Sep 2009 10:38:53 -0400
From: jfa...@cambridgecomputer.com
To: cloud-c...@googlegroups.com


This is a fascinating thread.  The folks at Backblaze deserve a lot of credit for assembling incredibly cheap storage that probably works really well -- relative to the cost. At the same time, they are breaking a lot of rules of storage system design -- rules that have been learned the hard way over many, many years.  I wish these guys the best of luck and I hope to see if their contraption truly holds water in the long run. 
 
My question -- Does anyone out there have a reference design (or better yet a product) that can achieve real economies of scale and tell a good story for hosting Petabytes of general purpose data?  Even better, does anyone have a good story for backing up such a beast? 
 
I've seen some interesitng "object" storage models that manage large collections of files, but does anyone have a general purpose file system that can scale to the moon and still be affordable and resilient? If not, does anyone want to speculate as to what such a file system might look like?
 
->  Jacob
 
 


From: cloud-c...@googlegroups.com on behalf of Peglar, Robert
Sent: Fri 9/4/2009 6:42 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze

>>> If you know for certain your data is corrupted

 

And therein lies the rub.

 

Lack of DIF at petabyte-scale does indeed damn it for integrity.  You are correct, there are even more checks beyond T10 DIF that could be done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that the initiator supplies; payload, address/length (and its subsequent V->P translation), and LUN.

 

As for market pricing, what value do you place on your data?  It’s really an exercise in risk management.  Look at it this way; how much would it cost you to recreate/restore that petabyte?  What is the time value of not having access to that petabyte?  If a petabyte of data is only worth $117,000 to you, that’s one thing.  For most companies, though, their petabyte is worth far, far more than that.

 

Then, there is the issue of storage performance – which is an entirely different (but related) topic.  Cost of space is one thing, cost of time is very much another.  There is absolutely no mention of performance in the post.

From the article:  “…When you strip away the marketing terms and fancy logos from any storage solution, data ends up on a hard drive.”

 

Can’t argue with that.  But what data?  The data that you think you wrote, at the address you think you wrote it to, at the LUN or ID you think you sent it to?  How do you know?

 

But let’s look a little deeper here.  They use two power supplies, 760W rated each.  It looks like the unit must have both supplies up to function, as they have subdivided duties.  Lose one supply, half your drives disappear, and if you lose half the drives in the scheme they use (see below) your box is dead.  Not very good if you have a halfway decent SLA.  But perhaps they are assuming cloud vendors will not have SLAs with teeth – which is certainly the case today.

 

Then there are disk rebuilds and sparing.  No hot spares configured, or at least described.  Looks like they do three sets of double parity 13+2.  So all drives are holding necessary information.  No mention of how one spares out a drive, so you have to assume entire unit downtime to replace a drives and allow a rebuild to begin.  Speaking of which, how long does that take?  Rebuilding a 1.5TB drive – and the SiL chipset they use cannot recognize white space, so they have to rebuild the entire drive – takes tens of hours, easily.  Have you ever rebuilt a drive from a 13+2 set?  It’s ugly.  I have to read 19.5 TB of data to rebuild a single 1.5TB drive.  How long does it take to read 19.5 TB of data?  Even if the box could write data at max speed to the spare, which is 120MB/sec or 432 GB/hour, it would take 4 hours.  But that doesn’t count the read time – and it takes 45 hours to read 19.5 TB of data @ 432 GB/hr – never mind the compute time needed to recalculate double parity on that entire set.  That’s the problem.

 

Then there is the mechanicals.  Quite literally, they use rubber bands to ‘dampen vibration’.  Now, that’s certainly novel, but I wouldn’t trust my data to that.  What’s next, duct tape?  I bet this box goes 30 rads easily on the vibration test.  They also mount all the drives in the same orientation, a sure-fire way to induce torque and stress.  I would also not want to be a drive in the rearmost group of 15, given their depiction of fans – and they use fans, not blowers. 

 

On average, they will lose 2 drives every year (4 per 100 per annum) and by year 5, it will probably rise to 6-8 per year, given their use of consumer-level drives.  So, over the course of 5 years, you may have to replace roughly half the drive population.  Did they mention that in their cost analysis?

 

Seagate themselves says this drive is not best adapted for array-based storage.  Look at their website for this drive and see their recommended applications – PCs, desktop (not array) RAID and personal external storage.  This is because of the duty cycle that these drives are built to.  Is this array powered on and active all the time?  Verbatim from the drive manual:

 

Normal I/O duty cycle for desktop personal computers. Operation at excessive I/O duty cycle may degrade product reliability.   

 

The paper mentions expense.  Sure, I am worried about expense too.  I am worried about the biggest expense of all – humans.   I need a server admin to manage and babysit this machine.  How many of these can one admin manage?  The answer is, much less than I would expect of a cloud storage array.  Management is completely by hand.  Compare to storage arrays which are intelligent and need no management beyond initial provisioning.

 

Just replicate data to protect it, right?  To what degree of replication?  Is 2X enough?  I know several cloud providers that replicate 3X using arrays that are far more reliable than this, not made of nylon shims and rubber bands.  So, now that petabyte becomes $351,000, and I also need triple the humans to manage triple the units.  Plus, that is not counting any drive replacement cost over time, and for 3 units (135 drives) I will probably have to replace 60-70 drives over 5 years for another 9 grand or so.  This also assumes there is no cost for a human to replace a drive, an ‘extraordinary event’. 

 

But let’s be very generous and say they can build and operate/maintain this box for 5 years for $500 grand.  That’s certainly cheaper than commercial arrays.  But again, for me, the biggest issue is integrity.  At petabyte scale, it’s a huge problem for any ATA technology.  Just ask the guys at CERN, even with FC drives that didn’t do DIF it’s a huge problem. 

 

Just the UCR errors alone at petabyte scale with this box are scary.  This drive is only 1 in 10^14.  1*10^14 bits is 1.25*10^13 bytes, or 125 TB.  This means if you spin and read a petabyte of this stuff, you get 8 errors which are uncorrectable.  Murphy says those 8 errors are in the data you really needed. 

 

Good news – cheap storage.  Bad news – cheap storage.  This thing is a nice home-built whitebox.  It would do well to stay there.  You wouldn’t run a serious cloud compute infrastructure on dirt-cheap whitebox servers; likewise, you shouldn’t run a serious cloud storage infrastructure on dirt-cheap whitebox storage.

 

Rob

 

 

From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Jayarama Shenoy
Sent: Thursday, September 03, 2009 10:36 AM
To: cloud-c...@googlegroups.com
Subject: [ Cloud Computing ] Build your own storage - blog from BackBlaze

 

Very interesting post on making Petabytes on a budget for cloud storage.

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/#more-150

A question to those who'd think of doing something similar themselves - at what price point would there be an opportunity for a vendor to be a value add?

Backbalze's pod is not too different than Sun's Thumper (aka X4550). Some plumbing details are different, CPU & memory configs are different and so is Linux/JFS vs Solaris/ZFS. You can argue that in each case, Sun's options are superior, but that and support/warranty were not enough to offset the $8K vs $50K.

So what price point would've made sense? Any one from Backblaze on this list?


With Windows Live, you can organize, edit, and share your photos. Click here.

 

No virus found in this incoming message.


Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00




Windows Live: Make it easier for your friends to see what you’re up to on Facebook. Find out more.

 

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.74/2339 - Release Date: 09/03/09 05:50:00

Windows Live: Make it easier for your friends to see what you’re up to on Facebook. Find out more.

 

No virus found in this incoming message.

Warren Davidson

unread,
Sep 4, 2009, 4:42:31 PM9/4/09
to cloud-c...@googlegroups.com
Jacob, my company that has a distributed database that can handle Petabyte's easily enough. It's not really a file system but a full database so perhaps not what you're asking for (I'm not technical so I'll try not to overly expose my lack thereof). Here is a link so some info with instructions on how you can play with it in EC2 if interesting...
http://www.objectivity.com/cloud-computing/default.asp

Warren



From: Jacob Farmer <jfa...@cambridgecomputer.com>
To: cloud-c...@googlegroups.com
Sent: Friday, September 4, 2009 7:38:53 AM

Subject: [ Cloud Computing ] If not Backblaze, than what? --- changing the discussion a little

Jayarama Shenoy

unread,
Sep 4, 2009, 7:28:46 PM9/4/09
to cloud-c...@googlegroups.com
Let me try this one more (and last) time. To know *if* your data went bad - it is sufficient to verify payload. This can be done by a file system, especially if checksum is placed separately from the file.

To know *where* your data went bad (and perhaps a bit clue on why), things like DIF are needed. If you are in  a structured data environment, there isn't a file system and DIF (or equivalent) is a must for integrity. Not the case  in this instance.

Regards
Jay

p.s. We don't know Backblaze overall architecture, but a fair bet that they've baked in service reliability over this quasi-experimental box. Think carefully about their application. This box, while not a perfect storage system by a long shot, has limitations that are not terribly exposed by it. Last post, as I said.



Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze
Date: Fri, 4 Sep 2009 05:42:53 -0500

Windows Live: Keep your friends up to date with what you do online. Find out more.

Peglar, Robert

unread,
Sep 7, 2009, 8:11:10 PM9/7/09
to cloud-c...@googlegroups.com

One last time as well.  DIF is still essential regardless if the filesystem itself does payload checking.  The problem is not that the given filesystem will have bad data integrity of its own(ed) data, but can munge _someone else’s_ data. 

 

If the FS, say, orders a write to LUN 5 but it ends up on LUN 6, without DIF that write potentially destroyed someone else’s data, and your filesystem has no idea what just happened or who it happened to.  Same thing can happen if the error is addressing (V->P) instead of LUN.  Silent corruption of other people’s data.

 

The poor soul who comes along later expecting integral data on LUN 6 is in for, as they say, a world of hurt.  Sure, your filesystem can detect and fix its own, but how about the other guy?

 

Thanks and this is the last post on this.  Onward.

Version: 8.5.409 / Virus Database: 270.13.83/2352 - Release Date: 09/07/09 18:03:00

Jayarama Shenoy

unread,
Sep 8, 2009, 12:57:17 PM9/8/09
to cloud-c...@googlegroups.com
Actually, I will renege on the 'last post' to acknowledge one aspect. If all of your storage is not under ZFS-like checksum protection - there could be the corruption mode that Rob points out (mis-addressed writes totalling a few LBAs on some other LUN).

Technically you could serve unstructured data over a file system (and many unified systems do, including from respected enterprise vendors) and that simplifies snapshots etc - but many people (incl me) are not a particular fans of SAN data delivered over a file server (think latency).

So it would be commonly possible to have situations where structured and unstructured data co-exist, and this failure mode is important.

Jay





Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze
Date: Mon, 7 Sep 2009 19:11:10 -0500

Zero0n3

unread,
Sep 26, 2009, 7:10:33 PM9/26/09
to Cloud Computing
First, I am by no means someone who knows anything about data backup
or storage systems.

With that said, what if their app verifies the data during any
operation?

So poor soul B who's data just got written over without any ack. of it
happening now wants to grab his stored data.

Application requests Data for poorSoul B. application grabs it from
Node 1 and node 3. checks to see if it is the same... it now isn't as
node 3 was the one that was silently written over.

So now it needs to decide _what_ to do. it now goes out to node 8 and
checks to see if node 1 or node 3 matches. (lets assume node 1 and 3
were initially chosen because of latency)

so it quickly checks node 8 == 1 ? and 8 == 3 ? it sees that node 8
and 1 are a match, so it uses node 1 to retrieve said data.

the application also now knows that something is wrong with node 3.
and logs any and all data it finds out about it.



What I am simply saying is that without seeing ANY of their code,
there is no way we can tell how secure the data is, how much error
checking they have, what type of hardware failure issues they have,
etc.

Hell, the ONLY thing they really told us was how to build the storage
node. They didn't mention if they had compute nodes, or checksum
nodes, etc etc.

you are making assumptions based on 1 piece of a complicated puzzle.



Google did the same thing with GoogleFS how many years ago? they
designed a entire system on the premise that they were working with
commodity hardware. This company is no different, just doing it with
a home brew, commodity built, storage node.

I don't mean to be rude, but looking at their team ( https://www.backblaze.com/team.html
) I have a feeling they have thought of and came up with solutions to
everything that has been discussed in this thread.
> From: Robert_Peg...@xiotech.com
> To: cloud-c...@googlegroups.com
>
> >>> If you know for certain your data is corrupted
>
> And therein lies the rub.
>
> Lack of DIF at petabyte-scale does indeed damn it for integrity.  You
> are correct, there are even more checks beyond T10 DIF that could be
> done, but in terms of the actual SCSI CDB, DIF covers the 3 inputs that
> the initiator supplies; payload, address/length (and its subsequent V->P
> translation), and LUN.
>
> As for market pricing, what value do you place on your data?  It's
> really an exercise in risk management.  Look at it this way; how much
> would it cost you to recreate/restore that petabyte?  What is the time
> value of not having access to that petabyte?  If a petabyte of data is
> only worth $117,000 to you, that's one thing.  For most companies,
> though, their petabyte is worth far, far more than that.
>
> Then, there is the issue of storage performance - which is an entirely
> From: Robert_Peg...@xiotech.com
> From: Robert_Peg...@xiotech.com
> To: cloud-c...@googlegroups.com
>
> Petabytes, sure.  Petabytes of integral data, nope.  No DIF == no
> integrity.  
>
> From the article:  "...When you strip away the marketing terms and fancy
> logos from any storage solution, data ends up on a hard drive."
>
> Can't argue with that.  But what data?  The data that you think you
> wrote, at the address you think you wrote it to, at the LUN or ID you
> think you sent it to?  How do you know?
>
> But let's look a little deeper here.  They use two power supplies, 760W
> rated each.  It looks like the unit must have both supplies up to
> function, as they have subdivided duties.  Lose one supply, half your
> drives disappear, and if you lose half the drives in the scheme they use
> (see below) your box is dead.  Not very good if you have a halfway
> decent SLA.  But perhaps they are assuming cloud vendors will not have
> SLAs with teeth - which is certainly the case today.
>
> Then there are disk rebuilds and sparing.  No hot spares configured, or
> at least described.  Looks like they do three sets of double parity
> 13+2.  So all drives are holding necessary information.  No mention of
> how one spares out a drive, so you have to assume entire unit downtime
> to replace a drives and allow a rebuild to begin.  Speaking of which,
> how long does that take?  Rebuilding a 1.5TB drive - and the SiL chipset
> they use cannot recognize white space, so they have to rebuild the
> entire drive - takes tens of hours, easily.  Have you ever rebuilt a
> drive from a 13+2 set?  It's ugly.  I have to read 19.5 TB of data to
> rebuild a single 1.5TB drive.  How long does it take to read 19.5 TB of
> data?  Even if the box could write data at max speed to the spare, which
> is 120MB/sec or 432 GB/hour, it would take 4 hours.  But that doesn't
> count the read time - and it takes 45 hours to read 19.5 TB of data @
> 432 GB/hr - never mind the compute time needed to recalculate double
> parity on that entire set.  That's the problem.
>
> Then there is the mechanicals.  Quite literally, they use rubber bands
> to 'dampen vibration'.  Now, that's certainly novel, but I wouldn't
> trust my data to that.  What's next, duct tape?  I bet this box goes 30
> rads easily on the vibration test.  They also mount all the drives in
> the same orientation, a sure-fire way to induce torque and stress.  I
> would also not want to be a drive in the rearmost group of 15, given
> their depiction of fans - and they use fans, not blowers.  
>
> On average, they will lose 2 drives every year (4 per 100 per annum) and
> by year 5, it will probably rise to 6-8 per year, given their use of
> consumer-level drives.  So, over the course of 5 years, you may have to
> replace roughly half the drive population.  Did they mention that in
> their cost analysis?
>
> Seagate themselves says this drive is not best adapted for array-based
> storage.  Look at their website for this drive and see their recommended
> applications - PCs, desktop (not array) RAID and personal external
> storage.  This is because of the duty cycle that these drives are built
> to.  Is this array powered on and active all the time?  Verbatim from
> the drive manual:
>
> Normal I/O duty cycle for desktop personal computers. Operation at
> excessive I/O duty cycle may degrade product reliability.  
>
> The paper mentions expense.  Sure, I am worried about expense too.  I am
> worried about the biggest expense of all - humans.   I need a server
> admin to manage and babysit this ...
>
> read more »

Peglar, Robert

unread,
Sep 27, 2009, 8:23:56 AM9/27/09
to cloud-c...@googlegroups.com
Sure, with sufficient space and time, one can avoid or work around most any inherent problem. What you describe requires 3X the storage and <insert large # here>X the time, relative to an I/O without this type of verification. It is also a workaround; the problem still exists (silent corruption). DIF prevents it from occurring. The workaround you describe merely detects it and works around it at a higher layer.

It's funny; on one hand, many cry out for efficiency and with the other hand, design solutions which are far less than optimal in order to justify the use of certain piece parts. Several decades ago, this was called "Rube Goldberg" design.

T10 DIF is not only a good idea, but a standard. It exists for a reason. It takes about 50 lines of code on both initiator and target to implement. For me, it makes sense. If it doesn't to you, so be it.

Rob

Jayarama Shenoy

unread,
Sep 27, 2009, 3:45:59 PM9/27/09
to cloud-c...@googlegroups.com
Indeed, it is amusing.

On one hand GFS & it's various knock-offs contain appalling practices when you think in terms of efficiency (power, space, what have you). And then, they find that their data centers burn too much power. Duh, what did you expect?

I realize that I'm talking of a 5+ year old architecture and think (or hope) that saner minds are now prevailing.

Backblaze in particular, from what they've described (very likely there's stuff they didnt say) - has no choice but to take the same very brute force way out - except it is 2009 and not 1999.

> Subject: [ Cloud Computing ] Re: Build your own storage - blog from BackBlaze
> Date: Sun, 27 Sep 2009 07:23:56 -0500
> From: Robert...@xiotech.com
> To: cloud-c...@googlegroups.com
>
>
Reply all
Reply to author
Forward
0 new messages