Big Numbers  DrQ  8/27/12 7:59 AM  Don't be too easily impressed by big numbers. Recently, I saw a tweet regarding the amount of CPU time devoted by Fermilab to crunching LHC data: (Fermilab) Progress in HEP Computing: Recent LHC resource appetite is a mindboggling Is it mindboggling? This caused me to do some calculations. Let's compare with SETI@home.
> cpu.yrs < 2e6 #since 1999 # Per day... > cpu.yrs/((20121999)*365) [1] 421.4963 But it's not clear exactly when the endpoint of "since" was measured (e.g., 2009, 2012?). So, for simplicity, I'll just round it up to 500 cpuyrs per day or 1/2 a CPUmillennium per day. > 500*3 #over 3 days [1] 1500 That's 1.5 thousand cpuyrs in 3 days or very similar to the Fermilab claim. And since those Fermilab cycles are also highly distributed, like SETI, the number is impressive but not quite so impressive as it first appeared. 
Re: Big Numbers  SteveJ  8/27/12 3:40 PM  DrQ wrote on 28/08/12 12:59 AM: What's their daily failure rate? [500 cpuyrs/day]"1.5 CPUmillennia every 3 days"  500 cpuyears = 182,625 days == # cpus. 2 cpu/system == 100,000 sys, rounded up [@450W ea, 50MWatt] Though, do they fudge the numbers by counting cores not CPU's? Divide by 2 or 4. 99.99% availability? = 00.01% down == 1/10,000 hours failure. Need MTTR to convert to a failure/rate. Guess 1hr MTTR = 100,000 sys/10,000 = 10 fails/hroperated = 240 fails/day. More likely they get 510 times better reliability, but MTTR would be 1 day [ie. a daily fix sweep] at 1 in 100,000 availability [5 nines, 99.999%], they have 1 sys fail/hr, or ~20100/day replaced systems. They would also have 2,000 48port switches. Can't imagine them having less than MTBF of worse than 510M hrs MTBF (guess). They might have 2 switches/day fail. They wouldn't have less than 2 disks/system, more likely 34. 200,000 drives * 24hours/day = 5M hrs/day = 5e6hrs/day Manufacturers like to quote 1Mhrs MTBF, or 100 years (a 1%/yr failure rate) The 2007 paper by Google talks of failures as %/yr. They found 1.7% in 1st year rising to ~8.5% in 3rd year. Guess 4% = 8,000 drives/yr = ~20 drives/day Disk errors? Even the google paper didn't go there :) Manufacturers typically quote hard errors of 1 in 10^15 bits read for SATA class drives. If drives are run hard, say 0.5Gbps (5e8 bps), there are 32M (3.2e6) seconds/yr, or 1.6e15 bits/yr/drive. So each drive would experience at least 12 Unrecoverable Read Error/year. Each system, 34 times that [with 34 drives]. With a fleet of 3400,000 drives, you're looking at 12,000 URR's/day. you'd want to be running RAID of some sort... They have a busy bunch of beavers looking after their gear. Google is reported to have close to 1M systems. I can't imagine that scale.  Steve Jenkin, Info Tech, Systems and Design Specialist. 0412 786 915 (+61 412 786 915) PO Box 48, Kippax ACT 2615, AUSTRALIA stev...@gmail.com http://members.tip.net.au/~sjenkin 
Re: Big Numbers  James  8/28/12 7:25 AM  I was wondering that one myself. And not just cores, but logical cores with SMT/Hyperthreading. AMD has chips that pack in 16 cores per socket (Opteron 6200). James 
Re: Big Numbers  DrQ  8/28/12 12:31 PM  In a (sideways) related note, I just came across this claim: Anyone wanna check that one out? 
Re: Big Numbers  SteveJ  8/28/12 5:29 PM  After I wrote this, I realised I have a
real gap in my knowledge of stats:
 for single or small numbers of machines, how do you calculate useful numbers from MTBF's? If I buy a new PC every 3 years, only ever owning one at time, and the disk drives have are rated at "1M hours MTBF", what's the probability in a lifetime of ownership (50 yrs) of having a failure? Is it just 1% every year and 50*1% for 50 years? Or, what proportion of single PC owners will experience a disk drive failure in 50 years of ownership? For small numbers, many small businesses have 45 servers with a few disks each, and tend to keep each server 45 years. What's the likelihood of having to replace a disk in a server? Comes to a prosaic purchase decision: steve jenkin wrote on 28/08/12 8:40 AM:

Re: Big Numbers  SteveJ  8/28/12 5:31 PM  Fat fingered :(
steve jenkin wrote on 29/08/12 10:29 AM:  do we purchase a spare disk (or two) along with the server, to put on the shelf as a replacement, or  do we buy maintenance at 1020% purchase price? 
Re: Big Numbers  Darryl Gove  8/28/12 6:42 PM  I think at this point you are moving beyond probability into risk assessment.
You can work out the probability of a disk failure (etc.) But you need to assign a cost the the various situations. For example, if your disk contains critical work, then it would be better to have some raid system. Assuming the disk failure is not about data loss, purely convenience, then the decision is more about whether you want to have the downtime, the cost of the spare disk vs the cost of buying a disk when you need it  vs the cost of just buying a new machine out of cycle. A similar set of arguments apply to the maintenance. In thiis case it's the cost of your time fixing the problem, the cost of the lost hours of productivity/downtime. Of course, you can then frontload it by asking whether it's better to buy a machine with redundancy in order to avoid downtime if a disk (etc) goes out. D. >  > You received this message because you are subscribed to the Google Groups "Guerrilla Capacity Planning" group. > To post to this group, send email to guerrillacap...@googlegroups.com. > To unsubscribe from this group, send email to guerrillacapacityplanning+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/guerrillacapacityplanning?hl=en. >  http://www.darrylgove.com/ 
Re: Big Numbers  Darryl Gove  8/28/12 6:42 PM  So you have 1/100 chance of a disk failure in one year. The age of the
disks doesn't matter so we can ignore the fact that you replace your machine every three years. The crucial step is that the chance of experiencing (at least) one failure is 1 probability of experiencing none. The probability of experiencing no failures over 50 years is (99/100) ^ 50. So the probability of experiencing at least one is 1(99/100)^50 = 0.39. Regards, Darryl. 
Re: Big Numbers  SteveJ  8/28/12 7:09 PM  Darryl Gove wrote on 29/08/12 11:26 AM: Thanks very much. That was what I was missing... Pretty dumb, I know :(The probability of experiencing no failures over 50 years is (99/100) ^ 50. So the probability of experiencing at least one is 1(99/100)^{^50} = 0.39.

Re: Big Numbers  SteveJ  8/28/12 9:17 PM 
Darryl,The probability of experiencing no failures over 50 years is (99/100) ^ 50. So the probability of experiencing at least one is 1(99/100)^{^50} = 0.39. Thinking a little more on this. If I have a server and run it 45 years and there's an averaged change of failure of 6%/year of drives, then over the 5 year life of a single drive:  Prob. No Failure/yr = 1  prob(failure/yr) = 1  .06 = .94  prob No failure in 5 yr = 0.94 ^ 5 = 0.734 Now if I have 3 drives, is the Probability of No drives failing in a single year the product of 1  sum(prob failure)?? ie. 0.94 * 0.94 * 0.94 = 0.831 or 1  (0.06 + 0.06 +.06) = 1  0.18) = .82 I'm guessing having working out the yearly rate of "No Fails this year" for a group of drives, then the probability for getting through the entire 5 years with no failures is the product: ie. p1 * p2 * p3 * p4 * p5 cheers steve

Re: Big Numbers  Darryl Gove  8/28/12 9:59 PM  On 28 August 2012 21:05, steve jenkin <stev...@gmail.com> wrote:Not the sum  you can only add the probabilities of mutually exclusive events. Yes, this is the probability of no drive failing during one year. No, this is not right (imagine that the probability of a drive failing is 0.5 :) Yes. 0.831^5 = 0.396 So the total probability is (prob working for one year)^( #drives * #years ) Darryl.

Re: Big Numbers  SteveJ  8/28/12 11:44 PM  Darryl,
Thanks very much for explaining it so patiently to me, A Bear of Little Brain [reference to Pooh Bear] cheers steve Darryl Gove wrote on 29/08/12 2:53 PM: 
Re: Big Numbers  rml...@gmail.com  8/30/12 4:34 PM  Nice analyses by Steve and Darryl. I question the uniform distribution of failures. Because systems tend to have burnin failures early in life and burnout/wearout failures late in life with stability in between, the uniform distribution seems suspect. Because this is a bathtub shaped distribution, wouldn't an exponentiated Weibel distribution be a better model for the failures that are being discussed? If we assume that product life is divided into three partsinfant mortality, random failures, and wearoutthen three functions may express the probability of failure depending on where the system is at in its lifespan. Some vendors burnin their systems prior to customer delivery to minimize customers experiencing infant mortality. I know that this moves us from performance to reliability. Bob 
Re: Big Numbers  SteveJ  8/30/12 5:14 PM  rml...@gmail.com wrote on 31/08/12 9:34 AM:
> I question the uniform distribution of failures.Bob, you're dead right :) A Simplifying Assumption. There are two well known papers on largescale HDD failures, published within the last 5 years. One by Google. [others will know the refs] HDD failure rate changes with age, use, temperature and powercycles. It's also not usefully near the Vendorpublished figures from accelerated ageing. Nor, very surprisingly, does S.M.A.R.T. reporting give you much predictive power. IIRC, google says more than 50% of failures are not predictable from those logs. The wild card is "new technologies"  how will they perform? We've entered the last factor10 increase of HDD recording density (~2020) and three new recording techniques are yet to enter mainstream service:  HAMR [heat assisted Magnetic recording  higher coercivity media, heated by laser]  BPR [Bit Patterned Media. shape the bit areas  Shingled Writes [better described as Multitrack overlapped write with no inplace update] The replacement technologies, broadly called "Storage Class Memories", will, like Flash memory, have completely different wear and failure characteristics. Even 'Flash' seems to now be in a region of declining returns with feature size reduction. We're currently at 100 electrons per cell, looking to get to 10. Yes, that's One Hundred. A figure I find hard to comprehend in consumer devices. But, from where we are now, there aren't any technologies that will surpass HDD in capacity/price for the next 20 years. But as Neil has so ably pointed out recently with FusionIO and PCISSD's, HDD's are no longer "useful" or costeffective for serving Random I/O loads. [Sell your shares in Enterprise Disk Array manufacturers, but not Seagate or Western Digital]. HDD's work well for streamingIO: think CDROM or DVD.  while they can seek, they are horribly slow at it...  It takes 1001000 times as long to read a HDD with 'random I/O' compared to seekandstream. Point of that excursion: HDD reliability figures are going to become important to archivists and not for Performance Analysts. 