Big Numbers

Showing 1-14 of 14 messages
 Big Numbers DrQ 8/27/12 7:59 AM Don't be too easily impressed by big numbers.Recently, I saw a tweet regarding the amount of CPU time devoted by Fermilab to crunching LHC data:(Fermilab) Progress in HEP Computing: Recent LHC resource appetite is a mind-boggling "1.5 CPU-millennia every 3 days"Is it mind-boggling? This caused me to do some calculations.Let's compare with SETI@home. "Since its launch on May 17, 1999, the project has logged over two million years of aggregate computing time."> cpu.yrs <- 2e6 #since 1999# Per day...> cpu.yrs/((2012-1999)*365)[1] 421.4963But it's not clear exactly when the end-point of "since" was measured (e.g., 2009, 2012?). So, for simplicity, I'll just round it up to 500 cpu-yrs per day or 1/2 a CPU-millennium per day.> 500*3 #over 3 days[1] 1500That's 1.5 thousand cpu-yrs in 3 days or very similar to the Fermilab claim. And since those Fermilab cycles are also highly distributed, like SETI, the number is impressive but not quite so impressive as it first appeared. Re: Big Numbers SteveJ 8/27/12 3:40 PM DrQ wrote on 28/08/12 12:59 AM: "1.5 CPU-millennia every 3 days" What's their daily failure rate? [500 cpu-yrs/day] - 500 cpu-years = 182,625 days == # cpus. 2 cpu/system == 100,000 sys, rounded up [@450W ea, 50MWatt] Though, do they fudge the numbers by counting cores not CPU's? Divide by 2 or 4. 99.99% availability? = 00.01% down == 1/10,000 hours failure. Need MTTR to convert to a failure/rate. Guess 1hr MTTR = 100,000 sys/10,000 = 10 fails/hr-operated = 240 fails/day. More likely they get 5-10 times better reliability, but MTTR would be 1 day [ie. a daily fix sweep] at 1 in 100,000 availability [5 nines, 99.999%], they have 1 sys fail/hr,  or ~20-100/day replaced systems. They would also have 2,000 48-port switches. Can't imagine them having less than MTBF of worse than 5-10M hrs MTBF (guess). They might have 2 switches/day fail. They wouldn't have less than 2 disks/system, more likely 3-4. 200,000 drives * 24-hours/day = 5M hrs/day = 5e6hrs/day Manufacturers like to quote 1M-hrs MTBF, or 100 years (a 1%/yr failure rate) The 2007 paper by Google talks of failures as %/yr. They found 1.7% in 1st year rising to ~8.5% in 3rd year. Guess 4% = 8,000 drives/yr = ~20 drives/day Disk errors? Even the google paper didn't go there :-) Manufacturers typically quote hard errors of 1 in 10^15 bits read for SATA class drives. If drives are run hard, say 0.5Gbps (5e8 bps), there are 32M (3.2e6) seconds/yr, or 1.6e15 bits/yr/drive. So each drive would experience at least 1-2 Unrecoverable Read Error/year. Each system, 3-4 times that [with 3-4 drives]. With a fleet of 3-400,000 drives, you're looking at 1-2,000 URR's/day. you'd want to be running RAID of some sort... They have a busy bunch of beavers looking after their gear. Google is reported to have close to 1M systems. I can't imagine that scale. ```-- Steve Jenkin, Info Tech, Systems and Design Specialist. 0412 786 915 (+61 412 786 915) PO Box 48, Kippax ACT 2615, AUSTRALIA stev...@gmail.com http://members.tip.net.au/~sjenkin ``` Re: Big Numbers James 8/28/12 7:25 AM On 8/27/2012 5:40 PM, steve jenkin wrote: Though, do they fudge the numbers by counting cores not CPU's? Divide by 2 or 4. I was wondering that one myself. And not just cores, but logical cores with SMT/Hyperthreading. AMD has chips that pack in 16 cores per socket (Opteron 6200). James Re: Big Numbers DrQ 8/28/12 12:31 PM In a (sideways) related note, I just came across this claim: "Anyone wanna check that one out? Re: Big Numbers SteveJ 8/28/12 5:29 PM After I wrote this, I realised I have a real gap in my knowledge of stats:  - for single or small numbers of machines, how do you calculate useful numbers from MTBF's? If I buy a new PC every 3 years, only ever owning one at time, and the disk drives have are rated at "1M hours MTBF", what's the probability in a lifetime of ownership (50 yrs) of having a failure? Is it just 1% every year and 50*1% for 50 years? Or, what proportion of single PC owners will experience a disk drive failure in 50 years of ownership? For small numbers, many small businesses have 4-5 servers with a few disks each, and tend to keep each server 4-5 years. What's the likelihood of having to replace a disk in a server? Comes to a prosaic purchase decision: steve jenkin wrote on 28/08/12 8:40 AM: DrQ wrote on 28/08/12 12:59 AM: "1.5 CPU-millennia every 3 days" Manufacturers like to quote 1M-hrs MTBF, or 100 years (a 1%/yr failure rate) The 2007 paper by Google talks of failures as %/yr. They found 1.7% in 1st year rising to ~8.5% in 3rd year. Guess 4% = 8,000 drives/yr = ~20 drives/day ```-- Steve Jenkin, Info Tech, Systems and Design Specialist. 0412 786 915 (+61 412 786 915) PO Box 48, Kippax ACT 2615, AUSTRALIA stev...@gmail.com http://members.tip.net.au/~sjenkin ``` Re: Big Numbers SteveJ 8/28/12 5:31 PM Fat fingered :-( steve jenkin wrote on 29/08/12 10:29 AM: For small numbers, many small businesses have 4-5 servers with a few disks each, and tend to keep each server 4-5 years. What's the likelihood of having to replace a disk in a server? Comes to a prosaic purchase decision:  - do we purchase a spare disk (or two) along with the server, to put on the shelf as a replacement, or  - do we buy maintenance at 10-20% purchase price? Re: Big Numbers Darryl Gove 8/28/12 6:42 PM I think at this point you are moving beyond probability into risk assessment. You can work out the probability of a disk failure (etc.) But you need to assign a cost the the various situations. For example, if your disk contains critical work, then it would be better to have some raid system. Assuming the disk failure is not about data loss, purely convenience, then the decision is more about whether you want to have the downtime, the cost of the spare disk vs the cost of buying a disk when you need it - vs the cost of just buying a new machine out of cycle. A similar set of arguments apply to the maintenance. In thiis case it's the cost of your time fixing the problem, the cost of the lost hours of productivity/downtime. Of course, you can then front-load it by asking whether it's better to buy a machine with redundancy in order to avoid downtime if a disk (etc) goes out. D. > -- > You received this message because you are subscribed to the Google Groups "Guerrilla Capacity Planning" group. > To post to this group, send email to guerrilla-cap...@googlegroups.com. > To unsubscribe from this group, send email to guerrilla-capacity-planning+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/guerrilla-capacity-planning?hl=en. > -- http://www.darrylgove.com/ Re: Big Numbers Darryl Gove 8/28/12 6:42 PM So you have 1/100 chance of a disk failure in one year. The age of the disks doesn't matter so we can ignore the fact that you replace your machine every three years. The crucial step is that the chance of experiencing (at least) one failure is 1- probability of experiencing none. The probability of experiencing no failures over 50 years is (99/100) ^ 50. So the probability of experiencing at least one is 1-(99/100)^50 = 0.39. Regards, Darryl. Re: Big Numbers SteveJ 8/28/12 7:09 PM Darryl Gove wrote on 29/08/12 11:26 AM: ```The probability of experiencing no failures over 50 years is (99/100) ^ 50. So the probability of experiencing at least one is 1-(99/100)^50 = 0.39.``` Thanks very much. That was what I was missing... Pretty dumb, I know :-( ```-- Steve Jenkin, Info Tech, Systems and Design Specialist. 0412 786 915 (+61 412 786 915) PO Box 48, Kippax ACT 2615, AUSTRALIA stev...@gmail.com http://members.tip.net.au/~sjenkin ``` Re: Big Numbers SteveJ 8/28/12 9:17 PM Darryl Gove wrote on 29/08/12 11:26 AM: ```The probability of experiencing no failures over 50 years is (99/100) ^ 50. So the probability of experiencing at least one is 1-(99/100)^50 = 0.39.``` Darryl, Thinking a little more on this. If I have a server and run it 4-5 years and there's an averaged change of failure of 6%/year of drives, then over the 5 year life of a single drive:  - Prob. No Failure/yr = 1 - prob(failure/yr) = 1 - .06 = .94  - prob No failure in 5 yr = 0.94 ^ 5 = 0.734 Now if I have 3 drives, is the Probability of No drives failing in a single year the product of 1 - sum(prob failure)?? ie. 0.94 * 0.94 * 0.94 = 0.831 or 1 - (0.06 + 0.06 +.06) = 1 - 0.18) = .82 I'm guessing having working out the yearly rate of "No Fails this year" for a group of drives, then the probability for getting through the entire 5 years with no failures is the product: ie. p1 * p2 * p3 * p4 * p5 cheers steve ```-- Steve Jenkin, Info Tech, Systems and Design Specialist. 0412 786 915 (+61 412 786 915) PO Box 48, Kippax ACT 2615, AUSTRALIA stev...@gmail.com http://members.tip.net.au/~sjenkin ``` Re: Big Numbers Darryl Gove 8/28/12 9:59 PM On 28 August 2012 21:05, steve jenkin wrote: > Darryl Gove wrote on 29/08/12 11:26 AM: > > The probability of experiencing no failures over 50 years is (99/100) > ^ 50. So the probability of experiencing at least one is 1-(99/100)^50 > = 0.39. > > Darryl, > > Thinking a little more on this. > > If I have a server and run it 4-5 years and there's an averaged change of > failure of 6%/year of drives, > then over the 5 year life of a single drive: > >  - Prob. No Failure/yr = 1 - prob(failure/yr) = 1 - .06 = .94 >  - prob No failure in 5 yr = 0.94 ^ 5 = 0.734 > > Now if I have 3 drives, is the Probability of No drives failing in a single > year the product of 1 - sum(prob failure)?? Not the sum - you can only add the probabilities of mutually exclusive events. > > ie. 0.94 * 0.94 * 0.94 = 0.831 Yes, this is the probability of no drive failing during one year. > or 1 - (0.06 + 0.06 +.06) = 1 - 0.18) = .82 No, this is not right (imagine that the probability of a drive failing is 0.5 :) > > I'm guessing having working out the yearly rate of "No Fails this year" for > a group of drives, then the probability for getting through the entire 5 > years with no failures is the product: > ie. p1 * p2 * p3 * p4 * p5 Yes. 0.831^5 = 0.396 So the total probability is (prob working for one year)^( #drives * #years ) Darryl. > > cheers > steve > > > -- > Steve Jenkin, Info Tech, Systems and Design Specialist. > 0412 786 915 (+61 412 786 915) > PO Box 48, Kippax ACT 2615, AUSTRALIA > > stev...@gmail.com http://members.tip.net.au/~sjenkin > Re: Big Numbers SteveJ 8/28/12 11:44 PM Darryl, Thanks very much for explaining it so patiently to me, A Bear of Little Brain [reference to Pooh Bear] cheers steve Darryl Gove wrote on 29/08/12 2:53 PM: Re: Big Numbers rml...@gmail.com 8/30/12 4:34 PM Nice analyses by Steve and Darryl. I question the uniform distribution of failures. Because systems tend to have burn-in failures early in life and burn-out/wear-out failures late in life with stability in between, the uniform distribution seems suspect. Because this is a bathtub shaped distribution, wouldn't an exponentiated Weibel distribution be a better model for the failures that are being discussed? If we assume that product life is divided into three parts--infant mortality, random failures, and wear-out--then three functions may express the probability of failure depending on where the system is at in its lifespan. Some vendors burn-in their systems prior to customer delivery to minimize customers experiencing infant mortality. I know that this moves us from performance to reliability.Bob Re: Big Numbers SteveJ 8/30/12 5:14 PM rml...@gmail.com wrote on 31/08/12 9:34 AM: > I question the uniform distribution of failures. Bob, you're dead right :-) A Simplifying Assumption. There are two well known papers on large-scale HDD failures, published within the last 5 years. One by Google. [others will know the refs] HDD failure rate changes with age, use, temperature and power-cycles. It's also not usefully near the Vendor-published figures from accelerated ageing. Nor, very surprisingly, does S.M.A.R.T. reporting give you much predictive power. IIRC, google says more than 50% of failures are not predictable from those logs. The wild card is "new technologies" - how will they perform? We've entered the last factor-10 increase of HDD recording density (~2020) and three new recording techniques are yet to enter mainstream service:  - HAMR [heat assisted Magnetic recording - higher coercivity media, heated by laser]  - BPR [Bit Patterned Media. shape the bit areas  - Shingled Writes [better described as Multi-track overlapped write with no in-place update] The replacement technologies, broadly called "Storage Class Memories", will, like Flash memory, have completely different wear and failure characteristics. Even 'Flash' seems to now be in a region of declining returns with feature size reduction. We're currently at 100 electrons per cell, looking to get to 10. Yes, that's One Hundred. A figure I find hard to comprehend in consumer devices. But, from where we are now, there aren't any technologies that will surpass HDD in capacity/price for the next 20 years. But as Neil has so ably pointed out recently with Fusion-IO and PCI-SSD's, HDD's are no longer "useful" or cost-effective for serving Random I/O loads. [Sell your shares in Enterprise Disk Array manufacturers, but not Seagate or Western Digital]. HDD's work well for streaming-IO: think CD-ROM or DVD.  - while they can seek, they are horribly slow at it...  - It takes 100-1000 times as long to read a HDD with 'random I/O' compared to seek-and-stream. Point of that excursion:   HDD reliability figures are going to become important to archivists and not for Performance Analysts.