Big Numbers

DrQ

unread,

Aug 27, 2012, 10:59:58 AM8/27/12

to guerrilla-cap...@googlegroups.com

Don't be too easily impressed by big numbers.

Recently, I saw a tweet regarding the amount of CPU time devoted by Fermilab to crunching LHC data:

(Fermilab) Progress in HEP Computing: Recent LHC resource appetite is a mind-boggling
"1.5 CPU-millennia every 3 days"

Is it mind-boggling? This caused me to do some calculations.

Let's compare with SETI@home.

"Since its launch on May 17, 1999, the project has logged over two million years of aggregate computing time."

> cpu.yrs <- 2e6 #since 1999

# Per day...

> cpu.yrs/((2012-1999)*365)

[1] 421.4963

But it's not clear exactly when the end-point of "since" was measured (e.g., 2009, 2012?). So, for simplicity, I'll just round it up to 500 cpu-yrs per day or 1/2 a CPU-millennium per day.

> 500*3 #over 3 days

[1] 1500

That's 1.5 thousand cpu-yrs in 3 days or very similar to the Fermilab claim. And since those Fermilab cycles are also highly distributed, like SETI, the number is impressive but not quite so impressive as it first appeared.

steve jenkin

unread,

Aug 27, 2012, 6:40:17 PM8/27/12

to guerrilla-cap...@googlegroups.com

DrQ wrote on 28/08/12 12:59 AM:

"1.5 CPU-millennia every 3 days"

What's their daily failure rate? [500 cpu-yrs/day]

- 500 cpu-years = 182,625 days == # cpus.

2 cpu/system == 100,000 sys, rounded up [@450W ea, 50MWatt]

Though, do they fudge the numbers by counting cores not CPU's?
Divide by 2 or 4.

99.99% availability? = 00.01% down == 1/10,000 hours failure.
Need MTTR to convert to a failure/rate. Guess 1hr MTTR
= 100,000 sys/10,000 = 10 fails/hr-operated = 240 fails/day.

More likely they get 5-10 times better reliability, but MTTR would be 1 day [ie. a daily fix sweep]
at 1 in 100,000 availability [5 nines, 99.999%], they have 1 sys fail/hr,
or ~20-100/day replaced systems.

They would also have 2,000 48-port switches.
Can't imagine them having less than MTBF of worse than 5-10M hrs MTBF (guess).
They might have 2 switches/day fail.

They wouldn't have less than 2 disks/system, more likely 3-4.
200,000 drives * 24-hours/day = 5M hrs/day = 5e6hrs/day

Manufacturers like to quote 1M-hrs MTBF, or 100 years (a 1%/yr failure rate)
The 2007 paper by Google talks of failures as %/yr.
They found 1.7% in 1st year rising to ~8.5% in 3rd year.
Guess 4% = 8,000 drives/yr = ~20 drives/day

Disk errors? Even the google paper didn't go there :-)
Manufacturers typically quote hard errors of 1 in 10^15 bits read for SATA class drives.

If drives are run hard, say 0.5Gbps (5e8 bps), there are 32M (3.2e6) seconds/yr,
or 1.6e15 bits/yr/drive.

So each drive would experience at least 1-2 Unrecoverable Read Error/year.
Each system, 3-4 times that [with 3-4 drives].

With a fleet of 3-400,000 drives, you're looking at 1-2,000 URR's/day.
you'd want to be running RAID of some sort...

They have a busy bunch of beavers looking after their gear.

Google is reported to have close to 1M systems.
I can't imagine that scale.

-- 
Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

stev...@gmail.com http://members.tip.net.au/~sjenkin

James Newsom

unread,

Aug 28, 2012, 10:07:13 AM8/28/12

to guerrilla-cap...@googlegroups.com

On 8/27/2012 5:40 PM, steve jenkin wrote:

Though, do they fudge the numbers by counting cores not CPU's?
Divide by 2 or 4.

I was wondering that one myself. And not just cores, but logical cores with SMT/Hyperthreading.

AMD has chips that pack in 16 cores per socket (Opteron 6200).

James

DrQ

unread,

Aug 28, 2012, 3:31:44 PM8/28/12

to guerrilla-cap...@googlegroups.com

In a (sideways) related note, I just came across this claim:

"One Google search uses the computing power of the entire Apollo space mission"

Anyone wanna check that one out?

steve jenkin

unread,

Aug 28, 2012, 8:29:59 PM8/28/12

to guerrilla-cap...@googlegroups.com

After I wrote this, I realised I have a real gap in my knowledge of stats:

- for single or small numbers of machines, how do you calculate useful numbers from MTBF's?

If I buy a new PC every 3 years, only ever owning one at time, and the disk drives have are rated at "1M hours MTBF", what's the probability in a lifetime of ownership (50 yrs) of having a failure?

Is it just 1% every year and 50*1% for 50 years?

Or, what proportion of single PC owners will experience a disk drive failure in 50 years of ownership?

For small numbers, many small businesses have 4-5 servers with a few disks each, and tend to keep each server 4-5 years.
What's the likelihood of having to replace a disk in a server?
Comes to a prosaic purchase decision:

steve jenkin wrote on 28/08/12 8:40 AM:

DrQ wrote on 28/08/12 12:59 AM:

"1.5 CPU-millennia every 3 days"

Manufacturers like to quote 1M-hrs MTBF, or 100 years (a 1%/yr failure rate)
The 2007 paper by Google talks of failures as %/yr.
They found 1.7% in 1st year rising to ~8.5% in 3rd year.
Guess 4% = 8,000 drives/yr = ~20 drives/day

steve jenkin

unread,

Aug 28, 2012, 8:31:59 PM8/28/12

to guerrilla-cap...@googlegroups.com

Fat fingered :-(

steve jenkin wrote on 29/08/12 10:29 AM:

For small numbers, many small businesses have 4-5 servers with a few
disks each, and tend to keep each server 4-5 years.
What's the likelihood of having to replace a disk in a server?

Comes to a prosaic purchase decision:

- do we purchase a spare disk (or two) along with the server, to put on
the shelf as a replacement, or
- do we buy maintenance at 10-20% purchase price?

Darryl Gove

unread,

Aug 28, 2012, 9:31:46 PM8/28/12

to guerrilla-cap...@googlegroups.com

I think at this point you are moving beyond probability into risk assessment.

You can work out the probability of a disk failure (etc.) But you need
to assign a cost the the various situations. For example, if your disk
contains critical work, then it would be better to have some raid
system. Assuming the disk failure is not about data loss, purely
convenience, then the decision is more about whether you want to have
the downtime, the cost of the spare disk vs the cost of buying a disk
when you need it - vs the cost of just buying a new machine out of
cycle.

A similar set of arguments apply to the maintenance. In thiis case
it's the cost of your time fixing the problem, the cost of the lost
hours of productivity/downtime.

Of course, you can then front-load it by asking whether it's better to
buy a machine with redundancy in order to avoid downtime if a disk
(etc) goes out.

D.

> --
> You received this message because you are subscribed to the Google Groups "Guerrilla Capacity Planning" group.
> To post to this group, send email to guerrilla-cap...@googlegroups.com.
> To unsubscribe from this group, send email to guerrilla-capacity-...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/guerrilla-capacity-planning?hl=en.
>

--
http://www.darrylgove.com/

Darryl Gove

unread,

Aug 28, 2012, 9:26:46 PM8/28/12

to guerrilla-cap...@googlegroups.com

So you have 1/100 chance of a disk failure in one year. The age of the
disks doesn't matter so we can ignore the fact that you replace your
machine every three years.

The crucial step is that the chance of experiencing (at least) one
failure is 1- probability of experiencing none.

The probability of experiencing no failures over 50 years is (99/100)
^ 50. So the probability of experiencing at least one is 1-(99/100)^50
= 0.39.

Regards,

Darryl.

steve jenkin

unread,

Aug 28, 2012, 9:56:02 PM8/28/12

to guerrilla-cap...@googlegroups.com

Darryl Gove wrote on 29/08/12 11:26 AM:

The probability of experiencing no failures over 50 years is (99/100)
^ 50. So the probability of experiencing at least one is 1-(99/100)^{^50}
= 0.39.

Thanks very much. That was what I was missing... Pretty dumb, I know :-(

steve jenkin

unread,

Aug 29, 2012, 12:05:05 AM8/29/12

to guerrilla-cap...@googlegroups.com

Darryl Gove wrote on 29/08/12 11:26 AM:

The probability of experiencing no failures over 50 years is (99/100)
^ 50. So the probability of experiencing at least one is 1-(99/100)^{^50}
= 0.39.

Darryl,

Thinking a little more on this.

If I have a server and run it 4-5 years and there's an averaged change of failure of 6%/year of drives,
then over the 5 year life of a single drive:

- Prob. No Failure/yr = 1 - prob(failure/yr) = 1 - .06 = .94
- prob No failure in 5 yr = 0.94 ^ 5 = 0.734

Now if I have 3 drives, is the Probability of No drives failing in a single year the product of 1 - sum(prob failure)??

ie. 0.94 * 0.94 * 0.94 = 0.831
or 1 - (0.06 + 0.06 +.06) = 1 - 0.18) = .82

I'm guessing having working out the yearly rate of "No Fails this year" for a group of drives, then the probability for getting through the entire 5 years with no failures is the product:
ie. p1 * p2 * p3 * p4 * p5

cheers
steve

Darryl Gove

unread,

Aug 29, 2012, 12:53:05 AM8/29/12

to guerrilla-cap...@googlegroups.com

On 28 August 2012 21:05, steve jenkin <stev...@gmail.com> wrote:
> Darryl Gove wrote on 29/08/12 11:26 AM:
>
> The probability of experiencing no failures over 50 years is (99/100)
> ^ 50. So the probability of experiencing at least one is 1-(99/100)^50
> = 0.39.
>
> Darryl,
>
> Thinking a little more on this.
>
> If I have a server and run it 4-5 years and there's an averaged change of
> failure of 6%/year of drives,
> then over the 5 year life of a single drive:
>
> - Prob. No Failure/yr = 1 - prob(failure/yr) = 1 - .06 = .94
> - prob No failure in 5 yr = 0.94 ^ 5 = 0.734
>
> Now if I have 3 drives, is the Probability of No drives failing in a single
> year the product of 1 - sum(prob failure)??

Not the sum - you can only add the probabilities of mutually exclusive events.

>
> ie. 0.94 * 0.94 * 0.94 = 0.831

Yes, this is the probability of no drive failing during one year.

> or 1 - (0.06 + 0.06 +.06) = 1 - 0.18) = .82

No, this is not right (imagine that the probability of a drive failing is 0.5 :)

>
> I'm guessing having working out the yearly rate of "No Fails this year" for
> a group of drives, then the probability for getting through the entire 5
> years with no failures is the product:
> ie. p1 * p2 * p3 * p4 * p5

Yes.

0.831^5 = 0.396

So the total probability is (prob working for one year)^( #drives * #years )

Darryl.

>
> cheers
> steve
>
>
> --
> Steve Jenkin, Info Tech, Systems and Design Specialist.
> 0412 786 915 (+61 412 786 915)
> PO Box 48, Kippax ACT 2615, AUSTRALIA
>
> stev...@gmail.com http://members.tip.net.au/~sjenkin
>

steve jenkin

unread,

Aug 29, 2012, 2:44:53 AM8/29/12

to guerrilla-cap...@googlegroups.com

Darryl,

Thanks very much for explaining it so patiently to me, A Bear of Little
Brain [reference to Pooh Bear]

cheers
steve

Darryl Gove wrote on 29/08/12 2:53 PM:

steve jenkin

unread,

Aug 30, 2012, 8:13:46 PM8/30/12

to guerrilla-cap...@googlegroups.com

rml...@gmail.com wrote on 31/08/12 9:34 AM:
> I question the uniform distribution of failures.

Bob,

you're dead right :-) A Simplifying Assumption.

There are two well known papers on large-scale HDD failures, published
within the last 5 years. One by Google. [others will know the refs]

HDD failure rate changes with age, use, temperature and power-cycles.
It's also not usefully near the Vendor-published figures from
accelerated ageing.

Nor, very surprisingly, does S.M.A.R.T. reporting give you much
predictive power. IIRC, google says more than 50% of failures are not
predictable from those logs.

The wild card is "new technologies" - how will they perform?

We've entered the last factor-10 increase of HDD recording density
(~2020) and three new recording techniques are yet to enter mainstream
service:
- HAMR [heat assisted Magnetic recording - higher coercivity media,
heated by laser]
- BPR [Bit Patterned Media. shape the bit areas
- Shingled Writes [better described as Multi-track overlapped write
with no in-place update]

The replacement technologies, broadly called "Storage Class Memories",
will, like Flash memory, have completely different wear and failure
characteristics.

Even 'Flash' seems to now be in a region of declining returns with
feature size reduction. We're currently at 100 electrons per cell,
looking to get to 10. Yes, that's One Hundred. A figure I find hard to
comprehend in consumer devices.

But, from where we are now, there aren't any technologies that will
surpass HDD in capacity/price for the next 20 years.

But as Neil has so ably pointed out recently with Fusion-IO and
PCI-SSD's, HDD's are no longer "useful" or cost-effective for serving
Random I/O loads. [Sell your shares in Enterprise Disk Array
manufacturers, but not Seagate or Western Digital].

HDD's work well for streaming-IO: think CD-ROM or DVD.
- while they can seek, they are horribly slow at it...
- It takes 100-1000 times as long to read a HDD with 'random I/O'
compared to seek-and-stream.

Point of that excursion:
HDD reliability figures are going to become important to archivists
and not for Performance Analysts.

Reply all

Reply to author

Forward