Intel abandons USEnet news

Andy Glew

unread,

Sep 16, 2006, 6:51:51 PM9/16/06

to

Minorly off-topic, but I feel impelled to note that Intel has just
ZBB'ed its internal NNTP news servers. Actually, they were ZBB'ed many
years ago, but volunteers kept them going. Those volunteers may now be
ZBB'ed. New volunteers may arise; heck, I may; but the path of least
resistance is to give up on getting USEnet news inside the company, and
go to some external service. E.g. today I am posting from Google
Groups.

Personal relevance to comp.arch: Intel's internal news servers have
been my main connection to comp.arch since 1991. Brief exceptions
while I was in Madison and at AMD. Prior to joining Intel I
participated in comp.arch and its predecessor net.arch on news servers
from the University of Illinois and from Gould and Motorola. I still
maintain that I learned more computer architecture from comp.arch than
I did in any school; moreover, I am fairly confident that I would never
have gotten my job with Intel without my comp.arch mediated
acquaintance with Bob Colwell.

Generic relevance to comp.arch: this is a trend. Actually two trends.

Trend #1 is that less and less personal computing can be done at work,
and that more and more work related computing is "freeloading" on
personally paid for computing.

Most people used to have only one email address that they used for both
work and personal matters. You can still do this, but it is becoming
increasingly hard to do so because companies like Intel do not allow
you to access the corporate network from your own computers; you can
only do so from a company owned device.
So you have a personal mail service, as well as your work mail
service. Maybe your personal mail servce is from an ISP, and changes
whenever your ISP changes, you move, or when Qwest gets bought out by
Verizon. Maybe you have your own domain.
But your company doesn't allow you to run POP across the firewall.
Similarly for newsgroup access: your company desn't allow you to run
NNTP across the firewall.

This leads to Trend #2: Google. More generically, the rebirth of "Big
Iron", centralized, computer service companies.
Google *is* "Big Iron". Maybe not in the IBM mainframe sense, but
anyone who has seen a Google machine room knows that it is a completely
different scale than a desktop or laptop PC.
For many years I tried to keep my personal computing environment PC
based. I ran my mail reader on my laptop or desktop PC, sometimes via a
client technology such as POP, IMAP, sometimes peer-to-peer stuff like
SMTP. Similarly, up until now I have read news on my laptop or desktop
PC. When I saved a file, it was saved on my PC's hard disk. I could
not access my environment of saved files and email without being on my
PC. Maybe I could read my email from other computers, but I did not
have my mailreading environment on those other computers, so I tried to
avoid doing so.
But, not being able to access my personal email from work - no POP,
no ssh - was the last straw. I switched to Google mail. Now I can
access my personal email from any computer - at work, at home, from my
wife's computer. From my relatives' computers. I no longer need to
drag my personal laptop around with me.

Downside: I cannot access my Google email when I am not connected to
the net. For many years this was the biggest reason that I stayed PC
based. Broadband took a long time to get to many of the places where I
spend time, like Oceanside, Oregon, and the Ottawa river valley in
Canada. Broadband is still not available in many of my favorite places,
such as Eastern Oregon. Heck, cell phone service is not available. (I
am waiting for reports of the Microsoft/KVH mobile broadband with
interest.)
Perhaps most important for business folk, I cannot access Google
email on a plane, when I am not connected to the Internet.
Yes, I know: you can access Google mail via POP, downloading it to a
mobile PC where you can read it disconnected. But that just puts you
back in the "your mailreadimg environment lives on only one PC" mode.
So far as I know, there is no way to download Google mail to your PC,
and then upload back to Google any annotations, tags, classifications,
and spam markings you have made to your email while disconnected.
I hope that Google will soon remedy this, and provide disconnected
operation, not just for email, but also for other Google services such
as Google groups.

Interestingly, moving to Google mail has provided more freedom from the
point of view of form factor. In my "my mailreading environment lives
on a single laptop PC" days, I needed to have a laptop that met my
minimum needs for all common situations. E.g. it had to have a big
enough screen, enough disk, and a keyboard. But now that I am Google
based I can seriously consider reading email on a keyboardless tablet
in my living room, or a PDA, or... Since I can always go to another
device. I.e. I am more likely to buy a "widget" specialized computer
now that I am using Google mail than I was when I used a PC.
I hypothesize that this is true not just of me, but also of other
users. Perhaps the long awaited flowering of specialized devices for
ubiquitous computing is now about to begin.

Terminology change: I used to read my mail on my laptop PC. Now I read
it on Google, via a web browser that happens to run on a PC, but which
could run elsewhere. I used to be a PC user. Now I am a Google user.

USEnet news is just another information service, like email. Same
considerations apply. Since I have switched to Google mail, I might as
well switch to Google groups. Ditto RSS, and other information
services.
What I really want is to receive all of my information inputs in a
common environment, that can seamlessly prioritize and sort email,
USEnt news, RSS, regular news, IM, and telephony. Google is the most
likely company to achieve this.

Interestingly, I have been forced into schizophrenia. My work
information feeds are in one place, my personal feeds in another. At
the moment it appears that the personal feeds on Google are more
integrated, have better search abilities, etc., although far less
storage.
Will this keep up? Or will the quality of information management at
work play leapfrog with Google? I do not know... but I predict that
at least some fraction of companies will just outsource their employees
email, etc., to Google. I.e. I predict that Google will be able to
provide a single stop for both work and personal information
management. And that because of this, it will have a larger critical
mass than companies that are stuck just supporting an individual's work
computing and information needs.

Returning to trend #1: not only will less and less personal computing
be done at work, but more and more work computing will be done
personally... because the personal computing environment, whether
Google based or whaever, is pulling ahead of the work environment.
(Unless Google takes over the work environment, as predicted above.)
The item that sparked this post is just an example: reading USEnet
newsgroups such as comp.arch is recreation, but I also fairly regularly
post queries to newsgroups such as comp.lang.c, etc., for work related
questions. Closing down Intel's internal news service, of course,
means that I will be now doing this using my personal computing
resources. I.e. Google groups.
More evidence of this trend: back in the old days companis paid for
2nd phone lines for computer access. Nowadays you are expected to pay
for your own broadband access, and to use it for after hours work. I
keep meaning to take an income deduction for my broadband for tax
purposes, since the logs plainly show that it is mostly used for work,
not pleasure.

Summing up:
Prompted by: Intel abandoning internal new servers.
Hypothesis: there is a trend away from PC based computing and
information services, towards centralized computer services like
Google.

The computer industry battle is not Intel versus AMD, or Microsoft
versus Linux. It is the PC versus Google.

(Here, I use "Google" as representative of web based computing
services, ubiquitously accessible so long as you have Internet access.)

Del Cecchi

unread,

Sep 16, 2006, 7:42:52 PM9/16/06

to

"Andy Glew" <Andy...@gmail.com> wrote in message
news:1158447110.9...@i3g2000cwc.googlegroups.com...

I switched to news.individual.net for news several years ago. It was
superior to the internal servers at a large IT company I am associated
with. And google will let you do pop3 if you want to.
>

Eric P.

unread,

Sep 16, 2006, 8:25:14 PM9/16/06

to

Andy Glew wrote:
>
> Minorly off-topic, but I feel impelled to note that Intel has just
> ZBB'ed its internal NNTP news servers.

Apparently newsgroups aren't part of the whole Web 2.0 experience.

My ISP, Sympatico, dropped all its news servers this summer,
without public notice or warning I might add.
There was article in the Globe&Mail newspaper the next day
comparing their behavior to that of a Vorgon constructor fleet.

It seems their main competitors, Rogers Cable and AOL, had earlier
dropped news access and nothing bad happened to them so I guess they
thought they could do so to. And it seems the Sympatico was the news
feed for many other ISP servers because when I tried to switch ISPs,
all the alternatives were down and out too.

I find this all quite bizarre because there are more and more
businesses that are providing technical support through news.
Every day more newsgroups appear targeted at a specific
businesses product support.

Some free newsservers I found that allow posting:

reader.greatnowhere.com
Requires logon, 30 day text msg retention.

nntp.aioe.org
No logon, but can be snooty about msg formating it will accept.
Server was down for a few months for upgrades but is back.
Had limited retention in the past but may be better now.

Eric

kro...@princeton.edu

unread,

Sep 16, 2006, 8:53:35 PM9/16/06

to

Eric P. wrote:
> Andy Glew wrote:
> >
> > Minorly off-topic, but I feel impelled to note that Intel has just
> > ZBB'ed its internal NNTP news servers.
>
> Apparently newsgroups aren't part of the whole Web 2.0 experience.
>

When I took my faculty job here at New Mexico State University, I was
stunned to find they they do not maintain a news server. They have (or
had) one, but it never worked, and I had trouble getting them to
subscribe to new groups.

I've been accessing news with Google for three years, after having used
various mac and pc programs to access campus news servers since the
early 80s.

I'm currently in the process of creating a new news group: sci.eeg.
The general feel of the whole process, compared to when I created one
six years ago, is very different. It feels like a bunch of youngsters
trying to carry on something that has long been abandoned by the old
timers as a declining dead end. I discussed with the new head of the
Big 8 the current relationship with Google. It seems to me that Google
is making mincemeat of the Big 8.

Andy Glew

unread,

Sep 17, 2006, 12:55:19 AM9/17/06

to

Del Cecchi wrote:
>And google will let you do pop3 if you want to.

I am tired of people telling me this. I know it already.

You obviously did not read all of my post.

One of my points is that running POP off Google defeats one of the
great advantages of Gmail. You can tag email on the Gmail server, and
use those tags in searches.

If you download email via POP, then any tagging or mail sorting that
you do on your PC is lost from the point of view of Gmail. To the
best of my knowledge there is no way to upload any tagging or
annotation you do on your PC back to the Google server.

A secondary concern is that, if you download via POP, you end up back
in the PC usage model - you have a single mail reading environment.
You lose the ubiquitous access.

Others have said that you can rsync your PCs. Yes - but that is an a
priori arrangement. With server based email, you can walk to anyone's
computer, without prior arrangement, and access it.

One of my sayings is: "The purpose of computers is to reduce the need
to arange things in advance."

Andy Glew

unread,

Sep 17, 2006, 12:59:34 AM9/17/06

to

> Del Cecchi wrote:
> >And google will let you do pop3 if you want to.
>
> I am tired of people telling me this. I know it already.

Tell me something useful: tell me about a Google API tool that allows
various programs more sophisticated than Google's web interface (for
mail, RSS, alarms, etc.) to run, disconnect, but back upload any
annotations.

Andy Glew

unread,

Sep 17, 2006, 1:08:20 AM9/17/06

to

Eric P. wrote:
> Apparently newsgroups aren't part of the whole Web 2.0 experience.

I have pondered whether the USEnet, the peer to peer forwarding model,
still has any role in the future.

Could we not simply assume that everyone is connected to the Internet
at all times?

Maybe... on Earth. But clearly once we get to solar system or galactic
scale Internets, the round-trip latencies would be unacceptable. The
store and forward model is the only reasonable way to move forward at
such timescales.

But that does not mean that the old USEnet way will not die out, only
to be revived a century or a millenium from now.

---

Ironically, the same sort of consideration applies to memory ordering
models for MP and CMP systems. For tightly bound systems strong
ordering such as TSO and SC is implementable, and clearly more
convenient for programmers. But for larger systems weaker consistency
models are more appropriate.

Stephen Fuld

unread,

Sep 17, 2006, 1:11:32 AM9/17/06

to

"Andy Glew" <Andy...@gmail.com> wrote in message

news:1158468919.6...@m7g2000cwm.googlegroups.com...

snip

> A secondary concern is that, if you download via POP, you end up back
> in the PC usage model - you have a single mail reading environment.
> You lose the ubiquitous access.

Not necessarily. At least on my system, I have the option when downloading
e-mail via POP to not delete them on the server. Then I would have two
copies; one on my PC and one on the ISP's mail server. Would this address
at least you secondary concern?

--
- Stephen Fuld
e-mail address disguised to prevent spam

Andy Glew

unread,

Sep 17, 2006, 1:11:30 AM9/17/06

to

Andy Glew wrote:
> > Del Cecchi wrote:
> > >And google will let you do pop3 if you want to.
> >
> > I am tired of people telling me this. I know it already.
>
> Tell me something useful:

Tell me how to persuade Google groups to use my comp.arch AT NOSPAM
patten-glew.net email address, rather than Gmail. (All is forwarded,
but I prefer to use my own domain - I probably will move off Gmail at
some point, whereas patten-glew.net will last as long as my marriage.
(And, yes, I have thought maybe I should get a non patten- glew domain.)

Andy...@gmail.com

unread,

Sep 17, 2006, 2:43:20 AM9/17/06

to

Stephen Fuld wrote:
> "Andy Glew" <Andy...@gmail.com> wrote in message
> news:1158468919.6...@m7g2000cwm.googlegroups.com...
>
> snip
>
> > A secondary concern is that, if you download via POP, you end up back
> > in the PC usage model - you have a single mail reading environment.
> > You lose the ubiquitous access.
>
> Not necessarily. At least on my system, I have the option when downloading
> e-mail via POP to not delete them on the server. Then I would have two
> copies; one on my PC and one on the ISP's mail server. Would this address
> at least you secondary concern?
>

> - Stephen Fuld

I must be grumpy today - I as much as yelled at Del Cecchi, one of my
favorite comp.archers, and now I will similarly chide Stephen.

Apparently you did not read what I posted.

I am aware that you can leave email on the server.

What I am not aware of is a way to tag and annotate the email in a
disconnected manner, and then upload the tags and annotations back to
the server. Perhaps I am unusual in that I do things such as tagging
and annotating email - not just with the automatically deducible
annotations, such as "replied", but also with annotations such as
"checked back by phone".

This is why I was trying to use the phrase "email environment". It is
not just mailreading - it is mail processing and organization.

Yes, I am aware of IMAP, where the client can organize folders that are
stored on the server. I am not, however, aware of an IMAP client that
can do that sort of folder operation while disconnected, and then
resynch when eventually connected.
THIS is something that I may be ignorant of, and I welcome pointers.

Although, I must admit that I have drunk the koolaid, and I now hink of
organizing things in folders as sub-optimal.

Anyway, I see no technical detail in creating such an off-line tagging
and annotation system. I just am not aware that it exists. Perhaps I
should be learning Google's APIs so that I can write such a system.

---

Anothyer of my points that y'all sem to have missed: It is not just
email. It is email and news and RSS and messaging and ...

Andy...@gmail.com

unread,

Sep 17, 2006, 3:14:02 AM9/17/06

to

One fellow responded privately by email, suggesting that I set up my
own Linux box with a web front-end for email access (and news access,
and RSS access, and ...)

I.e. that I install and maintain my own version of Google's
communication management facilities.

Since I already have my own server(s) on the net, this is not
unreasonable. I am not aware of such packages that are as featureful as
Gmail, but my predilection for Open Source software makes me
uncomfortable with the possible proprietariness of a Google
environment.

It does miss one aspect: the professional sysadministration at Google,
compared to my own administered systems. Who is more likely to have a
network crash and be inaccessible, my systems or Googles?

(Aside: there is an opportunity for peer-to-peer filesharing as the
basis of a wide area backup system.)

---

In case it wasn't obvious, the purpose of my original post was not to
solicit advice ffrom comp.archers about how to get a newsfeed or handle
email. I probably know at least as much as most comp.archers about
these topics, and if I were soliciting such advice I would be posting
on different newsgroups where I could reasonably expect to get useful
advice.

No, the reason that I posted was not to solicit trivial advice about
eail and news (and RSS, and ...); I posted because this ANECDOTE gets
me thinking about stuff that is comp.arch related: how the ways in
which usage patterns for computers will evolve.

I think that such usage patterns are of great relevance to comp.arch.

My example:
I'm a long-time advocate of the autonomous PC usage model,
peer-to-peer, etc., not just because of convenience but for political
reasons. (Inspired by Ted Nelson.)
But I'm switching over to the server based usage model. Not just
because of the attractive pull of the Google model - ubiquity, search
technology, the benefits f having other users train spam. But also
because of something I had not expected: the push away from the PC
caused by corporate IT departments.

Note: it's not that the corporate IT departments are necessarily
deprecating the PC (although I know many IT folk who would love to).
It's just hat the corporate IT departments' attempts to "lock down" are
making it increasingly necessary to have 2 PCs, a work PC and a
personal PC. This schizophrenia significantly increases the hassle
factor of the PC usage model, and highly encourages switching to the
Googke usage model for at least one of the schizoid environments - the
personal environment.

I think that the "personal" nature of the PC is being lost as a resut
of IT control.

Ironically, I think that Intel's IT department is taking exactly the
sort of step that is leading to the biggest threat to Intel ever. I
suspect, the threat that will supplant Intel.

I can hear people say: "What's he talking about? Google supplanting
Intel? Doesn't Google buy Intel (or at least x86) systems?" Yes, but:
a computer services cmpany such as Google is much more efficient in its
use of computing resources. If all computer uses were to be hosted by
Google tomorrow, magically, then the PC microprocessor marketplace
would be 1/3rd or less than the current size.

By the way, it is not just Google, and this is not just a new fear for
me. I wrote similar diatribes back in the MMX days when people were
talking about videoservers - but at that time, before Google really
existed, I used AT&T as my proxy for what such a cenralized computer
services company might be.

---

Again, an interesting side effect: sinvce I switched to Gmail I am
much more willing to use slates and tablets. It may lead to a
flowering of form factors.

---

I think that the above is sufficiently comp.arch relevant. But to make
it more so: what sort of microprocessor would be best for the Google
workload?

Massively multicore CMP? Multithreaded? Or perhaps higher integration
of a relatively smal number of CPU cores, memory controllers, network,
and disk controllers?

Maintainability features? Reliability - error detection and
correction?> Machine check?

Or, if not for the Google workload, how about the workload of a Google
dominaed world? I think that formactor proliferation would further
increase the need for ultra low power.

Terje Mathisen

unread,

Sep 17, 2006, 4:30:43 AM9/17/06

to

Andy Glew wrote:
> Minorly off-topic, but I feel impelled to note that Intel has just
> ZBB'ed its internal NNTP news servers. Actually, they were ZBB'ed many
> years ago, but volunteers kept them going. Those volunteers may now be
> ZBB'ed. New volunteers may arise; heck, I may; but the path of least
> resistance is to give up on getting USEnet news inside the company, and
> go to some external service. E.g. today I am posting from Google
> Groups.

A somewhat related story:

news.hda.hydro.com has been my posting host since at least 1994 (that's
how far back Google have archives of it), and it has mostly worked very
well.

In fact, when Hydro first got an internet connection at all, it was a
link between our oil research department on the west coast, in Bergen,
and the Bergen university.

This was a 64 kbit/s link, setup specifically to allow Usenet access
four our scientists/researchers!

Over the years, Hydro got much better/faster internet connections, on a
commercial basis, from the old government PTT monopoly (Telenor), but
Usenet was a definite requirement.

Over the years that newsfeed has broken down multiple times, but always
been fixed within a few days to a week, until last year:

At that time Telenor simply stopped all their Usenet servers, while
negotiation a replacement feed from a commercial Usenet provider.

> Generic relevance to comp.arch: this is a trend. Actually two trends.
>
> Trend #1 is that less and less personal computing can be done at work,
> and that more and more work related computing is "freeloading" on
> personally paid for computing.

After several weeks of feeding my local news server via a steadily
varying list of free providers (mostly universities), I gave up and
bought access from GigaNews for something like $8/month, using my
personal MasterCard.

I.e. Hydro's news.hda.hydro.com server is now a Linux box running
Leafnode, pretending to be a single user that login to GigaNews every 10
minutes.

> So you have a personal mail service, as well as your work mail
> service. Maybe your personal mail servce is from an ISP, and changes
> whenever your ISP changes, you move, or when Qwest gets bought out by
> Verizon. Maybe you have your own domain.

This is the only real option, and has been so for years: Unless you own
your own domain, you will always be a hostage to your current ISP. :-(

> But your company doesn't allow you to run POP across the firewall.

Outgoing SSH (with POP tunneling) still works here, when this stop I'm
almost certain that some form of SSL tunneling will work more or less
forever.

I.e. if a security crackdown also means that the beancounters can't
access their internet bank and/or their stock portfolio, it won't be
implemented.

> Similarly for newsgroup access: your company desn't allow you to run
> NNTP across the firewall.

Unless you take over the company news server and get to configure the FW
properly. :-)

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Philip Homburg

unread,

Sep 17, 2006, 4:32:35 AM9/17/06

to

In article <1158447110.9...@i3g2000cwc.googlegroups.com>,

Andy Glew <Andy...@gmail.com> wrote:
> But, not being able to access my personal email from work - no POP,
>no ssh - was the last straw. I switched to Google mail. Now I can
>access my personal email from any computer - at work, at home, from my
>wife's computer. From my relatives' computers. I no longer need to
>drag my personal laptop around with me.

Assuming you have unlimited access to port 80, you can just as well
setup a webmail thingie at home.

Of course, if you can install arbitrary software at your computer at work,
you can tunnel all sorts of protocols (including POP and NNTP) over
what looks like an ordinary HTTP(S) connection.

--
That was it. Done. The faulty Monk was turned out into the desert where it
could believe what it liked, including the idea that it had been hard done
by. It was allowed to keep its horse, since horses were so cheap to make.
-- Douglas Adams in Dirk Gently's Holistic Detective Agency

Terje Mathisen

unread,

Sep 17, 2006, 6:43:12 AM9/17/06

to

comp...@patten-glew.net wrote:
> (Aside: there is an opportunity for peer-to-peer filesharing as the
> basis of a wide area backup system.)

In the hope that someone, somewhere will actually implement it, I have
advocated just such a system for a few years:

Peer-to-peer, all users volunteer some amount of free disk space (maybe
in the 1-100 GB range), in return they get about half that amount in
distributed, fault-tolerant backup space.

Each backup chunk would be striped N ways, using N+M = S systems for
redundancy.

Each user maintain a local master table which remembers which S systems
to go to to retrieve any given chunk. As long as at least N of these can
be reached the block is retrievable.

A janitor process would walk across the chunk lists, verifying that each
chunk server is accessible. If a server goes away for longer than a
maximum timeout value, a new peer would be picked and the missing chunks
sent to this system instead.

To make it more robust, the janitor process could retrieve crypto hashes
of each block, instead of just verifying that the server is still there.

The janitor process could also mark for deletions each locally stored
remote block that hasn't been verified (by the remote owner) during the
last X days.

One of the really nice features of such a system is that even without
encryption, it would be _really_ hard for an enemy to recover such a
backup. (Using byte or even bit-level striping? )

OTOH, the local chunk server hash list is absolutely critical, and
should be saved regularly onto something like a USB memory stick.

Terje

PS. Afaik, there might be an open source initiative which works on
something similar to this?

Joe Seigh

unread,

Sep 17, 2006, 8:20:50 AM9/17/06

to

comp...@patten-glew.net wrote:
[...]

>
> I think that the above is sufficiently comp.arch relevant. But to make
> it more so: what sort of microprocessor would be best for the Google
> workload?
>
> Massively multicore CMP? Multithreaded? Or perhaps higher integration
> of a relatively smal number of CPU cores, memory controllers, network,
> and disk controllers?

My impression was that Google was pretty heavily into the distributed
model. Any emphasis on multi-threading would just detract from this
model.

>
> Maintainability features? Reliability - error detection and
> correction?> Machine check?

They're into cheap replacable computers. Individual failures
can be tolerated as the system has redundancy built in and Google
queries are guaranteed to be precise. As long as they can detected
and replaced on a timely basis, everything is ok. Detection can be
done externally and does not have to be built in.

If you wanted to sell multi-core to Google, you'd have to make it
look like a bunch of networked single cpu computers. You can use
a VM operating system to do that. You'd need network hardware that
efficiently virtualized. Failing cpus can be offlined until system
degrades to a point where it has to be replaced.

The economies of such a system would be
- less space per system
- less energy usage (big at Google)
- less expensive network infrastructure required.

The last is important. When you do distributed computing it's not
just the cost of the cheap commodity computers that matters. There's
the network infrastructure like expensive high throuput switches and
nics. More if you want low latency (maybe not as important at Google,
I don't know). So a 32 way multi-core uses 31 fewer ports and nics.
That's a 32X savings there. Any intra virtual machine network traffic
would be more efficient than the network hardware given shared memory
latencies and bandwidth.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Terje Mathisen

unread,

Sep 17, 2006, 9:46:48 AM9/17/06

to

comp...@patten-glew.net wrote:
> I think that the above is sufficiently comp.arch relevant. But to make
> it more so: what sort of microprocessor would be best for the Google
> workload?

Whatever cpu+memory system that provides the most MIPS/Watt?

>
> Massively multicore CMP? Multithreaded? Or perhaps higher integration
> of a relatively smal number of CPU cores, memory controllers, network,
> and disk controllers?
>
> Maintainability features? Reliability - error detection and
> correction?> Machine check?

Google runs all of these things at a level way above individual cpus and
motherboards, they are really heavily into RAIN:

Redundant Arrays of Inexpensive Nodes.

All the expensive redundancy stuff, like hotswap/redundant power
supplies, battery-backed raid disk controllers etc becomes almost
trivial when a Gbit Ethernet plug is your unit access point.

They do stuff like making their own motherboards with a single 12 V
power supply, just to get lower conversion heat loss, and they install
naked MBs into racks, with the AC units ducting air directly into these
racks, with no need to cool a lot of (pizza) box enclosures.

>
> Or, if not for the Google workload, how about the workload of a Google
> dominaed world? I think that formactor proliferation would further
> increase the need for ultra low power.

As I said above, mips/watt is already a very (or possibly the most)
important factor when Google sets up new server farms. They expect to
pay as much or even more for energy to run the servers than it costs to
build/buy/install them!

Saving power by mostly idling doesn't work when you actually have
something for all those cpus to do.

Terje

John Dallman

unread,

Sep 17, 2006, 10:29:00 AM9/17/06

to

In article <1158447110.9...@i3g2000cwc.googlegroups.com>,
Andy...@gmail.com (Andy Glew) wrote:

> Summing up:
> Prompted by: Intel abandoning internal new servers.

> Hypothesis: there is a trend away from PC based computing and
> information services, towards centralized computer services like
> Google.

In general, this is undoubtedly true. However, I think the causes are a
little different:

I encounter an awful lot of people these days who point-blank refuse to
use Usenet. They have never heard of it, and when it's explained, regard
it as ancient technology. They feel that anything web-based is,
obviously and axiomatically, so superior that they should not consider
anything that isn't browser-based for discussion-style communication.
Hence the outbreak of the blogosphere, and many other social phenomena.
Hell, they prefer webmail to using an e-mail program, and web forums to
mailing lists. And that's a preference about as strong as your or my
preference for using a computer over a typewriter and carbon paper; they
get positively insulted by anything other than a web-based system.

Now, I theorise that if you're only using browser-based methods, you're
less conscious of where the data is stored and how much control you
have, or don't have, over it. You care a great deal about having
constant access to it, and this seems to me to lie behind the various
movements that feel that wireless connectivity all the time, everywhere,
cheaply, is something approaching a basic human right. Because if you
don't have that, you're cut off from your "external mind", and loosing
access to part of your mind does feel like a personal violation.

As a counter-example, your keeping everything on a laptop means that
your "external mind" is in that device. Provided you aren't deprived of
it, you're happy, and your access to it is protected by simple property
rights.

So if people with the browser-based mindset are responsible for IT
services at Intel, it isn't surprising that they're wanting to suppress
Usenet servers - "why would anyone sensible want them?" - and expect you
to have home broadband - "of course everyone has it, it's like having
running water".

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Del Cecchi

unread,

Sep 17, 2006, 11:56:39 AM9/17/06

to

"Andy Glew" <Andy...@gmail.com> wrote in message

news:1158468919.6...@m7g2000cwm.googlegroups.com...

Sorry, I thought you were also complaining about having to be connected
to the net to do your email. Seems I told you what you already knew, but
not being a mind reader I didn't realize that from what you wrote.

del
>

Bill Todd

unread,

Sep 17, 2006, 11:58:20 AM9/17/06

to

Joe Seigh wrote:
> comp...@patten-glew.net wrote:
> [...]
>>
>> I think that the above is sufficiently comp.arch relevant. But to make
>> it more so: what sort of microprocessor would be best for the Google
>> workload?
>>
>> Massively multicore CMP? Multithreaded? Or perhaps higher integration
>> of a relatively smal number of CPU cores, memory controllers, network,
>> and disk controllers?
>
> My impression was that Google was pretty heavily into the distributed
> model. Any emphasis on multi-threading would just detract from this
> model.

Google is into cost-effective, and how best to attain it. That almost
by definition means multi-threaded workloads within that multi-node
environment: the rest of each node is far too relatively expensive to
leave it idle waiting for disk accesses to complete, when it could be
serving multiple disks in parallel.

(Yes, you could use the old Unix kludge of using multiple processes
rather than multiple threads, but that's essentially the same thing with
somewhat inferior performance.)

...

> If you wanted to sell multi-core to Google, you'd have to make it
> look like a bunch of networked single cpu computers.

Not at all - see above. Given the economics of the situation, it seems
likely that they'll add cores and disks up to the point where some other
system bottleneck (such as bus or memory or cheap network bandwidth)
becomes the limiting factor: that *might* occur with only a single core
in some of their nodes (such as the page-serving ones that mostly
shuffle data from disk to the network), but quite possibly not in the
indexing nodes.

- bill

Niels Jørgen Kruse

unread,

Sep 17, 2006, 12:38:14 PM9/17/06

to

John Dallman <j...@cix.co.uk> wrote:

> I encounter an awful lot of people these days who point-blank refuse to
> use Usenet. They have never heard of it, and when it's explained, regard
> it as ancient technology. They feel that anything web-based is,
> obviously and axiomatically, so superior that they should not consider
> anything that isn't browser-based for discussion-style communication.
> Hence the outbreak of the blogosphere, and many other social phenomena.
> Hell, they prefer webmail to using an e-mail program, and web forums to
> mailing lists. And that's a preference about as strong as your or my
> preference for using a computer over a typewriter and carbon paper; they
> get positively insulted by anything other than a web-based system.

Partly that is probably due to USEnet being very conservative. Over the
last 10 years, the only development I see is character sets above 7 bits
getting accepted. No rich content (except for binary newsgroups which
have terrible retention) and that is connected to the lack of a
newsreader war.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Seongbae Park

unread,

Sep 17, 2006, 2:48:00 PM9/17/06

to

comp...@patten-glew.net wrote:
> Anyway, I see no technical detail in creating such an off-line tagging
> and annotation system. I just am not aware that it exists. Perhaps I
> should be learning Google's APIs so that I can write such a system.

I don't know the API, but I believe you can access Gmail offline
if you install the latest Google Desktop. I haven't tried it myself
though.

Seongbae

Seongbae Park

unread,

Sep 17, 2006, 3:30:15 PM9/17/06

to

comp...@patten-glew.net wrote:
...

> I think that the above is sufficiently comp.arch relevant. But to make
> it more so: what sort of microprocessor would be best for the Google
> workload?

A somewhat dated article about Google's infrastructure:

http://barroso.org/publications/ieeemicro_google.pdf

Seongbae

Seongbae Park

unread,

Sep 17, 2006, 3:56:49 PM9/17/06

to

Probably not as sophisticated as you want but related APIs and code
that may serve as a basis for what you want to do:

http://code.google.com/p/hosted-gmail-client/
http://code.google.com/p/gmail-backup/

Seongbae

Eric P.

unread,

Sep 17, 2006, 5:59:37 PM9/17/06

to

Andy Glew wrote:
>
> Eric P. wrote:
> > Apparently newsgroups aren't part of the whole Web 2.0 experience.
>
> I have pondered whether the USEnet, the peer to peer forwarding model,
> still has any role in the future.

The rational is likely more mundane than the deep thoughts your having.

The reason they all decommissioned newsgroups support is probably they
all viewed it as a cost center. In the case of these ISPs they
guessed they could cut this cost and loose minimal customers.
They were probably right.

Google sees it, for now, as a revenue center. They do that because
they can sell and show ads, and they can only do that because they
use a browser interface. If that revenue changes, or people stop
visiting their news search engine (because no one is posting messages
because their ISP's cut off access), they'll drop usenet too.

I don't see how Google could offer the interfaces or API's you
suggest in your other msg as it breaks their revenue stream,
except as a loss leader maybe.

> Could we not simply assume that everyone is connected to the Internet
> at all times?
>
> Maybe... on Earth. But clearly once we get to solar system or galactic
> scale Internets, the round-trip latencies would be unacceptable. The
> store and forward model is the only reasonable way to move forward at
> such timescales.

It is not just latency. Cell phones allow people to be connected to
the phone system at all times, but they still get shut off for meetings
and at night. So it needs both interactive and batch modes.

But also some protocols become essentially useless if you inject even
just a few seconds of delay (like bouncing off a satellite to Japan)
You need very large buffer windows, which works out well if you
already have a stored copy so you don't need to keep it all in memory.
If you need to retransmit, retrieve the stored copy.
At the receiver you are going to save into a file anyway so you are
less sensitive to retransmission and partial message reassembly issues.

So a protocol designed for store and forward should be much less
sensitive to transmission window full problems than other protocols.

> But that does not mean that the old USEnet way will not die out, only
> to be revived a century or a millenium from now.

I would prefer if they got the replacement up and running first,
_then_ decommissioned the "legacy" system.

> Ironically, the same sort of consideration applies to memory ordering
> models for MP and CMP systems. For tightly bound systems strong
> ordering such as TSO and SC is implementable, and clearly more
> convenient for programmers. But for larger systems weaker consistency
> models are more appropriate.

It's the weakest model possible. It says only that messages propagate
to all nodes, along some path, in some arbitrary order, at some time,
possibly infinite, since the 'cloud' (I saw that term applied, which
should tell you something) can change configuration at any instant.
Like a power grid what matters is local voltage (content) differences
and current (information) flow to your nearest neighbors.

Eric

Andrew Reilly

unread,

Sep 18, 2006, 12:42:49 AM9/18/06

to

On Sat, 16 Sep 2006 15:51:51 -0700, Andy Glew wrote:
> Summing up:
> Prompted by: Intel abandoning internal new servers.
> Hypothesis: there is a trend away from PC based computing and
> information services, towards centralized computer services like
> Google.
>
> The computer industry battle is not Intel versus AMD, or Microsoft
> versus Linux. It is the PC versus Google.
>
> (Here, I use "Google" as representative of web based computing
> services, ubiquitously accessible so long as you have Internet access.)

No scope for the "home information server/appliance closet" alternative?

I've used my home server as an e-mail aggregator for several years, now.
I'm happy with IMAP-SSL as an access mechanism, although it's not quite as
immediate as a web based interface. I know that I could add one of
several web front-ends (probably not as flashy as google or yahoo) if I
wanted to, but for now most PCs that I come across already have an IMAP
capable mail reader on them, or Thunderbird can be downloaded.

I've noticed several "file server" or "internet appliance" -in-a-box
appliances being advertised recently, that could probably do the job.

My "main" access point is a Mac laptop, and that does searching pretty
nicely. Other interfaces don't search quite as well, but there's glimpse
and other full text back-ends that are getting fairly shiny.

Sure, the google approach has a lot going for it, but I'm just not
comfortable with the notion that my personal e-mail content is probably
being mined to sell thing to me.

(I switched NNTP servers to news.individual.net some years ago when my
home ISP's server kept being black-listed. That had the added advantage
(?) of being able to read comp.arch from the office.)

Cheers,

--
Andrew

Jan Vorbrüggen

unread,

Sep 18, 2006, 4:23:06 AM9/18/06

to

> I do not know... but I predict that at least some fraction of companies
> will just outsource their employees email, etc., to Google.

I wouldn't want to write the SLA that would be required for this to happen,
and once somebody had written one to my liking, I doubt Google et al.
would sign it. Employees' correspondance is too close to a company's
core business to easily outsource. Heck, I send out offers and get back
confirmations by e-mail...should the e-mail service provider loose such
documents, my company has a problem - a BIG problem.

> I keep meaning to take an income deduction for my broadband for tax
> purposes, since the logs plainly show that it is mostly used for work,
> not pleasure.

Hereabouts, the equivalent of the IRS will accept a 20% charge of all telecon
costs as being work-related without any evidence. Only if you want to charge
a higher fraction would you need to provide such.

Jan

Jean-Marc Bourguet

unread,

Sep 18, 2006, 5:26:46 AM9/18/06

to

Jan Vorbrüggen <jvorbr...@not-mediasec.de> writes:

>> I do not know... but I predict that at least some fraction of companies
>> will just outsource their employees email, etc., to Google.
>
> I wouldn't want to write the SLA that would be required for this to happen,
> and once somebody had written one to my liking, I doubt Google et al.
> would sign it. Employees' correspondance is too close to a company's
> core business to easily outsource. Heck, I send out offers and get back
> confirmations by e-mail...should the e-mail service provider loose such
> documents, my company has a problem - a BIG problem.

I wonder what kind of agreement services like postini sign.

Yours,

--
Jean-Marc

Rob Warnock

unread,

Sep 18, 2006, 5:51:24 AM9/18/06

to

Andy Glew <Andy...@gmail.com> wrote:
+---------------

| I have pondered whether the USEnet, the peer to peer forwarding model,
| still has any role in the future.

...

| But clearly once we get to solar system or galactic scale Internets,
| the round-trip latencies would be unacceptable. The store and forward
| model is the only reasonable way to move forward at such timescales.

+---------------

1. Walk with me for a moment back to the days of 1200-Baud dialup,
when most [though certainly not all] of netnews was Usenet,
typically delivered via "C News"[1] over UUCP, rather than NNTP.
It was not unusual to see delays of *days* between the posting
of an article and seeing the response(s). Responses often appeared
out of order, and proper quoting was needed to avoid confusing
those who might not have seen the intervening responses yet.
That model [albeit probably with three decimal orders of magnitude
better bandwidth] would probably survive interplanetary delays
within the Sol system quite nicely, thank you.

2. Some of the funniest bits in Vernor Vinge's novel "A Fire Upon
The Deep" are the interstellar netnews exchanges. In brilliant
anticipation of today's bloggers, the interstellar netnews is
often referred to as "The Net of a Million Lies". And just to
show some things never change, there's one persistent sender who
keeps popping up from time to time with typical clueless newbie
comments, like: "I may have missed some of the discussion [my net
connection is very low bandwidth], but assuming my understanding
is correct that these 'humans' have six legs, then..."

And of course, all of this may be moot if we are ever able to
make use of entangled quantum pairs for long-distance instantaneous
communications... ;-}

-Rob

[1] http://www.faqs.org/faqs/news/software/b/cnews/

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Ketil Malde

unread,

Sep 18, 2006, 6:49:59 AM9/18/06

to

"Andy Glew" <Andy...@gmail.com> writes:

> This leads to Trend #2: Google. More generically, the rebirth of "Big
> Iron", centralized, computer service companies.

Yes. When working out how to adapt to the latest internal
reconfigurations, I notice more and more colleagues "just use Gmail".
They forward everything there - business or private - and have a singe
interface to it.

Same for calendars, photo management, and what have you. I remember
early on when banks wanted you to run a specific application. Now
it's all web-based, of course.

The PC has failed abysmally at one thing: ease of maintenance. After
experiencing too many breakdowns due to malware, misconfiguration and
incompatibilities, I suspect people find web based applications
refreshingly simple and reliable, as well as universally available.
(This is not a Windows flame, much of the same goes for Unix as
well - my Gmail using colleagues are on Linux).

It remains to be seen whether this will last - what happens when
people want to extract information that they feel they own, but the
provider won't relinquish? If Google goes bankrupt, what information
can be sold to whom, and used for what purposes?

-k
--
If I haven't seen further, it is by standing in the footprints of giants

ken...@cix.compulink.co.uk

unread,

Sep 18, 2006, 7:12:13 AM9/18/06

to

In article
<1158447110.9...@i3g2000cwc.googlegroups.com>,
Andy...@gmail.com (Andy Glew) wrote:

> Hypothesis: there is a trend away from PC based computing and
> information services, towards centralized computer services
> like Google.

It's an interesting theory. However in my case computing is a
hobby not a job and currently I do not have broadband at home.
However they recently installed free broadband and a computer at
work for staff to use during work breaks. I work for a train
operating company. I do not need continuous access to mail or
news. I use an offline reader for both, though of course that
sometimes will delay any response. I certainly have found no urge
to use google much unless I need to check the group archives.

Ken Young

ken...@cix.compulink.co.uk

unread,

Sep 18, 2006, 7:12:14 AM9/18/06

to

In article <1hltn41.f67szqcu9bh5N%nos...@ab-katrinedal.dk>,
nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=)
wrote:

> the only development I see is character sets above 7 bits
> getting accepted. No rich content (except for binary
> newsgroups which have terrible retention) and that is connected
> to the lack of a newsreader war.

That is probably a good thing. I use an off line reader and a
dial up connection. With the groups (none binary) I am in I can
end up downloading several megabytes a day of news. Most of that
is due to massive cross posting. It is still possible to
*emphasise* words of course. However if rich content was allowed
the amount of data involved would probably increase to the point
where nobody including Google could run a news server.

Ken Young

Joe Seigh

unread,

Sep 18, 2006, 7:44:45 AM9/18/06

to

Bill Todd wrote:

> Joe Seigh wrote:
>
>>
>>
>> My impression was that Google was pretty heavily into the distributed
>> model. Any emphasis on multi-threading would just detract from this
>> model.
>
>
> Google is into cost-effective, and how best to attain it. That almost
> by definition means multi-threaded workloads within that multi-node
> environment: the rest of each node is far too relatively expensive to
> leave it idle waiting for disk accesses to complete, when it could be
> serving multiple disks in parallel.
>
> (Yes, you could use the old Unix kludge of using multiple processes
> rather than multiple threads, but that's essentially the same thing with
> somewhat inferior performance.)
>
> ...
>
>> If you wanted to sell multi-core to Google, you'd have to make it
>> look like a bunch of networked single cpu computers.
>
>
> Not at all - see above. Given the economics of the situation, it seems
> likely that they'll add cores and disks up to the point where some other
> system bottleneck (such as bus or memory or cheap network bandwidth)
> becomes the limiting factor: that *might* occur with only a single core
> in some of their nodes (such as the page-serving ones that mostly
> shuffle data from disk to the network), but quite possibly not in the
> indexing nodes.
>

They're likely using some of simple multi-threading but I doubt the
approach they're using will scale up to 32 cores very well. And
I doubt they're into multi-threading, as opposed to distributed
computing, enough to change that approach. So they'll take some
approach to partittion the multi-core into a bunch of independently
executing units. Much like Sun uses their Niagara multi-cores
to run a bunch of independent Java virtual machines. It could
be VM or it could be a bunch of independent processes using
multi-threading or multiplexing or whatever.

Or maybe not. Who knows what they'll actually do. That's
just my feeling of which of the multiprocessing models
Google is biased towards.

Niels Jørgen Kruse

unread,

Sep 18, 2006, 8:12:56 AM9/18/06

to

<ken...@cix.compulink.co.uk> wrote:

Binary newsgroups completely dwarf the nonbinary ones.

Google is in a position to do something about rich(er) content. They
could construct an interface for putting images into posts and host the
images themselves. When showing a post on their webinterface they could
check links embedded in < > and embed the target on the page if they are
to simple images.

Bill Todd

unread,

Sep 18, 2006, 8:35:20 AM9/18/06

to

Why, exactly? Google's main activity is presumably read-only queries,
and since IIRC they perform most of their updates in a different context
than that in which the queries execute (and then switch the query
activity over to the new context when the updates have completed)
there's virtually no contention among query threads.

That scales almost *perfectly* to as many cores as they can keep active
- which (as I already observed) may not be that many in the nodes where
they're shuffling data between disk and network but could easily be
quite a few when they're performing index look-ups.

I'll freely admit that I haven't studied 'GFS' in any detail - but have
you looked at it at *all*, or are you just completely winging it here on
zero data?

- bill

Terje Mathisen

unread,

Sep 18, 2006, 8:58:03 AM9/18/06

to

Google had a 'The Drinks are on Google' evening at a local hotel last
week, where they started with a few presentations.

The Google File System is definitely based on massive redundancy, at
least 3 copies of everything critical, but more commonly 10-20.

Processing happens almost always on a node which has a local copy of the
relevant data, i.e. disk traffic very rarely has to cross the network.

Data is stored on IDE/SATA disks exclusively, and those disks are
treated as tape, i.e. no random access.

All disk IO happens in chunks, typically 64 MB, and their database
tables work with segments of 100-200 MB. If a chunk server with 1000
chunks (say 150 GB) goes down, a controlling node will redistribute the
data across maybe 100 other nodes, meaning that full redundancy can be
re-established in a number of seconds.

Alex Colvin

unread,

Sep 18, 2006, 1:32:08 PM9/18/06

to

>| I have pondered whether the USEnet, the peer to peer forwarding model,
>| still has any role in the future.

See ongoing work on Delay/Disruption-Tolerant Networks. Interstellar or
intermittent links.

>| But clearly once we get to solar system or galactic scale Internets,
>| the round-trip latencies would be unacceptable. The store and forward
>| model is the only reasonable way to move forward at such timescales.

>And of course, all of this may be moot if we are ever able to

>make use of entangled quantum pairs for long-distance instantaneous
>communications... ;-}

So far, it looks like entanglement cannot be used to transfer information
faster-than-light. But it should work fine for Usenet ;-)

--
mac the naïf

Rick Jones

unread,

Sep 18, 2006, 1:32:31 PM9/18/06

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> They do stuff like making their own motherboards with a single 12 V
> power supply, just to get lower conversion heat loss, and they
> install naked MBs into racks, with the AC units ducting air directly
> into these racks, with no need to cool a lot of (pizza) box
> enclosures.

I wonder what that does to the EFI (term?) in their machine rooms.

rick jones
--
web2.0 n, the dot.com reunion tour...
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

Del Cecchi

unread,

Sep 18, 2006, 1:37:05 PM9/18/06

to

Rick Jones wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
>>They do stuff like making their own motherboards with a single 12 V
>>power supply, just to get lower conversion heat loss, and they
>>install naked MBs into racks, with the AC units ducting air directly
>>into these racks, with no need to cool a lot of (pizza) box
>>enclosures.
>
>
> I wonder what that does to the EFI (term?) in their machine rooms.
>
> rick jones

Electronic Fuel Injection? EMI, Electro Magnetic Interference?

Interesting question as to FCC regulations. Since they are not selling
them, or installing them around the public, do these systems have to comply?

--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”

Rick Jones

unread,

Sep 18, 2006, 1:54:52 PM9/18/06

to

Del Cecchi <cecchi...@us.ibm.com> wrote:

> Rick Jones wrote:
>> I wonder what that does to the EFI (term?) in their machine rooms.

> Electronic Fuel Injection?

Nah, replacement for BIOS :) I've too many TLAs in my head this
morning :)

> EMI, Electro Magnetic Interference?

Yeah, that one.

> Interesting question as to FCC regulations. Since they are not
> selling them, or installing them around the public, do these systems
> have to comply?

Indeed a good question. I was also wondering what it might do to the
systems just above and below in the rack, and to the people in the
room etc.

rick jones
--
a wide gulf separates "what if" from "if only"

Del Cecchi

unread,

Sep 18, 2006, 2:31:34 PM9/18/06

to

Rick Jones wrote:
> Del Cecchi <cecchi...@us.ibm.com> wrote:
>
>>Rick Jones wrote:
>>
>>>I wonder what that does to the EFI (term?) in their machine rooms.
>>
>>Electronic Fuel Injection?
>
>
> Nah, replacement for BIOS :) I've too many TLAs in my head this
> morning :)
>
>
>>EMI, Electro Magnetic Interference?
>
>
> Yeah, that one.
>
>
>>Interesting question as to FCC regulations. Since they are not
>>selling them, or installing them around the public, do these systems
>>have to comply?
>
>
> Indeed a good question. I was also wondering what it might do to the
> systems just above and below in the rack, and to the people in the
> room etc.
>
> rick jones

I think the primary limit for the FCC regulations is not interfering
with communication services. I think the regs are quite a ways from
health and safety limits, although there used to be some disagreement
about what appropriate limits are. It is a long ways from functional
problems.

Terje Mathisen

unread,

Sep 18, 2006, 2:31:16 PM9/18/06

to

Rick Jones wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>> They do stuff like making their own motherboards with a single 12 V
>> power supply, just to get lower conversion heat loss, and they
>> install naked MBs into racks, with the AC units ducting air directly
>> into these racks, with no need to cool a lot of (pizza) box
>> enclosures.
>
> I wonder what that does to the EFI (term?) in their machine rooms.

They didn't show any actual photos, but I did ask another question where
I presumed each rack was enclosed in a (metal) frame, i.e. the AC duct.

The presenter mentioned that getting 20 C differential between incoming
and outgoing air is a lot more efficient than a normal computer room
where he claimed 2 C was normal.

Rick Jones

unread,

Sep 18, 2006, 2:44:57 PM9/18/06

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> The presenter mentioned that getting 20 C differential between
> incoming and outgoing air is a lot more efficient than a normal
> computer room where he claimed 2 C was normal.

Air? I thought that for most non-trivial machine rooms, it was
chilled water that came to heat exchangers in the room? Perhaps that
was air going into the heat exchanger versus the air coming-out? If
so, that would seem to suggest to my not-sufficiently-well-versed
brain that parts, if not the entire room were running pretty hot, or
that the heat exchangers were emitting some rather cold air.

rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window

Terje Mathisen

unread,

Sep 18, 2006, 2:35:26 PM9/18/06

to

Rick Jones wrote:
> Del Cecchi <cecchi...@us.ibm.com> wrote:

>> EMI, Electro Magnetic Interference?
>
> Yeah, that one.
>
>> Interesting question as to FCC regulations. Since they are not
>> selling them, or installing them around the public, do these systems
>> have to comply?
>
> Indeed a good question. I was also wondering what it might do to the
> systems just above and below in the rack, and to the people in the
> room etc.

It shouldn't matter:

Each rack carries nothing but a bunch of identical systems/boards, i.e.
there is nothing else that can be bothered.

Humans are presumably protected by the outer frame (air duct) that keeps
the AC air inside the rack.

When you're installing identical systems by the 100s to 10Ks, the need
to keep a single rack (with room for maybe 40-60 units?) homogeneous is
no trouble at all.

John Dallman

unread,

Sep 18, 2006, 3:15:00 PM9/18/06

to

In article <1hlv5ur.dy0czs15w5fcwN%nos...@ab-katrinedal.dk>,
nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) wrote:

> Google is in a position to do something about rich(er) content. They
> could construct an interface for putting images into posts and host
> the images themselves. When showing a post on their webinterface they
> could check links embedded in < > and embed the target on the page if
> they are to simple images.

Firstly, what's their motive? Secondly, that creates the chance for
someone to do a fill-all-the-disk-space attack on them. You'd need
ludicrous amounts of bandwidth to do it, but it's possible.

Thirdly, would good use be made of it? Looking at this newsgroup for an
example, I strongly suspect that we'd see very few diagrams drawn to
illustrate points, and a load of "Pictures of my cool casemod" from
people who wander in and don't understand the group.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Del Cecchi

unread,

Sep 18, 2006, 3:25:18 PM9/18/06

to

Terje Mathisen wrote:
> Rick Jones wrote:
>
>> Del Cecchi <cecchi...@us.ibm.com> wrote:
>>
>>> EMI, Electro Magnetic Interference?
>>
>>
>> Yeah, that one.
>>
>>> Interesting question as to FCC regulations. Since they are not
>>> selling them, or installing them around the public, do these systems
>>> have to comply?
>>
>>
>> Indeed a good question. I was also wondering what it might do to the
>> systems just above and below in the rack, and to the people in the
>> room etc.
>
>
> It shouldn't matter:
>
> Each rack carries nothing but a bunch of identical systems/boards, i.e.
> there is nothing else that can be bothered.
>
> Humans are presumably protected by the outer frame (air duct) that keeps
> the AC air inside the rack.
>
> When you're installing identical systems by the 100s to 10Ks, the need
> to keep a single rack (with room for maybe 40-60 units?) homogeneous is
> no trouble at all.
>
> Terje

what does homogeneous have to do with it?

Jan-Frode Myklebust

unread,

Sep 18, 2006, 3:28:26 PM9/18/06

to

On 2006-09-17, Andy Glew <Andy...@gmail.com> wrote:
>
> Tell me how to persuade Google groups to use my comp.arch AT NOSPAM
> patten-glew.net email address

Get patten-glew.net hosted at google, and I think you will
be able to create the @patten-glew.net accounts you want
and post from them to groups.

https://www.google.com/a/

-jf

Tim Bradshaw

unread,

Sep 18, 2006, 4:12:59 PM9/18/06

to

On 2006-09-18 19:44:57 +0100, Rick Jones <rick....@hp.com> said:

> Air? I thought that for most non-trivial machine rooms, it was
> chilled water that came to heat exchangers in the room? Perhaps that
> was air going into the heat exchanger versus the air coming-out? If
> so, that would seem to suggest to my not-sufficiently-well-versed
> brain that parts, if not the entire room were running pretty hot, or
> that the heat exchangers were emitting some rather cold air.

I think it must mean air entering a cabinet versus air leaving it
(which is, perhaps, the same thing as air leaving the heat exchanger vs
air entering it).

--tim

Rick Jones

unread,

Sep 18, 2006, 4:40:33 PM9/18/06

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> Each rack carries nothing but a bunch of identical systems/boards,
> i.e. there is nothing else that can be bothered.

ISTR hearing about all sorts of fun with EMI "inside" the chassis on
systems over the years. That being the case, I would think that
bareboards in a rack could do the same thing to one another.

> Humans are presumably protected by the outer frame (air duct) that keeps
> the AC air inside the rack.

This does assume that the rack is contained within AC ducting, which
frankly I rather doubt. Despite the "how would you update the OS on
10000 systems colocated on the moon" question I understand Google has
asked interviewees I doubt that they would want the difficulty of
replacement that having to have someone walkingo a large AC duct would
cause.

rick jones
--
oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates

Terje Mathisen

unread,

Sep 18, 2006, 5:04:29 PM9/18/06

to

Del Cecchi wrote:
> Terje Mathisen wrote:
>> Rick Jones wrote:
>>> Indeed a good question. I was also wondering what it might do to the
>>> systems just above and below in the rack, and to the people in the
>>> room etc.
>>
>>
>> It shouldn't matter:
>>
>> Each rack carries nothing but a bunch of identical systems/boards,
>> i.e. there is nothing else that can be bothered.
>>
>> Humans are presumably protected by the outer frame (air duct) that
>> keeps the AC air inside the rack.
>>
>> When you're installing identical systems by the 100s to 10Ks, the need
>> to keep a single rack (with room for maybe 40-60 units?) homogeneous
>> is no trouble at all.
>>
>> Terje
>
> what does homogeneous have to do with it?
>

Rick asked:

>>> I was also wondering what it might do to the
>>> systems just above and below in the rack,

and my reply was that all systems above and below were identical, i.e. a
homogeneous rack. (Or is that a misuse of the word?)

Niels Jørgen Kruse

unread,

Sep 18, 2006, 5:10:16 PM9/18/06

to

John Dallman <j...@cix.co.uk> wrote:

> In article <1hlv5ur.dy0czs15w5fcwN%nos...@ab-katrinedal.dk>,
> nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) wrote:
>
> > Google is in a position to do something about rich(er) content. They
> > could construct an interface for putting images into posts and host
> > the images themselves. When showing a post on their webinterface they
> > could check links embedded in < > and embed the target on the page if
> > they are to simple images.
>
> Firstly, what's their motive? Secondly, that creates the chance for
> someone to do a fill-all-the-disk-space attack on them. You'd need
> ludicrous amounts of bandwidth to do it, but it's possible.

I was thinking of a webinterface, of course. You could probably make a
simple diagram drawer with AJAX.

> Thirdly, would good use be made of it? Looking at this newsgroup for an
> example, I strongly suspect that we'd see very few diagrams drawn to
> illustrate points, and a load of "Pictures of my cool casemod" from
> people who wander in and don't understand the group.

You wouldn't be looking at them unless you were reading news with
groups.google.com.

As I said, USEnet is very conservative.

Tim Bradshaw

unread,

Sep 18, 2006, 6:31:37 PM9/18/06

to

On 2006-09-18 21:40:33 +0100, Rick Jones <rick....@hp.com> said:

>
>> Humans are presumably protected by the outer frame (air duct) that
>> keeps the AC air inside the rack.
>
> This does assume that the rack is contained within AC ducting, which
> frankly I rather doubt.

Well, you can have doors which can be shut. When they're open then
you'd get interference but only one cab's worth. Much more critically
if the doors are part of arranging the airflow you'd have to make sure
they didn't stay open too long. The same applies for lots of systems
though - for instance big machines may need blanking plates when boards
are removed to keep the airflow intact (and often do, people tend to
lose them in my experience...).

I rather doubt the electromagnetic grut from a cab full of x86 boxes is
going to be damaging to people, at least not the sort of people who
live in datacentres. I'd be considerably more worried about the
long-term effects of typical DC sound levels on hearing.

--tim

Bill Todd

unread,

Sep 18, 2006, 6:54:35 PM9/18/06

to

Terje Mathisen wrote:

...

> The Google File System is definitely based on massive redundancy, at
> least 3 copies of everything critical, but more commonly 10-20.

I'm not sure how that relates to the issue under discussion here, which
is whether Google finds multi-threading at individual nodes useful,
whether supported by the hardware or (lacking such support) only in
software and in at least some cases involving many threads.

>
> Processing happens almost always on a node which has a local copy of the
> relevant data, i.e. disk traffic very rarely has to cross the network.

I suspect that you're referring to indexing data. Clearly, the cached
page images have to cross the network both when they're captured and
when they're served up to users.

>
> Data is stored on IDE/SATA disks exclusively, and those disks are
> treated as tape, i.e. no random access.

Or at least, from what you report below, no *fine-grained* random access.

>
> All disk IO happens in chunks, typically 64 MB,

That must mean that user requests for cached data are very (relatively)
rare, unless in such cases their drastically reduce that 'chunk' size.

and their database
> tables work with segments of 100-200 MB.

Sounds like the unit in which data is updated, with indexes (even
rarely-used portions) otherwise maintained wholly in memory - since
random look-ups would be prohibitively slow (unless the local node
striped the chunk across dozens of local disks, but I don't have the
impression that Google nodes sport anything like that many local disks).

If a chunk server with 1000
> chunks (say 150 GB) goes down, a controlling node will redistribute the
> data across maybe 100 other nodes, meaning that full redundancy can be
> re-established in a number of seconds.

An eminently sensible approach given the nature of their workload and
the level of replication that they're using (robust in the face of
multiple concurrent node failures), but 150 GB per node seems awfully
light unless they're still playing out the useful lives of 4+-year-old
platforms (and running ATA disks that long often starts to become a
money-losing proposition, though if their replacement procedure is
sufficiently streamlined it may make sense).

I think I've got a paper on the 'GFS' squirreled away somewhere - if I
can find it easily I'll try to find time to read it.

- bill

Del Cecchi

unread,

Sep 18, 2006, 8:03:01 PM9/18/06

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:q7e3u3-...@osl016lin.hda.hydro.com...

I was just wondering why it mattered if the systems were homogeneous. I
couldn't think of a physical reason that it did. I don't do software.

del

Terje Mathisen

unread,

Sep 19, 2006, 1:00:32 AM9/19/06

to

Tim Bradshaw wrote:
> On 2006-09-18 21:40:33 +0100, Rick Jones <rick....@hp.com> said:
>
>>
>>> Humans are presumably protected by the outer frame (air duct) that
>>> keeps the AC air inside the rack.
>>
>> This does assume that the rack is contained within AC ducting, which
>> frankly I rather doubt.
>
> Well, you can have doors which can be shut. When they're open then
> you'd get interference but only one cab's worth. Much more critically
> if the doors are part of arranging the airflow you'd have to make sure
> they didn't stay open too long. The same applies for lots of systems

Why not think even bigger:

When a single board in a rack fails, you simply unplug it, either
remotely with a small management system, or by pulling out the power
wire, possibly without even opening the door.

When more than N out of M boards have failed, then you redistribute all
the remaining data across other racks, pull down the entire failing
rack, and replace it with a new/larger/faster rack of the latest design.

IBM showed some (prototype?) ideas similar to this a few years ago,
(maybe called Cubes?), where the idea was to never replace any failing
part, just disable it and keep running with slightly lower capacity,
until the end of the economic lifetime of the system.

The key idea is to have much more than N+1 redundancy, this allows you
to skip large parts of the normal monitor/detect/locate/swap cycle on
failing parts in any large (HPC) system.

Paul Aaron Clayton

unread,

Sep 19, 2006, 1:06:14 AM9/19/06

to

comp...@patten-glew.net wrote:
[snip]
>I can hear people say: "What's he talking about? Google
>supplanting Intel? Doesn't Google buy Intel (or at least x86)
>systems?" Yes, but: a computer services cmpany such as Google
>is much more efficient in its use of computing resources. If
>all computer uses were to be hosted by Google tomorrow,
>magically, then the PC microprocessor marketplace would be 1/3rd
>or less than the current size.

But would software eventually increase in processing demand to
equal capacity? (BTW, there would also be an increase in
processing efficiency from aggressive multithreading processors
relative to aggressive ILP processors.) Centralization of data
would seem to provide new opportunities for computation, even if
privacy-protecting encryption was broadly adopted.

What would users with other workloads do when there is no longer
a huge price/performance-oriented market to subsidize their
equipment costs? Might more specialized hardware become more
economical (e.g., physics processors for simulation processing
[for entertainment, training, and R&D])?

Paul Aaron Clayton
not one of the other Paul A. Clayton's :-)

Paul Aaron Clayton

unread,

Sep 19, 2006, 1:15:31 AM9/19/06

to

Bill Todd wrote:
> Joe Seigh wrote:

> > comp...@patten-glew.net wrote:
> > [...]
> >>
> >> I think that the above is sufficiently comp.arch relevant. But to make
> >> it more so: what sort of microprocessor would be best for the Google
> >> workload?
> >>
> >> Massively multicore CMP? Multithreaded? Or perhaps higher integration
> >> of a relatively smal number of CPU cores, memory controllers, network,
> >> and disk controllers?

> >
> > My impression was that Google was pretty heavily into the distributed
> > model. Any emphasis on multi-threading would just detract from this
> > model.
>
> Google is into cost-effective, and how best to attain it. That almost
> by definition means multi-threaded workloads within that multi-node
> environment: the rest of each node is far too relatively expensive to
> leave it idle waiting for disk accesses to complete, when it could be
> serving multiple disks in parallel.

Agreed. But doesn't Google control its entire software stack (i.e.,
using either open-source software or in-house software)? This could
present an opportunity for a new Architecture not merely interesting
microarchitecture.

ISTM that given the distributable nature of the Google workloads the
higher integration option would be preferred (reduced interchip
communication implying reduced power consumption as well as modestly
better performance). Even with this distributable nature, there might
be sufficient density and sharing benefits to support board-level
integration of multiple nodes (without cache coherence). This brings
the question: how much of a high-processor-count board must fail before
replacement is to be preferred over retention?

If power-efficiency was extremely preferred over single-thread
performance, tag-sequential L1 caches could be implemented with the
additional delay being hidden by TLP. It might even be conceivable
that
virtual addressing could be considered an unnecessary overhead (though
I
would assume the potential for undetected wrong data accesses strongly
discourages such a radical step). Presumably the processors would have
a hierarchy of threads to allow for exploiting
fine-grained/simultaneous
TLP and very-coarse-grained TLP (making one kind of context swapping
fast relative to off-chip accesses and especially I/O). Interestingly,
for a new, Google-oriented architecture, it might be desirable to
provide large software-controlled thread contexts even with the
emphasis
on multithreading (which would seem to imply that smaller contexts
would
be desirable, allowing more contexts to be active).

Two advantages of providing a large L2 register set would be that
software could manage more frequently accessed data (avoiding tag
overhead and conflict misses and concentrating frequent accesses into a
known fraction of the data space [so power/performance optimizations
can
be made in the design]; it might be desirable to treat this almost like
a stack-cache, providing support for packed data structures yet without
requiring support for variable indexing into the data region) and that
hardware could more easily manage the core context (though it still
might be desirable to tag L1 cache blocks with a thread ID to allow for
preferential eviction from L1 of data from stalled contexts and
preferential retention in L2 of data from a somewhat soon to be
reactivated thread--the overhead of such might not be worth the
probably
modest benefit from more intelligent cache management).

(BTW, would higher leakage of tiny devices lead to two-wide issue as
the
most performance/watt efficient design point? Of course,
multithreading
would probably provide high-utilization without needing to resort to
ILP. OTOH, some ILP optimization might be attractive for more general
purpose use.)

Of course, cost-effectiveness probably means using commodity (x86 PC)
parts. Even with the lower design costs and greater chip-area- and
power-efficiency of a Niagara-like implementation, PC parts are
probably
still more cost-effective (at least given the costs of changing
architecture; afterall Google has apparently not broadly adopted
Niagara).

comp...@patten-glew.net wrote:
>Or, if not for the Google workload, how about the workload of a Google
>dominaed world? I think that formactor proliferation would further
>increase the need for ultra low power.

At least until glasses that project images (or some other high pixel
count display system) become cheap, the size, cost and power
consumption
of the display system constrain how small, cheap, and energy-efficient
the processing element needs to be (at least for certain uses).

What would a cache/client-oriented system architecture be like? It
seems that memory/storage would make up a larger fraction of the total
system costs and of the power budget, though even a cache/client-
oriented system would benefit from some local intelligence perhaps even
beyond that necessary for managing the network, providing a
user-interface, and handling (de)compression and (en|de)cryption.
Variety of formfactors might be more about style than utility,
especially if cheap (power and bandwidth) interdevice communication
allows one to hide the main resource (battery, processors, memory,
etc.)
device while more openly carrying the really dumb client component.

Paul Aaron Clayton
just a technophile

Terje Mathisen

unread,

Sep 19, 2006, 1:32:29 AM9/19/06

to

Bill Todd wrote:

> Terje Mathisen wrote:
> If a chunk server with 1000
>> chunks (say 150 GB) goes down, a controlling node will redistribute
>> the data across maybe 100 other nodes, meaning that full redundancy
>> can be re-established in a number of seconds.
>
> An eminently sensible approach given the nature of their workload and
> the level of replication that they're using (robust in the face of
> multiple concurrent node failures), but 150 GB per node seems awfully
> light unless they're still playing out the useful lives of 4+-year-old

You're right, when I assumed that they used either the largest or the
best in GB/$ (currently about 300 GB disks), with 2 or 4 per system, the
presenter more or less confirmed it.

I.e. I'm guessing 500 to 1200 GB/system, soon to increase to the 2-3 TB
range.

> platforms (and running ATA disks that long often starts to become a
> money-losing proposition, though if their replacement procedure is
> sufficiently streamlined it may make sense).

Right, they can run until their energy cost becomes the limiter, since
they don't really have to replace any of them.

Andy...@gmail.com

unread,

Sep 19, 2006, 2:11:01 AM9/19/06

to

Niels Jørgen Kruse wrote:
> Partly that is probably due to USEnet being very conservative. Over the
> last 10 years, the only development I see is character sets above 7 bits

> getting accepted. No rich content (except for binary newsgroups which
> have terrible retention) and that is connected to the lack of a
> newsreader war.

Save USEnet! Start posting in XHTML with SVG.

Bill Todd

unread,

Sep 19, 2006, 2:19:30 AM9/19/06

to

Terje Mathisen wrote:

...

> When a single board in a rack fails, you simply unplug it, either
> remotely with a small management system, or by pulling out the power
> wire, possibly without even opening the door.

The problem with that (at least in a Google-type environment) is that
the still-perfectly usable disk resources which the failed board
controls directly (because anything other than DAS blows your costs out
of the water) are worth considerably more than the board itself is and
should not be left idle.

I suspect that the Google techs can replace a failed board in a very few
minutes at most. If so, this is probably cost-effective (i.e.,
developing that level of expertise is worth it given the frequency with
which boards fail; same for disks and perhaps even for switches, though
these last you have to replace anyway given the amount of hardware you
lose access to when they fail).

...

> IBM showed some (prototype?) ideas similar to this a few years ago,
> (maybe called Cubes?), where the idea was to never replace any failing
> part, just disable it and keep running with slightly lower capacity,
> until the end of the economic lifetime of the system.

They called it 'ice cube' for a reason: it was a 3-dimensional layout
with modular bricks electrically coupled to each other (and to the
outside world at the cube's faces) capacitively, such that no wiring had
to be done and a full mesh was present to preserve reasonable bandwidth
when it had to route around failed parts.

In such a configuration you *can't* replace inner failed components
without taking the entire system apart, but in return you get prodigious
storage density (yes, it was water-cooled).

If Google can't get some benefit similar to that prodigious density, the
'just leave failed parts in place' model may not be useful for them.

>
> The key idea is to have much more than N+1 redundancy, this allows you
> to skip large parts of the normal monitor/detect/locate/swap cycle on
> failing parts in any large (HPC) system.

Actually, you only need N + M redundancy as far as data is concerned,
where M can still be small (say, 3 at most, and 2 unless you're *really*
paranoid). The real secret is the ability to reconfigure after a
failure to *restore* that level of redundancy quickly such that you're
ready for the next failure (or even for two at once). Of course, one
could argue that if you've got the spare space available to restore
redundancy after failures, you might as well use it to establish extra
redundancy up front, but OTOH you might elect to reserve the space for
other uses until such time as your critical data might require it.

The main challenge in ice cube was to maintain *connectivity and
bandwidth* despite an increasing number of failed bricks that left the
mesh full of random holes. And, of course, to cool it.

- bill

Terje Mathisen

unread,

Sep 19, 2006, 2:57:41 AM9/19/06

to

Bill Todd wrote:
> Terje Mathisen wrote:
>
> ...
>
>> When a single board in a rack fails, you simply unplug it, either
>> remotely with a small management system, or by pulling out the power
>> wire, possibly without even opening the door.
>
> The problem with that (at least in a Google-type environment) is that
> the still-perfectly usable disk resources which the failed board
> controls directly (because anything other than DAS blows your costs out
> of the water) are worth considerably more than the board itself is and
> should not be left idle.

By 'failing board' I included any of the directly attached resources,
i.e. I expect the disks to be the component most likely to fail.

>
> I suspect that the Google techs can replace a failed board in a very few
> minutes at most. If so, this is probably cost-effective (i.e.,
> developing that level of expertise is worth it given the frequency with
> which boards fail; same for disks and perhaps even for switches, though
> these last you have to replace anyway given the amount of hardware you
> lose access to when they fail).

Possibly.

>> IBM showed some (prototype?) ideas similar to this a few years ago,
>> (maybe called Cubes?), where the idea was to never replace any failing
>> part, just disable it and keep running with slightly lower capacity,
>> until the end of the economic lifetime of the system.
>
> They called it 'ice cube' for a reason: it was a 3-dimensional layout
> with modular bricks electrically coupled to each other (and to the
> outside world at the cube's faces) capacitively, such that no wiring had
> to be done and a full mesh was present to preserve reasonable bandwidth
> when it had to route around failed parts.
>
> In such a configuration you *can't* replace inner failed components
> without taking the entire system apart, but in return you get prodigious
> storage density (yes, it was water-cooled).
>
> If Google can't get some benefit similar to that prodigious density, the
> 'just leave failed parts in place' model may not be useful for them.

OK.

>
>>
>> The key idea is to have much more than N+1 redundancy, this allows you
>> to skip large parts of the normal monitor/detect/locate/swap cycle on
>> failing parts in any large (HPC) system.
>
> Actually, you only need N + M redundancy as far as data is concerned,
> where M can still be small (say, 3 at most, and 2 unless you're *really*

Yes, except they've told us that they have stored everything in at least
3 locations, and usually more than 10.

OTOH, for a pure storage system I agree with you.

> paranoid). The real secret is the ability to reconfigure after a
> failure to *restore* that level of redundancy quickly such that you're
> ready for the next failure (or even for two at once). Of course, one

I've done the math for a fractional PB storage system, simply having
chunks of 12+1 or 13+1 disks in classic RAID, plus a couple of shared
hot standby disks to step in if/when any of the chunks lose their
redundancy, would be sufficient for four nines availability over a year
or two, even when using plain SATA disks. This assumed something like 5
hours to rebuild the RAID info. Having a N+2 setup basically removes the
worry over disk failure, unless you get a really bad batch where all of
the disks are going to fail more or less simultaneously.

Anyway, with remote mirroring of the entire setup, and replacing both
halfs every two years, the cost goes up about 3 times, but expected MTBF
becomes so high that some kind of operator error is much more likely.

Jan Vorbrüggen

unread,

Sep 19, 2006, 3:27:35 AM9/19/06

to

> The presenter mentioned that getting 20 C differential between incoming
> and outgoing air is a lot more efficient than a normal computer room
> where he claimed 2 C was normal.

If the numbers are correct, he's right: It's a heat engine whose thermo-
dynamic efficiency is delta-T / the higher of the two operating temperatures
(both in Kelvin, of course). To a first approximation, the ten-times higher
temperature difference means a ten-times higher efficiency as a heat engine.

Jan

Bill Todd

unread,

Sep 19, 2006, 3:32:49 AM9/19/06

to

Terje Mathisen wrote:
> Bill Todd wrote:
>> Terje Mathisen wrote:

...

>>> The key idea is to have much more than N+1 redundancy, this allows

>>> you to skip large parts of the normal monitor/detect/locate/swap
>>> cycle on failing parts in any large (HPC) system.
>>
>> Actually, you only need N + M redundancy as far as data is concerned,
>> where M can still be small (say, 3 at most, and 2 unless you're *really*
>
> Yes, except they've told us that they have stored everything in at least
> 3 locations, and usually more than 10.

In the loose terminology that I was using above, 'M' was the number of
simultaneous disk failures which could be tolerated. Thus storing three
copies means that M=2.

I just found the GFS paper I thought I had (presented at SOSP '03), and
while I haven't yet finished reading it it does state that their default
as of that time was 'three replicas', which I suspect means three copies
though taken literally it could mean four. I can certainly understand
why they might create more copies of their metadata, though: given its
relative size, that additional protection costs virtually nothing.

>
> OTOH, for a pure storage system I agree with you.
>
>> paranoid). The real secret is the ability to reconfigure after a
>> failure to *restore* that level of redundancy quickly such that you're
>> ready for the next failure (or even for two at once). Of course, one
>
> I've done the math for a fractional PB storage system, simply having
> chunks of 12+1 or 13+1 disks in classic RAID, plus a couple of shared
> hot standby disks to step in if/when any of the chunks lose their
> redundancy, would be sufficient for four nines availability over a year
> or two, even when using plain SATA disks. This assumed something like 5
> hours to rebuild the RAID info.

Those figures sound as if they're based solely on whole-disk failure
rates. If you include the likelihood that when a disk fails you will
then find that one of the sectors on the remaining disks in the array is
no longer readable, you may find the picture considerably less rosy.

Having a N+2 setup basically removes the
> worry over disk failure, unless you get a really bad batch where all of
> the disks are going to fail more or less simultaneously.

The main need for N+2 these days is based upon the deteriorated-sector
problem that I just described above.

>
> Anyway, with remote mirroring of the entire setup, and replacing both
> halfs every two years, the cost goes up about 3 times, but expected MTBF
> becomes so high that some kind of operator error is much more likely.

Remote mirroring is required anyway in installations that must be
disaster-tolerant, and if supplemented by suitably-constrained local
redundancy at both ends the likelihood of data loss becomes negligible
even if no discipline is exercised to keep localized data on the primary
site from being spread out over the entire remote site.

- bill

Terje Mathisen

unread,

Sep 19, 2006, 4:44:26 AM9/19/06

to

Bill Todd wrote:
> Terje Mathisen wrote:
>> I've done the math for a fractional PB storage system, simply having
>> chunks of 12+1 or 13+1 disks in classic RAID, plus a couple of shared
>> hot standby disks to step in if/when any of the chunks lose their
>> redundancy, would be sufficient for four nines availability over a
>> year or two, even when using plain SATA disks. This assumed something
>> like 5 hours to rebuild the RAID info.
>
> Those figures sound as if they're based solely on whole-disk failure
> rates. If you include the likelihood that when a disk fails you will
> then find that one of the sectors on the remaining disks in the array is
> no longer readable, you may find the picture considerably less rosy.

This was based on the maximum density available in a SATABeast, which
fits 42 disks inside a 4U enclosure. The controller in this box performs
regular background read-to-verify of all disk surfaces, i.e. silent
errors cannot last more than a day or two.

Since such a SATA system has a dedicated controller path/channel to
every disk, it can read all the disks at the same time, so when
otherwise idle it could scan a full load of 500 GB disks (giving a
usable file system of around 18 TB) in less than 3 hours.

It was my belief that even when spreading it out over a week this would
still be often enough that most detoriating disk sectors should be
caught while still recoverable via the ECC coding, so that the internal
disk controller can map out the bad sector?

>
> Having a N+2 setup basically removes the
>> worry over disk failure, unless you get a really bad batch where all
>> of the disks are going to fail more or less simultaneously.
>
> The main need for N+2 these days is based upon the deteriorated-sector
> problem that I just described above.

OK, even with regular readability verification?

Bill Todd

unread,

Sep 19, 2006, 5:53:39 AM9/19/06

to

Terje Mathisen wrote:
> Bill Todd wrote:
>> Terje Mathisen wrote:
>>> I've done the math for a fractional PB storage system, simply having
>>> chunks of 12+1 or 13+1 disks in classic RAID, plus a couple of shared
>>> hot standby disks to step in if/when any of the chunks lose their
>>> redundancy, would be sufficient for four nines availability over a
>>> year or two, even when using plain SATA disks. This assumed something
>>> like 5 hours to rebuild the RAID info.

Just to check, Seagate specs their current SATA drives with an AFR of
0.73% 24/7 (interestingly, with an AFR of 0.34% when they omit the 24/7
qualification), or 1,200,000 hour MTBF. This suggests that the (naive)
probability of encountering a second whole-disk failure during a 5-hour
rebuild of a failed disk would be 1/240,000. For a 1 PB storage system
you'd need about 1440 750 GB drives in the configuration you specify, so
you'd expect about 21 of them to fail over a 2-year period, with a
probability of about 1/11,400 that one of those failures would encounter
a second disk failure during the 5-hour rebuild: just over 4 nines, and
you did say a fractional PB system rather than a full PB.

>>
>> Those figures sound as if they're based solely on whole-disk failure
>> rates. If you include the likelihood that when a disk fails you will
>> then find that one of the sectors on the remaining disks in the array
>> is no longer readable, you may find the picture considerably less rosy.
>
> This was based on the maximum density available in a SATABeast, which
> fits 42 disks inside a 4U enclosure. The controller in this box performs
> regular background read-to-verify of all disk surfaces, i.e. silent
> errors cannot last more than a day or two.

That is definitely prudent but not, I think, part of the definition of
'simply' 'classic RAID'.

>
> Since such a SATA system has a dedicated controller path/channel to
> every disk, it can read all the disks at the same time, so when
> otherwise idle it could scan a full load of 500 GB disks (giving a
> usable file system of around 18 TB) in less than 3 hours.
>
> It was my belief that even when spreading it out over a week this would
> still be often enough that most detoriating disk sectors should be
> caught while still recoverable via the ECC coding, so that the internal
> disk controller can map out the bad sector?
>>
>> Having a N+2 setup basically removes the
>>> worry over disk failure, unless you get a really bad batch where all
>>> of the disks are going to fail more or less simultaneously.
>>
>> The main need for N+2 these days is based upon the deteriorated-sector
>> problem that I just described above.
>
> OK, even with regular readability verification?

That is certainly a legitimate question, and I do not know the answer.
If one takes at face value the number offered up by manufacturers for
the incidence of unreadable sectors (one for every 10^14 bits read for a
contemporary Seagate SATA drive, for example), then the incidence of
unreadable sectors may be independent of such scrubbing activity. At
first glance this seems at least potentially counter-intuitive,
especially if that scrubbing activity is accompanied by frequent checks
of the disks' SMART data to catch increasing frequencies of correctable
errors that occur during the scrubbing - but after you consider that
most *correctable* errors may arise not from bit deterioration on the
drive but from temporary aberrations due to head positioning and/or
flying height the possibility becomes more believable.

With the 21 whole-disk failures in your system that we calculated above,
you'd have to read in 21 * 12 * 750 GB = 189 TB of data to rebuild them.
With an expected unreadable sector every 12.5 TB or so (if that
expectation is indeed independent of the scrubbing activity), there'd be
a close to even chance on *every* whole-disk failure of not being able
to reconstruct its contents completely.

Just how much would scrubbing have to improve the situation before you'd
get close to your 4 nines availability goal (even just for this failure
mode, ignoring the probability of a second whole-disk failure during a
rebuild)? It would have to increase from one readable sector every 12.5
TB to one unreadable sector every 189 TB * 10,000 = 1.89 EB, or by over
5 orders of magnitude.

I'm afraid that while I do, intuitively, expect scrubbing to reduce the
incidence with which one encounters unreadable sectors, I don't expect
it to reduce it by anything like that amount. Do you?

- bill

Bill Todd

unread,

Sep 19, 2006, 6:03:12 AM9/19/06

to

Bill Todd wrote:
> Terje Mathisen wrote:
>> Bill Todd wrote:
>>> Terje Mathisen wrote:
>>>> I've done the math for a fractional PB storage system, simply having
>>>> chunks of 12+1 or 13+1 disks in classic RAID, plus a couple of
>>>> shared hot standby disks to step in if/when any of the chunks lose
>>>> their redundancy, would be sufficient for four nines availability
>>>> over a year or two, even when using plain SATA disks. This assumed
>>>> something like 5 hours to rebuild the RAID info.
>
> Just to check, Seagate specs their current SATA drives with an AFR of
> 0.73% 24/7 (interestingly, with an AFR of 0.34% when they omit the 24/7
> qualification), or 1,200,000 hour MTBF. This suggests that the (naive)
> probability of encountering a second whole-disk failure during a 5-hour
> rebuild of a failed disk would be 1/240,000. For a 1 PB storage system
> you'd need about 1440 750 GB drives in the configuration you specify, so
> you'd expect about 21 of them to fail over a 2-year period, with a
> probability of about 1/11,400 that one of those failures would encounter
> a second disk failure during the 5-hour rebuild: just over 4 nines, and
> you did say a fractional PB system rather than a full PB.

Actually, both the discussion above and the continuation to considering
unreadable sector influences are not, strictly speaking, about '4 nines
availability', since what they're talking about is the probability of
data loss rather than down-time percentages. To understand exactly what
consequences data loss has for availability, one must first posit a
additional means of retrieving the data (such as by rebuilding it from
up-to-date transaction logs) and then evaluate how long this will take
as a percentage of the time period in question.

- bill

ken...@cix.compulink.co.uk

unread,

Sep 19, 2006, 6:48:31 AM9/19/06

to

In article <1hlv5ur.dy0czs15w5fcwN%nos...@ab-katrinedal.dk>,
nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=)
wrote:

> Binary newsgroups completely dwarf the nonbinary ones.

And a lot of news servers either do not carry binary groups,
especially the free ones, or expire binaries at a much faster
rate than text groups. There are currently getting on for 105
thousand news groups in my group list and I have not updated it
recently about 6,555 of those are binary groups. That is counting
by groups that include binary in the group name. If you allowed
rich content in the rest it could very well cause severe
problems. It is always possible to stick a link into a news
message anyway.

Ken Young

Casper H.S. Dik

unread,

Sep 19, 2006, 6:52:53 AM9/19/06

to

Rick Jones <rick....@hp.com> writes:

>Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>> They do stuff like making their own motherboards with a single 12 V
>> power supply, just to get lower conversion heat loss, and they
>> install naked MBs into racks, with the AC units ducting air directly
>> into these racks, with no need to cool a lot of (pizza) box
>> enclosures.

>I wonder what that does to the EFI (term?) in their machine rooms.

EMI (electromagnetic interference).

That is an interesting question as it may be that doing so
is illegal unless you take additional steps.

(Though no white box PC is ever tested for emissions after assembly
and the modern trend towards translucent cases cannot be helpful)

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Terje Mathisen

unread,

Sep 19, 2006, 7:18:11 AM9/19/06

to

Bill Todd wrote:
> Terje Mathisen wrote:
>> OK, even with regular readability verification?
>
> That is certainly a legitimate question, and I do not know the answer.
> If one takes at face value the number offered up by manufacturers for
> the incidence of unreadable sectors (one for every 10^14 bits read for a
> contemporary Seagate SATA drive, for example), then the incidence of

If that number is for unrecoverable read errors, then it seems far too high!

10^14 bits is as you note about 12.5 TB, which means that a 750 GB disk
will fail on average every 15 times you read it. Since the vendors rate
these disks for 24/7 operation, and we've previously calculated that you
can read or write the entire surface at least every 5 hours, you could
install them in a video surveillance setup and get an unrecoverable
error every 3 days.

This _really_ doesn't match the MTBF numbers!

On the other hand I have the almost unbelievably good numbers from the
vendor of the SATABeast, claiming a 0.7% total error rate on all the
installed disks, with an average run time of at least a year.

BTW, they claim to use not quite standard Hitachi disks, with modified
firmware to improve reporting of possible error situations. If this is
true, or just a ruse to defend 'special' pricing, I don't know.

Anyway, since their error numbers are based on actual read verification,
with failing drives being flagged, we know that it must be possible to
read everything at least every week for a year with close to zero
failures (0.7% is one disk in 140, or 3.5 full 42-disk units).

It is still possible that what happens during the verification stage is
relatively regular unrecoverable read errors that are fixed immediately
by re-writing the missing data (regenerated from the other disks) to the
same drive, which would have remapped the sector upon the initial read
failure. If this is true, then your scenario with failure-to-rebuild
becomes much more believable.

OTOH I have the tests we did where we pulled working disks, and then
watched how a spare drive was mapped in and RAID restored with no
problem. :-)

> unreadable sectors may be independent of such scrubbing activity. At
> first glance this seems at least potentially counter-intuitive,
> especially if that scrubbing activity is accompanied by frequent checks
> of the disks' SMART data to catch increasing frequencies of correctable
> errors that occur during the scrubbing - but after you consider that
> most *correctable* errors may arise not from bit deterioration on the
> drive but from temporary aberrations due to head positioning and/or
> flying height the possibility becomes more believable.

I expect the ECC codes to work, i.e. with sufficient retries to get
perfect head/disk/track alignment, small numbers of flaky bits should
get caught be the ECC codes.

If all (or at least most of) the recoverable errors were caused by head
alignment and not magnetic media failure, then there wouldn't be any
need for ECC, right? Simply having a string CRC check would be sufficient.

>
> With the 21 whole-disk failures in your system that we calculated above,
> you'd have to read in 21 * 12 * 750 GB = 189 TB of data to rebuild them.
> With an expected unreadable sector every 12.5 TB or so (if that
> expectation is indeed independent of the scrubbing activity), there'd be
> a close to even chance on *every* whole-disk failure of not being able
> to reconstruct its contents completely.

There would also be a similar chance of never being able to fill the
array with data without getting an error. (This _would_ be handled by
disk sector remapping, but it still sounds suspicious.)

In the end it all comes down to how you measure the MTBF numbers: Is
this the time to the first unrecoverable error, or the time until you
detect it? If you sell a 750 GB disk and only 1% of it is actually read
back regularly (file system metad

Bill Todd

unread,

Sep 19, 2006, 9:04:45 AM9/19/06

to

Terje Mathisen wrote:
> Bill Todd wrote:
>> Terje Mathisen wrote:
>>> OK, even with regular readability verification?
>>
>> That is certainly a legitimate question, and I do not know the answer.
>> If one takes at face value the number offered up by manufacturers for
>> the incidence of unreadable sectors (one for every 10^14 bits read for
>> a contemporary Seagate SATA drive, for example), then the incidence of
>
> If that number is for unrecoverable read errors, then it seems far too
> high!

Well, it's been pretty consistent over the years: IIRC quite a few
years ago an enterprise Seagate SCSI/FC drive was rated for one error
every 10^15 bits (just as they still are today).

Not too long ago I came across a (IIRC refereed) paper that gave typical
values of 10^15 bits for enterprise drives and 10^14 bits for consumer
drives - plus the interesting figure of 10^15 bits for *undetected*
error rates for consumer drives (IIRC there used to be a couple of
orders of magnitude more separation between uncorrectable error rates
and undetected error rates, at least on enterprise drives back when such
figures were given out). I can't find that paper right now, but I did
come up with another ("Reliability Mechanisms for Very Large Storage
Systems" from April, 2003, written by some fairly well-known people)
that gave uncorrectable error rates as every 10^14 - 10^15 bits.

>
> 10^14 bits is as you note about 12.5 TB, which means that a 750 GB disk
> will fail on average every 15 times you read it. Since the vendors rate
> these disks for 24/7 operation, and we've previously calculated that you
> can read or write the entire surface at least every 5 hours, you could
> install them in a video surveillance setup and get an unrecoverable
> error every 3 days.

Possibly not: while Seagate does rate their SATA disks for 24/7
'operation', I suspect (though didn't find in the first-level literature
on their site) that they *don't* rate them for anything like a 100% duty
cycle. I.e., that 'continuous operation' = 'continuous power-on - and
perhaps spun-up - hours', not 'continuously transferring hours'.

This is certainly the way they used to differentiate between
'enterprise' and 'consumer' disk products, though their definition of
duty cycle may have referred more to seek activity than mere transfer
activity (hmmm - do the head-positioning magnets work to hold precise
head positions during transfers even when not actively seeking, though?).

For that matter, if you were using an intelligent RAID mechanism it
would mask any such errors anyway - and likely correct any bad data that
it noticed while doing so.

And constant streaming at full disk bandwidth is a fairly rare
application. Typical average disk transfer rates are orders of
magnitude slower, and thus would not encounter more than a single such
error per year (and often never during their entire lifetime).

>
> This _really_ doesn't match the MTBF numbers!

It's not inconsistent if the MTBF numbers refer only to whole-disk
failures (which sort of makes sense, given that they provide both
figures and separate them).

>
> On the other hand I have the almost unbelievably good numbers from the
> vendor of the SATABeast, claiming a 0.7% total error rate on all the
> installed disks, with an average run time of at least a year.
>
> BTW, they claim to use not quite standard Hitachi disks, with modified
> firmware to improve reporting of possible error situations. If this is
> true, or just a ruse to defend 'special' pricing, I don't know.
>
> Anyway, since their error numbers are based on actual read verification,
> with failing drives being flagged, we know that it must be possible to
> read everything at least every week for a year with close to zero
> failures (0.7% is one disk in 140, or 3.5 full 42-disk units).

Interesting. A friend of mine looked into undetected error rates at DEC
a little over a decade ago in what sounds like the same manner and found
them around once every few PB - but that potentially included undetected
bus-related errors as well as disk errors.

- bill

Jan Vorbrüggen

unread,

Sep 19, 2006, 9:14:48 AM9/19/06

to

> (hmmm - do the head-positioning magnets work to hold precise
> head positions during transfers even when not actively seeking, though?).

I thought that the signal from the read head is fed to a feedback loop that
does "micro-seeks" during a revolution to keep the head precisely aligned
with the track?

Jan

Niels Jørgen Kruse

unread,

Sep 19, 2006, 9:16:01 AM9/19/06

to

comp...@patten-glew.net <Andy...@gmail.com> wrote:

Joking? If you enhance the USEnet reading experience to give what
webboards do, I think it might be possible to turn the tide.

To gain acceptance of anything new, it has to be readable by people
using old software, however. This means that inline formatting
directives should be avoided and anything bulky should be hosted rather
than inlined. Links to things like animated avatars could go in headers
that you don't see in normal use. Formatting directives might be compact
enough to fit in a header line if not overused.

Paradox

unread,

Sep 19, 2006, 9:22:30 AM9/19/06

to

Niels Jørgen Kruse wrote:

> John Dallman <j...@cix.co.uk> wrote:
>
> > In article <1hlv5ur.dy0czs15w5fcwN%nos...@ab-katrinedal.dk>,
> > nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) wrote:
> >
> > > Google is in a position to do something about rich(er) content. They
> > > could construct an interface for putting images into posts and host
> > > the images themselves. When showing a post on their webinterface they
> > > could check links embedded in < > and embed the target on the page if
> > > they are to simple images.

[snip]

> > Thirdly, would good use be made of it? Looking at this newsgroup for an
> > example, I strongly suspect that we'd see very few diagrams drawn to
> > illustrate points, and a load of "Pictures of my cool casemod" from
> > people who wander in and don't understand the group.
>
> You wouldn't be looking at them unless you were reading news with
> groups.google.com.
>
> As I said, USEnet is very conservative.

I wonder if there is added value in having a picture (maybe as a
thumbnail one could click to get the big version) embedded in the text,
instead of simply embedding a link to the picture?
(When using Google to read news links are automatically made
'click-able'.)

----
Create your favorite image channel at http://www.imageoak.com

Bill Todd

unread,

Sep 19, 2006, 9:29:14 AM9/19/06

to

That sounds somewhat familiar. I guess the question then becomes
whether significantly more heat is generated by macro seeks to other
tracks than by such micro seeks, and how that may relate to allowed duty
cycles.

- bill

Bill Todd

unread,

Sep 19, 2006, 2:48:03 PM9/19/06

to

Terje Mathisen wrote:
> Bill Todd wrote:
>> Terje Mathisen wrote:
>>> OK, even with regular readability verification?
>>
>> That is certainly a legitimate question, and I do not know the answer.
>> If one takes at face value the number offered up by manufacturers for
>> the incidence of unreadable sectors (one for every 10^14 bits read for
>> a contemporary Seagate SATA drive, for example), then the incidence of
>
> If that number is for unrecoverable read errors, then it seems far too
> high!

Since I don't seem to be able to get any sleep right now, I kept nosing
around:

Jim Gray apparently agrees with you. In
ftp://ftp.research.microsoft.com/pub/tr/TR-2005-166.pdf , he found well
over an order of magnitude fewer uncorrectable read errors than the spec
would suggest, one obvious possible explanation being that the spec is
written to handle worst-case parts that occur only rarely (Jim's sample
size was small - 10 disks total).

On the other hand, the whole-disk failure rates he saw for a much larger
group of SATA disks were around an order of magnitude *higher* than the
specs predicted - perhaps because real-world operating environments are
significantly harsher than those used for the spec runs.

An Intel Developer Forum "High Performance RAID-6: Why and How" paper
states explicitly that disk MTBF figures (it also gives a 'typical' ATA
figure an order of magnitude worse than Seagate's - but then again Intel
*is* in the business of selling RAID hardware...) relate *only* to
whole-disk failures and do not take unreadable sectors into account.

Seagate seems to have bumped the unrecoverable error rate up to 1 sector
every 10^16 bits in its new 15K rpm Cheetahs and cared enough to point
that out in its brochure - so it must think *someone* cares.

A couple of years ago Mary Baker from HP Labs gave a talk at Stanford
about (among other things) the effect of scrubbing on unrecoverable
sectors, and indeed came to the conclusion that the more frequent the
scrubbing, the more reliable the storage was (up to over an order of
magnitude more reliable than without scrubbing, though that's still
several orders short of what you'd need it to be to reduce the
probability of loss to 0.01% in the configuration you described).
Another paper ("Disk Scrubbing in Large Archival Storage Systems", which
found that the incidence of unreadable sectors was about 5x that of
whole-disk failures) reached somewhat different conclusions - that
scrubbing much more than once a year yielded minimal increasing returns,
but that it was extremely important to scrub (though the 100-year period
of data retention may have significantly skewed the latter).

Dave Hitz (co-founder of Network Appliance and farther of its WAFL file
system) recently described the worth of NetApp's dual-parity RAID in his
corporate blog, citing the conventional 1 unreadable sector in 10^14
bits figure and noting that it might be as much as an order of magnitude
better (still pretty far from the 5 orders better that your
configuration would need, though).

I was obviously half-asleep doing the failure calculations a few posts
ago (though not as nearly asleep as I am now), and left out a factor of
12 for the 12 disks participating in the reconstruction effort. Thus it
should have read:

Just to check, Seagate specs their current SATA drives with an AFR of
0.73% 24/7 (interestingly, with an AFR of 0.34% when they omit the 24/7
qualification), or 1,200,000 hour MTBF. This suggests that the (naive)
probability of encountering a second whole-disk failure during a 5-hour

rebuild of a failed disk in a 13-disk RAID-5 array would be 1/20,000
(the reciprocal of (single-disk
MTBF/(number-of-rebuild-hours*number-of-participating-disks)). For a 1

PB storage system you'd need about 1440 750 GB drives in the
configuration you specify, so you'd expect about 21 of them to fail over

a 2-year period, with a probability of about 1/952 that one of those

failures would encounter a second disk failure during the 5-hour

rebuild: just about 3 nines, though you did say a fractional PB system
rather than a full PB and for 1 or 2 years rather than a full 2 years.

That mistake did not affect the analysis of the impact of unreadable
sectors on rebuild operations. though: using the specced bit error rate
you'd still be off by 5 orders of magnitude (4 orders if you assume that
the specced value is about 1 order too pessimistic), which is still an
awful lot for scrubbing to take care of.

- bill

dg...@barnowl.research.intel-research.net

unread,

Sep 19, 2006, 5:54:15 PM9/19/06

to

nos...@ab-katrinedal.dk (Niels Jørgen Kruse) writes:

> comp...@patten-glew.net <Andy...@gmail.com> wrote:
>
> > Niels Jørgen Kruse wrote:
> > > Partly that is probably due to USEnet being very conservative. Over the
> > > last 10 years, the only development I see is character sets above 7 bits
> > > getting accepted. No rich content (except for binary newsgroups which
> > > have terrible retention) and that is connected to the lack of a
> > > newsreader war.
> >
> > Save USEnet! Start posting in XHTML with SVG.
>
> Joking? If you enhance the USEnet reading experience to give what
> webboards do, I think it might be possible to turn the tide.

I'm curious what you mean. My reading experience with webboards is a *lot*
worse than Usenet+Emacs/Gnus (lack of history, lack of kill files, annoying
interface, all webboards are different, etc).

--
David Gay
dg...@acm.org

David Kanter

unread,

Sep 19, 2006, 6:10:20 PM9/19/06

to

So what are the features (aside from those you mentioned) that you
think are great for usenet?

Killfiles can be nice, history is essential.

DK

Niels Jørgen Kruse

unread,

Sep 19, 2006, 6:12:53 PM9/19/06

to

<dg...@barnowl.research.intel-research.net> wrote:

I agree, but why are so many that disappeared from USEnet posting on
webboards?

dg...@barnowl.research.intel-research.net

unread,

Sep 19, 2006, 6:29:21 PM9/19/06

to

Well I don't know of course, but I could hazard some guesses:
- Usenet requires using special software
- "User experience" doesn't mesh with standard web browsing habits
- "Everybody else has left" (the critical mass effect (*))
- "What's Usenet?" (doesn't apply to people who left, but contributes to
the critical mass bit)
- Webboards have better search (you have to know to go to groups.google.com
for usenet...)

If you want better than my random guesses, hire somebody experienced in
doing surveys ;-)

*: This is the main reason I read most of my photo-related stuff at
www.dpreview.com rather than rec.photo.* these days, so may be the
only valid reason in the list above ;-)

--
David Gay
dg...@acm.org

dg...@barnowl.research.intel-research.net

unread,

Sep 19, 2006, 6:31:06 PM9/19/06

to

"David Kanter" <dka...@gmail.com> writes:

More flexible/better UI.

"Everything in one place".

--
David Gay
dg...@acm.org

Del Cecchi

unread,

Sep 19, 2006, 9:55:19 PM9/19/06

to

"David Kanter" <dka...@gmail.com> wrote in message
news:1158703820.4...@b28g2000cwb.googlegroups.com...

DK
What do you mean by "history"? Old threads still available?

ken...@cix.compulink.co.uk

unread,

Sep 19, 2006, 10:19:51 PM9/19/06

to

In article
<HcudnamZTde-I5LY...@metrocastcablevision.com>,
bill...@metrocast.net (Bill Todd) wrote:

> That is certainly a legitimate question, and I do not know the
> answer. If one takes at face value the number offered up by
> manufacturers for the incidence of unreadable sectors

Since the IDE drive was introduced there have been spare sectors
and automatic remapping by the drive firmware. You will not get
detectable unreadable sectors before the drive runs out of
spares. I can remember the first home hard drives where you got a
list of bad sectors and had to map them out yourself. Visible bad
sector incidence will depend both on failure rate and the number
of spares.

Ken Young

David Kanter

unread,

Sep 19, 2006, 10:55:30 PM9/19/06

to

Del Cecchi wrote:
> "David Kanter" <dka...@gmail.com> wrote in message
> news:1158703820.4...@b28g2000cwb.googlegroups.com...
>
> dg...@barnowl.research.intel-research.net wrote:
> > nos...@ab-katrinedal.dk (Niels Jørgen Kruse) writes:
> >
> > > comp...@patten-glew.net <Andy...@gmail.com> wrote:
> > >
> > > > Niels Jørgen Kruse wrote:

[snip]

> What do you mean by "history"? Old threads still available?

No, I mean that you can tell which threads/messages you haven't read
yet and which you have. When I use Google Groups (my prefered news
reader) it tells me which messages I have read and which ones I have
not. Very convenient.

DK

Del Cecchi

unread,

Sep 19, 2006, 11:08:45 PM9/19/06

to

"David Kanter" <dka...@gmail.com> wrote in message

news:1158720930....@m7g2000cwm.googlegroups.com...

Del Cecchi wrote:
> "David Kanter" <dka...@gmail.com> wrote in message
> news:1158703820.4...@b28g2000cwb.googlegroups.com...
>
> dg...@barnowl.research.intel-research.net wrote:
> > nos...@ab-katrinedal.dk (Niels Jørgen Kruse) writes:
> >
> > > comp...@patten-glew.net <Andy...@gmail.com> wrote:
> > >
> > > > Niels Jørgen Kruse wrote:

[snip]
There are webforums that do that already. See the forums at
http://www.fishingminnesota.com as an example. You might have to
register to have that feature work.

Bill Todd

unread,

Sep 20, 2006, 1:15:20 AM9/20/06

to

ken...@cix.compulink.co.uk wrote:
> In article
> <HcudnamZTde-I5LY...@metrocastcablevision.com>,
> bill...@metrocast.net (Bill Todd) wrote:
>
>> That is certainly a legitimate question, and I do not know the
>> answer. If one takes at face value the number offered up by
>> manufacturers for the incidence of unreadable sectors
>
> Since the IDE drive was introduced there have been spare sectors
> and automatic remapping by the drive firmware. You will not get
> detectable unreadable sectors before the drive runs out of
> spares.

Are you saying that when the drive finds an unreadable sector and has
spares available it will just remap it to a new sector, fill that sector
with, say, nulls, and *not tell higher-level software that anything went
wrong*? I know it may remap sectors that are *becoming* hard to read
*before* they become unreadable and not return any error, and I know
that it may remap sectors that it has difficulty writing to without
returning any error, but these situations don't lose your data without
telling you about it.

If Seagate & friends count quietly remapped sectors as unreadable, then
the difference between their specs and observed behavior makes more
sense - and indeed the observed behavior is what should be used to
calculate probability of data loss (because no data is lost if a sector
gets remapped *before* useful data in it has become unreadable).

- bill

Andy...@gmail.com

unread,

Sep 20, 2006, 1:27:28 AM9/20/06

to

> > > > > Niels Jørgen Kruse wrote:
> > > > Joking? If you enhance the USEnet reading experience to give what
> > > > webboards do, I think it might be possible to turn the tide.
> > >

> > > David Gay?

> > > I'm curious what you mean. My reading experience with webboards is a *lot*
> > > worse than Usenet+Emacs/Gnus (lack of history, lack of kill files, annoying
> > > interface, all webboards are different, etc).
> >

> > Niels Jørgen Kruse wrote:
> > So what are the features (aside from those you mentioned) that you
> > think are great for usenet?
> >

> > David Kanter:

> > Killfiles can be nice, history is essential.
>

> David Gay:

> More flexible/better UI.
>
> "Everything in one place".

I'm not sure if I tracked the attributions right. Since it appeared
that I said "Joking" in the post I replied to, I know that something
was screwed up.

Anyway... I'm not sure who asked "Joking?" about what. I was serious
when I said XHTML and SVG.

But I agree with David Gay(?) I hate webboards, for the reasons he
mentions. Interestingly, this is one reasoin why I rarely participate
in website based communities such as realworldtech (for David Kanter).

I find Usenet+Emacs/Gnus much moe flexible. One interface, that I can
use for email, new, website browsing...

I haven't found Google Groups anywhere near so good. Google reader
helps, but, again, it is not very sophisticated. E.g., so far as I can
tell, Google reader has no thread mode; neither has the equivalent of
Gnus topic mode.

However, I may fall back to Google groups for the reason mentioned
earlier: ubiquitous access, with shared state.

I have hope that something like a good user interface could be designed
using the Google APIs.

Andy...@gmail.com

unread,

Sep 20, 2006, 1:36:35 AM9/20/06

to

> I wonder if there is added value in having a picture (maybe as a
> thumbnail one could click to get the big version) embedded in the text,
> instead of simply embedding a link to the picture?
> (When using Google to read news links are automatically made
> 'click-able'.)

Duh...

If I am reading the newsgroup on an airplane, I cannot click through
the link to see the big picture.

And although I am n using Google Groups, it won't be forever.

Disconnected operation matters.

---

Heck, more and more I am seeing the following pattern:

send the link AND the object

use the object in case disconnected

use the link for updates.

---

BTW, SVG drawings are small. Bandwdth is not an issue, for what we
would see in comp.arch.

---

Heck, legal concerns: you can't have a picture sent in the message
removed from a website.

Andy...@gmail.com

unread,

Sep 20, 2006, 1:38:38 AM9/20/06

to

Niels Jørgen Kruse wrote:

> > Save USEnet! Start posting in XHTML with SVG.
>
> Joking? If you enhance the USEnet reading experience to give what
> webboards do, I think it might be possible to turn the tide.
>
> To gain acceptance of anything new, it has to be readable by people
> using old software, however. This means that inline formatting
> directives should be avoided and anything bulky should be hosted rather
> than inlined. Links to things like animated avatars could go in headers
> that you don't see in normal use. Formatting directives might be compact
> enough to fit in a header line if not overused.

Not joking. Reading backwards timewise to catch up.

SVG is not bulky.

Hosted sucks for people not connected at reading time.

Ah, heck: hosted sucks. Period.

David Kanter

unread,

Sep 20, 2006, 1:39:23 AM9/20/06

to

Andy:

> I haven't found Google Groups anywhere near so good. Google reader
> helps, but, again, it is not very sophisticated. E.g., so far as I can
> tell, Google reader has no thread mode; neither has the equivalent of
> Gnus topic mode.

It does have a mode where you can see a threaded view of every message
in a given 'topic'. What you want to do first is in the main c.a page
select 'Viewing titles only' in the upper right. Then open a thread,
once inside the thread click on 'view as tree'.

That's about the closest thing to a civilized UI you can get...

DK

Andy...@gmail.com

unread,

Sep 20, 2006, 1:50:05 AM9/20/06

to

Paul Aaron Clayton wrote:
> >If all computer uses were to be hosted by Google tomorrow,
> >magically, then the PC microprocessor marketplace would be 1/3rd
> >or less than the current size.
>
> But would software eventually increase in processing demand to
> equal capacity?

Maybe evebtually. But it would be a major blow to microprocessor
companies in the short and medium term.

Also, the playing field would be levelled. Google would be much more
willing to switch vendors, than 300 million individual PC users.

Terje Mathisen

unread,

Sep 20, 2006, 2:24:59 AM9/20/06

to

The key problem seems to be this:

For a once a week data scrubbing job to really work, it must detect a
single unreadable sector on one disk, regenerate that data and write it
back to original failing disk, which in the meantime have remapped the
sector, right?

If the RAID controller instead treats one such unreadable sector as a
'total disk failure' and disables the drive before starting to build up
a replacement, then we'll decrease the total MTBF significantly, while
at the same time wasting a lot of otherwise perfectly usable drives. :-(

Terje Mathisen

unread,

Sep 20, 2006, 2:30:06 AM9/20/06

to

Bill Todd wrote:
> ken...@cix.compulink.co.uk wrote:
>> In article
>> <HcudnamZTde-I5LY...@metrocastcablevision.com>,
>> bill...@metrocast.net (Bill Todd) wrote:
>>
>>> That is certainly a legitimate question, and I do not know the
>>> answer. If one takes at face value the number offered up by
>>> manufacturers for the incidence of unreadable sectors
>>
>> Since the IDE drive was introduced there have been spare sectors and
>> automatic remapping by the drive firmware. You will not get detectable
>> unreadable sectors before the drive runs out of spares.
>
> Are you saying that when the drive finds an unreadable sector and has
> spares available it will just remap it to a new sector, fill that sector
> with, say, nulls, and *not tell higher-level software that anything went
> wrong*? I know it may remap sectors that are *becoming* hard to read
> *before* they become unreadable and not return any error, and I know
> that it may remap sectors that it has difficulty writing to without
> returning any error, but these situations don't lose your data without
> telling you about it.

My naive belief was that of course they would remap any sector which
required multiple retries and/or ECC help to read it, but that an actual
unrecoverable read failure would _never_ be silently hidden.

>
> If Seagate & friends count quietly remapped sectors as unreadable, then
> the difference between their specs and observed behavior makes more
> sense - and indeed the observed behavior is what should be used to
> calculate probability of data loss (because no data is lost if a sector
> gets remapped *before* useful data in it has become unreadable).

And this is why scrubbing can help: It sort of takes the place of
regular tape-to-tape copying in a large tape vault, or signal
regeneration i na long transmission link: It recovers all readable
information, then regenerates a pristine signal if/when you were getting
close to the S/N floor.

Seongbae Park

unread,

Sep 20, 2006, 2:39:55 AM9/20/06

to

I think it's a matter of time before
somebody makes use of "intelligent" proxies
running locally in your computer when disconnected from the internet
to service all those online services offline.
For static content that are cached, mozilla/firefox offline mode does
that already
- so hosted SVG won't be a problem if we add some way to make sure
certain data are cached and not purged.
With aggressive prefetching and sufficiently large local scratch disk
space,
it won't be a big problem once people have sufficient bandwidth.

Seongbae

Bill Todd

unread,

Sep 20, 2006, 4:06:59 AM9/20/06

to

Terje Mathisen wrote:

...

> For a once a week data scrubbing job to really work, it must detect a
> single unreadable sector on one disk, regenerate that data and write it
> back to original failing disk, which in the meantime have remapped the
> sector, right?
>
> If the RAID controller instead treats one such unreadable sector as a
> 'total disk failure' and disables the drive before starting to build up
> a replacement, then we'll decrease the total MTBF significantly, while
> at the same time wasting a lot of otherwise perfectly usable drives. :-(

Then the solution is to get oneself a decent RAID controller: I'm
reasonably sure that in my whirlwind tour of such matters over the past
24 hours I've encountered a reference that states that at least *some*
RAID controllers are sufficiently suave to treat unreadable sectors in
an intelligent manner.

- bill

Terje Mathisen

unread,

Sep 20, 2006, 4:25:35 AM9/20/06

to

Bill Todd wrote:
> Just to check, Seagate specs their current SATA drives with an AFR of
> 0.73% 24/7 (interestingly, with an AFR of 0.34% when they omit the 24/7
> qualification), or 1,200,000 hour MTBF. This suggests that the (naive)
> probability of encountering a second whole-disk failure during a 5-hour
> rebuild of a failed disk in a 13-disk RAID-5 array would be 1/20,000
> (the reciprocal of (single-disk
> MTBF/(number-of-rebuild-hours*number-of-participating-disks)). For a 1
> PB storage system you'd need about 1440 750 GB drives in the
> configuration you specify, so you'd expect about 21 of them to fail over
> a 2-year period, with a probability of about 1/952 that one of those
> failures would encounter a second disk failure during the 5-hour
> rebuild: just about 3 nines, though you did say a fractional PB system
> rather than a full PB and for 1 or 2 years rather than a full 2 years.

Right, these numbers match what I got.

>
> That mistake did not affect the analysis of the impact of unreadable
> sectors on rebuild operations. though: using the specced bit error rate
> you'd still be off by 5 orders of magnitude (4 orders if you assume that
> the specced value is about 1 order too pessimistic), which is still an
> awful lot for scrubbing to take care of.

I agree.

Since the SATA enclosures I wrote about have been in production for more
than a year, with zero actual data loss (operator error deleted a file
system at one costumer. That's what offsite async mirroring is for,
right?), it would seem that at least part of the rebuild analysis must
be wrong.

If the situation was as bleak as your numbers indicate, then N+2
redundancy should have taken over completely by now.

Bill Todd

unread,

Sep 20, 2006, 5:23:42 AM9/20/06

to

Terje Mathisen wrote:

...

> If the situation was as bleak as your numbers indicate, then N+2
> redundancy should have taken over completely by now.

The question is where on the continuum between my numbers and your
suspicions reality lies. I find it easy to accept that disk-level
proactive transparent revectoring reduces the incidence of *observable*
unreadable sectors by somewhat more than an order of magnitude (as Jim
Gray's observations and others suggest), and that scrubbing further
reduces the practical impact of that level of errors by another 1+ order
of magnitude (as the Baker talk suggested - I'm going through a later
paper she co-authored that may discuss the issue in more detail) - but
that still leaves 2+ orders of magnitude more to go before reaching your
'4 nines' reliability estimate (though since Jim Gray's figures suggest
that real-world MTBF numbers are an order of magnitude lower than
manufacturers' specs indicate that reduces reliability of your proposed
configuration by two orders of magnitude right there, making it look
more like a '1-to-2 nine' system based on whole-disk failure rates alone).

Large RAID-5 arrays (above 5 - 6 disks) are usually avoided for
precisely such reasons.

- bill

Bill Todd

unread,

Sep 20, 2006, 6:24:06 AM9/20/06

to

Terje Mathisen wrote:
> Bill Todd wrote:

...

>> If Seagate & friends count quietly remapped sectors as unreadable,
>> then the difference between their specs and observed behavior makes
>> more sense - and indeed the observed behavior is what should be used
>> to calculate probability of data loss (because no data is lost if a
>> sector gets remapped *before* useful data in it has become unreadable).
>
> And this is why scrubbing can help: It sort of takes the place of
> regular tape-to-tape copying in a large tape vault, or signal
> regeneration i na long transmission link: It recovers all readable
> information, then regenerates a pristine signal if/when you were getting
> close to the S/N floor.

Just to attempt to bring some level of closure to this lengthy and
interesting discussion, I finally found a more recent paper by Mary
Baker et al. ("A Fresh Look at the Reliability of Long-term Digital
Storage", EuroSys '06, April, 2006) that was reviewed by some *very*
experienced people and seems to cover the issue pretty well (though its
assumption in its 'internet archive' case examination assumes that a
single error makes an entire file copy unusable, which seems to result
from an artifact of a sub-optimal implementation - hence I won't report
those numbers). The entire paper seems worthwhile reading for anyone
sufficiently interested to have pursued this discussion to this point,
but the particularly salient findings for a simple mirrored disk pair
with disk MTBF = 120,000 hours (consistent with Jim Gray's and the
authors' real-world experience rather than the manufacturer's spec) and
a disk bit error rate of 1 sector every 10^15 bits read (i.e.,
apparently allowing for the beneficial effect of the disk-level
automatic revectoring and, again, fairly consistent with their
real-world experience) are:

1. A naive analysis completely ignoring problems caused by unreadable
sectors yielded a mean time to data loss of 1,200,000 years.

2. Factoring in the issue of unreadable sectors reduced the MTTDL to
9.5 years (this is the manifestation of the problem with 'simply'
'classic RAID' which I originally highlighted).

3. Scrubbing every 4 months increased the MTTDL to 795 years.

4. The form of the equation they derived suggests that MTTDL is roughly
inversely proportional to the scrubbing frequency - save for any adverse
effect that more frequent scrubbing may have on error and/or failure rates.

Thus to a first approximation scrubbing that disk pair every week would
increase the MTTDL to about 14,000 years. Since a mirrored disk pair
has reliability characteristics similar to a 2-disk parity RAID array,
and assuming (without going through the relevant numbers, which could be
dangerous in this case but I don't have time to do so now) that as in
the case of whole-disk failure consequences parity RAID reliability is
roughly inversely proportional to the square of the array's disk count,
the MTTDL for a single 12 + 1 RAID-5 array would then be about 333 years
and the MTTDL for a system comprising multiple such arrays would be
inversely proportional to their number (i.e., not very good for a system
supporting any significant fraction of 1 PB; I've ignored any effects
from the paper's use of 200 GB drives rather than the 750 GB drives now
available).

- bill

Terje Mathisen

unread,

Sep 20, 2006, 7:52:44 AM9/20/06

to

Bill Todd wrote:
> Just to attempt to bring some level of closure to this lengthy and
> interesting discussion, I finally found a more recent paper by Mary
> Baker et al. ("A Fresh Look at the Reliability of Long-term Digital
> Storage", EuroSys '06, April, 2006) that was reviewed by some *very*

Thanks!!!

I found a pdf copy and I'm currently re-reading it, after forwarding
copies to several people inhouse.

Niels Jørgen Kruse

unread,

Sep 20, 2006, 9:51:42 AM9/20/06

to

<dg...@barnowl.research.intel-research.net> wrote:

The critical mass effect is what caused me to take up reading webboards.
You have to go were the good stuff is posted.

USEnet postings will show up in regular Google searches, so people would
naturally get to google groups without anyone telling them about it.
They can start using it if they like it.

Niels Jørgen Kruse

unread,

Sep 20, 2006, 9:51:42 AM9/20/06

to

comp...@patten-glew.net <Andy...@gmail.com> wrote:

> SVG is not bulky.

Just don't stick it inline in the text I am reading, unless it is
properly rendered.

> Hosted sucks for people not connected at reading time.
>
> Ah, heck: hosted sucks. Period.

It is no worse than a juicy link to a .pdf that you have to postpone
getting.

Anne & Lynn Wheeler

unread,

Sep 20, 2006, 10:22:33 AM9/20/06

to

"Seongbae Park" <Seongb...@gmail.com> writes:
> I think it's a matter of time before
> somebody makes use of "intelligent" proxies
> running locally in your computer when disconnected from the internet
> to service all those online services offline.
> For static content that are cached, mozilla/firefox offline mode does
> that already
> - so hosted SVG won't be a problem if we add some way to make sure
> certain data are cached and not purged.
> With aggressive prefetching and sufficiently large local scratch disk
> space,
> it won't be a big problem once people have sufficient bandwidth.

for a couple years now, i've been using mozilla/firefox tab folders to
read news. i have a bookmark tab folder with a 100 or so news sites.
i click on the folder and it starts fetching all 100 URLs concurrently
into different tabs. i can go get another cup of coffee while it is
doing its thing. i no longer into vageries of syncronous click/delay.
as i'm scanning news, i can click on news story url ... which is
brought asyncronously into background tab. when i'm finished scanning
the news sites (deleting the tab as i finish) ... i then have all the
specific news stories ready (having been brought asyncronously into
the background tabs). early on, there were some performance issues
with having several hundred open tabs ... but tab support has gotten
quite a bit better.

even with significant higher bandwidth (dialed vis-a-vis broadband)
there can still be issue with remote site load and/or infrastructure
congestion.

i've mostly used emacs gnus for quiet awhile for newsreading ... and
it tries to do some heuristic asyncronous read-ahead (and does support
offline processing).