Google and scale: weblogs to the rescue

Prentiss Riddle

unread,

Apr 10, 2002, 12:09:42 PM4/10/02

to

Information architecture guru Lou Rosenfeld wonders whether Google
will scale due to the increasing cost of finding useful new pages
from among the old:

http://louisrosenfeld.com/home/bloug_archive/000086.html

His correspondents answer that Google's ability to frequently spider
weblogs and other "news" pages offsets this problem -- just the
opposite of the doomsday "Google Time Bomb" theory.

Links and discussion at:

http://www.io.com/~riddle/toys/?item=20020410

-- Prentiss Riddle ("aprendiz de todo, maestro de nada")
-- rid...@io.com / http://www.io.com/~riddle/

Bart

unread,

Apr 11, 2002, 8:19:01 AM4/11/02

to

The news crawling ability does not give them the edge required to
keep their main index up to date. If they do in fact have a scaling
problem it will still exist with the news crawler.

Bart

Richard Wheeldon

unread,

Apr 11, 2002, 3:24:09 PM4/11/02

to

Bart wrote:
> The news crawling ability does not give them the edge required to
> keep their main index up to date. If they do in fact have a scaling
> problem it will still exist with the news crawler.

For what it's worth, here's my take on the scalability issue:

1. The web is growing at an exponential rate.
2. So is computing power.

Therefore if the rate is roughly the same in both cases, then
the cost of indexing 1 nth of the web (and the hence burn rate
for this part of the company) should remain pretty consistent.

This is assuming that the algorithms are roughly linear. Given
that the tasks are basically:

1. Index the content (linear)
2. Build and index - O(Nlog(N)) - Index building implies sorting,
so it can't be better than this.
3. Compute PageRank - roughly linear.

the situation is slightly worse. However, the number of people
using the web is also increasing exponentially (although this
will probably start slowing in a few years), and the difficulty
in finding content is also increasing, so increasing demand
should easily offset costs assuming they can find a suitable
business model, which they seem to have,

Richard

Bart

unread,

Apr 12, 2002, 11:01:33 AM4/12/02

to

Here's the problem with your assertion; Graph the following trends for
the last 20 years:

CPU SPEED
DISK SIZE (for an avg pc)
RAM SIZE
DISK TRANSFER RATE
BUS TRANSFER RATE

What you'll see is that while CPUs are in fact getting exponentially faster,
the speed increase is not nearly in proportion to the growth rate of the
average disk size (which has grown _very_ exponentially). In addition to
this, the bus speeds and disk transfer rates are are increasing a terribly
slow rate with respect to CPU and disk size. Its reasonably safe to assume that
internet content will grow at a rate somewhat proportional to avg. disk size
multiplied by the increase in the number of web sites.

If you add one more factor to the graph: Avg Internet transfer rate, you'll
see the true nature of their problem. The internet has actually gotten slower
over time, not faster. Avg HTTP xfer rates used to be in excess of 16KB/sec,
now they're much less than 8KB/sec.

Your math assumes perfect and ram only conditions. Google uses 10-20 thousand
machines to index and search their data. Their theory is that they'll keep
adding machines in proportion to network content size. Their model has a
limitation though; at a certain point the inter-process communication required
to conduct the searches will overwhelm the benefit of the distributed processing
power.

Assume that the average web page contains 2000 unique indexable strings.
If you index 1 million new dynamic pages every day then you'd have to perform
in excess of 23 thousand update transactions per second against the index.
Since you can't index the entire web at once, you must perform an update
on the existing index data. This implies the math behind a disk based
merge and not an in ram sort. The index size is almost as large as the
original data and all of this must be sucked through the relatively slow
disk and bus data straw.

The reason they use a separate set of machines to index dynamically changing
content is to avoid the cost of constant merges with the astronomically
large main-index. The Rube-Goldburg solution to providing a reasonably accurate
search is to conduct a meta-search across both indexes and then merge the results.
Its a workable short-term solution, but in the long term it will become
troublesome for them as well.

The truth will come out as soon as Google "comes out of the closet" and
goes public. I pretty strongly suspect that they will have to cut their
costs dramatically in order to survive the market. I believe their business
agenda is to stomp all other engines out of existence by providing
exceedingly high quality results for as long as they arent subject to
financial accountability. They believe they'll win in the long term
by being the only player left I think. This is akin to Japan's steel dumping
practice that squished the American companies out of biz.

BTW: Computing Pagerank is a Log not Lin algorithm (1 for your side!)

Thanks,

Bart

Richard Wheeldon

unread,

Apr 16, 2002, 6:28:03 AM4/16/02

to

Bart wrote:
> Here's the problem with your assertion; Graph the following trends for
> the last 20 years:

Do you have the stats for any of these available? The ones I've found
are limited. i.e. they don't go back that far and don't cover bus
transfer
rates.

> CPU SPEED
> DISK SIZE (for an avg pc)

We're not really considering 'average' here but the difference should
be negligible.

> RAM SIZE
> DISK TRANSFER RATE
> BUS TRANSFER RATE

I hadn't really considered bus transfers. Good point.

> What you'll see is that while CPUs are in fact getting exponentially faster,
> the speed increase is not nearly in proportion to the growth rate of the
> average disk size (which has grown _very_ exponentially).

True.

> If you add one more factor to the graph: Avg Internet transfer rate, you'll
> see the true nature of their problem. The internet has actually gotten slower
> over time, not faster. Avg HTTP xfer rates used to be in excess of 16KB/sec,
> now they're much less than 8KB/sec.

Where did you get these stats from? Bear in mind that the content most
in
need of refreshing is likely to be easier to download (e.g. news sites,
etc.)

> Your math assumes perfect and ram only conditions. Google uses 10-20 thousand
> machines to index and search their data. Their theory is that they'll keep
> adding machines in proportion to network content size. Their model has a
> limitation though; at a certain point the inter-process communication required
> to conduct the searches will overwhelm the benefit of the distributed processing
> power.

Not sure I agree with this argument. If we switch companies for a moment
(I believe
that if this argument applies to one search engine, it applies to all.
If you
disagree, please tell me what makes Google a special case), Altavista
also has many
servers, yet the entire index will fit on only 12. Google, I imagine
runs a similar
setup - the vast bulk of the servers are only for handling load, with
the index
comfortably sitting on a small number, so the inter-process
communication scales
with this small number.

> Assume that the average web page contains 2000 unique indexable strings.
> If you index 1 million new dynamic pages every day then you'd have to perform
> in excess of 23 thousand update transactions per second against the index.
> Since you can't index the entire web at once, you must perform an update
> on the existing index data. This implies the math behind a disk based
> merge and not an in ram sort. The index size is almost as large as the
> original data and all of this must be sucked through the relatively slow
> disk and bus data straw.

Good point.

> The reason they use a separate set of machines to index dynamically changing
> content is to avoid the cost of constant merges with the astronomically
> large main-index. The Rube-Goldburg solution to providing a reasonably accurate
> search is to conduct a meta-search across both indexes and then merge the results.
> Its a workable short-term solution, but in the long term it will become
> troublesome for them as well.

Rube-Goldburg ?

> The truth will come out as soon as Google "comes out of the closet" and
> goes public. I pretty strongly suspect that they will have to cut their
> costs dramatically in order to survive the market. I believe their business
> agenda is to stomp all other engines out of existence by providing
> exceedingly high quality results for as long as they arent subject to
> financial accountability. They believe they'll win in the long term
> by being the only player left I think. This is akin to Japan's steel dumping
> practice that squished the American companies out of biz.
>
> BTW: Computing Pagerank is a Log not Lin algorithm (1 for your side!)

Sadly not. The number of iterations is actually much less than log(N),
but
each iteration involves a second iteration through every link in the
graph
hence slightly worse than linear,

Richard

Bart

unread,

Apr 17, 2002, 7:00:02 PM4/17/02

to

Richard,

I am retired now, but was in the search engine biz from the early 80's
until last Nov. I was drawing on factoids from my career.

The internet data xfer rates I quoted are from a search engine's crawler
that has been in operation for a very long time (pre Netscape).

Computing pagerank is only linear on a one term query. In multi term
queries the calculations are not. The answer you had below is only true
for the published methods, but few of the actually useful algorithms
are published anymore. Too much IP ripoff. Look at it as nested ordered lists
instead of a graph.

The methods used by Google are _very_ distinct from that of Altavista.
I dont think I would generalize across them.

I didnt spell Rube-Goldberg correctly for more info on him:
http://www.rgmc.com/

Thanks,
Bart

Richard Wheeldon

unread,

Apr 23, 2002, 7:06:12 PM4/23/02

to

Bart wrote:
> Computing pagerank is only linear on a one term query. In multi term
> queries the calculations are not. The answer you had below is only true
> for the published methods, but few of the actually useful algorithms
> are published anymore. Too much IP ripoff. Look at it as nested ordered lists
> instead of a graph.

Can you expand on this ? As I understood it PageRank is a single number
for all pages precomputed. What do multi-term queries have to do with
this ?
Or am I making some big mistake here ?

Richard

Greg Skinner

unread,

Apr 28, 2002, 3:47:13 PM4/28/02

to

Bart <nob...@nospam.org> wrote:
> The internet data xfer rates I quoted are from a search engine's crawler
> that has been in operation for a very long time (pre Netscape).

Hmmm ... What type of connectivity does this site have? Who are
their providers? Do they have redundant connectivity to top-tier
providers?

[from a previous post of yours]

> The truth will come out as soon as Google "comes out of the closet" and
> goes public. I pretty strongly suspect that they will have to cut their
> costs dramatically in order to survive the market. I believe their business
> agenda is to stomp all other engines out of existence by providing
> exceedingly high quality results for as long as they arent subject to
> financial accountability. They believe they'll win in the long term
> by being the only player left I think.

I certainly think that Google and other search engines will have to
(continue to) cut costs. Despite technological advances, there is
little to suggest that the money that is paid to keep search engines
in operation, whether it be from paid listings fees, ads, etc. is
going to grow in proportion to technology. Search engines will come
under increasing pressure to meet bottom-line revenue and earnings
targets. Also, at some point, the capital expenditure of indexing
will have to be justified by whether there are actually sufficient
(monetizable) users out there looking for all of that information.

If search engines start to charge for search queries, they'll have to
be very very relevant and current, or else people won't want to pay.
If they continue to rely on ads or adlike revenue, they'll wind up
having to favor results that the advertisers are willing to pay for.
Search engines may wind up like other media companies and be forced to
specialize in niches such as news or music, because it will be easier
and cheaper to create highly relevant and up-to-date specialized
indexes. They also may be forced to consolidate if they can't achieve
their business objectives as separate companies.

--gregbo
gds at best dot com