Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Google Is Dying: Death by a Billion Cuts
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  1 message - Expand all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
n...@olm.blythe-systems.com  
View profile  
 More options Nov 3 2004, 4:24 pm
Newsgroups: misc.activism.progressive
Followup-To: alt.activism.d
From: n...@olm.blythe-systems.com
Date: 3 Nov 2004 15:24:31 -0600
Local: Wed, Nov 3 2004 4:24 pm
Subject: [NYTr] Google Is Dying: Death by a Billion Cuts
Via NY Transfer News Collective  *  All the News that Doesn't Fit

Google-Watch - Oct 15, 2004
http://www.google-watch.org/dying2.html

Google is Dying:

Death by a Billion Cuts

by Daniel Brandt

On sites with more than a few thousand pages, Google is not indexing
anywhere from ten percent to seventy percent of the pages it knows about.
These pages show up in Google's main index as a listing of the URL, which
means that the Googlebot is aware of the page. But they do not show up as an
indexed page. When the page is listed but not indexed, the only way to find
it in a search is if your search terms hit on words in the URL itself. Even
if they do hit, these listed pages rank so poorly compared to indexed pages,
that they are almost invisible.  This is true even though the listed pages
still retain their usual PageRank.

I have been complaining about this since April 2003, and it has become more
visible in 2004. There is no method to Google's madness, which is another
way of saying that this phenomenon is not characteristic of any particular
type of site. It is happening across the entire landscape of large sites. I
find it on www.johnkerry.com, on searchenginewatch.com, and dozens of other
large sites I checked. Our own site, www.namebase.org, is a clean example of
this, and I will use it to show how to do searches that expose this
phenomenon.

You have to know what to look for and how to look for it. First of all, a
listing consists of the URL in place of the title on Google's search results
pages, in blue, and below this in a smaller font there appears a "Similar
pages" link in blue. That's all. An indexed page has a real title, almost
always has a snippet in black, shows the URL and the size of the page in
green, and then has "Cached" and "Similar pages" links in blue. (On NameBase
we disallow Google's cache copy, so the "Cached" link is legitimately
missing on all of our pages.) These two types of links are very different
and immediately obvious. However, you should set your Google preferences to
100 links per page, because the listed links are buried much deeper in the
results.

Before I explain how to isolate the listed links from the indexed links,
there are two cases I know of where a listing is normal for Google. These
are exceptions to the phenomenon that interests me in this essay. Neither is
relevant to NameBase, but I have to mention them in case you want to examine
other sites. The first exception is when a site has certain directories
disallowed in their robots.txt file. Google will habitually list the URLs in
the disallowed directory but not index them.  (This itself is an invasion of
privacy, because filenames can be very revealing -- but that's a rant for
another day.)

The second exception is when there are ID numbers at the end of the URL,
particularly if these numbers follow a question mark in the URL.  Google
avoids any URL that looks like it might be a problem. Sometimes this number
is a session ID number from a shopping cart site. If Google followed these
links, the crawler might end up grabbing thousands of duplicate pages,
distinguished only by the session ID.

Now that you know what I'm not talking about, here is how you can
investigate a site. First you have to find a word on the site that is
present on nearly every page of the site. On some of the sites we looked at,
the word "reserved" from the copyright notice (as in "All rights reserved")
worked fairly well. On NameBase, we have "home page" at the bottom. The
"site:" command is used in conjunction with "home page." By putting "home
page" in quotes, the search is more accurate:

        site:www.namebase.org "home page"

That search asks for all pages from www.namebase.org that include the phrase
"home page." These will be indexed pages. If the page was merely listed,
Google wouldn't be aware that this phrase is at the bottom of the page. Next
you can request all pages that do not contain this phrase, by inserting a
minus sign in front of the phrase:

        site:www.namebase.org -"home page"

In the case of NameBase, this became a problem that I first noticed in April
2003. That was the month when Google underwent a massive upheaval, which I
describe in my Google is broken essay. When that essay was written two
months after the upheaval, it would have been speculative to claim that the
listed URL phenomenon was a symptom of the 4-byte docID problem described in
the essay. It was too soon. But sixteen months later, the URL listings are
beginning to look very widespread and very suspicious. It's a major fault in
Google's index, it is getting worse, and it is much more than a mere
temporary glitch.

Another curiosity emerged in August 2003, two months after my "Google is
broken" essay. Google started showing supplemental results from an entirely
separate index. If you run out of regular results you will often see the
label "Supplemental Result" in green on the last page of available links. At
that time Google briefly stated on their site that they "augment results for
difficult queries by searching a supplemental collection of web pages." A
representative from Google had little to add to this, but did concede that
it is an entirely separate index, and then threw out a few words of spin. It
sounded like a cover story. I believe that this new index was started due to
a capacity problem in the main index and the need to develop new software.

Google is dying. It broke sixteen months ago and hasn't been fixed. It looks
to me as if pages that have been noted by the crawler cannot be indexed
until some other indexed page gives up its docID number. Now that Google is
a public company, stockholders and analysts should require that Google give
a full accounting of their indexing problems, and what they are doing to fix
the situation. The SEC should get involved too, because this continuing
decline in the quality of Google's main index is a significant risk factor
that should have been mentioned in the prospectus.

The graphs below are based on page views at NameBase, our main site. Images
and automated crawlers are excluded from these numbers. We started
collecting traffic data for NameBase in May, 2003. The first graph shows
daily totals, the second shows weekly totals, and the third shows monthly
totals. The last day shown is always yesterday, and the last week shown is
always the week that ended yesterday. The average that defines the 100
percent line consists of all of the data shown on each graph, and is
specified in the upper right corner.

NameBase

Because NameBase is a large site with broad appeal, the traffic tends to be
steady and predictable. When the pattern changes, apart from the usual
weekend dips, we start looking at what's happening with our referrals from
search engines. Note the skinny blue line on the bottom.  This is the number
of referrals from Google (excluding Yahoo, AOL, Earthlink, and Netscape).
Our site is "sticky," which means that anyone who lands on one of our pages
from a search engine is likely to click around some more. Nevertheless, it
is clear who is in the driver's seat when it comes to overall traffic trends
-- the little blue line on the bottom is directly driving the big line on
top. The more Google loves you, the more the world loves you. Google rules.
For the past few years they have not reflected popularity, as much as their
near-monopoly created and perpetuated it.

But things seem to be changing. Our biggest problem is that with 129,000
pages on our site, Google doesn't take the time to get a complete crawl of
our data. Or if they do, they don't put it all into their index. The red
line at the bottom is the number of referrals from Yahoo plus Microsoft (if
the blue line is missing in certain places, it's exactly behind the red
line). We began tracking Yahoo and Microsoft in April 2004, shortly after
Yahoo dropped Google and began their own engine. It took six months before
Yahoo, which also feeds MSN until Microsoft switches to their new engine,
had most of our pages indexed.  Beginning in October 2004, the combination
of Yahoo and Microsoft is doing much better than Google for NameBase
referrals, primarily because Google is merely listing most of our pages
instead of indexing them.

Some speculate that Google has a redesigned index that could show up any day
now. If so, the blue line on the top graph might jump up a few days later.

I'm not holding my breath. It's just as likely that Google has lost interest
in their main index altogether, and has decided that they don't need to
trouble themselves with large sites.

[Daniel Brandt is a member of NY Transfer News Collective4 and is the
founder of Namebase (http://www.namebase.org), Google Watch
(http://www.google-watch.org) and Gmail-Is-Too-Creepy
(http://www.gmail-is-too-creepy.org) ]

                                *
Search the NYTr Archives at:
http://olm.blythe-systems.com/pipermail/nytr/

To subscribe or unsubscribe or change your settings via the web, visit:
http://olm.blythe-systems.com/mailman/listinfo/nytr

=================================================================
  NY Transfer News Collective   *   A Service of Blythe Systems
           Since 1985 - Information for the Rest of Us
              339 Lafayette St., New York, NY 10012
  http://www.blythe.org                  e-mail: n...@blythe.org
=================================================================


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google