Update on indexing blogrolls

24 views
Skip to first unread message

Jeremy Hylton

unread,
Dec 19, 2008, 1:25:22 PM12/19/08
to Google Blog Search
I wanted to give everyone a brief end-of-the-year update on the
blogroll problem. When we switched blogsearch to indexing the full
text of posts, we started seeing a lot more results where the only
matches for a query where from the blogroll or other parts of the page
that frame the actual post. (There's been a lot of discussion of the
problem. You can search for [google blogsearch] using Google
Blogsearch.)

We're in the midst of deploying a solution for this problem. The
basic approach is to analyze each blog to look for text and markup
that is common to all of the posts. Usually, these comment elements
include the blogroll, any navigational elements, and other parts of
the page that aren't part of the post. This approach works well for a
lot of blogs, but we're continuing to improve the algorithm. The
search results should ignore matches that only come from these common
elements. The indexing change to implement it is deployed almost
everywhere now.

We expect users will continue to see some spurious results, but many
fewer than before. I tried a search for my own name, which does
appear in a few blogrolls, and all the results looked good. If you
are still seeing blogroll hits, the problem is most likely caused by
our failure to analyze a particular blog correctly. Feel free to
follow up with examples in private email or in this forum.

Jeremy Hylton
Google Blogsearch

tamar

unread,
Dec 26, 2008, 9:34:41 AM12/26/08
to Google Blog Search
Curious - around the same time of the initial report, I started
getting Google Alerts with blogroll links. If anything, it's become
*more* common and not less common lately. Does the change you write
about, Jeremy, impact Google Alerts?

If not, perhaps someone should take a look.

Thanks.

Kyle_Texas

unread,
Dec 28, 2008, 11:10:54 PM12/28/08
to Google Blog Search
Tamar,

It has become even more common. If Google Blog Search isn't finding
these blogroll hits, it is finding spam. In the last 3 days, I have
seen exactly ONE result which was not a result from the blogroll or a
SPLOG.

Jeremy Hylton

unread,
Dec 29, 2008, 11:35:49 AM12/29/08
to Google Blog Search
On Dec 28, 11:10 pm, Kyle_Texas <Reiko.Admi...@gmail.com> wrote:
> Tamar,
>
> It has become even more common.  If Google Blog Search isn't finding
> these blogroll hits, it is finding spam.  In the last 3 days, I have
> seen exactly ONE result which was not a result from the blogroll or a
> SPLOG.

Can you tell me the specific queries that are showing bad results?
Also, is the problem specific to alerts or do you see them in regular
blogsearch results, too?

Jeremy

Kyle_Texas

unread,
Dec 29, 2008, 8:08:16 PM12/29/08
to Google Blog Search
I could write out a lengthy explanation of the different search I do
in Google Blog Search, but I decided since this is all visual, it
would be more efficient just to use screenshots.

I have tagged almost all of the results with what they are, either
Blogroll results or my personal favorite, Fake DVD Review SPLOGS. A
few that are either legit or I am unsure what they are, are left
mostly blank.

Search Term: “Reiko Aylesworth”

http://i131.photobucket.com/albums/p312/CO757300/Temp/gbs-1.jpg

http://i131.photobucket.com/albums/p312/CO757300/Temp/gbs-2.jpg

http://i131.photobucket.com/albums/p312/CO757300/Temp/gbs-3.jpg

Search Term: “Carlos Bernard”

http://i131.photobucket.com/albums/p312/CO757300/Temp/gbs-4.jpg

Search Term: “Kiefer Sutherland”

http://i131.photobucket.com/albums/p312/CO757300/Temp/gbs-5.jpg

I hope this helps. If needed, I can write out a more detailed
explanation.

tamar

unread,
Jan 1, 2009, 9:54:40 PM1/1/09
to Google Blog Search
Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
or link:www.domain.com (where domain.com is my blog).

I don't check blogsearch results regularly, but I just performed a
search for the purposes of giving you as much information as possible
and saw a result that showed my blog on the sidebar navigation from 4
hours ago.

That said, I'm pretty certain that this isn't fully addressed. :(

Jeremy Hylton

unread,
Jan 7, 2009, 12:58:46 PM1/7/09
to Google Blog Search
On Jan 1, 9:54 pm, tamar <puntr...@gmail.com> wrote:
> Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
> or link:www.domain.com(where domain.com is my blog).
>
> I don't check blogsearch results regularly, but I just performed a
> search for the purposes of giving you as much information as possible
> and saw a result that showed my blog on the sidebar navigation from 4
> hours ago.
>
> That said, I'm pretty certain that this isn't fully addressed. :(

I agree that the problem isn't fully addressed :-(. I just did a
link: search for your blog. It returned 10 results ranging from 37
minutes old to several days old (Jan 1). There were two results that
obviously came from the blogroll, one from http://janefouts.com/ and
one from http://simplystated.realsimple.com/. We'll have to see why
we failed to detect those links as coming from the blogroll. There
are also a few results that came from Techcrunch posts that you
commented on. The comment has a link to your blog. I think those are
legitimate results, but I'd be interested to hear what users thinks.

So we're at 80% accuracy at this very moment. It's better than it
was, but obviously a lot of room for improvement.

Jeremy

tamarw

unread,
Jan 8, 2009, 9:53:42 AM1/8/09
to Google Blog Search
Thanks Jeremy. As far as comments showing up in these searches,
you're right - that may be a little out of place, but I'm actually not
adverse to seeing those in my queries/alerts emails. It's more of a
concern when I see links coming from random sidebars (repeatedly, like
simplystated.realsimple.com).

I appreciate that you're still looking into it!

On Jan 7, 12:58 pm, Jeremy Hylton <jhyl...@gmail.com> wrote:
> On Jan 1, 9:54 pm, tamar <puntr...@gmail.com> wrote:
>
> > Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
> > or link:www.domain.com(wheredomain.com is my blog).
>
> > I don't check blogsearch results regularly, but I just performed a
> > search for the purposes of giving you as much information as possible
> > and saw a result that showed my blog on the sidebar navigation from 4
> > hours ago.
>
> > That said, I'm pretty certain that this isn't fully addressed. :(
>
> I agree that the problem isn't fully addressed :-(.  I just did a
> link: search for your blog.  It returned 10 results ranging from 37
> minutes old to several days old (Jan 1).  There were two results that
> obviously came from the blogroll, one fromhttp://janefouts.com/and
> one fromhttp://simplystated.realsimple.com/.  We'll have to see why

tamarw

unread,
Jan 11, 2009, 1:21:55 AM1/11/09
to Google Blog Search
One more thing - there's a LOT of MyBlogLog stuff coming up for my
name. I'm not sure that should be included in search results either.

On Jan 8, 9:53 am, tamarw <puntr...@gmail.com> wrote:
> Thanks Jeremy.  As far as comments showing up in these searches,
> you're right - that may be a little out of place, but I'm actually not
> adverse to seeing those in my queries/alerts emails.  It's more of a
> concern when I see links coming from random sidebars (repeatedly, like
> simplystated.realsimple.com).
>
> I appreciate that you're still looking into it!
>
> On Jan 7, 12:58 pm, Jeremy Hylton <jhyl...@gmail.com> wrote:
>
> > On Jan 1, 9:54 pm, tamar <puntr...@gmail.com> wrote:
>
> > > Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
> > > or link:www.domain.com(wheredomain.comis my blog).

Holly

unread,
Jan 12, 2009, 1:06:39 PM1/12/09
to Google Blog Search
In my particular case, it's a little weird. Before Blogsearch started
to index blogroll links and everything was fine, when I searched using
the command link: mysite.com it used to bring around 50+ backlinks.
Now, it only shows 2.
Why is that? Maybe some reset or something?

tamar

unread,
Jan 22, 2009, 9:39:42 AM1/22/09
to Google Blog Search
Any update? It's been 3 weeks.

On Jan 7, 12:58 pm, Jeremy Hylton <jhyl...@gmail.com> wrote:
> On Jan 1, 9:54 pm, tamar <puntr...@gmail.com> wrote:
>
> > Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
> > or link:www.domain.com(wheredomain.com is my blog).
>
> > I don't check blogsearch results regularly, but I just performed a
> > search for the purposes of giving you as much information as possible
> > and saw a result that showed my blog on the sidebar navigation from 4
> > hours ago.
>
> > That said, I'm pretty certain that this isn't fully addressed. :(
>
> I agree that the problem isn't fully addressed :-(.  I just did a
> link: search for your blog.  It returned 10 results ranging from 37
> minutes old to several days old (Jan 1).  There were two results that
> obviously came from the blogroll, one fromhttp://janefouts.com/and
> one fromhttp://simplystated.realsimple.com/.  We'll have to see why

tamar

unread,
Jan 27, 2009, 11:22:36 AM1/27/09
to Google Blog Search
It looks like no progress has been made on this front AT ALL. The
Google Alert emails I receive are spam and nothing but at this point.
Plus, I keep receiving the same emails again and again -- it's not
necessarily a "blogroll" issue but the same OLD content is being
treated by Google Blogsearch as new content. On one search query,
I've received the same result at least 10 times.

Jeremy and team, please don't forget about us.

On Jan 22, 9:39 am, tamar <puntr...@gmail.com> wrote:
> Any update?  It's been 3 weeks.
>
> On Jan 7, 12:58 pm, Jeremy Hylton <jhyl...@gmail.com> wrote:
>
> > On Jan 1, 9:54 pm, tamar <puntr...@gmail.com> wrote:
>
> > > Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
> > > or link:www.domain.com(wheredomain.comis my blog).

McCain

unread,
Jan 27, 2009, 11:19:57 PM1/27/09
to Google Blog Search
We are having similar experiences, not just with blogroll references
but also recent post widgets and such on the blogs. Anytime another
post is mentioned with a link, we were frequently seeing a mostly
irrelevant page substituted for a relevant page in the index. It has
led to a lesser user experience, but we've ended up removing our
blogrolls from the sidebars, removing "recent post" references from
the sidebars, altering the recent comment widget so it does not cite
posts by title, and changing "recent/next" post references at the top
of posts so that the links are generic references rather than post
titles. That seems to make the SERPs more appropriate but it's
really not an ideal presentation. Hope this issue can be worked out.

On Jan 27, 8:22 am, tamar <puntr...@gmail.com> wrote:
> It looks like no progress has been made on this front AT ALL.  The
> Google Alert emails I receive are spam and nothing but at this point.
> Plus, I keep receiving the same emails again and again -- it's not
> necessarily a "blogroll" issue but the same OLD content is being
> treated by Google Blogsearch as new content.  On one search query,
> I've received the same result at least 10 times.
>
> Jeremy and team, please don't forget about us.
>
> On Jan 22, 9:39 am, tamar <puntr...@gmail.com> wrote:
>
> > Any update?  It's been 3 weeks.
>
> > On Jan 7, 12:58 pm, Jeremy Hylton <jhyl...@gmail.com> wrote:
>
> > > On Jan 1, 9:54 pm, tamar <puntr...@gmail.com> wrote:
>
> > > > Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
> > > > or link:www.domain.com(wheredomain.comismy blog).

Kyle_Texas

unread,
Jan 28, 2009, 6:25:57 PM1/28/09
to Google Blog Search
Yep, the problem remains. Either SPAM or Blogroll for 90% of
results. The SPAM is actually getting worse. It's funny to see
SPLOGS at the top of the relevancy rankings, or better yet, almost the
entire first page of relevancy rankings being SPLOGS.

On Jan 27, 10:22 am, tamar <puntr...@gmail.com> wrote:
> It looks like no progress has been made on this front AT ALL.  The
> Google Alert emails I receive are spam and nothing but at this point.
> Plus, I keep receiving the same emails again and again -- it's not
> necessarily a "blogroll" issue but the same OLD content is being
> treated by Google Blogsearch as new content.  On one search query,
> I've received the same result at least 10 times.
>
> Jeremy and team, please don't forget about us.
>
> On Jan 22, 9:39 am, tamar <puntr...@gmail.com> wrote:
>
> > Any update?  It's been 3 weeks.
>
> > On Jan 7, 12:58 pm, Jeremy Hylton <jhyl...@gmail.com> wrote:
>
> > > On Jan 1, 9:54 pm, tamar <puntr...@gmail.com> wrote:
>
> > > > Jeremy, I'm doing searches for "tamar weinberg," my blog title name,
> > > > or link:www.domain.com(wheredomain.comismy blog).

tamar

unread,
Jan 31, 2009, 8:00:20 PM1/31/09
to Google Blog Search
Today, I got links from 2006 and 2007 in my link: query emails.

:(

Kyle_Texas

unread,
Feb 2, 2009, 12:11:44 AM2/2/09
to Google Blog Search
Yeah, same thing for me. It keeps reverting to these old results
which are completely worthless.

tamar

unread,
Feb 6, 2009, 8:07:51 AM2/6/09
to Google Blog Search
Is anything at ALL being done about this? I'm starting to consider
either:

1. flagging all Google Alerts sent to my Gmail inbox as spam (cuz uh,
they contain spammy results)
2. unsubscribing from Google Alerts -- since the results returned
aren't relevant and they certainly aren't fresh. (Come on, isn't
Google's mission to organize the world's information? This is clearly
disorganized and in a very bad way.)

Google: we've been pretty darn patient. This thread started in
December and referenced an even older incident. It's February now.
Is ANYONE paying attention to this? Please?

Thanks.

(p.s. a Google Alert email just prompted this post update. I don't
really post about this out of the blue.)

Jeremy Hylton

unread,
Feb 6, 2009, 6:03:50 PM2/6/09
to Google Blog Search
Tamar,

Apologies for my tardy response. I'll be sure to give everyone an
update every week, even if we don't have much news to report.

As I mentioned, we made an initial attempt to fix the blogroll problem
in December. It fixed some fraction of the results that were coming
from blogrolls, but was inadequate in a number of ways. For some
blogs, the blog roll detection didn't pick anything up. For other
blogs, it detect some items in the blog roll, but not all of them. My
colleague Rick Klau was particularly unlucky. His blog appears in the
blog rolls of many legal blogs. I noticed that we often detect every
blog but his as a blogroll entry. We've been looking at a collection
of backlink queries (with the link: operator) and still see about 50%
of the results coming from blog rolls. So there is obviously a lot of
room for improvement.

We have been working on an improved blog roll detector. Our internal
tests look fairly promising, but there is a lot of variability in blog
markup that we need to handle. It's going to be a few more weeks
until we can start to deploy it. I'll see if I can provide a better
ETA next week.

I haven't been paying attention to the Google Alerts specifically.
The accuracy I mentioned earlier was for the regular search results.
I'll make sure we add some metrics that look at Alerts quality so that
we don't forgot about it again. The basic solution is the same for
search results and for alerts, but maybe there's something more we can
do for alerts in the short term.

Jeremy

Jeremy Hylton

unread,
Feb 6, 2009, 10:12:16 PM2/6/09
to Google Blog Search
On Feb 6, 6:03 pm, Jeremy Hylton <jhyl...@gmail.com> wrote:
> Tamar,
>
> Apologies for my tardy response.  I'll be sure to give everyone an
> update every week, even if we don't have much news to report.
>
> As I mentioned, we made an initial attempt to fix the blogroll problem
> in December.  It fixed some fraction of the results that were coming
> from blogrolls, but was inadequate in a number of ways.  For some
> blogs, the blog roll detection didn't pick anything up.  For other
> blogs, it detect some items in the blog roll, but not all of them.  My
> colleague Rick Klau was particularly unlucky.  His blog appears in the
> blog rolls of many legal blogs.  I noticed that we often detect every
> blog but his as a blogroll entry.  We've been looking at a collection
> of backlink queries (with the link: operator) and still see about 50%
> of the results coming from blog rolls.  So there is obviously a lot of
> room for improvement.

I wanted to clarify this point a little bit. The problem really is
worst for people with popular blogs. The average user is getting more
and better results as a consequence of the indexing changes that
introduced the blogroll problems. We're return results from blogs
with partial content feeds. We're index comments. We discover more
links. So a lot of our internal analysis shows that most queries do
better as a result of the changes. If there weren't some real
benefits to the indexing changes, we would have reverted to the old
version.

Jeremy

tamar

unread,
Feb 7, 2009, 11:00:05 PM2/7/09
to Google Blog Search
Thanks for the update.

A few things I noticed lately:

1. Lots of redundancy. For example, 25 separate Google Alerts have
arrived in my inbox since 12/18/08 from a single blog source citing
the SAME exact blog post (nothing new!)
2. Old posts from 2006/2007.
3. The blogroll issue

That said, the issue seems to not necessarily be limited to the
blogroll itself. The entire system is a mess. And while I say Google
Alerts, I'm able to reproduce the problems every time simply by going
to blogsearch.google.com, so I don't really think you need to focus
too much on Google Alerts. After all, it seems to be gathering data
from a system that isn't exactly returning relevant results.

Also, some of the data I actually receive is not tied to popular blogs
of mine at all. I understand the indexing problems; I'm not
requesting that you revert to the old system, but I still contend that
the new system gives me 95% noise and 5% reasonable results, which is
pretty poor.

Hopefully Google's deployment of the fixes will address the issue.

p.s. I'll be happy to send you the *really* awkward results I've
received that illustrate all above issues if you want them...unless,
of course, you already received them. ;)

Kyle_Texas

unread,
Feb 19, 2009, 1:05:45 PM2/19/09
to Google Blog Search
It seems to have been better as of late until yesterday. All of a
sudden it reverted back to some old version and results from 2007 and
now coming up as the most relevant. As always, most of the recent
results have vanished if you search by date with the majority from 2
weeks to 2 months ago.
> ...
>
> read more »

Jeremy Hylton

unread,
Feb 25, 2009, 11:22:22 AM2/25/09
to Google Blog Search
This is just a brief status report. We've been continuing to
experiment with blogroll detectors. We're going to do some user-
visible experiments early next month, probably starting with link:
queries. I'll follow up here when the experiments are running.

Jeremy
> ...
>
> read more »

Jeremy Hylton

unread,
Mar 6, 2009, 2:22:53 PM3/6/09
to Google Blog Search
Unfortunately, we ran into some delays with these experiments and had
to push back the schedule a couple of weeks.

Jeremy
> ...
>
> read more »

Barry Schwartz

unread,
Mar 9, 2009, 7:43:19 AM3/9/09
to Google Blog Search
thanks for the update.
> > > > > > > > > > > > > > > > *more* common and not less common lately.  Does the change you...
>
> read more »

Rodrigo

unread,
Mar 21, 2009, 1:15:53 PM3/21/09
to Google Blog Search
Anything new?
> ...
>
> read more »

Jeremy Hylton

unread,
Mar 26, 2009, 11:19:46 PM3/26/09
to Google Blog Search
Yes, we do have some news to report.

We have launched a ranking change that reduces the number of results
that are returned because of blogroll matches. There are still
problems to work out, but this change appears to be a big improvement
over our earlier fix. We had originally planned to launch an
experiment for link: queries, but decide more recently to release this
change first. We are still working on the link: change and expect to
have that ready in a few more weeks.

We'd appreciate your feedback on the latest change.

Jeremy
> ...
>
> read more »

namkraps

unread,
Mar 31, 2009, 12:08:07 PM3/31/09
to Google Blog Search
I'm the producer for a watershed education site www.protectingourwater.org
launched a few months ago by the Florida Dept of Env. Protection.
Although not a blog, with this release the site disappeared from the
face of the earth in Google's search for key terms related to
watersheds in Florida. Any ideas or explanation?

Kevin

Kyle_Texas

unread,
Mar 31, 2009, 11:41:51 PM3/31/09
to Google Blog Search
I've noticed the number of blogroll results have declined (only 4-6
now instead of the usual 15-20), but the number of SPLOGS has exploded
again, especially the fake DVD review pages. Sometimes an entire page
of date ranked results is nothing but DVD review spam. You can always
tell because the summary will start with "The breathtaking cast in
this movie is astounding" (even TV shows are called a 'movie' most of
the time by these idiots), "The intricately woven subplots..." "The
overwhelming cast in this movie is confounding" "The confounding
movie" "The eye-opening movie" or "Download (Insert Title Here) Right
Now!" Many of these pages (at least they used to) contain viruses or
redirects to infected websites. Rarely are they on Blogger now, most
are on sites I doubt anyone has ever heard of.
> ...
>
> read more »

Kyle_Texas

unread,
May 5, 2009, 11:34:38 PM5/5/09
to Google Blog Search
Well, the blog roll results are much more common again. In fact, this
one blog is always #1 now in some of my searches, both in relevancy
and date rankings, because it is simply a personal blog a woman posts
to numerous times a day and she used this one phrase over a year ago.
The rest are mostly blogs which are 1-6 months old and completely
irrelevant now. I don't expect all of this to disappear, but having a
blog constantly appear as #1 simply because someone makes a post every
1/2 hour, which has nothing to do with the results being searched for,
is frustrating.
Reply all
Reply to author
Forward
0 new messages