Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Discussions > For Developers > Feed crawlers and robots.txt
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  9 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Partap  
View profile  
 More options Aug 8 2008, 4:05 pm
From: Partap
Date: Fri, 8 Aug 2008 13:05:44 -0700 (PDT)
Local: Fri, Aug 8 2008 4:05 pm
Subject: Feed crawlers and robots.txt
I've noticed a few feeds redirecting to feedproxy.google.com lately
ex:
http://feeds.feedburner.com/TechCrunch
http://feeds.feedburner.com/talking-points-memo

feedproxy.google.com has a robots.txt rule banning all bots, which is
breaking my feed reader.  Are readers not obligated to obey robots.txt?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Franklin Tse [Community Expert]  
View profile  
 More options Aug 9 2008, 11:10 am
From: Franklin Tse [Community Expert]
Date: Sat, 9 Aug 2008 08:10:37 -0700 (PDT)
Local: Sat, Aug 9 2008 11:10 am
Subject: Re: Feed crawlers and robots.txt
Hello,

Feed readers are not expected to follow robots.txt, as they are not
web robots, they are user agents.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Partap  
View profile  
 More options Aug 12 2008, 2:43 pm
From: Partap
Date: Tue, 12 Aug 2008 11:43:27 -0700 (PDT)
Local: Tues, Aug 12 2008 2:43 pm
Subject: Re: Feed crawlers and robots.txt

> Feed readers are not expected to follow robots.txt, as they are not
> web robots, they are user agents.

Hrm.  That is not an entirely accurate assumption to make ;)

I assume the disallow rules are put in to prevent feeds showing up in
search results, but the side effect is that they are preventing feed-
aware bots from efficiently indexing new posts.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Franklin Tse [Community Expert]  
View profile  
 More options Aug 13 2008, 12:02 am
From: Franklin Tse [Community Expert]
Date: Tue, 12 Aug 2008 21:02:20 -0700 (PDT)
Local: Wed, Aug 13 2008 12:02 am
Subject: Re: Feed crawlers and robots.txt
Which feed reader are you using?

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Partap  
View profile  
 More options Aug 18 2008, 5:16 pm
From: Partap
Date: Mon, 18 Aug 2008 14:16:02 -0700 (PDT)
Local: Mon, Aug 18 2008 5:16 pm
Subject: Re: Feed crawlers and robots.txt
It's mine, (well, the company I'm working for, anyway)
It's a web crawler, but for sites with feeds, it tries to use the feed
links rather than following all links on the site.

Feedbot seems to be doing something similar... they say on their site
that they ignore robots.txt rules for feed xml files, but the problem
I'm having with feedproxy.google.com is that *everything* is
disallowed, not just the feeds...

eg.,  if I ignored the disallow rule for a feed xml:
http://feeds.feedburner.com/talking-points-memo
and fetch it anyway, I still get blocked when trying to access a feed
item:
http://feedproxy.google.com/~r/Talking-Points-Memo/~3/xO4rl2ga-6I/208...
which just redirects to:
http://talkingpointsmemo.com/archives/208941.php
which is not blocked by robots.txt, but I have no way of knowing this
without following the link.

I don't want to ignore robots.txt rules for all feed items, because
some may be legitimate blocks.

On Aug 12, 10:02 pm, Franklin Tse [Community Expert] wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Franklin Tse [Community Expert]  
View profile  
 More options Aug 19 2008, 8:27 am
From: Franklin Tse [Community Expert]
Date: Tue, 19 Aug 2008 05:27:18 -0700 (PDT)
Local: Tues, Aug 19 2008 8:27 am
Subject: Re: Feed crawlers and robots.txt
The block is there probably because links beginning with
http://feedproxy.google.com/~r/ should not be indexed.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jkassemi  
View profile  
 More options Sep 15 2008, 11:40 am
From: jkassemi
Date: Mon, 15 Sep 2008 08:40:49 -0700 (PDT)
Local: Mon, Sep 15 2008 11:40 am
Subject: Re: Feed crawlers and robots.txt
Who would we contact to have this changed to the correct "/~r" pattern
if this is the case?

I'm hesitant to begin ignoring robots.txt with our application, which
will often fetch an HTML page simply to determine where associated
feeds are located.

Best,
James

On Aug 19, 6:27 am, Franklin Tse [Community Expert] wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Franklin Tse [Community Expert]  
View profile  
 More options Sep 16 2008, 11:42 am
From: Franklin Tse [Community Expert]
Date: Tue, 16 Sep 2008 08:42:59 -0700 (PDT)
Local: Tues, Sep 16 2008 11:42 am
Subject: Re: Feed crawlers and robots.txt
I will forward this thread to the FeedBurner Team for further
investigation.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt S. Google employee  
View profile  
 More options Sep 16 2008, 2:00 pm
From: Matt S.
Date: Tue, 16 Sep 2008 11:00:33 -0700 (PDT)
Local: Tues, Sep 16 2008 2:00 pm
Subject: Re: Feed crawlers and robots.txt
Hello,

We will soon be applying the exact same robots.txt pattern to
feedproxy.google.com as has been in place on feeds.feedburner.com for
quite some time:

User-agent: *
Disallow: /~a/

This should permit all readers/crawlers that previously retrieved feed
content, but now get a blocked response, to start working properly
again. Apologies for the inconvenience!

On Sep 15, 10:40 am, jkassemi wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »