Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
robots.txt and web site remap
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  10 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Motorhead Extraordinaire  
View profile  
 More options Oct 18, 8:24 am
From: Motorhead Extraordinaire <J...@MotorheadExtraordinaire.com>
Date: Sun, 18 Oct 2009 05:24:03 -0700 (PDT)
Local: Sun, Oct 18 2009 8:24 am
Subject: robots.txt and web site remap
I am about to move around a lot of product categories and products
sets on my ecommerce strore. In effect, I will be ramapping almost
everything. This process will take several hours (maybe all day) and
when all is done I will also have generated a proper .htacces file to
re-point everything to, as well as a new siitemap.

I use a local off-line database application to build my world and then
reload the E-commerce database via ODBC.  This is basically done in a
batch process so I do not have the luxury of dynamically
making .htaccess Redirect entries on the fly.  This means that all
external links and crawls will be messed up while the update process
is going on.

During this process, I "think" the best thing to do is to set a
"Disallow: /:" in the robots.txt file.  My feeling on this is that
during this big multi-pass shuffle, I don't want crawlers to be
collecting crappy information.

Are there any negative ramifications to turning away crawlers?  Are
there any other suggestions as to how to handle this situation?

Many thanks,
Joe Germann
www.MotorheadExtraordinaire.com


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christina S  
View profile  
 More options Oct 18, 10:38 am
From: "Christina S" <web...@gmail.com>
Date: Sun, 18 Oct 2009 10:38:09 -0400
Local: Sun, Oct 18 2009 10:38 am
Subject: Re: [GSiteCrawler] robots.txt and web site remap
Actually you should set it up so your server responds with a 503 (under
maintenance) while you do this sort of thing.
This way robots will reschedule their crawls.


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
prodigy2006  
View profile  
 More options Oct 18, 6:37 pm
From: prodigy2006 <adever...@gmail.com>
Date: Sun, 18 Oct 2009 15:37:01 -0700 (PDT)
Local: Sun, Oct 18 2009 6:37 pm
Subject: Re: robots.txt and web site remap
Hello, I'm having some trouble crawling my website, when crawlers hit
my webgallery they crawl it forever, and the database gets over 900MB,
I've tryed to set filters to in robots.txt with no results, also I've
tryed to set filters within gsitecrawler settings also with no result.
Hope you can give me some advice. can you try to crawl my site?
www.mediaportal.hr

    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christina S  
View profile  
 More options Oct 18, 7:15 pm
From: "Christina S" <web...@gmail.com>
Date: Sun, 18 Oct 2009 19:15:43 -0400
Local: Sun, Oct 18 2009 7:15 pm
Subject: Re: [GSiteCrawler] Re: robots.txt and web site remap
Lots of things need to be disallowed in the robtos.txt file for starters.
All the /forum/search  urls, anythign to do with memberlist, login, etc.

Also your forum geenrates session ids - this is bad for robots, try to find
a way to suppress those, otherwise an unlimuited number of dsiticnt urls are
generated for all teh same page.

For the gallery also there are many uri prefixes that should be disallowed
in robtos.txt such as /webgalerija/index.php?lang  and
/webgalerija/displayimage.php?album=random  and
/webgalerija/thumbnails.php.

Think carefully all that results in useless pages, with no content or with
the same content as others and disallow thsoe.

As for GSiteCrawler, you will need to request it to reimport the new
robots.txt file and filter everything, before restarting the crawl.

Just merely setting filters in GSC itself does not help - Googlebto will
still encounter all those urls you are filtering in GSC.

Also  ake sure you are not requesting GSC to include image files in the
sitemap - it's useless as images (and video/audio files)  don't belong in a
general web sitemap. Only web pages and text containing files shoudl be
included..


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joe Germann  
View profile  
 More options Oct 18, 8:44 pm
From: Joe Germann <j...@motorheadextraordinaire.com>
Date: Sun, 18 Oct 2009 20:44:29 -0400
Local: Sun, Oct 18 2009 8:44 pm
Subject: Re: [GSiteCrawler] Re: robots.txt and web site remap

Thanks for the guidance.  I am investigating how to properly set us a
.htaccess to do this.  It looks like it is straight forward. I just
have to read up a bit more and set up a test
scenario.
http://www.askapache.com/htaccess/503-service-temporarily-unavailable...

Thanks,
Joe

At 10:38 AM 10/18/2009, you wrote:

MOTORHEAD extraordinaire
Professional Storage and Workspace Solutions
79 Park Road - Chelmsford, MA - 01824
Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
Visit our web site at www.MotorheadExtraordinaire.com and
for our latest specials,
<https://www.motorheadextraordinaire.com/create_account.php>sign up
for our Newsletter

    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christina S  
View profile  
 More options Oct 18, 8:55 pm
From: "Christina S" <web...@gmail.com>
Date: Sun, 18 Oct 2009 20:55:38 -0400
Local: Sun, Oct 18 2009 8:55 pm
Subject: Re: [GSiteCrawler] Re: robots.txt and web site remap

Yes, one of those examples should work for you.


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joe Germann  
View profile  
 More options Oct 18, 9:00 pm
From: Joe Germann <j...@motorheadextraordinaire.com>
Date: Sun, 18 Oct 2009 21:00:41 -0400
Local: Sun, Oct 18 2009 9:00 pm
Subject: Re: [GSiteCrawler] Re: robots.txt and web site remap

It looks like it;  just have to play around a bit to figure it all out.

Joe

At 08:55 PM 10/18/2009, you wrote:


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joe Germann  
View profile  
 More options Oct 19, 2:17 pm
From: Joe Germann <motorheadextraordina...@gmail.com>
Date: Mon, 19 Oct 2009 14:17:45 -0400
Local: Mon, Oct 19 2009 2:17 pm
Subject: Re: [GSiteCrawler] Re: robots.txt and web site remap

I implemented the following .htaccess and it appears to work just
fine for users.  I can plug in my IP and still get access to my web site.
Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT}
^.*(Googlebot|Googlebot|Mediapartners|Adsbot|Feedfetcher)-?(Google|Image)?
[NC]
RewriteCond %{REQUEST_URI} !^/503\.html [NC]
RewriteRule .* /503.html

RewriteCond %{REMOTE_HOST} !^xx\.xxx\.xxx\.xxx
RewriteCond %{REQUEST_URI} !^/503\.html [NC]
RewriteRule .* /503.html [R=302,L]

How can I confirm that the 503 response to the BOTS is actually working?

Thanks,
Joe

MOTORHEAD extraordinaire
Professional Storage and Workspace Solutions
79 Park Road - Chelmsford, MA - 01824
Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
Visit our web site at www.MotorheadExtraordinaire.com and
for our latest specials,
<https://www.motorheadextraordinaire.com/create_account.php>sign up
for our Newsletter


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Oct 19, 3:58 pm
From: webado <web...@gmail.com>
Date: Mon, 19 Oct 2009 12:58:12 -0700 (PDT)
Local: Mon, Oct 19 2009 3:58 pm
Subject: Re: robots.txt and web site remap
Use http://web-sniffer.net/ and put in your url.

On 19 oct, 14:17, Joe Germann <motorheadextraordina...@gmail.com>
wrote:


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joe Germann  
View profile  
 More options Oct 20, 1:38 am
From: Joe Germann <j...@motorheadextraordinaire.com>
Date: Tue, 20 Oct 2009 01:38:31 -0400
Local: Tues, Oct 20 2009 1:38 am
Subject: Re: [GSiteCrawler] Re: robots.txt and web site remap

Works great.  Thanks a bunch.

Joe

At 03:58 PM 10/19/2009, you wrote:

MOTORHEAD extraordinaire
Professional Storage and Workspace Solutions
79 Park Road - Chelmsford, MA - 01824
Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
Visit our web site at www.MotorheadExtraordinaire.com and
for our latest specials,
<https://www.motorheadextraordinaire.com/create_account.php>sign up
for our Newsletter

    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google