robots.txt and web site remap

Motorhead Extraordinaire

unread,

Oct 18, 2009, 8:24:03 AM10/18/09

to SOFTplus GSiteCrawler

I am about to move around a lot of product categories and products
sets on my ecommerce strore. In effect, I will be ramapping almost
everything. This process will take several hours (maybe all day) and
when all is done I will also have generated a proper .htacces file to
re-point everything to, as well as a new siitemap.

I use a local off-line database application to build my world and then
reload the E-commerce database via ODBC. This is basically done in a
batch process so I do not have the luxury of dynamically
making .htaccess Redirect entries on the fly. This means that all
external links and crawls will be messed up while the update process
is going on.

During this process, I "think" the best thing to do is to set a
"Disallow: /:" in the robots.txt file. My feeling on this is that
during this big multi-pass shuffle, I don't want crawlers to be
collecting crappy information.

Are there any negative ramifications to turning away crawlers? Are
there any other suggestions as to how to handle this situation?

Many thanks,
Joe Germann
www.MotorheadExtraordinaire.com

Christina S

unread,

Oct 18, 2009, 10:38:09 AM10/18/09

to gsitec...@googlegroups.com

Actually you should set it up so your server responds with a 503 (under
maintenance) while you do this sort of thing.
This way robots will reschedule their crawls.

prodigy2006

unread,

Oct 18, 2009, 6:37:01 PM10/18/09

to SOFTplus GSiteCrawler

Hello, I'm having some trouble crawling my website, when crawlers hit
my webgallery they crawl it forever, and the database gets over 900MB,
I've tryed to set filters to in robots.txt with no results, also I've
tryed to set filters within gsitecrawler settings also with no result.
Hope you can give me some advice. can you try to crawl my site?
www.mediaportal.hr

Christina S

unread,

Oct 18, 2009, 7:15:43 PM10/18/09

to gsitec...@googlegroups.com

Lots of things need to be disallowed in the robtos.txt file for starters.
All the /forum/search urls, anythign to do with memberlist, login, etc.

Also your forum geenrates session ids - this is bad for robots, try to find
a way to suppress those, otherwise an unlimuited number of dsiticnt urls are
generated for all teh same page.

For the gallery also there are many uri prefixes that should be disallowed
in robtos.txt such as /webgalerija/index.php?lang and
/webgalerija/displayimage.php?album=random and
/webgalerija/thumbnails.php.

Think carefully all that results in useless pages, with no content or with
the same content as others and disallow thsoe.

As for GSiteCrawler, you will need to request it to reimport the new
robots.txt file and filter everything, before restarting the crawl.

Just merely setting filters in GSC itself does not help - Googlebto will
still encounter all those urls you are filtering in GSC.

Also ake sure you are not requesting GSC to include image files in the
sitemap - it's useless as images (and video/audio files) don't belong in a
general web sitemap. Only web pages and text containing files shoudl be
included..

----- Original Message -----
From: "prodigy2006" <adev...@gmail.com>
To: "SOFTplus GSiteCrawler" <gsitec...@googlegroups.com>

Joe Germann

unread,

Oct 18, 2009, 8:44:29 PM10/18/09

to gsitec...@googlegroups.com

Thanks for the guidance. I am investigating how to properly set us a .htaccess to do this. It looks like it is straight forward. I just have to read up a bit more and set up a test scenario. http://www.askapache.com/htaccess/503-service-temporarily-unavailable.html

Thanks,
Joe

MOTORHEAD extraordinaire
Professional Storage and Workspace Solutions
79 Park Road - Chelmsford, MA - 01824
Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
Visit our web site at www.MotorheadExtraordinaire.com and
for our latest specials, sign up for our Newsletter

Christina S

unread,

Oct 18, 2009, 8:55:38 PM10/18/09

to gsitec...@googlegroups.com

Yes, one of those examples should work for you.

Christina
www.webado.net

Joe Germann

unread,

Oct 18, 2009, 9:00:41 PM10/18/09

to gsitec...@googlegroups.com

It looks like it; just have to play around a bit to figure it all out.

Joe

Joe Germann

unread,

Oct 19, 2009, 2:17:45 PM10/19/09

to gsitec...@googlegroups.com

I implemented the following .htaccess and it appears to work just fine for users. I can plug in my IP and still get access to my web site.

Options +FollowSymLinks

RewriteEngine On

RewriteBase /

RewriteCond %{HTTP_USER_AGENT} ^.*(Googlebot|Googlebot|Mediapartners|Adsbot|Feedfetcher)-?(Google|Image)? [NC]

RewriteCond %{REQUEST_URI} !^/503\.html [NC]

RewriteRule .* /503.html

RewriteCond %{REMOTE_HOST} !^xx\.xxx\.xxx\.xxx

RewriteCond %{REQUEST_URI} !^/503\.html [NC]

RewriteRule .* /503.html [R=302,L]

How can I confirm that the 503 response to the BOTS is actually working?

Thanks,
Joe

webado

unread,

Oct 19, 2009, 3:58:12 PM10/19/09

to SOFTplus GSiteCrawler

Use http://web-sniffer.net/ and put in your url.

On 19 oct, 14:17, Joe Germann <motorheadextraordina...@gmail.com>
wrote:

> <https://www.motorheadextraordinaire.com/create_account.php>sign up
> for our Newsletter

Joe Germann

unread,

Oct 20, 2009, 1:38:31 AM10/20/09

to gsitec...@googlegroups.com

Works great. Thanks a bunch.

Joe

Visit our web site at www.MotorheadExtraordinaire.com and

for our latest specials, sign up for our Newsletter

Reply all

Reply to author

Forward