I am about to move around a lot of product categories and products
sets on my ecommerce strore. In effect, I will be ramapping almost
everything. This process will take several hours (maybe all day) and
when all is done I will also have generated a proper .htacces file to
re-point everything to, as well as a new siitemap.
I use a local off-line database application to build my world and then
reload the E-commerce database via ODBC. This is basically done in a
batch process so I do not have the luxury of dynamically
making .htaccess Redirect entries on the fly. This means that all
external links and crawls will be messed up while the update process
is going on.
During this process, I "think" the best thing to do is to set a
"Disallow: /:" in the robots.txt file. My feeling on this is that
during this big multi-pass shuffle, I don't want crawlers to be
collecting crappy information.
Are there any negative ramifications to turning away crawlers? Are
there any other suggestions as to how to handle this situation?
Actually you should set it up so your server responds with a 503 (under maintenance) while you do this sort of thing.
This way robots will reschedule their crawls.
----- Original Message ----- From: "Motorhead Extraordinaire" <J...@MotorheadExtraordinaire.com>
To: "SOFTplus GSiteCrawler" <gsitecrawler@googlegroups.com>
Sent: Sunday, October 18, 2009 8:24 AM
Subject: [GSiteCrawler] robots.txt and web site remap
> I am about to move around a lot of product categories and products
> sets on my ecommerce strore. In effect, I will be ramapping almost
> everything. This process will take several hours (maybe all day) and
> when all is done I will also have generated a proper .htacces file to
> re-point everything to, as well as a new siitemap.
> I use a local off-line database application to build my world and then
> reload the E-commerce database via ODBC. This is basically done in a
> batch process so I do not have the luxury of dynamically
> making .htaccess Redirect entries on the fly. This means that all
> external links and crawls will be messed up while the update process
> is going on.
> During this process, I "think" the best thing to do is to set a
> "Disallow: /:" in the robots.txt file. My feeling on this is that
> during this big multi-pass shuffle, I don't want crawlers to be
> collecting crappy information.
> Are there any negative ramifications to turning away crawlers? Are
> there any other suggestions as to how to handle this situation?
Hello, I'm having some trouble crawling my website, when crawlers hit
my webgallery they crawl it forever, and the database gets over 900MB,
I've tryed to set filters to in robots.txt with no results, also I've
tryed to set filters within gsitecrawler settings also with no result.
Hope you can give me some advice. can you try to crawl my site?
www.mediaportal.hr
Lots of things need to be disallowed in the robtos.txt file for starters. All the /forum/search urls, anythign to do with memberlist, login, etc.
Also your forum geenrates session ids - this is bad for robots, try to find a way to suppress those, otherwise an unlimuited number of dsiticnt urls are generated for all teh same page.
For the gallery also there are many uri prefixes that should be disallowed in robtos.txt such as /webgalerija/index.php?lang and /webgalerija/displayimage.php?album=random and /webgalerija/thumbnails.php.
Think carefully all that results in useless pages, with no content or with the same content as others and disallow thsoe.
As for GSiteCrawler, you will need to request it to reimport the new robots.txt file and filter everything, before restarting the crawl.
Just merely setting filters in GSC itself does not help - Googlebto will still encounter all those urls you are filtering in GSC.
Also ake sure you are not requesting GSC to include image files in the sitemap - it's useless as images (and video/audio files) don't belong in a general web sitemap. Only web pages and text containing files shoudl be included..
----- Original Message ----- From: "prodigy2006" <adever...@gmail.com>
To: "SOFTplus GSiteCrawler" <gsitecrawler@googlegroups.com>
Sent: Sunday, October 18, 2009 6:37 PM
Subject: [GSiteCrawler] Re: robots.txt and web site remap
> Hello, I'm having some trouble crawling my website, when crawlers hit
> my webgallery they crawl it forever, and the database gets over 900MB,
> I've tryed to set filters to in robots.txt with no results, also I've
> tryed to set filters within gsitecrawler settings also with no result.
> Hope you can give me some advice. can you try to crawl my site?
> www.mediaportal.hr
>Actually you should set it up so your server responds with a 503 (under
>maintenance) while you do this sort of thing.
>This way robots will reschedule their crawls.
>----- Original Message -----
>From: "Motorhead Extraordinaire" <J...@MotorheadExtraordinaire.com>
>To: "SOFTplus GSiteCrawler" <gsitecrawler@googlegroups.com>
>Sent: Sunday, October 18, 2009 8:24 AM
>Subject: [GSiteCrawler] robots.txt and web site remap
> > I am about to move around a lot of product categories and products
> > sets on my ecommerce strore. In effect, I will be ramapping almost
> > everything. This process will take several hours (maybe all day) and
> > when all is done I will also have generated a proper .htacces file to
> > re-point everything to, as well as a new siitemap.
> > I use a local off-line database application to build my world and then
> > reload the E-commerce database via ODBC. This is basically done in a
> > batch process so I do not have the luxury of dynamically
> > making .htaccess Redirect entries on the fly. This means that all
> > external links and crawls will be messed up while the update process
> > is going on.
> > During this process, I "think" the best thing to do is to set a
> > "Disallow: /:" in the robots.txt file. My feeling on this is that
> > during this big multi-pass shuffle, I don't want crawlers to be
> > collecting crappy information.
> > Are there any negative ramifications to turning away crawlers? Are
> > there any other suggestions as to how to handle this situation?
----- Original Message ----- From: Joe Germann To: gsitecrawler@googlegroups.com Sent: Sunday, October 18, 2009 8:44 PM
Subject: [GSiteCrawler] Re: robots.txt and web site remap
Actually you should set it up so your server responds with a 503 (under maintenance) while you do this sort of thing.
This way robots will reschedule their crawls.
----- Original Message ----- From: "Motorhead Extraordinaire" <J...@MotorheadExtraordinaire.com>
To: "SOFTplus GSiteCrawler" <gsitecrawler@googlegroups.com>
Sent: Sunday, October 18, 2009 8:24 AM
Subject: [GSiteCrawler] robots.txt and web site remap
> I am about to move around a lot of product categories and products
> sets on my ecommerce strore. In effect, I will be ramapping almost
> everything. This process will take several hours (maybe all day) and
> when all is done I will also have generated a proper .htacces file to
> re-point everything to, as well as a new siitemap.
> I use a local off-line database application to build my world and then
> reload the E-commerce database via ODBC. This is basically done in a
> batch process so I do not have the luxury of dynamically
> making .htaccess Redirect entries on the fly. This means that all
> external links and crawls will be messed up while the update process
> is going on.
> During this process, I "think" the best thing to do is to set a
> "Disallow: /:" in the robots.txt file. My feeling on this is that
> during this big multi-pass shuffle, I don't want crawlers to be
> collecting crappy information.
> Are there any negative ramifications to turning away crawlers? Are
> there any other suggestions as to how to handle this situation?
MOTORHEAD extraordinaire
Professional Storage and Workspace Solutions
79 Park Road - Chelmsford, MA - 01824
Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
Visit our web site at www.MotorheadExtraordinaire.com and
for our latest specials, sign up for our Newsletter
>----- Original Message -----
>From: <mailto:j...@motorheadextraordinaire.com>Joe Germann
>To: <mailto:gsitecrawler@googlegroups.com>gsitecrawler@googlegroups.com
>Sent: Sunday, October 18, 2009 8:44 PM
>Subject: [GSiteCrawler] Re: robots.txt and web site remap
>>Actually you should set it up so your server responds with a 503 (under
>>maintenance) while you do this sort of thing.
>>This way robots will reschedule their crawls.
>>----- Original Message -----
>>From: "Motorhead Extraordinaire" <J...@MotorheadExtraordinaire.com>
>>To: "SOFTplus GSiteCrawler" <gsitecrawler@googlegroups.com>
>>Sent: Sunday, October 18, 2009 8:24 AM
>>Subject: [GSiteCrawler] robots.txt and web site remap
>> > I am about to move around a lot of product categories and products
>> > sets on my ecommerce strore. In effect, I will be ramapping almost
>> > everything. This process will take several hours (maybe all day) and
>> > when all is done I will also have generated a proper .htacces file to
>> > re-point everything to, as well as a new siitemap.
>> > I use a local off-line database application to build my world and then
>> > reload the E-commerce database via ODBC. This is basically done in a
>> > batch process so I do not have the luxury of dynamically
>> > making .htaccess Redirect entries on the fly. This means that all
>> > external links and crawls will be messed up while the update process
>> > is going on.
>> > During this process, I "think" the best thing to do is to set a
>> > "Disallow: /:" in the robots.txt file. My feeling on this is that
>> > during this big multi-pass shuffle, I don't want crawlers to be
>> > collecting crappy information.
>> > Are there any negative ramifications to turning away crawlers? Are
>> > there any other suggestions as to how to handle this situation?
>>MOTORHEAD extraordinaire
>>Professional Storage and Workspace Solutions
>>79 Park Road - Chelmsford, MA - 01824
>>Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
>>Visit our web site at www.MotorheadExtraordinaire.com and
>>for our latest specials, >><https://www.motorheadextraordinaire.com/create_account.php>sign up >>for our Newsletter
I implemented the following .htaccess and it appears to work just fine for users. I can plug in my IP and still get access to my web site. Options +FollowSymLinks RewriteEngine On RewriteBase / RewriteCond %{HTTP_USER_AGENT} ^.*(Googlebot|Googlebot|Mediapartners|Adsbot|Feedfetcher)-?(Google|Image)? [NC] RewriteCond %{REQUEST_URI} !^/503\.html [NC] RewriteRule .* /503.html
> I implemented the following .htaccess and it appears to work just
> fine for users. I can plug in my IP and still get access to my web site.
> Options +FollowSymLinks
> RewriteEngine On
> RewriteBase /
> RewriteCond %{HTTP_USER_AGENT}
> ^.*(Googlebot|Googlebot|Mediapartners|Adsbot|Feedfetcher)-?(Google|Image)?
> [NC]
> RewriteCond %{REQUEST_URI} !^/503\.html [NC]
> RewriteRule .* /503.html
>On 19 oct, 14:17, Joe Germann <motorheadextraordina...@gmail.com>
>wrote:
> > I implemented the following .htaccess and it appears to work just
> > fine for users. I can plug in my IP and still get access to my web site.
> > Options +FollowSymLinks
> > RewriteEngine On
> > RewriteBase /
> > RewriteCond %{HTTP_USER_AGENT}
> > ^.*(Googlebot|Googlebot|Mediapartners|Adsbot|Feedfetcher)-?(Google|Image)?
> > [NC]
> > RewriteCond %{REQUEST_URI} !^/503\.html [NC]
> > RewriteRule .* /503.html