Won't Crawl?

7 views
Skip to first unread message

Sharon

unread,
Jan 1, 2010, 9:47:32 AM1/1/10
to SOFTplus GSiteCrawler
I am not sure why this is happening. I have no items on the ban list
for this section of my website, yet the crawler will not crawl it. I
tried to setup a crawl for just this section alone, and it would not
crawl. Can you help me?

Section: http://www.rare-cancer.org/dictionary/

webado

unread,
Jan 1, 2010, 11:17:30 AM1/1/10
to SOFTplus GSiteCrawler
Your site contains pages with the .xhtml extension. This is not
included in the list of default file extensions to crawl in GSC.
Add it there under the Settings > General tab and re-crawl.

A page like this has no business being crawled and indexed:
http://www.rare-cancer.org/dictionary/index.php/viewpage/Feedback+%252F+Suggest+A+Word.xhtml
So you either add a noindex robots meta tag to it or disallow it in
the robots.txt file.


No idea why the css.php file is picked up as a url to crawl and
include, but it needs filtering out.


There's at least one broken link:

URL: http://www.rare-cancer.org/dictionary/mai
Error: HTTP-Error 404 Not Found
Linked from: http://www.rare-cancer.org/dictionary/

You should avoid using links to urls ending in /index.php (i.e. with
no query string after it). This is a duplciate of the same url ending
in / without the index.php .


The crawling is quite slow, might be due to a slow server (even if
it's called Lightspeed) , slow database and slow url rewriting.

Sharon

unread,
Jan 1, 2010, 11:52:45 AM1/1/10
to SOFTplus GSiteCrawler
I thank you very much for your very fast help. I had forgotten about
the extension (slaps forehead)! I will look at issue you brought up.
I can't do a thing about the slow server, but the other items I can
fix.

Thanks and Take Care, Sharon

webado

unread,
Jan 1, 2010, 12:03:33 PM1/1/10
to SOFTplus GSiteCrawler
You're welcome.

Another thing you should do is manage the canonical preference: www vs
non-www urls. Currently they both respond with 200.
One set shoudl respodn that way, the other shoudl be redirected to the
rpeferred form.

See here how to do it on an Apache server (includes Lightspeed):

http://groups.google.com/group/only-validation/web/fix-canonical-issues-www-vs-non-www-and-more-on-apache-server

> > Add it there under the Settings > General tab and re-crawl.- Hide quoted text -
>
> - Show quoted text -

Reply all
Reply to author
Forward
0 new messages