crawler idle

169 views
Skip to first unread message

prodigy2006

unread,
May 21, 2009, 4:34:01 PM5/21/09
to SOFTplus GSiteCrawler
I'd like to make a sitemap but I can't seem to get the site crawled
with the program, other than robots.txt, all I get in URL list is one
URL (the main url). www.mediaportal.hr

I read on this a post with a similar problem on this board but it
doesn't help solve the problem.

I set up the program through it's wizard but it doesnt seem to crawl
anything (crawlers remain idle). any ideas? help please!

webado2

unread,
May 25, 2009, 7:43:23 AM5/25/09
to SOFTplus GSiteCrawler
First of all if your computer runs Vista then check the Vista
instructions:
http://groups.google.com/group/gsitecrawler/web/gsc-v1-23-and-vista?hl=en

The when you request a crawl use the down arrow next to the button Re-
crawl and pikc This site.

Give it plenty of time to cawl, your site is big and/or slow for
crawlers (dont' knwo whihc, only tested a bit and seems to take long).

If your site has any redirecittions on it as it's being crawled that's
going to cause problems.

You can crawl your site using Xenu from
http://home.snafu.de/tilman/xenulink.html
and see if there are no errors (404, redirections, etc) during
navigation. These will need to be fixed before you can hope to build a
sitemap using Gsietcrawler or any other tool.

Jazbo

unread,
Jun 28, 2009, 8:34:24 PM6/28/09
to SOFTplus GSiteCrawler
I am having the same problem as prodigy2006. I am using Windows XP
pro.

I ran the url in Xenu and had no errors and only 28 links...

webado2

unread,
Jun 28, 2009, 8:55:50 PM6/28/09
to SOFTplus GSiteCrawler
Did you click Re-crawl > This Site ?

Once it finishes (maybe 1 or 2 mintes for only 28 urls) it will say
crawlers are emtpy, idle.
You can check what has bene found by clicking URL List and refresh. It
shoudl show the list of urls it has found.

Then click Generate > Google Sitemap

And so on.

Jazbo

unread,
Jun 28, 2009, 9:14:11 PM6/28/09
to SOFTplus GSiteCrawler
Re-crawl this site from the top toolbar, or the url list? The only
option on the top is recrawl this project.

I understand the final part, I made and uploaded a Sitemap from the
one url. But it seems like no matter what I do, it seems like I can't
get it to crawl...

Thanks for getting back so soon!

Christina S

unread,
Jun 28, 2009, 9:37:48 PM6/28/09
to gsitec...@googlegroups.com
Yes, Re-Crawl fromt the top button and select This project .

It should start it again.


Christina
www.webado.net

Jazbo

unread,
Jun 28, 2009, 10:01:20 PM6/28/09
to SOFTplus GSiteCrawler
I can't get it to work that way either... I haven't been able to get
it to crawl anything. All it has is the homepage.

Christina S

unread,
Jun 28, 2009, 10:26:14 PM6/28/09
to gsitec...@googlegroups.com
Please post the url.

Christina
www.webado.net

----- Original Message -----
From: "Jazbo" <wells...@gmail.com>
To: "SOFTplus GSiteCrawler" <gsitec...@googlegroups.com>
Sent: Sunday, June 28, 2009 10:01 PM
Subject: [GSiteCrawler] Re: crawler idle


>

Jazbo

unread,
Jun 28, 2009, 10:58:22 PM6/28/09
to SOFTplus GSiteCrawler
It's not fully functional yet, but the pages work. Here it is:

http://www.scentimentsfromtheheart.com/

Jazbo

unread,
Jun 29, 2009, 5:08:37 PM6/29/09
to SOFTplus GSiteCrawler
I don't understand it, I've tried a few other sites, and they worked
with no problem. It must be an issue with the site? Why would all of
the links test with no problems and still not have it crawl? Thanks
for the help!

webado2

unread,
Jun 29, 2009, 5:44:08 PM6/29/09
to SOFTplus GSiteCrawler
Hah!

In your internal navigation yoru urls are on http://scentimentsfromtheheart.com/
rather than on http://www.scentimentsfromtheheart.com/

So either you fix all your internal links to use all www urls or run
the crawler for http://scentimentsfromtheheart.com/ and submit the
site that way, without www.
> >http://www.scentimentsfromtheheart.com/- Hide quoted text -
>
> - Show quoted text -

webado2

unread,
Jun 29, 2009, 6:18:25 PM6/29/09
to SOFTplus GSiteCrawler
Actually it's worse. In addition to not staying on one particular
domain, you are also generating urls with some kind of session id.
Sort of never ending.

You have some broken links, and some redirected ones.

Use Xenu Link Sleuth to crawl the site - add both the www and non www
starting url.
http://home.snafu.de/tilman/xenulink.html

You should have a robots.txt file where you disallow various uris or
uri prefixes.

For instance:

User-agent: *
Disallow: /address_book
Disallow: /login
Disallow: /password_forgottten


First fix your navigation to stay either all on www or all without www.

Fix the broken links.

Add the robots.txt file.

Start Gsitecrawler again, delete the project.
Add it again for the particular domain form (with or withotu www).

Ask to read robtos.txt.
Do not ask to import known urls from Google.

Uncheck the option to crawl files types for images, word documents,
pdf, etc.

Go to Filter > Remove Parameters and add a new line for SFHid .



Also all yrou pages have the same title: SFH
Titles should be unique to each page and reflect what the page is
about.


I have tried to run GSC for yrou site - including botth the www and
non www urls in it, banning some ursl in the absence of a robots.txt
file.

I am not able to make headway with GSC.


I wonder if markup errrors are enough to break the crawler:
http://scentimentsfromtheheart.com/

Sometimes somethign stupid like an opening comment like <!-- which
doesn't get closed will mean that everything from that place one is
ignored. Thus no further links are found.


I don't know yet.

Fix what you can from the site and try again.




On Jun 29, 5:08 pm, Jazbo <wellspri...@gmail.com> wrote:

Jazbo

unread,
Jun 29, 2009, 6:28:23 PM6/29/09
to SOFTplus GSiteCrawler
Thanks! I'll try that.
Reply all
Reply to author
Forward
0 new messages