Re: [GSiteCrawler] Abridged summary of gsitecrawler@googlegroups.com

Justin Branch

unread,

Dec 29, 2009, 6:02:17 AM12/29/09

to gsitec...@googlegroups.com

Thanks for your response i am running Windows 2000 Service Pack 4

(1) How long does the system take to produce a site map? I have close
to10,000 Urls in all, but all are not live i'm talking about products
that are out of stock they are of course not live.

(2) There are only 5,000 live so my question is, does the system
search the entire database of 10,000 or the 5000 live products?

(3) Can i uninstall the program and restart fresh? cause i don't know
if my previous robots txt blocked of anything. I have sinced uploaded
the standard robots txt with most of the restrictions gone. While the
program was running i started stop, restarted again, stop i just was
so confused

(4) Does it come with an an auto detect parameter which would install
out of the box running?
I only can setup the FTP & robots txt besides what is shown on the
tutorial i am lost. As i cannot tweak what i don't understand.

(5) When i shut off the program does it still continue to generate
sitemaps it would say crawlers idle i don't know what that means?

Happy New Years

On 12/29/09, gsitecrawl...@googlegroups.com
<gsitecrawl...@googlegroups.com> wrote:
> =============================================================================
> Today's Topic Summary
> =============================================================================
>
> Group: gsitec...@googlegroups.com
> Url: http://groups.google.com/group/gsitecrawler/topics
>
> - Confused [2 Updates]
> http://groups.google.com/group/gsitecrawler/t/99848be478fa0f73
>
>
> =============================================================================
> Topic: Confused
> Url: http://groups.google.com/group/gsitecrawler/t/99848be478fa0f73
> =============================================================================
>
> ---------- 1 of 2 ----------
> From: dunn <dunn...@gmail.com>
> Date: Dec 28 09:01AM -0800
> Url: http://groups.google.com/group/gsitecrawler/msg/aa4df9f3ebad9bf7
>
> Hello All,
>
> My website is not being crawled i tried for almost 10 hrs total and i
> just don't understand why it's not working.
>
> I am using Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
>
> ---------- 2 of 2 ----------
> From: webado <web...@gmail.com>
> Date: Dec 28 02:37PM -0800
> Url: http://groups.google.com/group/gsitecrawler/msg/5c3d99eb8d747d8
>
> What OS are you running under? Widnows XP/ WIndows Vista? WIndows 7?
>
> If using Vista, please refer to the installaiton notes for Vista. I
> think they also apply to WIndows 7.
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups
> "SOFTplus GSiteCrawler" group.
> To post to this group, send email to gsitec...@googlegroups.com.
> To unsubscribe from this group, send email to
> gsitecrawler...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/gsitecrawler?hl=en.
>
>
>

webado

unread,

Dec 29, 2009, 9:54:18 AM12/29/09

to SOFTplus GSiteCrawler

Hi,

Funny, you replied to the summary email :)

1) It crawls. The time depends on your server response and how many
actual urls it finds to crawl. It will get slowed down if it runs into
redirections. For 10000 actual urls, on the average it may take 2-3
hours to crawl fully, but can be slower if there are redirections or
if the server responds more slowly.

2) Anyway, the program does not perform a search of YOUR database. It
simply crawls the website, like any robot, like any visitor (except it
does it exhaustively). If you have links that end up performing built-
in searches, that's what it will find. It won't fill in forms that
submit scripts that do searches.

If you set it to import known urls from Google, then it will also
import all that doesn't exist now. Not a good idea if many of the
already indexed urls are already no longer valid.

How does your server respond to a url which doesn't exist? Does it
respond with a 404 or 410? Or 301 redirects somewhere? or simply shows
a message syaing it's not found but with a 200 server response?

3) No need to uninstall the program in order to start afresh. You can
delete the project and start afresh. Or you can delete all urls from
it (see URL list and the button Delete all manual urls) , reimport a
fresh robots.txt file (in Filter). Or import a fresh robots.txt file
and request to refilter URL list and crawler queue.

It's important if you have urls that no longer exist and they respond
with 404 or 410 NOT to block them in robots.txt. You want Googlebot
and other robots to find out they don't exist so as to be able to drop
them from the index.

4) Not sure what you mean. Have you already installed the program?
Have you started it and added your website url there? and then go step
by step using the wizard? It's pretty straight forward.
The first phase involves setting up what to crawl and how, and then
letting it crawl. Next it is generating the sitemap. Lastly you have
uploading of the sitemap, which you can do yourself manually or
letting the program do it for you. For a big site like that you should
only upload the xml.gz sitemap, not the .xml one, as it will be too
big a file.

5) If crawlers are idle that means they are not crawling, nothing's
happening. But you can shut down the program when crawlers are still
crawling and they might continue to crawl for some time - just don't
count on them finishing the job properly after you have closed the
program. They crawl the site, but won't save info they crawled and
won't generate a sitemap by themselves. That's done explicitly when
you ask to generate the sitemap AFTER the crawl is over. You can
automate the process but you shoudl only do it after havign performed
at least one full crawl and sitemap generation cycle manually.
Uploading the sitemap by FTP is exactly that: it needs your FTP
details (ftp host address, login user id, password, destination folder
for the connection), just like you would access your site by any FTP
program. The requirement is you must have FTP access to your site.

Happy New Year to you and yours.

webado

unread,

Dec 29, 2009, 10:09:38 AM12/29/09

to SOFTplus GSiteCrawler

Just edited the subject line to something more meaningful :)

dunn

unread,

Jan 10, 2010, 3:10:57 PM1/10/10

to SOFTplus GSiteCrawler

Thanks for your reply....

That's the thing no url is being displayed just the homepage but i
currently don not have any redirects on my site, it's a brand new
site.

Also i'm not sure if i have it correctly setup which is why i would
like to uninstall it and reinstall so that i can undo any prior setup
actions that i may have overlooked during installation.

I have a linux server on my hosting account but i running windows 2000
on my computer.

Yes i do have it saved as xml.gz to my FTP but i would like the crawl
to run
when i shutoff the program so should it save the file to my computer
instead??

So the program has to constantly be on 24/7 online to genetate the
sitemaps is that
what you are saying??

Thank You

Christina S

unread,

Jan 10, 2010, 4:01:51 PM1/10/10

to gsitec...@googlegroups.com

If you only want the sitemap file to be saved to your pc and not uploaded,
then do not request ftp upload in the automation settings.
Start the program and let it run as long as it takes.

If you close the program window even though some crawlers may still be
active you cannot depend on them finishing the job properly. And no sitemap
will get generated if the program is closed.

I don't suppose you need sitemaps to be generated continually. Set yourself
a schedule or a habit of running GSC periodically - after you have made
significant changes to your website.

As I said if it's not working and appears not to be crawling, there's
something wrong with either your site (I'd need the url to test), or your
server blocks the GSC robot or your computer's firewall is.

Thanks for your reply....

Thank You

--------------------------------------------------------------------------------

> --
> You received this message because you are subscribed to the Google Groups
> "SOFTplus GSiteCrawler" group.
> To post to this group, send email to gsitec...@googlegroups.com.
> To unsubscribe from this group, send email to
> gsitecrawler...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/gsitecrawler?hl=en.
>
>
>

Christina
www.webado.net

Reply all

Reply to author

Forward

Re: [GSiteCrawler] Abridged summary of gsitecrawler@googlegroups.com - 2 Messages in 1 Topic

Justin Branch

webado

webado

dunn

Christina S