Crawls Not Completing - Taking forever...

194 views
Skip to first unread message

Norm Cotrona

unread,
Mar 30, 2013, 7:37:08 PM3/30/13
to gsitec...@googlegroups.com
I am testing gsite on several sites I own..  The sites are all setup identically, the only difference being the domain.   The robot.txt file has been setup properly and tested by using SEOmoz and inspyder.  

On Gsite, my smaller sites, 200 pages or so, crawl fine.  On larger sites, 500 pages or more, the sites crawls, but takes forever.  It never seems to end.  After several hours, I just end up cancelling the crawl and not allowing it to complete. 

I've double checked these larger sites on other crawl platforms, as well as my robots.txt file and all checks out fine..  The robots.txt file works perfectly on my smaller sites without incident...   

I have all of the default crawler settings set in the global settings....  I am at a loss as to why my crawl will not complete on Gsite Crawler.. Any assistance is much appreciateid..


webado

unread,
Mar 30, 2013, 8:32:53 PM3/30/13
to gsitec...@googlegroups.com
Try to crawl the sites with Xenu. You might discover redirections (like
from one canonical form to another) along the way. Googlebot can crawl
this (but it also will take long to complete) whereas a sitemap crawler
will not shift canoncial forms.

Make sure you aren't asking it to also crawl pdfs or images or other
non-page files. Very large pdfs take long. Also you don't want to pdfs
crawled but maybe just included in the sitemap.

Christina
> --
> You received this message because you are subscribed to the Google
> Groups "SOFTplus GSiteCrawler" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gsitecrawler...@googlegroups.com.
> To post to this group, send email to gsitec...@googlegroups.com.
> Visit this group at http://groups.google.com/group/gsitecrawler?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

webado

unread,
Mar 30, 2013, 8:34:18 PM3/30/13
to gsitec...@googlegroups.com
500 pages should not take forever, maybe 15-20 minutes or so, depending
on your connection speed and number of spiders used.

Christina

On 2013-03-30 19:37, Norm Cotrona wrote:

Norm Cotrona

unread,
Mar 30, 2013, 9:06:33 PM3/30/13
to gsitec...@googlegroups.com
Thanks Christina -

I will check this Xenu out on monday..  I am also blocking all images and pdf's in my robots.txt file.     I am able to complete my crawls on these sites using other sitemap generating tools with crawlers just fine.  GSite is not successful.

Even though I am able to complete my crawls with the other sitemap generators just fine, the fact the gsite is failing leaves me wondering if I have a site issue rather than a crawler issue....  I am truly in a grey area and I am not able to find anyone who knows how to assist me.  Scary think is I am not a programmer, yet, I am seem to know more about robots.txt and crawling than most of the programmers and developers I've been speaking too.  This concerns me...

So currently, I am able to get two Sitemap generators to crawl 10 individual sites and create sitemaps just fine.   Both tools are generating identical results for all sites.  Thus I have uploaded my robots.txt file on my production site, and have been testing it for several days...   I am not reporting any errors in webmaster tools or bing webmaster tools. Yet, I still have no idea if things are working as they should. 

Do you know if there is a way to verify is google and bing/yahoo are able to report the following:

1 .  If each engine is completing a crawl of my sites
2.   How long each crawl is taking
3.   How many pages were crawled
4.   How many pages were indexed?

I know webmaster tools have a crawler stats and crawler errors interface, but they are vague...

Thanks Norm

webado

unread,
Mar 30, 2013, 9:29:10 PM3/30/13
to gsitec...@googlegroups.com
If you provide the url for one such website that is hard or impossible to crawl using GsiteCrawler I will take a look. Maybe I'll find something that's sticky.
Christina

Norm Cotrona

unread,
Mar 30, 2013, 10:01:00 PM3/30/13
to gsitec...@googlegroups.com
Thanks Christine

One site that is crawling in gsite fine is www.battery-backup-sump-pump.com
Two sites that are having difficulty are www.altapower.com and www.sumppumpsusa.com

Thanks

Norm

webado

unread,
Mar 30, 2013, 10:04:48 PM3/30/13
to gsitec...@googlegroups.com
Oh I have seen www.altapower.com before. Was it here or in the Google Webmaster Central forum?
Christina

webado

unread,
Mar 30, 2013, 10:10:40 PM3/30/13
to gsitec...@googlegroups.com
www.altapower.com has a godawful robots.txt file. I wouldn't be surprized if it caused issues with pages being left orphan.
Maybe some stuff which should be blocked isn't.

I already have it in my Xenu. Running it again.


Christina
On 2013-03-30 22:01, Norm Cotrona wrote:

Norm Cotrona

unread,
Mar 30, 2013, 10:12:02 PM3/30/13
to gsitec...@googlegroups.com
Yes, on here...  You and I had communicated previously.  Previously, I had some issues previously with my robot.txt file...   These issues have since been resolved...

webado

unread,
Mar 30, 2013, 10:15:33 PM3/30/13
to gsitec...@googlegroups.com
Don't block anything which already responds with an error code like 404 or 403. Just make sure you have no links to it.
One such file is .htaccessold . No point in blocking it in robots.txt it responds with a 403, that's the best type of block.


Christina
On 2013-03-30 22:01, Norm Cotrona wrote:

webado

unread,
Mar 30, 2013, 10:23:43 PM3/30/13
to gsitec...@googlegroups.com
Ah OK.
Xenu is chugging along. I only told it to skip 
/catalog/product_compare/ 
because otherwise I have to enter each and every one of them manually. Xenu does not read and obey the robots.txt file. But GSiteCrawler does.
Christina

Norm Cotrona

unread,
Mar 30, 2013, 10:32:54 PM3/30/13
to gsitec...@googlegroups.com
I have not used Xenu at all and I am on my MAC, so I can't load it until I am in my office on Monday.   Therefore, I am not sure how Xenu works.   Based on your last response, I am assuming that you are entering the rule  /catalog/product_compare/ only and that rule is working properly.

So do you mean that I should enter each rule one at a time to determine which rules are causing an issue? Or which should be omitted?

My ultimate goal is to cleanup and optimize my robot.txt file.  However, I am using Magento e-Commerce Platform and everyone I am in communication with, including programmers and developers are telling me that the items included in my Robots.txt file are recommended..  As you are have pointed out, I think their recommendations are garbage and I am not confident in what I have running..  Problem is, I am not a programmer or a developer and out of my comfort zone with this. 

webado

unread,
Mar 30, 2013, 10:54:11 PM3/30/13
to gsitec...@googlegroups.com
Well you have to enter the fully qualified url ( or url prefix), starting with http://, in the options of what not to crawl.
You also have no need to crawl external urls as this is not you primary concern now, but that's useful when checking for broken outgoing links.

It's slow going for sure. There's an explosion of urls it finds which it skips anyway (because of the rule), but it still seems under control.
It's at 89%. But the actual number of urls keeps on going up too. What Xenu reports on an ongoing basis includes every type of url it finds, blocked or not, external or not.



Christina

webado

unread,
Mar 30, 2013, 11:02:21 PM3/30/13
to gsitec...@googlegroups.com
SOme hints (they don't improve anything except readability of the robts.txt file):

Because the robots.txt protocol is based on the prefix, of these 2 the2nd one is redundant:
Disallow: /productquestions/
Disallow: /productquestions/index/

Same for these 2:
Disallow: /catalogsearch/
Disallow: /catalogsearch/result/

Same for these 2:
Disallow: /rss
Disallow: /rss/

I don't know what these are supposed to be:
Disallow: /.m-.swp
Disallow: /.swp

Make sure whatever you block makes sense. Do not block anything which already responds with 404 or 403 or even 500.
Do not block anything that's 301 redirected elsewhere either.

Christina
On 2013-03-30 22:32, Norm Cotrona wrote:

Norm Cotrona

unread,
Mar 30, 2013, 11:07:44 PM3/30/13
to gsitec...@googlegroups.com
Thanks Christina -

Yes, www.altapower.com has approximately 7300 product and category URL's combined.

I am at a loss as to how to proceed..... My robots.txt file appears to be a mess.  Yet,  with my other programs, I am producing the correct amount of product and category URL's with this existing Robots.txt file in place.  Yet, Gsite, doesn't complete a crawl with the robots.txt file in place.    I am not comfortable with resolving this issue on my own or 100% comfortable on what to do next...  

Not sure if you would be interested a consulting arrangement in which you would  perform a complete review of my .htaccess file and robots.txt file to clean them up and make recommendations...  I can break down the robots.txt file to let you know which files are directories, paths, clean urls, files....   

Your Thoughts?

Thanks Norm

webado

unread,
Mar 30, 2013, 11:15:40 PM3/30/13
to gsitec...@googlegroups.com
Let's see what happens with Xenu and after that with GSiteCrawler.

Only you can figure out your robots.txt file. I can only pick on a few things in particular liek I did, and then make general statements if I feel maybe you're blocking stuff in desperation ;)

I'm not comfortable taking on  e-commerce sites.

Christina

Norm Cotrona

unread,
Mar 30, 2013, 11:37:12 PM3/30/13
to gsitec...@googlegroups.com
Thank you Christine -

I will give Xenu a try on Monday and see where it gets met..

Thanks again and enjoy your weekend..

Norm

webado

unread,
Mar 31, 2013, 12:09:44 AM3/31/13
to gsitec...@googlegroups.com
Running GSiteCRawler. Yep it's slow, but moving.

You need to use the Remove parameter options for various query string parameters, like:
dir
order

That's because GSiteCrawler is not good at interpreting robots.txt directives that include wildcards, and those are query string parameters anyway.Luckily you are using a canonical link tag which seems to correctly indicate the clean url with no query string.


Christina

webado

unread,
Mar 31, 2013, 1:09:27 AM3/31/13
to gsitec...@googlegroups.com
You have some nasty issues.

On a page like http://www.altapower.com/12v-dc-power-inverters.html?p=10

you have 2 canonical tags.

<link rel="canonical" href="http://www.altapower.com/12v-dc-power-inverters.html" />

<link rel="canonical" href="http://www.altapower.com/12v-dc-power-inverters.html?p=10" />


One of them is wrong and should be removed.  I suppsoe that would be the first one. The software has to be configured correctly. A load of problems.

In the meanwhile I will request GSiteCrawler to also remove the p parameter, as it makes no sense any more when combined withe the canonical bug.



Christina
On 2013-03-30 23:37, Norm Cotrona wrote:

Norm Cotrona

unread,
Mar 31, 2013, 9:40:47 AM3/31/13
to gsitec...@googlegroups.com
Hi Christine -

I am having our programmer look into the multiple canonical tags.... This is due to pagination on our site.  Apparently there is a way to configure pagination on our site to resolve this issue.  This should be resolved soon

In regards to my robots.txt file, I have been able to cleanup and run crawls with my other tools with the following rules only.  Below these rules I have posted a question for you.

User-agent: *
Crawl-delay: 60
Disallow: /powershare/
Disallow: /favicon.ico
Disallow: /*?dir*
Disallow: /*?limit*
Disallow: /*?mode*
Disallow: /*?manufacturer*
Disallow: /*?price*
Disallow: /*?switch_type_left_nav*
Disallow: /*?options*
Disallow: /*.js$
Disallow: /*.css$
Disallow: /*.php$
Disallow: /*?SID=
Disallow: /trackorder/
Disallow: /manuals/
Disallow: /images/
Disallow: /catalogsearch/
Disallow: /catalog/
Disallow: /rss*
Disallow: /productquestions/
Disallow: /skin/
Disallow: /customer/
Disallow: /js/
Disallow: /checkout/


Given the list above, how should I setup my filters?  Should I simply import the robots.txt with all of rules above to my BAN url list? 

Or should I break the rules up and add them to the BAN URL list and REMOVE PARAMETERS lists at follows:

BAN URLS LIST
Disallow: /powershare/
Disallow: /favicon.ico
Disallow: /trackorder/
Disallow: /manuals/
Disallow: /images/
Disallow: /catalogsearch/
Disallow: /catalog/
Disallow: /rss*
Disallow: /productquestions/
Disallow: /skin/
Disallow: /customer/
Disallow: /js/
Disallow: /checkout/

REMOVE PARAMTERS ( if you recommend this option, what is the proper syntax of the parameters below?)
/*?dir*
/*?limit*
/*?mode*
/*?manufacturer*
/*?price*
/*?switch_type_left_nav*
/*?options*
/*.js$
/*.css$
/*.php$
/*?SID=

Thanks Much..
Norm

webado

unread,
Mar 31, 2013, 10:16:13 AM3/31/13
to gsitec...@googlegroups.com
Hi,

GSiteCrawler is OK with all robots.txt directievs except these:


Disallow: /*?dir*
Disallow: /*?limit*
Disallow: /*?mode*
Disallow: /*?manufacturer*
Disallow: /*?price*
Disallow: /*?switch_type_left_nav*
Disallow: /*?options*
Disallow: /*?SID=

In general the trailing * is not needed, since robots.txt is based on prefix.

The SID query string paramater is already  part of the list of parameters to drop.

I would simply add to parameters to drop:

dir
limit
mode
manufacturer
price
switch_type_left_nav
options

For this:
Disallow: /rss*

You actually need this:
Disallow: /*/rss

because rss can be in several positions as I noticed. Not sure how to state it in GSiteCrawler, it depends how many urls have that, they may need to be blocked separately.

I ran GSiteCrawler last night and maybe because I may have blocked more than I should have, I found just 612 urls for the sitemap.

Or maybe it's correct based on the explicit blocks from the robots.txt file and what I added (to drop a few extra query string  parameters which is based on wildcard robots.txt directives anyway).

Mind you, the sitemap does not restrict crawling or indexing of the site itself, whatever is not included in the sitemap will still be discovered by Googlebot and if not blocked in robots.txt or by robots meta tags, will be included and  indexed.

You must be careful not to block in robots.txt anything which leads to discovery of other urls. What you need in such cases are robots noindex meta tags on urls you don't want indexed but which are needed to pass through to others.


Christina

Norm Cotrona

unread,
Mar 31, 2013, 10:50:30 AM3/31/13
to gsitec...@googlegroups.com
Thanks Christine -  I will give this a try.


I deleted the "Default" parameter list of parameters to drop,  Should I reinstate these parameters as well?  Or are they just examples?

webado

unread,
Mar 31, 2013, 10:54:26 AM3/31/13
to gsitec...@googlegroups.com
Oh you should not delete them! They are usual parameters to be dropped. Only delete some if they are truly useful to yrou site (perhaps used for a different purpose than the typical one).
You add to them any others you need.
Christina

Norm Cotrona

unread,
Mar 31, 2013, 11:02:07 AM3/31/13
to gsitec...@googlegroups.com
Thanks - I will re-add them...Do you have a list of these?  I deleted them completely.

Also - I add these to remove parameters, correct?  Not to drop parts, correct?

Norm Cotrona

unread,
Mar 31, 2013, 11:13:14 AM3/31/13
to gsitec...@googlegroups.com
Nevermind....I created another demo project and was able to get these parameters..

Thanks Much...

Norm

webado

unread,
Mar 31, 2013, 11:19:21 AM3/31/13
to gsitec...@googlegroups.com
When you click Statistics and Generate Statistics, scroll down and you see (if you didn't remove them):

Remove these parameters when specified while being crawled:
    osCsid
    PhpSessId
    PhpSessionId
    s
    Session
    SessionId
    SID
    XTCsid

Those are what needs adding back plus any others specific to your site.

Christina
Reply all
Reply to author
Forward
0 new messages