Only 300 of 7000 product are included in Index

51 views
Skip to first unread message

Norm Cotrona

unread,
Mar 22, 2013, 8:55:19 PM3/22/13
to gsitec...@googlegroups.com
I am running into an issue with in which only 300 or 7000 products are showing up in my index.  For testing purposes, I have created a test category and linked the category to my index page.  The test category and all of the product pages are set for index, follow...  I have also run a crawl test through inspyder, with the same robot.txt etc..  

Any thoughts in regards to what might be causing this conflict? Any assistance is much appreciated..


Norm Cotrona

unread,
Mar 22, 2013, 9:01:21 PM3/22/13
to gsitec...@googlegroups.com
Sorry I meant to state that I ran the same test using inpsyder sitemap tool, using the same robots.txt file and the crawl using inspyder completed successfully with all 7000+ pages.

Any thoughts in regards to what might be causing this conflict in gsite? Any assistance is much appreciated..

webado

unread,
Mar 22, 2013, 9:59:43 PM3/22/13
to gsitec...@googlegroups.com
I couldn't answer that without knowing your website url.

Generally I'd say you should use a crawler like Xenu Link Sleuth to see whether there are any issues on the site. For instance if you start off on www but at some point you switch to non-www urls in navigation those don't belong to the same site in fact. Same with http vs https urls.

Norm Cotrona

unread,
Mar 22, 2013, 11:44:25 PM3/22/13
to gsitec...@googlegroups.com
thank you...the url is www.altapower.com

webado

unread,
Mar 23, 2013, 2:06:38 AM3/23/13
to gsitec...@googlegroups.com
I can see your robots.txt file is incorrect. You must not have blank lines among the directives for a particular user agent.

Pay attention to what you allow and what you disallow. Only use the allow directive to override a broader disallow directive.


For instance, for:

Allow: /sitemap/
Allow: /sitemaps/
Allow: /sitemaps/sitemap_alta.xml
Disallow: /sitemaps/sitemap_bbps.xml
Disallow: /sitemaps/sitemap_ead.xml
Disallow: /sitemaps/sitemap_esd.xml
Disallow: /sitemaps/sitemap_hsd.xml
Disallow: /sitemaps/sitemap_skeeter.xml
Disallow: /sitemaps/sitemap_spusa.xml

You should only keep the disallow directives, the allow directives are not needed and confusing. Also remember Allow is not part fo the actual robots.txt protocol, but it is understood at face value by Googlebot.

Do not block urls which respond with 404. But also make sure links to them do not exist on your site.

Christina
--
You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsitecrawler...@googlegroups.com.
To post to this group, send email to gsitec...@googlegroups.com.
Visit this group at http://groups.google.com/group/gsitecrawler?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Norm Cotrona

unread,
Mar 23, 2013, 11:15:24 AM3/23/13
to gsitec...@googlegroups.com
Thank you

I'd removed the items below completely from my  robots.txt.  I am now getting approximately 800 links in my sitemap.  There still appears to be an issue.  Any thoughts?



Allow: /sitemap/
Allow: /sitemaps/
Allow: /sitemaps/sitemap_alta.xml
Disallow: /sitemaps/sitemap_bbps.xml
Disallow: /sitemaps/sitemap_ead.xml
Disallow: /sitemaps/sitemap_esd.xml
Disallow: /sitemaps/sitemap_hsd.xml
Disallow: /sitemaps/sitemap_skeeter.xml
Disallow: /sitemaps/sitemap_spusa.xml


On Friday, March 22, 2013 8:55:19 PM UTC-4, Norm Cotrona wrote:

webado

unread,
Mar 23, 2013, 11:28:42 AM3/23/13
to gsitec...@googlegroups.com
Well that was only a small example, the entire robots.txt file needs to be pruned seriously to remove redundancy and overlapping directives and ambiguity.

Those particular items should not even have links to them in your navigation. You should not link to xml sitemaps.

What you may want to do is, besides not linking to them in navigation and not disallowing them in robots.txt, is add an X-Robots-Tag specifying noindex.

I use this in the .htaccess file to add a noindex X-Robots-Tag to any files suffixed as xml:

<FilesMatch "\.xml">
Header set X-Robots-Tag "noindex"
</FilesMatch>




Christina

Norm Cotrona

unread,
Mar 23, 2013, 11:36:31 AM3/23/13
to gsitec...@googlegroups.com
Christina -

Do you do any consulting?  I'd be interested in having you take a look at this and resolve this once an for all...Please advise?




On Friday, March 22, 2013 8:55:19 PM UTC-4, Norm Cotrona wrote:

webado

unread,
Mar 23, 2013, 11:51:08 AM3/23/13
to gsitec...@googlegroups.com
Well yes, I sometimes do, time permitting.
But I can't  tweak sites on platforms I don't know, because this means going into admin interface, often even the core code of the application.
E-commerce sites are usually quite a headache. Most I don't know anything about.

I notice your site generates a lot of urls with something like session IDs, this should be disabled at least when robots crawl. You seem to have blocked those types of urls in robots.txt. That may be OK if they aren't needed at all.

But for other thngs that you block in robots.txt often the better solution is not to block them bu add a robots noindex meta tag, so they can bse used to pass through and discover other urls, while not indexing them. Typically categories are like that.

Whether session IDs can be handled at all and how depends on the software.

Christina

Norm Cotrona

unread,
Mar 23, 2013, 11:59:16 AM3/23/13
to gsitec...@googlegroups.com
Thanks Christine -

That is more of the issue I am running into right now.   My issues are platform specific.  No-one seems to have any experience with "properly" optimizing .htaccess and robot.txt for the Magento Platform.  There are many, who claim they are experts, however, when they look at my files, they tell me they are "perfect" and no changes are needed.    The robots.txt file I am using is what is recommended by Magento itself, yet there is no documentation regarding the exclusions or theory behind the exclusions.

It sounds like I am just going to have to experiment some by pulling some of the exclusions out one by one and researching more about what each controls...

thanks much...

Norm

webado

unread,
Mar 23, 2013, 12:23:37 PM3/23/13
to gsitec...@googlegroups.com
For understanding the robots.txt protocol you should study that at http://www.robotstxt.org/ .
In a nutshell it works by disallowing rather than by allowing things. It's prefixed based. The allow directive is an extension that's honored by some robots, but not really part of  the protocol. You should sit down and think about all the sections of the site and types of urls you don't want robots to crawl, idnetify a common prefix and disallow that. If there's an exception to that rule then use the allow directive for the specific exception. Otherwise all that is not disallowed is allowed.

Keep in mind the robots.txt file does not manage access to the site, it's just a polite request about crawling or not certain parts of the site, like "keep off the grass" would be in a park. 

In contrast the .htaccess file is used to control responses for various urls (including but not limited to rewriting them as needed), as well as physically controlling access to parts of the website (by blocking unauthorized access for instance).

The X-Robots-Tag is a robots directive (thus not forcefully binding like anything to do with robots diretcives)  and can be used when you cannot add robots meta tags to certain files or file types. Google and most reliable robots honor that, as they would a robots meta tag as well. A rogue robot honors nothing, just saying.

If Magento developers advise those robots directives the way you used them then they aren't really competent enough in the matter. Also, IMO,  a well built product shouldn't require quite so much manipulation of the robots.txt file. I don't know any of e-commerce platform that is good or even just acceptable right out of the box, but I've long avoided dealing with them anyway.

Christina
Reply all
Reply to author
Forward
0 new messages