I have a website that up until a few months ago showed products on static pages. Around June this year I added about dynamic asp pages with fairly ordinary readable urls but because each page could be sorted by price (high or low) and product it generated about 150 'difficult' urls with the usual '?' and '%20' in them. In addition because for each product a pre-filled quote request form could be generated the number of these urls multiplied.
Using GSite Crawler all these ages were in the sitemap.xml (it excluded any duplicates that could occur by the same quote form url being generated from different pages on the site).
A fairly large number of these urls seem to have been indexed - most surprisingly a number of quote form urls which surprised me because the tail end of it - for example - 'asp?lease-contract=120' will quite likely change every time the product database changes.
None of this seemed to be presenting any problems. However, as the database grows some individual pages have become too long. So it occurred to me to use pagination to restrict pages to about 10 products sorted by one of 4 ways. I worked on the code for that, including navigation and sorting and it works well but have not implemented it yet.
The reason for this is that it will generate many more complex urls. So it occurred to me to pre make asp pages with straightforward urls where the number would be approximately proportionate to the size of the database for each make and a small routine would work out the pagination. It is difficult but possible to do.
If I succeed in this I have the problem of re-directing the old 'difficult' urls. There are so many of them I am not sure that it is even possible.
There is always ISAPI but I don't have server access or the expertise to use regular expressions.
Does anyone here REALLY know if just switching will do damage?
It seems to me yuo basiclaly don't want to let url's containing query strings like:
?sort-by=vehicle_leasing_offers.businessTerm
get indexed at all.
So one thing to do is only offer links to those url's in javascript so robots don't pick them up in the first place.
Then I think you can disallow such url structures in robots.txt (Google says it can understand wild cards in a Disallow directive). Then you don't have to change your url structure.
User-agent: * Disallow /*?sort-by
But (in another lifetime) I'd have gone further and created a folder based url structure for each kind of sorting you offer so I can disallow those scripts based on prefix.
> I have a website that up until a few months ago showed products on > static pages. Around June this year I added about dynamic asp pages > with fairly ordinary readable urls but because each page could be > sorted by price (high or low) and product it generated about 150 > 'difficult' urls with the usual '?' and '%20' in them. In addition > because for each product a pre-filled quote request form could be > generated the number of these urls multiplied.
> Using GSite Crawler all these ages were in the sitemap.xml (it > excluded any duplicates that could occur by the same quote form url > being generated from different pages on the site).
> A fairly large number of these urls seem to have been indexed - most > surprisingly a number of quote form urls which surprised me because > the tail end of it - for example - 'asp?lease-contract=120' will quite > likely change every time the product database changes.
> None of this seemed to be presenting any problems. However, as the > database grows some individual pages have become too long. So it > occurred to me to use pagination to restrict pages to about 10 > products sorted by one of 4 ways. I worked on the code for that, > including navigation and sorting and it works well but have not > implemented it yet.
> The reason for this is that it will generate many more complex urls. > So it occurred to me to pre make asp pages with straightforward urls > where the number would be approximately proportionate to the size of > the database for each make and a small routine would work out the > pagination. It is difficult but possible to do.
> If I succeed in this I have the problem of re-directing the old > 'difficult' urls. There are so many of them I am not sure that it is > even possible.
> There is always ISAPI but I don't have server access or the expertise > to use regular expressions.
> Does anyone here REALLY know if just switching will do damage?
> A fairly large number of these urls seem to have been indexed - most > surprisingly a number of quote form urls which surprised me because > the tail end of it - for example - 'asp?lease-contract=120' will quite > likely change every time the product database changes.
Howdy,
Open the http://bizynet.biz shopping cart demonstration and use View>Source to see how the Robots meta tag is used to avoid the crawlers following anything other than the main content pages. You should never let the crawlers index search pages or pages with Order buttons.
Google still has serious problems establishing pages that can have different URL's due to tracking customers through a web site. Because their Gogglebot can't do cookies, ecommerce sites have to treat them as a new customer every time they arrive.
They do eventually sort it out and don't duplicate too many pages.
> > A fairly large number of these urls seem to have been indexed - most > > surprisingly a number of quote form urls which surprised me because > > the tail end of it - for example - 'asp?lease-contract=120' will quite > > likely change every time the product database changes.
> Howdy,
> Open thehttp://bizynet.bizshopping cart demonstration and use > View>Source to see how the Robots meta tag is used to avoid the > crawlers following anything other than the main content pages. You > should never let the crawlers index search pages or pages with Order > buttons.
> Google still has serious problems establishing pages that can have > different URL's due to tracking customers through a web site. Because > their Gogglebot can't do cookies, ecommerce sites have to treat them > as a new customer every time they arrive.
> They do eventually sort it out and don't duplicate too many pages.
Considering the massive number of dynamic sites that put cookies to effective use, it's overdue for the robots to get up-to-date. Arpanet faded away a long time ago.
> It's not Google's fault, it's the site not functioning logically.
There are number of things that can be done to help on that count. However, if the site is doing a good job of getting Order buttons clicked, they should not be penalized by inadaquate programming by Google. Not to mention, Google could reduce their storage requirements significantly.
Thanks both for your thoughts. Not sure I want to stop Google or any other search engine from indexing dynamic generated pages. For example, if I use pagination to make the pages more user friendly the introductory text above the table will give a mention to a higher percentage of the products and none of it will be duplication. Even the sorted pages aren't duplicates because the intro text changes (it grabs the first 6 products).
The folder based structure you mention Webado is like what I was planning but I was concerned about the impact of Google not being able to find the dynamic urls it has already indexed.
This bit -
'Then what's left is to remove the query string from any of the old url's so as to redirect to the url without any query string.'
I have absolutely NO idea how to do that and get the feeling I have missed a point. Nevertheless, what you seem to be saying is that if I do the robots.txt, then folders it won't cause me problems so far as SEs are concerned. It's the redirect bit that's worrying me. No access to server.
> Considering the massive number of dynamic sites that put cookies to > effective use, it's overdue for the robots to get up-to-date. Arpanet > faded away a long time ago.
> > It's not Google's fault, it's the site not functioning logically.
> There are number of things that can be done to help on that count. > However, if the site is doing a good job of getting Order buttons > clicked, they should not be penalized by inadaquate programming by > Google. Not to mention, Google could reduce their storage > requirements significantly.
Looking at the indexed pages I think I only saw any with query strigns for those situations - for sorting. So it makes sense to block them, as they seem to be irrelevant and there's a perfectly good url without any query string giving the same information.
> Thanks both for your thoughts. Not sure I want to stop Google or any > other search engine from indexing dynamic generated pages. For > example, if I use pagination to make the pages more user friendly the > introductory text above the table will give a mention to a higher > percentage of the products and none of it will be duplication. Even > the sorted pages aren't duplicates because the intro text changes (it > grabs the first 6 products).
> The folder based structure you mention Webado is like what I was > planning but I was concerned about the impact of Google not being able > to find the dynamic urls it has already indexed.
> This bit -
> 'Then what's left is to remove the query string from any of the old > url's so as to redirect to the url without any query string.'
> I have absolutely NO idea how to do that and get the feeling I have > missed a point. Nevertheless, what you seem to be saying is that if I > do the robots.txt, then folders it won't cause me problems so far as > SEs are concerned. It's the redirect bit that's worrying me. No access > to server.
> On Sep 10, 5:43 pm, Chris Gunn wrote:
> > On Sep 10, 9:58 am, webado wrote:
> > > No robots do cookies, you should know that.
> > Howdy Webado,
> > Considering the massive number of dynamic sites that put cookies to > > effective use, it's overdue for the robots to get up-to-date. Arpanet > > faded away a long time ago.
> > > It's not Google's fault, it's the site not functioning logically.
> > There are number of things that can be done to help on that count. > > However, if the site is doing a good job of getting Order buttons > > clicked, they should not be penalized by inadaquate programming by > > Google. Not to mention, Google could reduce their storage > > requirements significantly.