Criteria for identifying identical pages (Duplicate content URLs)

19 views
Skip to first unread message

kkarl

unread,
Nov 19, 2009, 7:33:33 AM11/19/09
to SOFTplus GSiteCrawler
Hi

what are the criteria GSC identifies URLs as pointing to the same
content??

Thx
Kkarl

webado

unread,
Nov 19, 2009, 8:25:45 AM11/19/09
to SOFTplus GSiteCrawler
Same text content perhaps?

I don't know exactly because I don't have such a situation anywhere to
test it.

kkarl

unread,
Nov 19, 2009, 8:59:43 AM11/19/09
to SOFTplus GSiteCrawler
I believe that an answer is important because in case of thousands of
duplicate pages ( e.g. for shops) the size of the sitemap is strongly
reduced and the transferred URLs to the SEs are quasi correct after
disabling the duplicate content.

To compare the content (body...) the task would be very time
consuming!
To compare e.g. the <titles> would be very often meaningless

webado

unread,
Nov 19, 2009, 9:14:05 AM11/19/09
to SOFTplus GSiteCrawler
That's why you have to fix your site to avoid generating pages with
largely the same content (e.g. sorted in different ways or with a
larger view of a product image).
Proper use of a robots noindex met atag on page you dont' want
indexed, and/or rel="nofollow" on links to alternate disaplys of the
ame essential page, and/or disallowing certain uri patterns in the
robots.txt file all should be used.

Just tailoring the sitemap is not enough, because Google and other
search engines index what they find while crawling, will nto be
restricted to what they find in the sitemap.
> > > Kkarl- Hide quoted text -
>
> - Show quoted text -

Joe Germann

unread,
Nov 19, 2009, 9:27:16 AM11/19/09
to gsitec...@googlegroups.com
I let GSC crawl my entire PHP eCommerce site and then built a robots file to filter out the duplicate castings that were generated automagically by the eCommerce site.  I now use the robots.txt file to tell both the crawlers and GSC what to crawl. Every once in a while I will run GSC without the robots directives to make sure that my filtering strategy is exactly what it needs to be.

It's real easy to do and maintain.

Regards,
Joe
--

You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To post to this group, send email to gsitec...@googlegroups.com.
To unsubscribe from this group, send email to gsitecrawler...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gsitecrawler?hl=.

MOTORHEAD extraordinaire
Professional Storage and Workspace Solutions
79 Park Road - Chelmsford, MA - 01824
Toll Free 800.618.8028 - Direct 978.618.2800 - Fax 978.418.0404
Visit our web site at www.MotorheadExtraordinaire.com and
for our latest specials, sign up for our Newsletter

kkarl

unread,
Nov 19, 2009, 10:59:53 AM11/19/09
to SOFTplus GSiteCrawler
Webado: I agree totally with your comments and hints

Webado & Joe: How do you know that the "identical pages listed" by
GSC are all of them - completely recognized as duplicate content by
GSC?
Thats why I am interested in knowing how GSC discovers DC, what are
the criteria !

Thx
Kkarl

Joe Germann

unread,
Nov 19, 2009, 11:44:23 AM11/19/09
to gsitec...@googlegroups.com
Kkarl,

I am by no means the expert on GSC but I was able to recognize URL patterns and multiple entries that were pointing to the same effective URL.  My eCommerce PHP site code spits out quite a few different URL link patterns.  Based upon what pops up in a browser's URL window, I chose the pattern that was consistent with the user visible URL.  All of the others I filtered out.

I have over 3100 products on my web site and the GSC generated site map, filtered by my robots.txt file,  lists exactly 3100 links to products. Each of those links have the URL form I want (at least for now).

Run a full scan with no filters, then examine your results. You'll find redundant entries with all sort of weird URL patterns to filter out.

The following is some of the stuff in my robots.txt file.
Disallow: /products_new.php/
Disallow: /index.php/cPath/*/sort/
Disallow: /index.php/cPath/*/page/
Disallow: /includes/languages/english/images/
Disallow: /product_reviews.php/
Disallow: /product_info.php/manufacturers_id/
Disallow: /index.php/manufacturers_id/
Disallow: /index.php?manufacturers_id=*
Disallow: /images/main/
Disallow: /images/infobox/

Joe
--

You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To post to this group, send email to gsitec...@googlegroups.com.
To unsubscribe from this group, send email to gsitecrawler...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gsitecrawler?hl=.

webado

unread,
Nov 20, 2009, 1:14:39 AM11/20/09
to SOFTplus GSiteCrawler
kkarl, I can't answer that. As I said I don't have any sites with this
issue that need special treatment. At most I dealt with a Zen Cart
site and since I had already noticed some useless urls (different sort
orders for instance), I had already handled them by modifying the
code to add rel="nofollow" to those links. I didn't have to rely on
GSC to point them out. It may or may not have caught them anyway.

You need to know your site at the end of the day.

kkarl

unread,
Nov 20, 2009, 4:06:41 AM11/20/09
to SOFTplus GSiteCrawler
Webado

The developers of GSC can certainly answer my question


Joe

One example can show the complex issues:

We discovered by chance for six websites (only differing in language)
that there are a lot of duplicate URLs in the index of Google:

1. We know how this DC is generated but not why!

2. Google now has a tool in webmaster tools to ignore these URLs

3. GSC does not even recognize one of those duplicate URLs

4. Google now has changed his opinion regarding robots.txt ... for DC
issues:
< We now recommend not blocking access to duplicate content on your
website, whether with a robots.txt file or other methods. Instead, use
the rel="canonical" link element, the URL parameter handling tool, or
301 redirects. If access to duplicate content is entirely blocked,
search engines effectively have to treat those URLs as separate,
unique pages since they cannot know that they're actually just
different URLs for the same content. A better solution is to allow
them to be crawled, but clearly mark them as duplicate using one of
our recommended methods.>

5. To detect DC it is necessary in my opinion to use various tools to
get a feeling about the size of DC issues. It is also important to
know a little bit about the criteria these tools are creating their
results.

6. Another DC example using various tools for a large website: There
may be 1000 or 3000 or 6000 pages! What are the wrong results?


Kind regards
Kkarl


Joe Germann

unread,
Nov 20, 2009, 6:36:02 AM11/20/09
to gsitec...@googlegroups.com
Hi KKarl,

Very interesting.  I'll have to dig into Google deeper regarding blocking access with robots.txt files.

The issue on my web site is that the OS Commerce software kicks out a lot of different links, some of which are to URL's that I want don't want people landing on; like into a sorted list of some random products.  Without pruning down the potential list of URL we run the risk of having potential customers land in a bad place and then leave.

Hmmm........  now you have me thinking

Regards,
Joe
--

You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To post to this group, send email to gsitec...@googlegroups.com.
To unsubscribe from this group, send email to gsitecrawler...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gsitecrawler?hl=.

Christina S

unread,
Nov 20, 2009, 8:30:35 AM11/20/09
to gsitec...@googlegroups.com
Hi again,
 
The developers may eventually answer this.
 
It doesn't remove from the good practices.
 
As for the method of avoiding having duplicate content urls in the first place, what I said stands.
 
If those urls have already been indexed, then you need to:
1) not have them in navigation where Google et al. get at them - thus modify software to not create them at all
2) any urls that have duplicate content and which are no longer part of navigation at all (followed or not followed) are best 301 redirected to the correct urls and do not disallow them in robots.txt at this point in order to allow the redirection to be found.
 
If the site has not yet been released for indexing, disallowing duplicate urls in the robots.txt file will work well, and adding rel="nofollow" to links to them will be extra help.
 
If duplicate urls cannot be controlled that way, rel="canonical" link tags added to those pages will at least help by letting Google know that it's a duplicate of another url. It won't have the strength of a 301 redirection with its effect of consolidating value (pagerank), but a url with rel="canonical" indicating  a different url will not get indexed. They won't drop out of the index if they have already been indexed though.
 
Finally if you cannot modify software to avoid creating duplicate urls and add rel="nofollow" to links or rel="canonical" link tags in the head of the page, then disallowing dups or patterns of dups in robots.txt will also work. Combine with url removal requests if any such urls happen to have been indexed.
 
A silly way of ending up with tons of duplicate urls is usually when you have changed the permalink structure by using some kind of url rewriting but have not applied 301 redirections to the original urls. Worse if you have not modified the navigation fully everywhere and they are in fact found here and there.
Reply all
Reply to author
Forward
0 new messages