Am 24.02.2013, 22:10 Uhr, schrieb Steve Souders
<
steveso...@gmail.com>:
> Duplicates are restricted in the schema. So, for example, there are no
> pages with the same URL in the a single crawl.
> You use the word "alias" - I kinda know what you mean but this is a vague
> term. You probably mean "two URLs that end up at the same site" - but
> "same site" is hard to define. Many search companies have dedicated many
> years of research and coding to determine if two sites are "the same".
Agreed. It depends largely on configuration of DNS and webserver. On
similar projects I've always compared resultant URLs to see if they
resolve to the same host.
ie.
http://www.linde.de and
http://www.the-linde-group.com/
> What are the issues you're trying to solve? How would you propose to
> solve them?
Well, I noticed that both
http://stevesouders.com and
http://www.stevesouders.com are covered in each crawl! ;-)