Duplicates & Aliases

Charlie Clark

unread,

Feb 22, 2013, 1:08:36 PM2/22/13

to httpa...@googlegroups.com

Hi,

there are a couple of sites which appear twice in the data, either because
they are alias (once with "www" and once without "www") or sites which are
aliases of each other - this is quite common for international websites
which often have .com on top of their country's TLD. Where this is the
case, how should they be treated?

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

Steve Souders

unread,

Feb 24, 2013, 4:10:04 PM2/24/13

to httpa...@googlegroups.com

Duplicates are restricted in the schema. So, for example, there are no pages with the same URL in the a single crawl.

You use the word "alias" - I kinda know what you mean but this is a vague term. You probably mean "two URLs that end up at the same site" - but "same site" is hard to define. Many search companies have dedicated many years of research and coding to determine if two sites are "the same".

What are the issues you're trying to solve? How would you propose to solve them?

-Steve

Charlie Clark

unread,

Feb 24, 2013, 4:18:39 PM2/24/13

to httpa...@googlegroups.com

Am 24.02.2013, 22:10 Uhr, schrieb Steve Souders
<steveso...@gmail.com>:

> Duplicates are restricted in the schema. So, for example, there are no
> pages with the same URL in the a single crawl.

> You use the word "alias" - I kinda know what you mean but this is a vague
> term. You probably mean "two URLs that end up at the same site" - but
> "same site" is hard to define. Many search companies have dedicated many
> years of research and coding to determine if two sites are "the same".

Agreed. It depends largely on configuration of DNS and webserver. On
similar projects I've always compared resultant URLs to see if they
resolve to the same host.

ie. http://www.linde.de and http://www.the-linde-group.com/

> What are the issues you're trying to solve? How would you propose to
> solve them?

Well, I noticed that both http://stevesouders.com and
http://www.stevesouders.com are covered in each crawl! ;-)

Reply all

Reply to author

Forward