Hi Henry,
> 2a) There are duplicate captures, i.e. some response records will show the same target URI, so
> URIs are not unique;
> 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different
times.
Yes, both are correct:
2a) ideally a single URL is fetched only once in a monthly crawl. There are no duplicate URLs in the
fetch lists, but the crawler follows redirects unchecked which may cause a duplicate if the URL the
redirect points is already fetched.
2b) of course, two captures of the same URL may or may not result in duplicate content.
> Not an actual upper bound, because distinct URIs might yield duplicate responses,
> with some (we hope) low probability.
Yes, that's true unfortunately.
> Are you aware of any attempt to do duplicate (page) detection, perhaps even to publish the IDs of
> duplicate responses, either checking only responses with the same URI, or doing the full N^2 check?
The URL index contains a digest of the binary content (raw HTML). The full N^2 check could be easily
realized as MapReduce job. I haven't done this, for two reasons:
- near-duplicate detection would be more important to look into
- Nutch (the crawler we use) already provides a tool to do this on the CrawlDb. From two or more
URLs with the same content checksum all except one (the shortest or simplest URL) are flagged as
duplicates. Duplicates are then revisited within longer periods of time. That's why the amount of
exact duplicates is now within an acceptable range.
Btw., the statistics files contain the also contain an Hyperloglog estimate of the number of unique
contents. Attached a condensed view over all crawls - the data frame is dumped while generating the
plots on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
see step 3 "plotting" on
https://github.com/commoncrawl/cc-crawl-statistics
Best,
Sebastian
On 10/23/2017 10:10 PM, Henry S Thompson wrote:
> On Monday, October 23, 2017 at 3:51:16 PM UTC-4, Sebastian Nagel wrote:
>
> if one URL is fetched twice, there will be two "pages" ("response records", "captures")
> in the crawl archives. In the past there have been many duplicate captures. At present,
> there the rate of duplicates is around 1-2%.
> ...
> > 2) In what way, if any, are the entries in a given WAT collection, e.g. from CC-MAIN-2014-15,
> unique?
>
> No, that's not the case, also not in the WARC files.
>
>
> Thanks for the quick and helpful reply.
>
> Just to check my understanding, your first answer (there are still a few duplicate captures)
> explains your second answer, as follows:
>
> 2a) There are duplicate captures, i.e. some response records will show the same target URI, so URIs
> are not unique;
> 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different times.
>
> Right?
>
> It follows, I think, that the difference between the page and URI counts gives the number of
> duplicate captures (not the same as the number of duplicated URIs, as some may have been captured
> more often than others), which is also an approximation of an upper bound on the number of
> duplicated pages. Not an actual upper bound, because distinct URIs might yield duplicate responses,
> with some (we hope) low probability.
>
> Are you aware of any attempt to do duplicate (page) detection, perhaps even to publish the IDs of
> duplicate responses, either checking only responses with the same URI, or doing the full N^2 check?
>
> Thanks again,
>
> ht
>