Pages versus URLs, and uniqueness of WAT file entries

Henry S Thompson

unread,

Oct 23, 2017, 2:56:56 PM10/23/17

to Common Crawl

Two related newbie questions, which I can't find answers to after much searching...

1) What's the difference between 'pages' and 'urls' in size statistics, for instance in the following values from https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2014-15/stats/part-00000.gz:

2014 ["size", "page", "CC-MAIN-2014-15"] 2641371316
2014 ["size", "url", "CC-MAIN-2014-15"]  1718646762

Neither of these correspond to my count of entries in the corresponding WAT files (2,534,229,771), although that's obviously closer to the page number above.

2) In what way, if any, are the entries in a given WAT collection, e.g. from CC-MAIN-2014-15, unique?

Pointers to where I should have been able to find the answers welcome, as well as the answers themselves.

Thanks,

ht

Sebastian Nagel

unread,

Oct 23, 2017, 3:51:16 PM10/23/17

to common...@googlegroups.com

Hi Henry,

if one URL is fetched twice, there will be two "pages" ("response records", "captures")
in the crawl archives. In the past there have been many duplicate captures. At present,
there the rate of duplicates is around 1-2%.

For CC-MAIN-2014-15 there are more WARC files then WAT/WET files, see
http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2014-15/index.html
I do not know the reason. Only 95% of WARCs have WATs/WETs, that could
be a plausible explanation and is about the same ratio as 2.53/2.64.

> 2) In what way, if any, are the entries in a given WAT collection, e.g. from CC-MAIN-2014-15, unique?

No, that's not the case, also not in the WARC files.

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Henry S Thompson

unread,

Oct 23, 2017, 4:10:24 PM10/23/17

to Common Crawl

On Monday, October 23, 2017 at 3:51:16 PM UTC-4, Sebastian Nagel wrote:

if one URL is fetched twice, there will be two "pages" ("response records", "captures")
in the crawl archives. In the past there have been many duplicate captures. At present,
there the rate of duplicates is around 1-2%.

...

> 2) In what way, if any, are the entries in a given WAT collection, e.g. from CC-MAIN-2014-15, unique?

No, that's not the case, also not in the WARC files.

Thanks for the quick and helpful reply.

Just to check my understanding, your first answer (there are still a few duplicate captures) explains your second answer, as follows:

2a) There are duplicate captures, i.e. some response records will show the same target URI, so URIs are not unique;

2b) A duplicate capture may result in duplicate pages, since the fetches are done at different times.

Right?

It follows, I think, that the difference between the page and URI counts gives the number of duplicate captures (not the same as the number of duplicated URIs, as some may have been captured more often than others), which is also an approximation of an upper bound on the number of duplicated pages. Not an actual upper bound, because distinct URIs might yield duplicate responses, with some (we hope) low probability.

Are you aware of any attempt to do duplicate (page) detection, perhaps even to publish the IDs of duplicate responses, either checking only responses with the same URI, or doing the full N^2 check?

Thanks again,

ht

Sebastian Nagel

unread,

Oct 23, 2017, 4:45:23 PM10/23/17

to common...@googlegroups.com

Hi Henry,

> 2a) There are duplicate captures, i.e. some response records will show the same target URI, so
> URIs are not unique;

> 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different
times.

Yes, both are correct:

2a) ideally a single URL is fetched only once in a monthly crawl. There are no duplicate URLs in the
fetch lists, but the crawler follows redirects unchecked which may cause a duplicate if the URL the
redirect points is already fetched.

2b) of course, two captures of the same URL may or may not result in duplicate content.

> Not an actual upper bound, because distinct URIs might yield duplicate responses,
> with some (we hope) low probability.

Yes, that's true unfortunately.

> Are you aware of any attempt to do duplicate (page) detection, perhaps even to publish the IDs of
> duplicate responses, either checking only responses with the same URI, or doing the full N^2 check?

The URL index contains a digest of the binary content (raw HTML). The full N^2 check could be easily
realized as MapReduce job. I haven't done this, for two reasons:
- near-duplicate detection would be more important to look into
- Nutch (the crawler we use) already provides a tool to do this on the CrawlDb. From two or more
URLs with the same content checksum all except one (the shortest or simplest URL) are flagged as
duplicates. Duplicates are then revisited within longer periods of time. That's why the amount of
exact duplicates is now within an acceptable range.

Btw., the statistics files contain the also contain an Hyperloglog estimate of the number of unique
contents. Attached a condensed view over all crawls - the data frame is dumped while generating the
plots on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
see step 3 "plotting" on
https://github.com/commoncrawl/cc-crawl-statistics

Best,
Sebastian

On 10/23/2017 10:10 PM, Henry S Thompson wrote:
> On Monday, October 23, 2017 at 3:51:16 PM UTC-4, Sebastian Nagel wrote:
>
> if one URL is fetched twice, there will be two "pages" ("response records", "captures")
> in the crawl archives. In the past there have been many duplicate captures. At present,
> there the rate of duplicates is around 1-2%.
> ...
> > 2) In what way, if any, are the entries in a given WAT collection, e.g. from CC-MAIN-2014-15,
> unique?
>
> No, that's not the case, also not in the WARC files.
>
>
> Thanks for the quick and helpful reply.
>
> Just to check my understanding, your first answer (there are still a few duplicate captures)
> explains your second answer, as follows:
>
> 2a) There are duplicate captures, i.e. some response records will show the same target URI, so URIs
> are not unique;

> 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different times.

>
> Right?
>
> It follows, I think, that the difference between the page and URI counts gives the number of
> duplicate captures (not the same as the number of duplicated URIs, as some may have been captured
> more often than others), which is also an approximation of an upper bound on the number of
> duplicated pages. Not an actual upper bound, because distinct URIs might yield duplicate responses,
> with some (we hope) low probability.
>
> Are you aware of any attempt to do duplicate (page) detection, perhaps even to publish the IDs of
> duplicate responses, either checking only responses with the same URI, or doing the full N^2 check?
>
> Thanks again,
>
> ht
>

crawlduplicates.txt

Sebastian Nagel

unread,

Oct 23, 2017, 4:49:40 PM10/23/17

to common...@googlegroups.com

> Are you aware of any attempt to do duplicate (page) detection

Forgot to say: if anyone did some work on this (esp. near duplicates), please let us know about it!

Thanks,
Sebastian

On 10/23/2017 10:10 PM, Henry S Thompson wrote:
> On Monday, October 23, 2017 at 3:51:16 PM UTC-4, Sebastian Nagel wrote:
>
> if one URL is fetched twice, there will be two "pages" ("response records", "captures")
> in the crawl archives. In the past there have been many duplicate captures. At present,
> there the rate of duplicates is around 1-2%.
> ...
> > 2) In what way, if any, are the entries in a given WAT collection, e.g. from CC-MAIN-2014-15,
> unique?
>
> No, that's not the case, also not in the WARC files.
>
>
> Thanks for the quick and helpful reply.
>
> Just to check my understanding, your first answer (there are still a few duplicate captures)
> explains your second answer, as follows:
>
> 2a) There are duplicate captures, i.e. some response records will show the same target URI, so URIs
> are not unique;

> 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different times.

>
> Right?
>
> It follows, I think, that the difference between the page and URI counts gives the number of
> duplicate captures (not the same as the number of duplicated URIs, as some may have been captured
> more often than others), which is also an approximation of an upper bound on the number of
> duplicated pages. Not an actual upper bound, because distinct URIs might yield duplicate responses,
> with some (we hope) low probability.
>
> Are you aware of any attempt to do duplicate (page) detection, perhaps even to publish the IDs of
> duplicate responses, either checking only responses with the same URI, or doing the full N^2 check?
>
> Thanks again,
>
> ht
>

Henry S Thompson

unread,

Oct 23, 2017, 5:14:25 PM10/23/17

to Common Crawl

Thanks for all that, very helpful, particularly the attachment.

Evidently I picked a starting point (2014-15) in the middle of a run of high-duplication months :-(

ht

Henry S Thompson

unread,

Oct 25, 2017, 2:55:20 PM10/25/17

to Common Crawl

On Monday, October 23, 2017 at 4:45:23 PM UTC-4, Sebastian Nagel wrote:

Hi Henry,

> 2a) There are duplicate captures, i.e. some response records will show the same target URI, so
> URIs are not unique;
> 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different
times.

Yes, both are correct:

2a) ideally a single URL is fetched only once in a monthly crawl. There are no duplicate URLs in the
fetch lists, but the crawler follows redirects unchecked which may cause a duplicate if the URL the
redirect points is already fetched.

Returning to this, specifically wrt 2014-15 (April 2014 crawl). In your very helpful tabulation of key numbers, for this crawl we find:

         crawl       page         url         digest estim.  1-(urls/pages)  1-(digests/pages)
3   CC-MAIN-2014-15  2641371316  1718646762   2250363653         34.9%             14.8%

It's hard to believe that redirects can account for 900,000,000 duplicates! Are you confident there were no duplicates in the fetch list? According to [1], this was the first crawl using just seed URIs (i.e. with no spidering) from blekko.

Thanks again for bearing with me as I dig in to this...

ht

[1] http://webdatacommons.org/hyperlinkgraph/2014-04/topology.html

Sebastian Nagel

unread,

Oct 25, 2017, 5:12:18 PM10/25/17

to common...@googlegroups.com

Hi Henry,

> It's hard to believe that redirects can account for 900,000,000 duplicates! Are you confident
> there were no duplicates in the fetch list?

I know from Stephen Merity that he has been fighting duplicates which were caused by redirects.
That has been even discussed in this group, see [2].

Apache Nutch which is used as crawler does not allow to put duplicates in fetch lists. URLs are hold
as keys in the CrawlDb, a Hadoop map file which does not allow for duplicate keys. Fetch lists are
generated from the CrawlDb. Every URL should end up only once in the fetch list.

Of course, I cannot be 100% confident about the number because there may be a bug (more likely in
the code to count pages and unique URLs than in Nutch). But I have no evidence that the numbers
are wrong.

The problem about redirects is that many sites do not send a 404 if an outdated URL is requested,
instead a redirect is sent back pointing to the home page or a login page, etc.

At present, every month about 100 million URLs are flagged as duplicates because they are redirected
to a known URL (or a second URL is redirected to the same target). The 100 million are in addition
to previously flagged duplicate redirects. After the September crawl there have been 2.9 billion
redirects in the CrawlDb, 1.15 billion flagged as duplicates. That's why 900 million is in general
a plausible number.

To get rid of stale URLs, duplicates are deleted from CrawlDb if not seen in "seeds" (a sitemap, a
breadth-first seed crawl, links randomly selected from WAT files, a seed donation, etc.) for more
than 4 month.

> According to [1], this was the first crawl using just
> seed URIs (i.e. with no spidering) from blekko.

The blekko seed donation was announced end of 2012 [3], and the 2013 crawls are definitely based on
blekko's seeds [4,5].

Best,
Sebastian

[2] https://groups.google.com/d/topic/common-crawl/iTV17kbU94E/discussion
[3] http://commoncrawl.org/2012/12/blekko-donates-search-data-to-common-crawl/
[4] http://commoncrawl.org/2014/01/winter-2013-crawl-data-now-available/
[5] https://groups.google.com/d/topic/common-crawl/H7jE-585uj8/discussion

On 10/25/2017 08:55 PM, Henry S Thompson wrote:
> On Monday, October 23, 2017 at 4:45:23 PM UTC-4, Sebastian Nagel wrote:
>
> Hi Henry,

> http://commoncrawl.org/2014/01/winter-2013-crawl-data-now-available/

Henry S Thompson

unread,

Oct 30, 2017, 11:46:03 AM10/30/17

to Common Crawl

On Monday, October 23, 2017 at 4:45:23 PM UTC-4, Sebastian Nagel wrote:

Hi Henry,

> 2a) There are duplicate captures, i.e. some response records will show the same target URI, so
> URIs are not unique;
> 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different
times.

Yes, both are correct:

2a) ideally a single URL is fetched only once in a monthly crawl. There are no duplicate URLs in the
fetch lists, but the crawler follows redirects unchecked which may cause a duplicate if the URL the
redirect points is already fetched.

OK, I'm still confused. For example what you describe is like this:

1) We start with two distinct URIs, U1 and U2, in the to-be-crawled list.

2) We GET U1, and get P1

3) We GET U2, it redirects to U1, so we get P1 again

Two URIs, two pages, two response records, same digest

So far as that goes, it's clear. But how can it result in more response records than URIs in the initial crawl list? Well, maybe it wasn't (3) as above, but

3') We GET U2, it redirects to U1, but time has passed, so we get P1' this time

Two URIs, two pages, two response records, distinct digests

But we still only have the same number of pages as the number of URIs we started with. How can we ever get more (as we evidently do, given the numbers)?

Thanks for your patience and continued help!

Henry S Thompson

unread,

Oct 30, 2017, 1:08:42 PM10/30/17

to Common Crawl

PS, I did a quick check on 4 WAT files I had to hand from 2014-15, and 240372 out of 240376 responses had status code 200, so redirection landing pages can't account for the 900,000,000 surplus of responses over crawl-list URIs

Henry S Thompson

unread,

Oct 30, 2017, 2:06:52 PM10/30/17

to Common Crawl

PPS There are no duplicate URIs among those 240376 responses.

Tom Morris

unread,

Oct 30, 2017, 7:36:16 PM10/30/17

to common...@googlegroups.com

On Mon, Oct 23, 2017 at 4:49 PM, 'Sebastian Nagel' via Common Crawl <common...@googlegroups.com> wrote:

> Are you aware of any attempt to do duplicate (page) detection

Forgot to say: if anyone did some work on this (esp. near duplicates), please let us know about it!

I did some work on this based on the C4CorpusTools package from TU Darmstadt.

You can see some example numbers in my comment on this thread:

https://github.com/dkpro/dkpro-c4corpus/issues/23#issuecomment-202161915

My fork of the project which reworks the dataflow to be more efficient is on this branch:

https://github.com/tfmorris/dkpro-c4corpus/tree/new-dataflow

Tom

Tom Morris

unread,

Oct 30, 2017, 8:18:26 PM10/30/17

to common...@googlegroups.com

On Wed, Oct 25, 2017 at 5:12 PM, Sebastian Nagel <seba...@commoncrawl.org> wrote:

Hi Henry,

> It's hard to believe that redirects can account for 900,000,000 duplicates! Are you confident
> there were no duplicates in the fetch list?

I know from Stephen Merity that he has been fighting duplicates which were caused by redirects.
That has been even discussed in this group, see [2].

Although Stephen says earlier [6] in that thread [2]:

"The duplication percentage for two exact URLs being in a single crawl archive should be quite low. The URL list is made unique before the crawling process is begun in our preparation stage. The only situation in which the exact same URL should be crawled twice is if the crawler follows a redirect from a previous URL."

Anecdotal empirical results have demonstrated that not to be the case. One specific case involves query parameters which make for a "different' URL that returns the same results [6]. You can also have multiple paths on a web server to effectively the same content, without the use of redirects.

My first guess in looking at the 2014 crawl stats [8] would be that: a) individual URLs are getting crawled multiple times and b) they're returning pages which are similar, but contain time stamps or a recent articles roll or some other boilerplate that varies. That's just a guess though and you'd need to do the work to dig into the crawl to see what's actually going on.

Given that the current crawls are much cleaner and figuring out what's going on with the older crawls is potentially a lot of work, I have to ask whether you really care enough about those old crawls to invest the time and money in doing that investigation of a 3 1/2 year old crawl.

Tom

[6] https://groups.google.com/d/msg/common-crawl/iTV17kbU94E/_bT33bJQGD4J

[7] https://groups.google.com/d/msg/common-crawl/VjXzIwfqZcU/GakNyUbsBwAJ

[8] https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize

> common-crawl+unsubscribe@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com

> <mailto:common-crawl@googlegroups.com>.

> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Tom Morris

unread,

Oct 30, 2017, 9:57:25 PM10/30/17

to common...@googlegroups.com

Curiosity got the better of me, so I did a quick and dirty analysis of a 1% sample of the 2015-05 URL index.

$ time aws --no-sign-request s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2014-15/indexes/cdx-00000.gz - | gunzip | cut -d ' ' -f 3-999 | jq -r .url | gzip > cc-urls-00000.gz

$ time aws --no-sign-request s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2014-15/indexes/cdx-00100.gz - | gunzip | cut -d ' ' -f 3-999 | jq -r .url | gzip > cc-urls-00100.gz

$ time aws --no-sign-request s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2014-15/indexes/cdx-00200.gz - | gunzip | cut -d ' ' -f 3-999 | jq -r .url | gzip > cc-urls-00200.gz

That takes under 15 minutes to download the necessary indices, generating a manageable 171 MB of data to analyze.

$ gzcat cc* | wc -l

26238874

$ gzcat cc* | uniq | wc -l

16997529

$ gzcat cc* | cut -d '?' -f 1 | uniq | wc -l

11905037

So 35% of the page captures are for duplicate URLs. Looking at just the unique URLs, 30% of them differ only in their query parameters.

Some of the URLs are duplicated as many as 70 times, e.g.

70 http://www.ubs.com/app_error/comp.html

70 http://www.ubergizmo.com/

70 http://www.twylah.com/Scobleizer

70 http://www.twoo.com/?login=0&utm_campaign=netlogmerger

70 http://www.twilightarchives.com/

70 http://www.messe-wels.at/ne07/?pn=1

70 http://www.klopeinersee.at/de/urlaubsthemen/

70 http://www.kidspot.com.au/PageNotFound.asp

70 http://www.jump-in.com.au/watch-now/

70 http://www.intimateweddings.com/wedding-venues/

From visual inspection it appears that there is a large percentage of error pages, so perhaps that's a clue as to the source of the duplicates.

Tom

Sebastian Nagel

unread,

Oct 31, 2017, 7:47:41 AM10/31/17

to common...@googlegroups.com

Hi Henry,

> 1) We start with two distinct URIs, U1 and U2, in the to-be-crawled list.
> 2) We GET U1, and get P1
> 3) We GET U2, it redirects to U1, so we get P1 again
>
> Two URIs, two pages, two response records, same digest
>

No, _one_ URL, two pages, two response records, ...
Regarding the digest: yes, it may be the same or different, often only because
the content contains a time stamp.

* response records have always the URL of the GET request,
a redirection if followed causes two or more GET requests

* redirects as well as 404s and other non-200 request/response pairs are
not recorded in CC-MAIN-2014-15. Since about one year they're captured
in WARC files (but not WAT and WET, see [1]) and later on also contained
in the URL index [2]

* it's actually:

2) We GET U1, and get P1

3a) We GET U2, it redirects to U1, following the redirected immediately ...
3b) We GET U1, and get P1 or P1'
2 - size of fetch list
1 - number of unique URLs in WARC files
2 - WARC response records

I hope that now explains how the URL-level duplicates appear in the archives.

Best,
Sebastian

[1] http://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/
[2] http://commoncrawl.org/2016/12/december-2016-crawl-archive-now-available/

On 10/30/2017 04:46 PM, Henry S Thompson wrote:
> On Monday, October 23, 2017 at 4:45:23 PM UTC-4, Sebastian Nagel wrote:
>
> Hi Henry,
>
> > 2a) There are duplicate captures, i.e. some response records will show the same target URI, so
> > URIs are not unique;
> > 2b) A duplicate capture /may /result in duplicate pages, since the fetches are done at different
> times.
>
> Yes, both are correct:
>
> 2a) ideally a single URL is fetched only once in a monthly crawl. There are no duplicate URLs in
> the
> fetch lists, but the crawler follows redirects unchecked which may cause a duplicate if the URL the
> redirect points is already fetched.
>
>

> OK, I'm /still/ confused. For example what you describe is like this:

>
> 1) We start with two distinct URIs, U1 and U2, in the to-be-crawled list.
> 2) We GET U1, and get P1
> 3) We GET U2, it redirects to U1, so we get P1 again
>
> Two URIs, two pages, two response records, same digest
>
> So far as that goes, it's clear. But how can it result in more response records than URIs in the
> initial crawl list? Well, maybe it wasn't (3) as above, but
>
> 3') We GET U2, it redirects to U1, but time has passed, so we get P1' this time
>
> Two URIs, two pages, two response records, distinct digests
>

> But we /still/ only have the same number of pages as the number of URIs we started with. How can we

> ever get more (as we evidently do, given the numbers)?
>
> Thanks for your patience and continued help!
>

Sebastian Nagel

unread,

Oct 31, 2017, 8:23:29 AM10/31/17

to common...@googlegroups.com

Hi Tom,

thanks for the pointer!

Of course, inside a single segment there may be no URL-level duplicates
because a segment is written by one MapReduce job and URLs are used as
keys which makes them automatically unique.

That leads to 100 as the upper boundary for the number of "duplicate"
captures of a single URL within one monthly crawl because a crawl is
done in 100 segments.

Best,
Sebastian

> <https://github.com/tfmorris/dkpro-c4corpus/tree/new-dataflow>
>
> Tom

Sebastian Nagel

unread,

Oct 31, 2017, 8:52:41 AM10/31/17

to common...@googlegroups.com

Hi Tom,

> Given that the current crawls are much cleaner and figuring out what's going on with the older
> crawls is potentially a lot of work, I have to ask whether you really care enough about those old
> crawls to invest the time and money in doing that investigation of a 3 1/2 year old crawl.

Yes, the data remains as it is. Maybe in the future we remove exact duplicates (or replace them
by "revisit" records). But this will only happen in combination with further improvements (better
packaging, tighter compression, etc.)

> From visual inspection it appears that there is a large percentage of error pages, so perhaps
> that's a clue as to the source of the duplicates.

That's true. I've made a similar analysis for the 10% URL-level duplicates contained in the 2016-07
crawl: it's mostly typical landing pages you're redirected when the server does not send a 404
response, sometimes error pages, but often also the home page itself or overview pages with
alternative products or topics.

Thanks,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com

> <mailto:common...@googlegroups.com>.

Henry S Thompson

unread,

Oct 31, 2017, 11:13:27 AM10/31/17

to Common Crawl

On Monday, October 30, 2017 at 9:57:25 PM UTC-4, Tom Morris wrote:

Curiosity got the better of me, so I did a quick and dirty analysis of a 1% sample of the 2015-05 URL index.

Thanks, that's very useful indeed!

Earlier on Monday, Tom wrote:

> Given that the current crawls are much cleaner and figuring out what's going on with the older crawls is potentially a lot of work,

> I have to ask whether you really care enough about those old crawls to invest the time and money in doing that investigation of

> a 3 1/2 year old crawl.

Since what I'm studying is the uptake over time of DOIs, yes, I do care enough, since I want to do my best to compare like with like...

Your 1% analysis is already very useful in this regard, as it suggests that for my purposes the duplicates are not likely to be very much of any issue, as neither error pages nor the kind of more-or-less commercial sites which use lots of params to manage state are likely to be the kind of site to use DOIs anyway...

Not entirely sure about the latter claim, but since I'm already discarding params when I count unique occurrences, it doesn't really matter.

Thanks again, you'll get a citation in the resulting paper!

Henry S Thompson

unread,

Oct 31, 2017, 11:21:31 AM10/31/17

to Common Crawl

On Tuesday, October 31, 2017 at 7:47:41 AM UTC-4, Sebastian Nagel wrote:

Hi Henry,

> 1) We start with two distinct URIs, U1 and U2, in the to-be-crawled list.
> 2) We GET U1, and get P1
> 3) We GET U2, it redirects to U1, so we get P1 again
>
> Two URIs, two pages, two response records, same digest
>

No, _one_ URL, two pages, two response records, ...

* response records have always the URL of the GET request,
a redirection if followed causes two or more GET requests

...

* it's actually:
2) We GET U1, and get P1
3a) We GET U2, it redirects to U1, following the redirected immediately ...
3b) We GET U1, and get P1 or P1'
2 - size of fetch list
1 - number of unique URLs in WARC files
2 - WARC response records

I hope that now explains how the URL-level duplicates appear in the archives.

Yes! But it raises a new (easy, I hope) question -- how big was the fetch list? I had been assuming all along that the URI numbers reported in various places, including your helpful table from last week, were the fetch list sizes, but from your little story above it must be the size of the "number of unique URLs in WARC files" --- right? Because that's the only number that will be less (a lot less, evidently, in 2014-15) than the number of response records.

ht

Sebastian Nagel

unread,

Oct 31, 2017, 11:56:19 AM10/31/17

to common...@googlegroups.com

Hi Henry,

> -- how big was the fetch list?

See
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlermetrics

For the exact numbers:
git clone https://github.com/commoncrawl/cc-crawl-statistics.git
grep cc-crawl-statistics/fetch_list stats/crawler/CC-MAIN-*.json \
| sed 's/[^A-Z0-9-]\{2,\}/ /g'
CC-MAIN-2016-18 4884080239
CC-MAIN-2016-22 5261630738
CC-MAIN-2016-26 3578181474
CC-MAIN-2016-30 2973093696
CC-MAIN-2016-36 2837438550
CC-MAIN-2016-40 2438332392
CC-MAIN-2016-44 4298206731
CC-MAIN-2016-50 3689984208
CC-MAIN-2017-04 3877984074
CC-MAIN-2017-09 4435442265
CC-MAIN-2017-13 3953684648
CC-MAIN-2017-17 3741728936
CC-MAIN-2017-22 3949651087
CC-MAIN-2017-26 4019722453
CC-MAIN-2017-30 4674326265
CC-MAIN-2017-34 4228576525
CC-MAIN-2017-39 3951178480
CC-MAIN-2017-43 4775053544

A rough estimate: right now the fetch list is about 1.5 the number of crawl archive records.
But it required some trial and error to get to this number, before it happened that 3/4 of
the fetch list haven't been fetched successfully.

Unfortunately, for older crawls these numbers do not exist.
But the fetch list size is only one indicator. More important what's in there:
- are the URLs valid and not "stale"?
- are they allowed to crawl (not excluded by robots.txt)?
- a representative sample, etc.

> the size of the "number of unique URLs in WARC files" --- right?

Yes. Sorry, without that misunderstanding we've got probably faster to the point.

Best,
Sebastian

Tom Morris

unread,

Dec 3, 2017, 9:57:50 PM12/3/17

to common...@googlegroups.com

[Resurrecting a month-old thread]

On Tue, Oct 31, 2017 at 11:13 AM, Henry S Thompson <h...@inf.ed.ac.uk> wrote:

Since what I'm studying is the uptake over time of DOIs, yes, I do care enough, since I want to do my best to compare like with like...

I'd be interested in the results of the study when it's complete, but, depending on the design of the study, using CommonCrawl data in a longitudinal fashion could be, umm, "challenging."

If you look at the graphs that Sebastian linked to earlier:

https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize

you'll see that CommonCrawl has gone through three pretty distinct phases, most visible in the "New Items per Crawl" chart. During the 2 year period of the middle phase, roughly from mid-2014 to late-2016, very few new URLs were crawled with each month's crawl basically retracing largely the same ground. In the third phase, since late 2016 there has been very little month-to-month overlap between crawls.

I suspect it'll take some care to design a study which will deal appropriately with data from all three phases of the CommonCrawl history.

Tom

Henry S. Thompson

unread,

Dec 4, 2017, 5:03:28 AM12/4/17

to common...@googlegroups.com

Tom Morris writes:

> [Resurrecting a month-old thread]
>
> On Tue, Oct 31, 2017 at 11:13 AM, Henry S Thompson <h...@inf.ed.ac.uk> wrote:
>
>> Since what I'm studying is the uptake over time of DOIs, yes, I do care
>> enough, since I want to do my best to compare like with like...
>
> I'd be interested in the results of the study when it's complete, but,
> depending on the design of the study, using CommonCrawl data in a
> longitudinal fashion could be, umm, "challenging."

I'll share it once I know whether the paper I've submitted on it is
accepted.

> If you look at the graphs that Sebastian linked to earlier:
>
> https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
>
> you'll see that CommonCrawl has gone through three pretty distinct phases,
> most visible in the "New Items per Crawl" chart. During the 2 year period
> of the middle phase, roughly from mid-2014 to late-2016, very few new URLs
> were crawled with each month's crawl basically retracing largely the same
> ground. In the third phase, since late 2016 there has been very little
> month-to-month overlap between crawls.

Thanks for the useful summary. Fortunately the core of the study
compared the April crawls in 2014 and 2017.

> I suspect it'll take some care to design a study which will deal
> appropriately with data from all three phases of the CommonCrawl history.

Indeed. And I may make the attempt, in which case I'll certainly share
the plan with this list and ask for feedback.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Reply all

Reply to author

Forward