Delay between creating and indexing URL

24 views
Skip to first unread message

Maciej Gawinecki

unread,
Mar 8, 2021, 9:24:53 AM3/8/21
to Common Crawl
Hi there,

I have just found that some URLs in Common Crawl are crawled even a year after a resource under a given URL has been created. For instance, this happens for some forum posts. I estimated that by comparing post creation date (available in HTML content) with first time the URL has been indexed by Common Crawl.

I know that Common Crawl crawls monthly.

But what is the average delay between creating a URL and indexing it by Common Crawl?

Any estimates?

With kind regards,
Maciej Gawinecki

Sebastian Nagel

unread,
Mar 8, 2021, 10:31:15 AM3/8/21
to common...@googlegroups.com
Hi Maciej,

> But what is the average delay between creating a URL and indexing it by Common Crawl?

Difficult to say, mostly because I do not know when the post(s) were created or crawled.
Over the years there have been multiple changes how URLs are detected and sampled.

A short explanation about the current state:
- every URL is sampled and this applies to known URLs (page re-fetch) and to new URLs
found via links or sitemaps
- in order to provide a global and balanced sample and to ensure crawler politeness,
the number of sampled pages per domain is limited. The limit depends on the harmonic
centrality score of the domain.
- for large forums or blogging sites the number of sampled pages is naturally small
compared to the sheer amount of available URLs/pages.
- in turn, it may take relatively long (if at all) until a URL is randomly sampled,
esp. it's deep in the side with many hops/links necessary to reach the page from
the home page.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/2833255a-c538-40f6-89df-2d36c9de9e8dn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/2833255a-c538-40f6-89df-2d36c9de9e8dn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Ed Summers

unread,
Mar 8, 2021, 11:24:19 AM3/8/21
to common...@googlegroups.com
On Mon, 2021-03-08 at 16:31 +0100, Sebastian Nagel wrote:
>
> Over the years there have been multiple changes how URLs are detected
> and sampled.
>
> A short explanation about the current state:
> - every URL is sampled and this applies to known URLs (page re-fetch)
> and to new URLs
> found via links or sitemaps
> - in order to provide a global and balanced sample and to ensure
> crawler politeness,
> the number of sampled pages per domain is limited. The limit
> depends on the harmonic
> centrality score of the domain.
> - for large forums or blogging sites the number of sampled pages is
> naturally small
> compared to the sheer amount of available URLs/pages.
> - in turn, it may take relatively long (if at all) until a URL is
> randomly sampled,
> esp. it's deep in the side with many hops/links necessary to reach
> the page from
> the home page.

Thanks for these details Sebastian. As someone who is relatiely new to
CC I've been interested in reading about how the crawl runs, and how it
has changed over time. Is there any code on GitHub that is used for
managing this process? Or is it handled some other way?

//Ed

Henry S. Thompson

unread,
Mar 8, 2021, 12:57:53 PM3/8/21
to common...@googlegroups.com
Sebastian Nagel writes:

> [Maciej writes]
>
>> But what is the average delay between creating a URL and indexing
>> it by Common Crawl?

Somewhat worryingly, one empirically determined answer is very close
to 0 seconds. Based on a random sample of 3 million pages from the
April 2016 crawl, my student Lukasz Domanski compared the
Last-Modified times vs. Crawl times for the 676,000 pages which had
valid Last-Modified headers. Here's his summary, taken from his
4th-year dissertation [1]:

"Over 56% of the pages in the sample are less than 1 day old. I
began to suspect that the overrepresentation of 1-day old pages
might be caused by webservers returning the current time as
Last-Modified header, instead of the correct value. I noticed that
nearly 40% of pages claim to be no older than 5 seconds and 35%
claim to be no older than 1 second. Additionally, 24% of pages have
Last-Modified time equal to the time they were crawled (they are ”0
seconds old”)."

It's worth noting that there's no obvious way to distinguish between
bogus ages of 0 [server always uses now() for Last-Modified, as Lukasz
suggests above], and true ages of 0 [server has built the page on
request, so it really is brand new]. Lukasz did look for a
correlation between Server type and page age, but didn't find one.

ht

[1] http://www.ltg.ed.ac.uk/~ht/Lukasz_Domanski_ug4_proj.pdf
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply all
Reply to author
Forward
0 new messages