Sebastian Nagel writes:
> [Maciej writes]
>
>> But what is the average delay between creating a URL and indexing
>> it by Common Crawl?
Somewhat worryingly, one empirically determined answer is very close
to 0 seconds. Based on a random sample of 3 million pages from the
April 2016 crawl, my student Lukasz Domanski compared the
Last-Modified times vs. Crawl times for the 676,000 pages which had
valid Last-Modified headers. Here's his summary, taken from his
4th-year dissertation [1]:
"Over 56% of the pages in the sample are less than 1 day old. I
began to suspect that the overrepresentation of 1-day old pages
might be caused by webservers returning the current time as
Last-Modified header, instead of the correct value. I noticed that
nearly 40% of pages claim to be no older than 5 seconds and 35%
claim to be no older than 1 second. Additionally, 24% of pages have
Last-Modified time equal to the time they were crawled (they are ”0
seconds old”)."
It's worth noting that there's no obvious way to distinguish between
bogus ages of 0 [server always uses now() for Last-Modified, as Lukasz
suggests above], and true ages of 0 [server has built the page on
request, so it really is brand new]. Lukasz did look for a
correlation between Server type and page age, but didn't find one.
ht
[1]
http://www.ltg.ed.ac.uk/~ht/Lukasz_Domanski_ug4_proj.pdf
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail:
h...@inf.ed.ac.uk
URL:
http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.