Hi Stan,
> 1) What is the web traverse algorithms used by CC, i.e.
> how new urls are chosen to be crawled ?
Starting point are the harmonic centrality scores/ranks calculated on
the web graphs. First, the HC defines how many pages are allowed per
domain. Second, the domain-level scores are "projected" to the level
of URLs/pages by OPIC and inlink count. Which URLs are chosen is defined
by the resulting score plus a random value to give lower scoring URLs/pages
a chance to be sampled. Also the fetch history is taken into account:
if a URL was already fetched, some time (more if the score is low)
must elapse until it's re-fetched. If a re-fetch leads to a duplicate
(not modified), the re-fetch interval is further increased.
> it is possible that crawler visits the same page several times
> during one crawl. Is this true and why is it not prevented?
Yes, there are still URL-level duplicates because the fetcher follows
redirects without verifying whether the redirect target was already fetched
resp. is expected to be fetched in this crawl.
The amount of URL-level duplicates is steadily below 1% in recent crawls.
Usually, there are 1.5 - 4% content-level duplicates (with different URL).
So, the priority should be here and even more important on near-duplicates.
> it does not seem like a difficult or resource consuming tweak to any
> distributed architecture to prevent double crawl
Well, eliminating URL-level duplicates entirely would mean to have
a shared data structure or service which is initially filled with 3-4 billion
URLs and then checked and continuously updated for 300 million redirect targets.
> and this can make a difference in terms of the proportion of the web crawled ,
> e.g. 10% v.s. 30 %).
Agreed. And URL-level duplicates were definitely a problem in the past, see
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
> 2) CC seems to be a distributed multi threaded crawler. Are the threads independent or do
> they access each other's data? How many threads (or CPUs) are there and how much time it takes
> to do the crawl?
Apache Nutch (
https://nutch.apache.org/) a multi-process and multi-threaded crawler.
The crawling is done in 100 segments, one after each other. Each segment is fetched
in one Hadoop job by 40 parallel tasks (running as separate processes), each with
160 parallel fetcher threads plus threads handling the queues. The threads do not
communicate with each other but the queues are shared between the threads of one task.
There is a chapter about Nutch in Tom White's book "Hadoop: The Definitive Guide"
(1st, 2nd or 3rd edition) which explains the principles how work is distributed
while ensuring politeness and alike.
> What is the bandwidth consumed?
Actually, the uncompressed payload (HTML pages, mostly) of a monthly crawl
with 2.5 - 3 billion pages is around 250 - 300 GiB. To request that many pages
plus 20-30% redirects and 404s on top of the successfully fetched pages makes
100 - 120 TiB ingress traffic
6 - 8 TiB egress
including the overhead of the protocol layers. In turn, ingress traffic profits
from protocol-level compression. Note that these numbers do not include cluster-
internal traffic.
> But it is hard to estimate your CPU power.
Currently, the fetching is done in about 13 days (100 segments per 3 hours)
using a cluster of 16-20 EC2 r*.xlarge instances (32 GB RAM, 4 vCPUs).
This includes the packaging of the content into WARC files which is CPU intensive
because of calculating checksums, WARC compression and the detection of MIME type
and content language.
Pre- and post-processing together require a similar amount of CPU time
than the fetching alone. Usually, compute-optimized instances are added
to speed up the processing.
Best,
Sebastian
On 7/17/21 6:14 PM, Stan Srednyak wrote:
>
> hi CC,
>
> thanks for the amazing work and public service that you have been doing.
>
> I have some overall questions:
>
> 1) What is the web traverse algorithms used by CC, i.e. how new urls are chosen to be crawled ? I think they are chosen randomly, and the
> question is, from what distribution? In a post on this email list I saw that it is possible that crawler visits the same page several times
> during one crawl. Is this true and why is it not prevented? ( it does not seem like a difficult or resource consuming tweak to any
> distributed architecture to prevent double crawl and this can make a difference in terms of the proportion of the web crawled , e.g. 10%
> v.s. 30 %).
>
>
> 2) CC seems to be a distributed multi threaded crawler. Are the threads independent or do they access each other's data? How many threads (
> or CPUs) are there and how much time it takes to do the crawl? What is the bandwidth consumed?
>
> I can estimate the bandwidth. Assuming that you crawl ~1PB in 10 days (~10^6 s)( this is your monthly crawl) this gives ~10^15/10^6