Some details on CC architecture.

151 views
Skip to first unread message

Stan Srednyak

unread,
Jul 17, 2021, 12:14:26 PM7/17/21
to common...@googlegroups.com

hi CC,

thanks for the amazing work and public service that you have been doing.

I have some overall questions:

1) What is the web traverse algorithms used by CC, i.e. how new urls are chosen to be crawled ? I think they are chosen randomly, and the question is, from what distribution? In a post on this email list I saw that it is possible that crawler visits the same page several times during one crawl. Is this true and why is it not prevented? ( it does not seem like a difficult or resource consuming tweak to any distributed architecture to prevent double crawl and this can make a difference in terms of the proportion of the web crawled , e.g. 10% v.s. 30 %).


2) CC seems to be a distributed multi threaded crawler. Are the threads independent or do they access each other's data? How many threads ( or CPUs) are there and how much time it takes to do the crawl? What is the bandwidth consumed?

I can estimate the bandwidth. Assuming that you crawl ~1PB in 10 days (~10^6 s) ( this is your monthly crawl) this gives  ~10^15/10^6 ~1GB/s. But it is hard to estimate your CPU power. It really depends on what you are doing with the pages ( I believe you construct the index, web graph,  compute various ranks,   do functional analysis on JS and the like ). Could you please give me some hint? If it is not a secret, could you please describe your hardware?



best regards,
Stan

Sebastian Nagel

unread,
Jul 19, 2021, 11:33:33 AM7/19/21
to common...@googlegroups.com
Hi Stan,

> 1) What is the web traverse algorithms used by CC, i.e.
> how new urls are chosen to be crawled ?

Starting point are the harmonic centrality scores/ranks calculated on
the web graphs. First, the HC defines how many pages are allowed per
domain. Second, the domain-level scores are "projected" to the level
of URLs/pages by OPIC and inlink count. Which URLs are chosen is defined
by the resulting score plus a random value to give lower scoring URLs/pages
a chance to be sampled. Also the fetch history is taken into account:
if a URL was already fetched, some time (more if the score is low)
must elapse until it's re-fetched. If a re-fetch leads to a duplicate
(not modified), the re-fetch interval is further increased.


> it is possible that crawler visits the same page several times
> during one crawl. Is this true and why is it not prevented?

Yes, there are still URL-level duplicates because the fetcher follows
redirects without verifying whether the redirect target was already fetched
resp. is expected to be fetched in this crawl.

The amount of URL-level duplicates is steadily below 1% in recent crawls.
Usually, there are 1.5 - 4% content-level duplicates (with different URL).
So, the priority should be here and even more important on near-duplicates.

> it does not seem like a difficult or resource consuming tweak to any
> distributed architecture to prevent double crawl

Well, eliminating URL-level duplicates entirely would mean to have
a shared data structure or service which is initially filled with 3-4 billion
URLs and then checked and continuously updated for 300 million redirect targets.

> and this can make a difference in terms of the proportion of the web crawled ,
> e.g. 10% v.s. 30 %).

Agreed. And URL-level duplicates were definitely a problem in the past, see
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize


> 2) CC seems to be a distributed multi threaded crawler. Are the threads independent or do
> they access each other's data? How many threads (or CPUs) are there and how much time it takes
> to do the crawl?

Apache Nutch (https://nutch.apache.org/) a multi-process and multi-threaded crawler.
The crawling is done in 100 segments, one after each other. Each segment is fetched
in one Hadoop job by 40 parallel tasks (running as separate processes), each with
160 parallel fetcher threads plus threads handling the queues. The threads do not
communicate with each other but the queues are shared between the threads of one task.

There is a chapter about Nutch in Tom White's book "Hadoop: The Definitive Guide"
(1st, 2nd or 3rd edition) which explains the principles how work is distributed
while ensuring politeness and alike.


> What is the bandwidth consumed?

Actually, the uncompressed payload (HTML pages, mostly) of a monthly crawl
with 2.5 - 3 billion pages is around 250 - 300 GiB. To request that many pages
plus 20-30% redirects and 404s on top of the successfully fetched pages makes
100 - 120 TiB ingress traffic
6 - 8 TiB egress
including the overhead of the protocol layers. In turn, ingress traffic profits
from protocol-level compression. Note that these numbers do not include cluster-
internal traffic.

> But it is hard to estimate your CPU power.

Currently, the fetching is done in about 13 days (100 segments per 3 hours)
using a cluster of 16-20 EC2 r*.xlarge instances (32 GB RAM, 4 vCPUs).
This includes the packaging of the content into WARC files which is CPU intensive
because of calculating checksums, WARC compression and the detection of MIME type
and content language.

Pre- and post-processing together require a similar amount of CPU time
than the fetching alone. Usually, compute-optimized instances are added
to speed up the processing.


Best,
Sebastian


On 7/17/21 6:14 PM, Stan Srednyak wrote:
>
> hi CC,
>
> thanks for the amazing work and public service that you have been doing.
>
> I have some overall questions:
>
> 1) What is the web traverse algorithms used by CC, i.e. how new urls are chosen to be crawled ? I think they are chosen randomly, and the
> question is, from what distribution? In a post on this email list I saw that it is possible that crawler visits the same page several times
> during one crawl. Is this true and why is it not prevented? ( it does not seem like a difficult or resource consuming tweak to any
> distributed architecture to prevent double crawl and this can make a difference in terms of the proportion of the web crawled , e.g. 10%
> v.s. 30 %).
>
>
> 2) CC seems to be a distributed multi threaded crawler. Are the threads independent or do they access each other's data? How many threads (
> or CPUs) are there and how much time it takes to do the crawl? What is the bandwidth consumed?
>
> I can estimate the bandwidth. Assuming that you crawl ~1PB in 10 days (~10^6 s)( this is your monthly crawl) this gives  ~10^15/10^6

Xavier Raj

unread,
Jul 19, 2021, 11:45:51 AM7/19/21
to common...@googlegroups.com
Thanks Sebastian for the useful information and Thanks to @stan....@gmail.com for the questions to learn from.

  _______________________

  With best regards

  Xavier Raj. A

  eGrove Systems Corporation

  Phone: 732-387-5769 (Extn. 404) | Skype: xavierraj.egrove | Twitter: @axavierraj

Book a meeting with me at: https://calendly.com/xavier-egrove/30min

  Never, never, never give up...!

    And always believe you can achieve something..!"

    Did you know? Elite Site Optimizer can help you to monitor your web pages performance, accessibility, usability and user experience metrics? Read more... 

    

    Try Elite Site Optimizer now!

         

  Disclaimer:

// Please reply with REMOVE in the subject line, if you don't want to receive email of this nature in future and we apologize for the inconvenience caused.// /// This email including any attachments is for the sole use of the intended recipient(s) and may contain confidential and/or proprietary and/or copyrighted information. Unauthorized use or disclosure or distribution is strictly prohibited. Please contact the sender if you received this email in error and delete this email.//





--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/5978c4a2-b093-268e-0476-c6d09f93dbd9%40commoncrawl.org.

Stan Srednyak

unread,
Jul 21, 2021, 10:56:41 AM7/21/21
to common...@googlegroups.com
hi Sebastian,

thanks for the detailed answers. Based on them and also on Tom White's book and Nutch's documentation, I have some further questions:


Summarizing your architecture, you have for each "segment":

40 independent processes each having 160 fetcher threads.

threads in a process share fetchlists. Processes have non-intersecting fetchlists.

You also have separate Mapreduce job that forms the non-intersecting fetchlists, from newly harvested outlinks.


1) What URLs end up in the same segment?

Is the "segment" actually a url name segment( e.g.  aa-bz,ca-de,df-ee ...) or something like hashed url name segment ?

Are data structures such as fetchlists and Harmonic Centrality (HC) computed independently for each "segment" or for the whole crawl?

2) Do you start each new crawl from scratch and initialize all data structures as crawl proceeds?

I assume so.

Then you need to construct Harmonic Centrality dynamically, i.e., you need to keep updating it. But this is costly operation, you cannot do it too often. So how do you go around this? How do you define host-level Harmonic Centrality?


2.1) Could you please write down the formula that you are using for the probability distribution?

I think many people would benefit greatly if you explain this formula to us. I already saw academic papers devoted to questions like "Is Common Crawl sampling web fairly?". I believe if you explain the formula such studies do not have to be  published or even conducted.

This is a tricky issue because the probability distribution is time dependent. It depends on the state of the crawl ( how much was already crawled), on the visible part of the web that was already visited , and perhaps on the details how you split your crawl into your "segments"( if these latter have separate data structures).


3) From your answer, it seems that the only source of URL level duplicates are redirects. Is this correct?

This would mean that it is part of fetchlist forming jobs to ensure that the fetchlists for each process do not intersect and they only contain urls that have not been crawled. Do you maintain host-level list of visited urls? If so, what is the data structure ( in Hadoop terminology ) that contains this info? What about the inlinks data structure?

4) I just consulted the price list at AWS. Each r*.xlarge iinstance is ~0.2$/hour, and it has ~1GB/s bandwidth.

Then the cost of your cluster of 20 r*.xlarge, for 10 day crawl is

~20*10*25*0.2 = 1000$

Does this mean that your monthly crawl costs you ~1000$?

I did not add storage costs here, because I do not understand how to estimate them. They seem to price storage and IOPS separately. Could you please explain this part?

4.1) Does this mean that the crawl time is CPU-dominated? I estimated in my previous email that for 1PB you need ~1GB/s, while you have 20 instances each having ~1GB/s, and you crawl "only" 120TB. This suggests that most of these "3hours/segment" are spent on processing data rather than waiting on it to arrive.

4.2) Is it possible to further split CPU time into "in-memory" part and the part that is responsible for "writing to disk"? I.e., separate intrinsic page processing time from "Hadoop operations" time? Is it true that the bulk of the CPU time goes into maintaining the Hadoop data structures?

You also mentioned the in-cluster traffic. How large is that?



best regards,
Stan

Reply all
Reply to author
Forward
0 new messages