Questions about using Common Crawl for another Hugging Face project

30 views
Skip to first unread message

Tristan Thrush

unread,
Sep 13, 2022, 3:51:22 PM (12 days ago) Sep 13
to Common Crawl
Hey Sebastian,

I'm doing a spinoff project to Hugging Face's BigScience/BLOOM project! The idea behind this new project is to train a language model continuously, updating it as new common crawl snapshots come out. If you have a moment, I'm wondering if you are able to answer some questions that we have. Thanks a ton for any help that you can give!

First, here is my understanding of common crawl:

There is a huge list of URLs somewhere. URLs are constantly being added/removed. When Common Crawl wants to do another monthly crawl, it samples URLs like this:

1. some of the URLs for crawling are randomly sampled from this big list
2. some of the URLs for crawling are always sampled for every crawl
3. some of the URLs for crawling are chosen heuristically.

This means that some webpages always show up in a monthly crawl (e.g. if “https://en.wikipedia.org/wiki/Barack_Obama” is in the list of URLs that are always used, then it will definitely be in the next month’s crawl just as it was in last month’s crawl). But there are some URLs which will appear in the latest month’s crawl which have never been sampled before and some URLs which have only been sampled e.g. two years ago.

Now, here are some questions:

1. I’m wondering if my understanding is correct?
2. I’m wondering how you decide which URLs to always use / use more often, and which URLs to randomly sample from the big URL list?
3. I’m wondering how URLs are added and removed from the big URL list?
4. I’m wondering whether the content is totally randomized across common crawl segments? So, for example, would this file contain a uniformly random sampling of all of the common crawl content from all of the different segments? crawl-data/CC-MAIN-2022-33/segments/1659882573197.34/wet/CC-MAIN-20220818124424-20220818154424-00658.warc.wet.gz 

Sebastian Nagel

unread,
Sep 16, 2022, 7:58:58 AM (9 days ago) Sep 16
to common...@googlegroups.com
Hi Tristan,

> There is a huge list of URLs somewhere. URLs are constantly being
> added/removed.

- additions are from
1. regular crawls (the preceding monthly/main crawl)
2. a small side crawl run immediately before a main crawl
3. sitemaps
- URLs not observed during the last 13 months are removed

For the additions, a random factor is used in combination with ranking
signals. There is also a limit on the number of URLs per host and/or
domain sampled into the "huge list of URLs" (aka. CrawlDb)
- it does not make sense to keep many URLs from larger site because
we'll never are able to fetch all of them
- the limit is assigned individually to every host/domain based
on its harmonic centrality rank

The CrawlDb includes the ranking signal as a score and also the
timestamps when a URL was fetched latest, when it was latest seen as a
link, whether it was fetched successfully (HTTP 200, 404, redirect,
etc.) and of which MIME type the page/document was when fetched.

> 1. some of the URLs for crawling are randomly sampled from this big
> list
> 2. some of the URLs for crawling are always sampled for every crawl
> 3. some of the URLs for crawling are chosen heuristically.

Which URLs are taken from the CrawlDb into the next fetch list is
strictly speaking deterministic. But there is random involved
when "filling" the CrawlDb with URLs, and also heuristics...
Basically, 1, 2 and 3 are true. The fetch lists are created this
way:

1. calculate a generator score
- take the page score
(this is why a small number of URLs with high scores are selected
every or almost every crawl)
- add a small value for every day elapsed since the last fetch
(add more for URLs not seen before)
- subtract a small value for every day elapsed since the URL was seen
latest
- subtract a configurable value if a page was found to be a 404,
a 304 (not modified), etc.

2. skip URLs when the generator score is below a certain threshold

3. for both domains and hosts there are limits on the number of URLs
selected per domain/host. The URLs are sorted by decreasing score.
If a per-domain/host limit is reached, remaining URLs of this
domain/host are skipped.

4. the URLs are distributed over segments and (per segment) over
partitions.

5. segments are fetched sequentially and all partitions of
one segment are fetched concurrently

6. after the fetch the WARC records of one segment are shuffled
and distributed by URL hash (pseudorandom distribution) over
WARC files.

I hope this answers questions 1 - 3.

> 4. I’m wondering whether the content is totally randomized across
> common crawl segments? So, for example, would this file contain a
> uniformly random sampling of all of the common crawl content from all
> of the different segments?

No it isn't. When distributing URLs over the 100 segments:
- up to 100-200 URLs from the same host are always kept together
because fetching even a single page from one host requires a
DNS lookup and to request and parse robots.txt. This causes
that hosts with few pages are distributed over few segments
but more or less randomly selected segments (there are some
other constraints, eg. max size of a segment).
- because of the shuffle (step 6 above) URLs from one host
are distributed pseudorandomly over WARC/WAT/WET files.


The short answer: if you want a random sample, I'd also randomly
select WARC/WAT/WET files.


Best,
Sebastian

Tristan Thrush

unread,
Sep 16, 2022, 7:53:54 PM (9 days ago) Sep 16
to common...@googlegroups.com
Thanks Sebastian for such a detailed response. Super helpful - my understanding of common crawl is a lot better now.

Tristan

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/BgPvP6HB2n0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/b7cda2dc-517e-7b00-a967-139f85df0849%40commoncrawl.org.
Reply all
Reply to author
Forward
0 new messages