Hi Tristan,
> There is a huge list of URLs somewhere. URLs are constantly being
> added/removed.
- additions are from
1. regular crawls (the preceding monthly/main crawl)
2. a small side crawl run immediately before a main crawl
3. sitemaps
- URLs not observed during the last 13 months are removed
For the additions, a random factor is used in combination with ranking
signals. There is also a limit on the number of URLs per host and/or
domain sampled into the "huge list of URLs" (aka. CrawlDb)
- it does not make sense to keep many URLs from larger site because
we'll never are able to fetch all of them
- the limit is assigned individually to every host/domain based
on its harmonic centrality rank
The CrawlDb includes the ranking signal as a score and also the
timestamps when a URL was fetched latest, when it was latest seen as a
link, whether it was fetched successfully (HTTP 200, 404, redirect,
etc.) and of which MIME type the page/document was when fetched.
> 1. some of the URLs for crawling are randomly sampled from this big
> list
> 2. some of the URLs for crawling are always sampled for every crawl
> 3. some of the URLs for crawling are chosen heuristically.
Which URLs are taken from the CrawlDb into the next fetch list is
strictly speaking deterministic. But there is random involved
when "filling" the CrawlDb with URLs, and also heuristics...
Basically, 1, 2 and 3 are true. The fetch lists are created this
way:
1. calculate a generator score
- take the page score
(this is why a small number of URLs with high scores are selected
every or almost every crawl)
- add a small value for every day elapsed since the last fetch
(add more for URLs not seen before)
- subtract a small value for every day elapsed since the URL was seen
latest
- subtract a configurable value if a page was found to be a 404,
a 304 (not modified), etc.
2. skip URLs when the generator score is below a certain threshold
3. for both domains and hosts there are limits on the number of URLs
selected per domain/host. The URLs are sorted by decreasing score.
If a per-domain/host limit is reached, remaining URLs of this
domain/host are skipped.
4. the URLs are distributed over segments and (per segment) over
partitions.
5. segments are fetched sequentially and all partitions of
one segment are fetched concurrently
6. after the fetch the WARC records of one segment are shuffled
and distributed by URL hash (pseudorandom distribution) over
WARC files.
I hope this answers questions 1 - 3.
> 4. I’m wondering whether the content is totally randomized across
> common crawl segments? So, for example, would this file contain a
> uniformly random sampling of all of the common crawl content from all
> of the different segments?
No it isn't. When distributing URLs over the 100 segments:
- up to 100-200 URLs from the same host are always kept together
because fetching even a single page from one host requires a
DNS lookup and to request and parse robots.txt. This causes
that hosts with few pages are distributed over few segments
but more or less randomly selected segments (there are some
other constraints, eg. max size of a segment).
- because of the shuffle (step 6 above) URLs from one host
are distributed pseudorandomly over WARC/WAT/WET files.
The short answer: if you want a random sample, I'd also randomly
select WARC/WAT/WET files.
Best,
Sebastian