URL seeds?

220 views
Skip to first unread message

Tim Allison

unread,
Oct 4, 2024, 12:13:22 PM10/4/24
to Common Crawl
Hi All,
  Is there any documentation on how the crawl is seeded nowadays? Thank you.

      Best,

            Tim

Sebastian Nagel

unread,
Oct 5, 2024, 3:20:15 AM10/5/24
to common...@googlegroups.com
Hi Tim,

this question was discussed a couple of times in this group,
see the links [1-5] below. There are also two slides [6,7]
where the seed sampling is explained. Despite minor changes
in weight factors and thresholds, the approach is still the
same nowadays.

The very short answer: it's a stratified sample, registered
domains used as strata, and the domain's harmonic centrality
rank defines the sample size for each domain.

Best,
Sebastian

[1] https://groups.google.com/g/common-crawl/c/IxNvNZnV9fg/m/5P_BoM8LBAAJ
[2] https://groups.google.com/g/common-crawl/c/sNe1nsUFawg/m/vnnK2Sh3BwAJ
[3] https://groups.google.com/g/common-crawl/c/AmsXrCNVBzo/m/dXN8rtcVDwAJ
[4] https://groups.google.com/g/common-crawl/c/Kxsdz094UCI/m/9lmfQlE4BQAJ
[5] https://groups.google.com/g/common-crawl/c/XjLb_K0r5gI/m/WoP08slnEAAJ
[6]
https://indico.cern.ch/event/1006978/contributions/4539477/attachments/2325769/3962907/ossym2021-sn-web-graphs-crawling.pdf
[7] http://nlpl.eu/skeikampen23/nagel.230206.pdf

Tim Allison

unread,
Nov 8, 2024, 2:24:53 PM11/8/24
to Common Crawl
So very helpful. Thank you! (apologies for my delay!)
Reply all
Reply to author
Forward
0 new messages