I have a lot of past experience with common crawl and am currently doing this by parsing wet files downloaded from common crawl data sets, but that is just not very efficient, as most of URLs are subpages and not root domains. And it requires downloading a lot of data, that is not needed (subpages).
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Sam,
there's an example solution using the columnar index [1]
- perform a table join with the domain list
- filter by a URL path pattern matching the root page
You get the WARC file name and record offsets which
allow you to fetch the WARC records. See [2,3] for examples
how to do this at scale.
Notes:
- the webgraph includes domains which are not crawled
- you could just use the index table and pick only
one record per domain (or host name)
- eventually, and in order to handle case where the
root page is not contained in a crawl,
pick only the page with the shortest URL path.
This could be done using SQL window functions ("OVER").
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hi Sebastien,
thanks a lot for your feedback. In between my and your post, I already thought of columnar index as being the best path (have used it in the past for another task - by downloading CI files and then finding all URLs with "/pricing" and was excellent in that regard).
- perform a table join with the domain list
- filter by a URL path pattern matching the root page
###
looks to be faster.
"
- the webgraph includes domains which are not crawled"
Thanks for clarification. I suspected this a bit in that if CC crawler encounters some webpage like "example.com/pricing" then "example.com" is added to the webgraph, even though the root domain page (example.com/) is not actually present in collection of crawled wet pages (but in this case only example.com/pricing is).
Do you perhaps have an estimate of the overlap, i.e. if we parse for I think overall 90 million domains from the webgraph, for how many of those there is no respective wet file (for root domain) in common crawl data sets?
Thanks again for help on original question.
Best regards
Sebastian Nagel
unread,
May 9, 2022, 10:37:18 AM5/9/22
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Sam,
> Do you perhaps have an estimate of the overlap, i.e. if we parse for I
> think overall 90 million domains from the webgraph, for how many of
> those there is no respective wet file (for root domain) in common
> crawl data sets?
In a single main crawl there are currently 35 million domains having
at least one successfully fetched page - this is not necessarily the
root page. There is a chance to get a higher coverage (both for domains
and root pages) if multiple crawls are processed. Also the "robotstxt"
and "crawldiagnostics" (404, redirects, etc.) subsets include domains
without successfully fetched pages otherwise. But there are still
domains not crawled at all, only known by a link.
I'd expect that 50% coverage should be possible to reach if the
criteria to filter the root page isn't too restrictive.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com > <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hi Sebastian,
thanks for additional information.
Best regards
Sam
unread,
Oct 24, 2024, 1:02:01 PM10/24/24
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hello,
it has been a while since asking above question but wanted to add an update in case someone faces a similar task in future and perhaps it saves someone time. We were collecting domains for the purpose to classify them for two of our services (general websites at https://www.websitecategorizationapi.com based on IAB taxonomy and ecommerce ones at https://www.productcategorization.com ). The best sources of domains besides common crawl which helped a lot were:
- Google Crux Report (https://developer.chrome.com/docs/crux), this one is especially valuable, we got around 15 million domains from it. Use bigquery to extract them.
An important task was actually checking if the domains were still active, for 1 million domains list like above and Google Crux report domains this was mostly true, but not for domains from much wider sources that we obtained.
At the end we managed to collect 31 million active domains and classified them for our offline database. In total, we checked over 400 million domains, but it turns most of them are actually not active, most are expired.
Were actually surprised that only a minority of domains ever registered is still active.
Anyhow, wanted to share this, in case someone runs into similar task in the future. Feel free to send me questions if you run into any difficulties with sources above.