Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Efficient way to retrieve text of root domains from common crawl

955 views
Skip to first unread message

Sam

unread,
May 6, 2022, 10:11:55 AM5/6/22
to Common Crawl
Hi,

I am working on a data science project and am wondering what is the quickest way to obtain webpage text for a given domain name.

I am only interested in text of root domain, so e.g. www.example.com and not of subpages like www.example.com/page1.html and similar.

The overal goal is to fetch the text of each domain listed here:

I have a lot of past experience with common crawl and am currently doing this by parsing wet files downloaded from common crawl data sets, but that is just not very efficient, as most of URLs are subpages and not root domains. And it requires downloading a lot of data, that is not needed (subpages).

I tried other approaches like using https://github.com/lxucs/cdx-index-client, but it is very slow.

Hope someone can help me on this.

Thanks.

Sebastian Nagel

unread,
May 7, 2022, 10:59:26 AM5/7/22
to common...@googlegroups.com
Hi Sam,

there's an example solution using the columnar index [1]
- perform a table join with the domain list
- filter by a URL path pattern matching the root page

You get the WARC file name and record offsets which
allow you to fetch the WARC records. See [2,3] for examples
how to do this at scale.

Notes:
- the webgraph includes domains which are not crawled
- you could just use the index table and pick only
one record per domain (or host name)
- eventually, and in order to handle case where the
root page is not contained in a crawl,
pick only the page with the shortest URL path.
This could be done using SQL window functions ("OVER").

Best,
Sebastian

[1]
https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/get-records-home-pages.sql
[2]
https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives
[3]
https://github.com/commoncrawl/cc-pyspark/blob/main/cc_index_word_count.py
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sam

unread,
May 9, 2022, 8:13:00 AM5/9/22
to Common Crawl
Hi Sebastien,

thanks a lot for your feedback. In between my and your post, I already thought of columnar index as being the best path (have used it in the past for another task -  by downloading CI files and then finding all URLs with "/pricing" and was excellent in that regard).

I thought of downloading all CI files from here: https://data.commoncrawl.org/cc-index/collections/index.html. And then just parsing through them to get files + offsets.

But your suggestion:
###
- perform a table join with the domain list
- filter by a URL path pattern matching the root page
###
looks to be faster.

" - the webgraph includes domains which are not crawled"
Thanks for clarification. I suspected this a bit in that if CC crawler encounters some webpage like "example.com/pricing" then "example.com" is added to the webgraph, even though the root domain page (example.com/) is not actually present in collection of crawled wet pages (but in this case only example.com/pricing is).

Do you perhaps have an estimate of the overlap, i.e. if we parse for I think overall 90 million domains from the webgraph, for how many of those there is no respective wet file (for root domain) in common crawl data sets?

Thanks again for help on original question.

Best regards

Sebastian Nagel

unread,
May 9, 2022, 10:37:18 AM5/9/22
to common...@googlegroups.com
Hi Sam,

> Do you perhaps have an estimate of the overlap, i.e. if we parse for I
> think overall 90 million domains from the webgraph, for how many of
> those there is no respective wet file (for root domain) in common
> crawl data sets?

In a single main crawl there are currently 35 million domains having
at least one successfully fetched page - this is not necessarily the
root page. There is a chance to get a higher coverage (both for domains
and root pages) if multiple crawls are processed. Also the "robotstxt"
and "crawldiagnostics" (404, redirects, etc.) subsets include domains
without successfully fetched pages otherwise. But there are still
domains not crawled at all, only known by a link.

I'd expect that 50% coverage should be possible to reach if the
criteria to filter the root page isn't too restrictive.

Best,
Sebastian
> www.example.com <http://www.example.com> and
> > not of subpages like www.example.com/page1.html
> <http://www.example.com/page1.html> and similar.
> >
> > The overal goal is to fetch the text of each domain listed here:
> >
> https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/domain/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz
> <https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/domain/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz>
>
> >
> > I have a lot of past experience with common crawl and am currently
> doing
> > this by parsing wet files downloaded from common crawl data sets, but
> > that is just not very efficient, as most of URLs are subpages and not
> > root domains. And it requires downloading a lot of data, that is not
> > needed (subpages).
> >
> > I tried other approaches like using
> > https://github.com/lxucs/cdx-index-client
> <https://github.com/lxucs/cdx-index-client>, but it is very slow.
> >
> > Hope someone can help me on this.
> >
> > Thanks.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to common-crawl...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/08aedd94-09b0-4501-887b-6da7e6469f33n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/08aedd94-09b0-4501-887b-6da7e6469f33n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sam

unread,
May 11, 2022, 4:05:02 PM5/11/22
to Common Crawl
Hi Sebastian,

thanks for additional information.

Best regards

Sam

unread,
Oct 24, 2024, 1:02:01 PM10/24/24
to Common Crawl
Hello, 

it has been a while since asking above question but wanted to add an update in case someone faces a similar task in future and perhaps it saves someone time. We were collecting domains for the purpose to classify them for two of our services (general websites at https://www.websitecategorizationapi.com based on IAB taxonomy and ecommerce ones at https://www.productcategorization.com ). The best sources of domains besides common crawl which helped a lot were:
- tranco list (https://tranco-list.eu/)
- Google Crux Report (https://developer.chrome.com/docs/crux), this one is especially valuable, we got around 15 million domains from it. Use bigquery to extract them.

An important task was actually checking if the domains were still active, for 1 million domains list like above and Google Crux report domains this was mostly true, but not for domains from much wider sources that we obtained.

At the end we managed to collect 31 million active domains and classified them for our offline database. In total, we checked over 400 million domains, but it turns most of them are actually not active, most are expired.
Were actually surprised that only a minority of domains ever registered is still active.

Anyhow, wanted to share this, in case someone runs into similar task in the future. Feel free to send me questions if you run into any difficulties with sources above.
Reply all
Reply to author
Forward
0 new messages