Efficient way to retrieve text of root domains from common crawl

65 views
Skip to first unread message

Sam

unread,
May 6, 2022, 10:11:55 AMMay 6
to Common Crawl
Hi,

I am working on a data science project and am wondering what is the quickest way to obtain webpage text for a given domain name.

I am only interested in text of root domain, so e.g. www.example.com and not of subpages like www.example.com/page1.html and similar.

The overal goal is to fetch the text of each domain listed here:

I have a lot of past experience with common crawl and am currently doing this by parsing wet files downloaded from common crawl data sets, but that is just not very efficient, as most of URLs are subpages and not root domains. And it requires downloading a lot of data, that is not needed (subpages).

I tried other approaches like using https://github.com/lxucs/cdx-index-client, but it is very slow.

Hope someone can help me on this.

Thanks.

Sebastian Nagel

unread,
May 7, 2022, 10:59:26 AMMay 7
to common...@googlegroups.com
Hi Sam,

there's an example solution using the columnar index [1]
- perform a table join with the domain list
- filter by a URL path pattern matching the root page

You get the WARC file name and record offsets which
allow you to fetch the WARC records. See [2,3] for examples
how to do this at scale.

Notes:
- the webgraph includes domains which are not crawled
- you could just use the index table and pick only
one record per domain (or host name)
- eventually, and in order to handle case where the
root page is not contained in a crawl,
pick only the page with the shortest URL path.
This could be done using SQL window functions ("OVER").

Best,
Sebastian

[1]
https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/get-records-home-pages.sql
[2]
https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives
[3]
https://github.com/commoncrawl/cc-pyspark/blob/main/cc_index_word_count.py
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sam

unread,
May 9, 2022, 8:13:00 AMMay 9
to Common Crawl
Hi Sebastien,

thanks a lot for your feedback. In between my and your post, I already thought of columnar index as being the best path (have used it in the past for another task -  by downloading CI files and then finding all URLs with "/pricing" and was excellent in that regard).

I thought of downloading all CI files from here: https://data.commoncrawl.org/cc-index/collections/index.html. And then just parsing through them to get files + offsets.

But your suggestion:
###
- perform a table join with the domain list
- filter by a URL path pattern matching the root page
###
looks to be faster.

" - the webgraph includes domains which are not crawled"
Thanks for clarification. I suspected this a bit in that if CC crawler encounters some webpage like "example.com/pricing" then "example.com" is added to the webgraph, even though the root domain page (example.com/) is not actually present in collection of crawled wet pages (but in this case only example.com/pricing is).

Do you perhaps have an estimate of the overlap, i.e. if we parse for I think overall 90 million domains from the webgraph, for how many of those there is no respective wet file (for root domain) in common crawl data sets?

Thanks again for help on original question.

Best regards

Sebastian Nagel

unread,
May 9, 2022, 10:37:18 AMMay 9
to common...@googlegroups.com
Hi Sam,

> Do you perhaps have an estimate of the overlap, i.e. if we parse for I
> think overall 90 million domains from the webgraph, for how many of
> those there is no respective wet file (for root domain) in common
> crawl data sets?

In a single main crawl there are currently 35 million domains having
at least one successfully fetched page - this is not necessarily the
root page. There is a chance to get a higher coverage (both for domains
and root pages) if multiple crawls are processed. Also the "robotstxt"
and "crawldiagnostics" (404, redirects, etc.) subsets include domains
without successfully fetched pages otherwise. But there are still
domains not crawled at all, only known by a link.

I'd expect that 50% coverage should be possible to reach if the
criteria to filter the root page isn't too restrictive.

Best,
Sebastian
> www.example.com <http://www.example.com> and
> > not of subpages like www.example.com/page1.html
> <http://www.example.com/page1.html> and similar.
> >
> > The overal goal is to fetch the text of each domain listed here:
> >
> https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/domain/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz
> <https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2021-22-oct-nov-jan/domain/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz>
>
> >
> > I have a lot of past experience with common crawl and am currently
> doing
> > this by parsing wet files downloaded from common crawl data sets, but
> > that is just not very efficient, as most of URLs are subpages and not
> > root domains. And it requires downloading a lot of data, that is not
> > needed (subpages).
> >
> > I tried other approaches like using
> > https://github.com/lxucs/cdx-index-client
> <https://github.com/lxucs/cdx-index-client>, but it is very slow.
> >
> > Hope someone can help me on this.
> >
> > Thanks.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to common-crawl...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/cb2fc0eb-19dc-4332-b8b7-97dce220289an%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/08aedd94-09b0-4501-887b-6da7e6469f33n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/08aedd94-09b0-4501-887b-6da7e6469f33n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sam

unread,
May 11, 2022, 4:05:02 PMMay 11
to Common Crawl
Hi Sebastian,

thanks for additional information.

Best regards
Reply all
Reply to author
Forward
0 new messages