Help writing SQL that combines different crawls

hcf

unread,

Mar 13, 2023, 7:49:40 AM3/13/23

to Common Crawl

I have successfully opened an Athena account and executed the first four queries from this tutorial:

https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

The following query runs as expected:

SELECT COUNT(*) AS count, url_host_registered_domainFROM "ccindex"."ccindex"WHERE crawl = 'CC-MAIN-2018-05' AND subset = 'warc' AND url_host_tld = 'no' GROUP BY url_host_registered_domain HAVING (COUNT(*) >= 100) ORDER BY count DESC

I am having trouble modifying a query by changing the crawl to 'CC-MAIN-2018-06' or 'CC-MAIN-2019-05, as I'm only getting empty results. Can someone assist me in troubleshooting this issue?

My goal is to run the aforementioned query while combining multiple crawls. Specifically, I would like to merge the 41 shards of monthly CommonCrawl from 2016 to 2019 mentioned in the GPT3 article by Brown et al. (2020). Can you provide guidance on how to accomplish this?

Best regards

hcf

Sebastian Nagel

unread,

Mar 13, 2023, 10:49:06 AM3/13/23

to common...@googlegroups.com

Hi,

> I am having trouble modifying a query by changing the crawl to
> 'CC-MAIN-2018-06' or 'CC-MAIN-2019-05

A list of crawls is provided here:
https://commoncrawl.org/the-data/get-started/
https://data.commoncrawl.org/crawl-data/index.html
Crawls are named using the scheme CC-MAIN-<year>-<week_of_year>.
Because a crawl is run only every few weeks and not in regular
intervals, incrementing the week number isn't possible.

In the SQL query one could use pattern matching, e.g.
crawl LIKE 'CC-MAIN-2016-%'
However, if the aggregation is complex, it's more reliable
to query each crawl separately. An aggregation over many crawls
may time out. There are tools to automatize the iteration
over crawls, see for example PyAthena [1].

> Specifically, I would like to merge the 41 shards of monthly
> CommonCrawl from 2016 to 2019 mentioned in the GPT3 article by Brown
> et al. (2020).

There were 45 crawls in the years 2016 - 2019. Most likely the last 4
crawls in 2019 are not included in the training data of GPT-3 - but
that's just a guess based on the publication date of [2] and given
that a certain amount of time is required to filter and sample data,
and to train and evaluate the model.

Best,
Sebastian

[1] https://pypi.org/project/PyAthena/
[2] https://arxiv.org/abs/2005.14165

hcf

unread,

Mar 14, 2023, 10:31:56 AM3/14/23

to Common Crawl

Thanks, this solved the "problem". I've downloaded 41 files and are ready to merge them. Will try PyAthena.

Best,

Hans Chr

Tom Morris

unread,

Mar 14, 2023, 4:30:54 PM3/14/23

to common...@googlegroups.com

If your queries are simple enough and you want to try running them all at once, you can use an expression like

WHERE REGEXP_LIKE(crawl, '^CC-MAIN-201([6-8]|9-([0-2]|3[05]))')

This bumps the resource usage for the toy example from the tutorial from:

Run time: 7.675 sec Data scanned: 2.12 MB

for a single crawl to

Run time: 22.429 sec Data scanned: 88.73 MB

for all 41 crawls.

Modifying the query to run on all TLDs instead of just Norway (.no) bumps resource usage for all 41 crawls to:

Run time: 2 min 33.702 sec Data scanned: 12.34 GB

You'll definitely want to test your queries on smaller subsets to get them debugged and tuned, but it's pretty nice to be able to get answers in a couple of minutes for a cost of $0.06. Below are the top 10 domains for the crawls used in GPT-3:

#	count	url_host_registered_domain
1	4944797834	blogspot.com
2	1455794143	wordpress.com
3	532945968	fc2.com
4	333119601	wikipedia.org
5	212625909	pixnet.net
6	204972441	exblog.jp
7	150158391	wikia.com
8	121818794	smugmug.com
9	117438666	pinterest.com
10	113826264	skyrock.com

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/540f4bcf-8633-eb7d-2ff7-f36629378fa8%40commoncrawl.org.

hcf

unread,

Mar 17, 2023, 6:46:26 AM3/17/23

to Common Crawl

Thanks, running the queries all at once worked fine – they also confirmed that my python script for aggregating crawls did the right thing :)

Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.

The list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus" Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?

Tom Morris

unread,

Mar 17, 2023, 2:45:27 PM3/17/23

to common...@googlegroups.com

On Fri, Mar 17, 2023 at 6:46 AM hcf <fars...@gmail.com> wrote:

Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.

Here's the SQL. It'll generate an ordered list of all 28.4M domains with >100 URLs fetched. Yes, the query cost $0.06. I've also attached the top 1000 domains with their counts.

SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex"."ccindex"

WHERE REGEXP_LIKE(crawl, '^CC-MAIN-201([6-8]|9-([0-2]|3[05]))')

AND subset = 'warc'

GROUP BY url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY count DESC

The list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus" Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?

I don't think token counts are easily available anywhere. You'd need to download the raw WET files for all 41 crawls (~322 TiB), filter, deduplicate, and tokenize them, then count the tokens. The GPT-3 filtering and deduplication algorithms are pretty lightly documented, so I'm not sure you could accurately reproduce them, even if you were willing to spend the many CPU days required.

Tom

commoncrawl-top-1000-hosts-GPT-3-2016-2019.csv

hcf

unread,

Mar 17, 2023, 3:19:58 PM3/17/23

to Common Crawl

Thanks again. This is very useful.

Downloading the raw wet files is far outside my scope. My main interest is to understand the very large difference in types of domains in the article and in the query here. I understand that C4 is based on one Crawl and GPT3 on 41 crawls. But this doesn't seem to explain the difference. In the article Dodgeet. al. are listing patents.google.com, nytimes, latimes as top three (excluding wikipedia). Is it plausible that the main difference here is statistics based on tokens vs pages? Are there other strategies to shed som light on this?

Sebastian Nagel

unread,

Mar 17, 2023, 6:21:50 PM3/17/23

to common...@googlegroups.com

The length in terms of tokens is one factor.

The other is the filtering of the corpus:

... for C4 this means mostly
- "removing documents which contain any word on the “List
of Dirty, Naughty, Obscene, or Otherwise Bad
Words”."
- "remove documents which weren’t classified as
English with probability at least 0.99, so C4 is
primarily comprised of English text.

... GPT-3 the filtering (see [1])
- uses a classifier trained on the WebText corpus
(Reddit links with at least 3 karma), Wikipedia
and the book corpus as "positive examples"
- but with a random factor allowing some documents
"out of distribution"
- fuzzy deduplication

[1] https://arxiv.org/abs/2005.14165

> et. al (2021) <https://arxiv.org/abs/2104.08758> there is a list

> of top 25 domains (by token count) in the april 2019 crawl. A
> statistic on counts, which I understand is a count of
> html-documents, and a statistic on tokens gives very different
> images on what these large language models are fed. I would like
> to reporduce the same token based (or word based) statistic for
> the GPT3 crawl. How can I do that? Is it possible to modify the
> SQL selecting counts to treview number of words or tokens?
>
>
> I don't think token counts are easily available anywhere. You'd need
> to download the raw WET files for all 41 crawls (~322 TiB), filter,
> deduplicate, and tokenize them, then count the tokens. The GPT-3
> filtering and deduplication algorithms are pretty lightly
> documented, so I'm not sure you could accurately reproduce them,
> even if you were willing to spend the many CPU days required.
>
> Tom
>

> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com

> <mailto:common-crawl...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/1b089008-5d91-4ff0-bf82-7b360e0f0d7cn%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/1b089008-5d91-4ff0-bf82-7b360e0f0d7cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Morris

unread,

Mar 21, 2023, 2:28:33 AM3/21/23

to common...@googlegroups.com

On Fri, Mar 17, 2023 at 3:20 PM hcf <fars...@gmail.com> wrote:

[...] My main interest is to understand the very large difference in types of domains in the article and in the query here. I understand that C4 is based on one Crawl and GPT3 on 41 crawls. But this doesn't seem to explain the difference. In the article Dodgeet. al. are listing patents.google.com, nytimes, latimes as top three (excluding wikipedia). Is it plausible that the main difference here is statistics based on tokens vs pages? Are there other strategies to shed som light on this?

I don't think one can really say anything about GPT-3 because the corpus isn't available and they haven't published enough information to recreate it as AllenAI did with C4 for their analysis.

If you want to focus on the C4 results in the AllenAI (Dodge et al 2020) paper, one significant difference is that the example query that you were using was enumerating registered domains, NOT hostnames, which accounts for the blogging sites ranking so highly -- they offer individualized hostnames for blogs, so the per-host statistics are going to be much more fragmented. It also wasn't restricting the query to pages where English was detected as was done for C4.

Other differences include:

multimedia (ie primarily non-text) sites like Tumblr, YouTube, Flickr are likely to get filtered by the C4 filtering
C4 text is deduplicated as part of the processing
CommonCrawl potentially has multiple crawls of the same content (a degenerated case of duplication)

The AllenAI paper discusses the filtering that was used for C4 and its impact. You can get an idea of the scale just by looking at the raw numbers. The final filtered C4.EN corpus has 1/3 the pages and 1/9 the tokens (see Table 1) when compared to the unfiltered English corpus.

Here's a tweaked query that counts by hostnames instead of registered domain and restricts results to pages where English was the primary language detected (which also eliminates all non-text results):

SELECT COUNT(*) AS count,
url_host_name
FROM "ccindex"."ccindex"
WHERE REGEXP_LIKE(crawl, '^CC-MAIN-2019-18')
AND subset = 'warc'
AND content_languages LIKE 'eng%'
GROUP BY url_host_name

HAVING (COUNT(*) >= 100)
ORDER BY count DESC

There are still a number of anomalies in the top 25. Here are the top 25 sites by token count from the paper with their page count ranks from the revised query above:

C4.EN rank	Hostname	Eng CC 2019-18 rank	Raw CC 2019-18 rank
1	patents.google.com	32	50
2	en.wikipedia.org	11	17
3	en.m.wikipedia.org	62	98
4	www.nytimes.com	9	10
5	www.latimes.com	19	32
6	www.theguardian.com	18	31
7	journals.plos.org	1633	2085
8	www.forbes.com	55	91
9	www.huffpost.com	24	40
10	patents.com	3500	6307
11	www.scribd.com	142	113
12	www.washingtonpost.com	64	97
13	www.fool.com	276	444
14	ipfs.io	532	781
15	www.frontiersin.org	2181	3398
16	www.businessinsider.com	44	65
17	www.chicagotribune.com	53	77
18	www.booking.com	151	104
19	www.theatlantic.com	82	133
20	link.springer.com	23	33
21	www.aljazeera.com	109	176
22	www.kickstarter.com	83	131
23	caselaw.findlaw.com	1723	2646
24	www.ncbi.nlm.nih.gov	77	124
25	www.npr.org	59	94

It seems likely that the PLOS and "Journals In ..." were boosted by high token counts per URL.

Tom

fredag 17. mars 2023 kl. 19:45:27 UTC+1 skrev tfmo...@gmail.com:
On Fri, Mar 17, 2023 at 6:46 AM hcf <fars...@gmail.com> wrote:

Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.

Here's the SQL. It'll generate an ordered list of all 28.4M domains with >100 URLs fetched. Yes, the query cost $0.06. I've also attached the top 1000 domains with their counts.

SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex"."ccindex"

WHERE REGEXP_LIKE(crawl, '^CC-MAIN-201([6-8]|9-([0-2]|3[05]))')
AND subset = 'warc'

GROUP BY url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY count DESC

The list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus" Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?

I don't think token counts are easily available anywhere. You'd need to download the raw WET files for all 41 crawls (~322 TiB), filter, deduplicate, and tokenize them, then count the tokens. The GPT-3 filtering and deduplication algorithms are pretty lightly documented, so I'm not sure you could accurately reproduce them, even if you were willing to spend the many CPU days required.

Tom

--

You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/1b089008-5d91-4ff0-bf82-7b360e0f0d7cn%40googlegroups.com.

Message has been deleted

Sebastian Nagel

unread,

Mar 24, 2023, 3:01:24 AM3/24/23

to common...@googlegroups.com

Hi Hans,

> (I wrote an answer yesterday, but somehow it is now listed as deleted. If this
> is due to an error on my side please advice.)

Your answer was on hold and required to be approved by a list moderator.
The spam filter seems to be quite sensitive to listings of URLs including domain
names.

> which pages that are archived from a specific subdomain. [...]
> Is it possible to use SQL and Athene (as above) to get the link to specific
> pages and the archived text from these pages for a small subdomain?

You can use SQL (Athena, Spark, etc.) querying the columnar index to get a list
of URLs and the "coordinates" of the page captures in the archives. Based on the
columns warc_filename, warc_record_offset and warc_record_length, it's possible
to fetch the WARC records, extract the payload and capture metadata and extract
the text from the HTML. Few links how to do this are shared in this discussion:
https://groups.google.com/g/common-crawl/c/tAO6VaAw3WA/m/Untz_qt6EwAJ
Let me know if you need more information!

Best,
Sebastian

On 3/23/23 11:44, hcf wrote:
> (I wrote an answer yesterday, but somehow it is now listed as deleted. If this
> is due to an error on my side please advice.)
>
> Sebastian and Tom, thank you for making this clearer. I agree that it is
> possible to get much further on the GPT3.
>
> One thing I would like to do is to peak into Common Crawl and study exactly
> which pages that are archived from a specific subdomain. I have seen discussions
> and webpages explaining this, but I am bit overwhelmed by the information. Is it
> possible to use SQL and Athene (as above) to get the link to specific pages and
> the archived text from these pages for a small subdomain?
>
> Hans
>
>

Greg Lindahl

unread,

Mar 24, 2023, 1:22:28 PM3/24/23

to common...@googlegroups.com

On Thu, Mar 23, 2023 at 03:44:58AM -0700, hcf wrote:

> Is it possible to use SQL and Athene (as above) to get the
> link to specific pages and the archived text from these pages for a small
> subdomain?

My cdx_toolkit does that -- although the SQL Athena support isn't
finished and is still in a branch. For this particular purpose (a
specific subdomain) the ordinary "cdx index" gets the job done.

https://github.com/cocrawler/cdx_toolkit

-- greg

Tom Morris

unread,

Mar 27, 2023, 6:56:58 PM3/27/23

to common...@googlegroups.com

On Fri, Mar 24, 2023 at 2:40 AM hcf <fars...@gmail.com> wrote:

> One follow up to the last post. What do you mean by " boosted by high token counts per URL"?

The graph in the paper ordered domains by the total number of tokens
whereas most CommonCrawl stats are by number of visits or URLs, so
domains which have a high number of tokens per page (e.g. journal
sites) will rank higher when the token count ordering is used.

Tom

Reply all

Reply to author

Forward