Help writing SQL that combines different crawls

209 views
Skip to first unread message

hcf

unread,
Mar 13, 2023, 7:49:40 AM3/13/23
to Common Crawl
I have successfully opened an Athena account and executed the first four queries from this tutorial:

The following query runs as expected:
SELECT COUNT(*) AS count,   url_host_registered_domainFROM "ccindex"."ccindex"WHERE crawl = 'CC-MAIN-2018-05'   AND subset = 'warc'    AND url_host_tld = 'no' GROUP BY url_host_registered_domain HAVING (COUNT(*) >= 100) ORDER BY count DESC

I am having trouble modifying a query by changing the crawl to 'CC-MAIN-2018-06' or 'CC-MAIN-2019-05, as I'm only getting empty results. Can someone assist me in troubleshooting this issue?

My goal is to run the aforementioned query while combining multiple crawls. Specifically, I would like to merge the 41 shards of monthly CommonCrawl from 2016 to 2019 mentioned in the GPT3 article by Brown et al. (2020). Can you provide guidance on how to accomplish this?

Best regards

hcf

Sebastian Nagel

unread,
Mar 13, 2023, 10:49:06 AM3/13/23
to common...@googlegroups.com
Hi,

> I am having trouble modifying a query by changing the crawl to
> 'CC-MAIN-2018-06' or 'CC-MAIN-2019-05

A list of crawls is provided here:
https://commoncrawl.org/the-data/get-started/
https://data.commoncrawl.org/crawl-data/index.html
Crawls are named using the scheme CC-MAIN-<year>-<week_of_year>.
Because a crawl is run only every few weeks and not in regular
intervals, incrementing the week number isn't possible.

In the SQL query one could use pattern matching, e.g.
crawl LIKE 'CC-MAIN-2016-%'
However, if the aggregation is complex, it's more reliable
to query each crawl separately. An aggregation over many crawls
may time out. There are tools to automatize the iteration
over crawls, see for example PyAthena [1].

> Specifically, I would like to merge the 41 shards of monthly
> CommonCrawl from 2016 to 2019 mentioned in the GPT3 article by Brown
> et al. (2020).

There were 45 crawls in the years 2016 - 2019. Most likely the last 4
crawls in 2019 are not included in the training data of GPT-3 - but
that's just a guess based on the publication date of [2] and given
that a certain amount of time is required to filter and sample data,
and to train and evaluate the model.

Best,
Sebastian

[1] https://pypi.org/project/PyAthena/
[2] https://arxiv.org/abs/2005.14165

hcf

unread,
Mar 14, 2023, 10:31:56 AM3/14/23
to Common Crawl
Thanks, this solved the "problem". I've downloaded 41 files and are ready to merge them. Will try PyAthena.

Best,
Hans Chr

Tom Morris

unread,
Mar 14, 2023, 4:30:54 PM3/14/23
to common...@googlegroups.com
If your queries are simple enough and you want to try running them all at once, you can use an expression like

WHERE REGEXP_LIKE(crawl, '^CC-MAIN-201([6-8]|9-([0-2]|3[05]))')

This bumps  the resource usage for the toy example from the tutorial from:

    Run time: 7.675 sec Data scanned: 2.12 MB

for a single crawl to

    Run time: 22.429 sec Data scanned: 88.73 MB

for all 41 crawls.

Modifying the query to run on all TLDs instead of just Norway (.no) bumps resource usage for all 41 crawls to:

    Run time: 2 min 33.702 sec Data scanned: 12.34 GB

You'll definitely want to test your queries on smaller subsets to get them debugged and tuned, but it's pretty nice to be able to get answers in a couple of minutes for a cost of $0.06. Below are the top 10 domains for the crawls used in GPT-3:



Tom


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/540f4bcf-8633-eb7d-2ff7-f36629378fa8%40commoncrawl.org.

hcf

unread,
Mar 17, 2023, 6:46:26 AM3/17/23
to Common Crawl
Thanks, running the queries all at once worked fine – they also confirmed that my python script for aggregating crawls did the right thing :)

Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.

The list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus"  Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?

Tom Morris

unread,
Mar 17, 2023, 2:45:27 PM3/17/23
to common...@googlegroups.com
On Fri, Mar 17, 2023 at 6:46 AM hcf <fars...@gmail.com> wrote:

Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.

Here's the SQL. It'll generate an ordered list of all 28.4M domains with >100 URLs fetched. Yes, the query cost $0.06. I've also attached the top 1000 domains with their counts. 

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"

WHERE REGEXP_LIKE(crawl, '^CC-MAIN-201([6-8]|9-([0-2]|3[05]))')
  AND subset = 'warc'

GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
 
The list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus"  Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?

I don't think token counts are easily available anywhere. You'd need to download the raw WET files for all 41 crawls (~322 TiB), filter, deduplicate, and tokenize them, then count the tokens. The GPT-3 filtering and deduplication algorithms are pretty lightly documented, so I'm not sure you could accurately reproduce them, even if you were willing to spend the many CPU days required.

Tom
commoncrawl-top-1000-hosts-GPT-3-2016-2019.csv

hcf

unread,
Mar 17, 2023, 3:19:58 PM3/17/23
to Common Crawl
Thanks again. This is very useful. 

Downloading the raw wet files is far outside my scope. My main interest is to understand the very large difference in types of domains in the article and in the query here. I understand that C4 is based on one Crawl and GPT3 on 41 crawls. But this doesn't seem to explain the difference. In the article Dodgeet. al. are listing patents.google.com,  nytimes, latimes as top three (excluding wikipedia). Is it plausible that the main difference here is statistics based on tokens vs pages? Are there other strategies to shed som light on this? 



Sebastian Nagel

unread,
Mar 17, 2023, 6:21:50 PM3/17/23
to common...@googlegroups.com
The length in terms of tokens is one factor.

The other is the filtering of the corpus:

... for C4 this means mostly
- "removing documents which contain any word on the “List
of Dirty, Naughty, Obscene, or Otherwise Bad
Words”."
- "remove documents which weren’t classified as
English with probability at least 0.99, so C4 is
primarily comprised of English text.

... GPT-3 the filtering (see [1])
- uses a classifier trained on the WebText corpus
(Reddit links with at least 3 karma), Wikipedia
and the book corpus as "positive examples"
- but with a random factor allowing some documents
"out of distribution"
- fuzzy deduplication


[1] https://arxiv.org/abs/2005.14165
> et. al (2021) <https://arxiv.org/abs/2104.08758> there is a list
> of top 25 domains (by token count) in the april 2019 crawl. A
> statistic on counts, which I understand is a count of
> html-documents, and a statistic on tokens gives very different
> images on what these large language models are fed. I would like
> to reporduce the same token based (or word based) statistic for
> the GPT3 crawl. How can I do that? Is it possible to modify the
> SQL selecting counts to treview number of words or tokens?
>
>
> I don't think token counts are easily available anywhere. You'd need
> to download the raw WET files for all 41 crawls (~322 TiB), filter,
> deduplicate, and tokenize them, then count the tokens. The GPT-3
> filtering and deduplication algorithms are pretty lightly
> documented, so I'm not sure you could accurately reproduce them,
> even if you were willing to spend the many CPU days required.
>
> Tom
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/1b089008-5d91-4ff0-bf82-7b360e0f0d7cn%40googlegroups.com <https://groups.google.com/d/msgid/common-crawl/1b089008-5d91-4ff0-bf82-7b360e0f0d7cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Morris

unread,
Mar 21, 2023, 2:28:33 AM3/21/23
to common...@googlegroups.com
On Fri, Mar 17, 2023 at 3:20 PM hcf <fars...@gmail.com> wrote:
[...] My main interest is to understand the very large difference in types of domains in the article and in the query here. I understand that C4 is based on one Crawl and GPT3 on 41 crawls. But this doesn't seem to explain the difference. In the article Dodgeet. al. are listing patents.google.com,  nytimes, latimes as top three (excluding wikipedia). Is it plausible that the main difference here is statistics based on tokens vs pages? Are there other strategies to shed som light on this? 

I don't think one can really say anything about GPT-3 because the corpus isn't available and they haven't published enough information to recreate it as AllenAI did with C4 for their analysis.

If you want to focus on the C4 results in the AllenAI (Dodge et al 2020) paper, one significant difference is that the example query that you were using was enumerating registered domains, NOT hostnames, which accounts for the blogging sites ranking so highly -- they offer individualized hostnames for blogs, so the per-host statistics are going to be much more fragmented. It also wasn't restricting the query to pages where English was detected as was done for C4.

Other differences include:
  • multimedia (ie primarily non-text) sites like Tumblr, YouTube, Flickr are likely to get filtered by the C4 filtering
  • C4 text is deduplicated as part of the processing
  • CommonCrawl potentially has multiple crawls of the same content (a degenerated case of duplication)
The AllenAI paper discusses the filtering that was used for C4 and its impact. You can get an idea of the scale just by looking at the raw numbers. The final filtered C4.EN corpus has 1/3 the pages and 1/9 the tokens (see Table 1) when compared to the unfiltered English corpus.

Here's a tweaked query that counts by hostnames instead of registered domain and restricts results to pages where English was the primary language detected (which also eliminates all non-text results):

SELECT COUNT(*) AS count,
       url_host_name
FROM "ccindex"."ccindex"
WHERE REGEXP_LIKE(crawl, '^CC-MAIN-2019-18')
  AND subset = 'warc'
  AND content_languages LIKE 'eng%'
GROUP BY  url_host_name

HAVING (COUNT(*) >= 100)
ORDER BY  count DESC

There are still a number of anomalies in the top 25. Here are the top 25 sites by token count from the paper with their page count ranks from the revised query above:


C4.EN rank

Hostname

Eng CC 2019-18  rank

Raw CC 2019-18 rank

1

patents.google.com

32

50

2

en.wikipedia.org

11

17

3

en.m.wikipedia.org

62

98

4

www.nytimes.com

9

10

5

www.latimes.com

19

32

6

www.theguardian.com

18

31

7

journals.plos.org

1633

2085

8

www.forbes.com

55

91

9

www.huffpost.com

24

40

10

patents.com

3500

6307

11

www.scribd.com

142

113

12

www.washingtonpost.com

64

97

13

www.fool.com

276

444

14

ipfs.io

532

781

15

www.frontiersin.org

2181

3398

16

www.businessinsider.com

44

65

17

www.chicagotribune.com

53

77

18

www.booking.com

151

104

19

www.theatlantic.com

82

133

20

link.springer.com

23

33

21

www.aljazeera.com

109

176

22

www.kickstarter.com

83

131

23

caselaw.findlaw.com

1723

2646

24

www.ncbi.nlm.nih.gov

77

124

25

www.npr.org

59

94


It seems likely that the PLOS and "Journals In ..." were boosted by high token counts per URL.

Tom

 
fredag 17. mars 2023 kl. 19:45:27 UTC+1 skrev tfmo...@gmail.com:
On Fri, Mar 17, 2023 at 6:46 AM hcf <fars...@gmail.com> wrote:

Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.

Here's the SQL. It'll generate an ordered list of all 28.4M domains with >100 URLs fetched. Yes, the query cost $0.06. I've also attached the top 1000 domains with their counts. 

SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"

WHERE REGEXP_LIKE(crawl, '^CC-MAIN-201([6-8]|9-([0-2]|3[05]))')
  AND subset = 'warc'

GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
 
The list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus"  Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?

I don't think token counts are easily available anywhere. You'd need to download the raw WET files for all 41 crawls (~322 TiB), filter, deduplicate, and tokenize them, then count the tokens. The GPT-3 filtering and deduplication algorithms are pretty lightly documented, so I'm not sure you could accurately reproduce them, even if you were willing to spend the many CPU days required.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
Message has been deleted
Message has been deleted

Sebastian Nagel

unread,
Mar 24, 2023, 3:01:24 AM3/24/23
to common...@googlegroups.com
Hi Hans,

> (I wrote an answer yesterday, but somehow it is now listed as deleted. If this
> is due to an error on my side please advice.)

Your answer was on hold and required to be approved by a list moderator.
The spam filter seems to be quite sensitive to listings of URLs including domain
names.

> which pages that are archived from a specific subdomain. [...]
> Is it possible to use SQL and Athene (as above) to get the link to specific
> pages and the archived text from these pages for a small subdomain?

You can use SQL (Athena, Spark, etc.) querying the columnar index to get a list
of URLs and the "coordinates" of the page captures in the archives. Based on the
columns warc_filename, warc_record_offset and warc_record_length, it's possible
to fetch the WARC records, extract the payload and capture metadata and extract
the text from the HTML. Few links how to do this are shared in this discussion:
https://groups.google.com/g/common-crawl/c/tAO6VaAw3WA/m/Untz_qt6EwAJ
Let me know if you need more information!

Best,
Sebastian


On 3/23/23 11:44, hcf wrote:
> (I wrote an answer yesterday, but somehow it is now listed as deleted. If this
> is due to an error on my side please advice.)
>
> Sebastian and Tom, thank you for making this clearer. I agree that it is
> possible to get much further on the GPT3.
>
> One thing I would like to do is to peak into Common Crawl and study exactly
> which pages that are archived from a specific subdomain. I have seen discussions
> and webpages explaining this, but I am bit overwhelmed by the information. Is it
> possible to use SQL and Athene (as above) to get the link to specific pages and
> the archived text from these pages for a small subdomain?
>
> Hans
>
>

Greg Lindahl

unread,
Mar 24, 2023, 1:22:28 PM3/24/23
to common...@googlegroups.com
On Thu, Mar 23, 2023 at 03:44:58AM -0700, hcf wrote:

> Is it possible to use SQL and Athene (as above) to get the
> link to specific pages and the archived text from these pages for a small
> subdomain?

My cdx_toolkit does that -- although the SQL Athena support isn't
finished and is still in a branch. For this particular purpose (a
specific subdomain) the ordinary "cdx index" gets the job done.

https://github.com/cocrawler/cdx_toolkit

-- greg

Tom Morris

unread,
Mar 27, 2023, 6:56:58 PM3/27/23
to common...@googlegroups.com
On Fri, Mar 24, 2023 at 2:40 AM hcf <fars...@gmail.com> wrote:

> One follow up to the last post. What do you mean by " boosted by high token counts per URL"?

The graph in the paper ordered domains by the total number of tokens
whereas most CommonCrawl stats are by number of visits or URLs, so
domains which have a high number of tokens per page (e.g. journal
sites) will rank higher when the token count ordering is used.

Tom
Reply all
Reply to author
Forward
0 new messages