I am having trouble modifying a query by changing the crawl to 'CC-MAIN-2018-06' or 'CC-MAIN-2019-05, as I'm only getting empty results. Can someone assist me in troubleshooting this issue?
My goal is to run the aforementioned query while combining multiple crawls. Specifically, I would like to merge the 41 shards of monthly CommonCrawl from 2016 to 2019 mentioned in the GPT3 article by Brown et al. (2020). Can you provide guidance on how to accomplish this?
Best regards
hcf
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/540f4bcf-8633-eb7d-2ff7-f36629378fa8%40commoncrawl.org.
Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.
The list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus" Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?
[...] My main interest is to understand the very large difference in types of domains in the article and in the query here. I understand that C4 is based on one Crawl and GPT3 on 41 crawls. But this doesn't seem to explain the difference. In the article Dodgeet. al. are listing patents.google.com, nytimes, latimes as top three (excluding wikipedia). Is it plausible that the main difference here is statistics based on tokens vs pages? Are there other strategies to shed som light on this?
|
C4.EN rank |
Hostname |
Eng CC 2019-18 rank |
Raw CC 2019-18 rank |
|
1 |
32 |
50 |
|
|
2 |
11 |
17 |
|
|
3 |
62 |
98 |
|
|
4 |
9 |
10 |
|
|
5 |
19 |
32 |
|
|
6 |
18 |
31 |
|
|
7 |
1633 |
2085 |
|
|
8 |
55 |
91 |
|
|
9 |
24 |
40 |
|
|
10 |
3500 |
6307 |
|
|
11 |
142 |
113 |
|
|
12 |
64 |
97 |
|
|
13 |
276 |
444 |
|
|
14 |
532 |
781 |
|
|
15 |
2181 |
3398 |
|
|
16 |
44 |
65 |
|
|
17 |
53 |
77 |
|
|
18 |
151 |
104 |
|
|
19 |
82 |
133 |
|
|
20 |
23 |
33 |
|
|
21 |
109 |
176 |
|
|
22 |
83 |
131 |
|
|
23 |
1723 |
2646 |
|
|
24 |
77 |
124 |
|
|
25 |
59 |
94 |
fredag 17. mars 2023 kl. 19:45:27 UTC+1 skrev tfmo...@gmail.com:On Fri, Mar 17, 2023 at 6:46 AM hcf <fars...@gmail.com> wrote:Could you please post the SQL you used for the top 10 domains? Would this be the one costing $ 0.06? I would like to have a top 100 list.Here's the SQL. It'll generate an ordered list of all 28.4M domains with >100 URLs fetched. Yes, the query cost $0.06. I've also attached the top 1000 domains with their counts.SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE REGEXP_LIKE(crawl, '^CC-MAIN-201([6-8]|9-([0-2]|3[05]))')AND subset = 'warc'
GROUP BY url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY count DESCThe list of the top 10 domains raises another question. In "Documenting the English Colossal Clean Crawled Corpus" Dodge et. al (2021) there is a list of top 25 domains (by token count) in the april 2019 crawl. A statistic on counts, which I understand is a count of html-documents, and a statistic on tokens gives very different images on what these large language models are fed. I would like to reporduce the same token based (or word based) statistic for the GPT3 crawl. How can I do that? Is it possible to modify the SQL selecting counts to treview number of words or tokens?I don't think token counts are easily available anywhere. You'd need to download the raw WET files for all 41 crawls (~322 TiB), filter, deduplicate, and tokenize them, then count the tokens. The GPT-3 filtering and deduplication algorithms are pretty lightly documented, so I'm not sure you could accurately reproduce them, even if you were willing to spend the many CPU days required.Tom
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/1b089008-5d91-4ff0-bf82-7b360e0f0d7cn%40googlegroups.com.