For a linguistic reasearch and from a beginner with Common Crawl : GPT-2 and traces of Isaac Asimov ,in the Common Crawl Dataset

86 views
Skip to first unread message

Claude Grunspan

unread,
Feb 20, 2021, 4:34:33 AM2/20/21
to Common Crawl

Hello everyone,

I am a student in linguistics at the Sorbonne University in Paris and we are now working on the GPT-2 text generator. For this we make tests with some Isaac Asimov's short novels using OPEN AI's API WriteWithTransformer  :
Hence I have two questions for you fellow members of this group :
1) Could anyone from this list help me know how I can trace Isaac Asimov's texts in Common Crawl? With which tools? (Whether URLs or datasets)
2) where can I find ANY specifications concerning how Write With Transformers  is conceived (hyperparameters, parameters, tokenization, other existing tests...)?

Thank you very much in advance

Tom Morris

unread,
Feb 20, 2021, 1:13:09 PM2/20/21
to common...@googlegroups.com
On Sat, Feb 20, 2021 at 4:34 AM Claude Grunspan <grunspa...@gmail.com> wrote:

1) Could anyone from this list help me know how I can trace Isaac Asimov's texts in Common Crawl? With which tools? (Whether URLs or datasets)

For a targeted literature selection such as this, you'd be better off with HathiTrust, OpenLibrary, or a similar resource, BUT Isaac Asimov has only been writing since 1938, so all of his works are still in copyright.

Tom

Claude Grunspan

unread,
Feb 20, 2021, 1:26:04 PM2/20/21
to common...@googlegroups.com
Thank you very much Tom. 
In fact i would like to know the process of searching to find how many of Asimov's writings, or another science fiction writer's not under copyright if possible, are part of the Common Crawl corpus. 
Where should i search?
Thank you very much in advance 


--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEGz%2B9sPM8toknY4o%3DY%3DrhoU%3DNg7pQToFjp%2BP7bkmAdXQA%40mail.gmail.com.

Tom Morris

unread,
Feb 20, 2021, 2:22:59 PM2/20/21
to common...@googlegroups.com
On Sat, Feb 20, 2021 at 1:26 PM Claude Grunspan <grunspa...@gmail.com> wrote:
In fact i would like to know the process of searching to find how many of Asimov's writings, or another science fiction writer's not under copyright if possible, are part of the Common Crawl corpus. 
Where should i search?

What you are looking for is a search engine, which isn't one of the things that Common Crawl offers. There have been some efforts to build search engines on top of the Common Crawl data, but I don't think any of them are currently active. One example is/was Elastic ChatNoir https://www.chatnoir.eu/?q=isaac+asimov Of course, any of the main search engines (Google, Bing, Duck Duck Go, Yandex, etc) could do the same searches for you on the live web.

With the Common Crawl data, you have two options:
1. Use the Athena-based Common Crawl index to search for likely keywords in URLs, which will be cheap and fast, but require a second level of validation to weed out book reviews, author biographies, etc.
2. Use Spark/Hadoop to do a brute force search across all the page captures, which will be computationally expensive.

Tom

Claude Grunspan

unread,
Feb 20, 2021, 2:27:54 PM2/20/21
to common...@googlegroups.com
Merci, Tom.
I will try Athena-based Common Crawl index  first for sure and then the second option you propose.

Claude

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

Alex Henry

unread,
Feb 21, 2021, 4:10:35 PM2/21/21
to common...@googlegroups.com
Hey Claude,

Be aware that even if you find the URLs you’re thinking of in the Common Crawl index, Common Crawl itself may have only a small subset of the webpages associated with those domains.  In other words even if a website contains the text you’re looking for Common Crawl might not have that part of the website.  My understanding is that the lower a domain’s harmonic centrality the fewer pages from that domain Common Crawl will actually scrape.  

If you’re just trying to see whether each book is online your best bet might be to use a search engine API on a suitably long random string from it (though there will be false negatives).  

For example, Googling "They have never moved in all that time and take no notice of day or night" returns a Google Books link to Childhood’s End as well as a Russian site that apparently has the full text of the book along with some sketchy-seeming pop-ups. 

Best of luck,
Alex
   
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CABPGxYBWY53EFGVeAKpsdzN0rS8OnNU-ftJgyqh6BDZJ%2Bzd%2B%2BQ%40mail.gmail.com.

Claude Grunspan

unread,
Feb 23, 2021, 5:32:03 AM2/23/21
to common...@googlegroups.com
Thank you for your answer Alex, I will follow your advice.

Do you have any idea of where I can find a complete list (the last one for example) of the websites which is support to be part of the Common Crawl database?


Thank you again in advance

Claude

Sebastian Nagel

unread,
Feb 23, 2021, 6:04:09 AM2/23/21
to common...@googlegroups.com
Hi Claude,

> where I can find a complete list of the websites

Either use the columnar index
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
(column url_host_name or url_host_registered_domain)

Alternatively, the data for the project https://github.com/commoncrawl/cc-crawl-statistics
includes also counts for host and domain names:
- download the count files:
CRAWL=CC-MAIN-2021-04
aws --no-sign-request s3 cp --recursive s3://commoncrawl/crawl-analysis/$CRAWL/count $CRAWL/count
- then grep for host (id = 2) resp. domain (id = 3) counts:
bzgrep -h '^\[[23],' $CRAWL/count/part-*.bz2
- e.g.
[3,"commoncrawl.org",65] [56,55,3]
- the second (tab-separated) column holds the counts
- number of page captures
- unique URLs
- and unique host names (only for domain counts)
If trailing numbers are identical than the list is compressed:
"1" means 1 page, 1 URL, 1 host

One remark: while GPT-3 indeed was trained on data from Common Crawl,
GPT-2 was not. The Open WebText Corpus tries to reproduce the GPT-2 training data,
see https://skylion007.github.io/OpenWebTextCorpus/

Best,
Sebastian

On 2/23/21 11:31 AM, Claude Grunspan wrote:
> Thank you for your answer Alex, I will follow your advice.
>
> Do you have any idea of where I can find a complete list (the last one for example) of the websites which is support to be part of the
> Common Crawl database?
>
>
> Thank you again in advance
>
> Claude
>
> Le dim. 21 févr. 2021 à 22:10, Alex Henry <alexanderp...@gmail.com <mailto:alexanderp...@gmail.com>> a écrit :
>
> Hey Claude,
>
> Be aware that even if you find the URLs you’re thinking of in the Common Crawl index, Common Crawl itself may have only a small subset
> of the webpages associated with those domains.  In other words even if a website contains the text you’re looking for Common Crawl might
> not have that part of the website.  My understanding is that the lower a domain’s harmonic centrality the fewer pages from that domain
> Common Crawl will actually scrape.
>
> If you’re just trying to see whether each book is online your best bet might be to use a search engine API on a suitably long random
> string from it (though there will be false negatives).
>
> For example, Googling "They have never moved in all that time and take no notice of day or night" returns a Google Books link to
> Childhood’s End as well as a Russian site that apparently has the full text of the book along with some sketchy-seeming pop-ups.
>
> Best of luck,
> Alex
> On Sat, Feb 20, 2021 at 2:27 PM Claude Grunspan <grunspa...@gmail.com <mailto:grunspa...@gmail.com>> wrote:
>
> Merci, Tom.
> I will try Athena-based Common Crawl index  first for sure and then the second option you propose.
>
> Claude
>
> Le sam. 20 févr. 2021 à 20:23, Tom Morris <tfmo...@gmail.com <mailto:tfmo...@gmail.com>> a écrit :
>
> On Sat, Feb 20, 2021 at 1:26 PM Claude Grunspan <grunspa...@gmail.com <mailto:grunspa...@gmail.com>> wrote:
>
> In fact i would like to know the process of searching to find how many of Asimov's writings, or another science fiction
> writer's not under copyright if possible, are part of the Common Crawl corpus.
> Where should i search?
>
>
> What you are looking for is a search engine, which isn't one of the things that Common Crawl offers. There have been some
> efforts to build search engines on top of the Common Crawl data, but I don't think any of them are currently active. One example
> is/was Elastic ChatNoir https://www.chatnoir.eu/?q=isaac+asimov <https://www.chatnoir.eu/?q=isaac+asimov> Of course, any of the
> main search engines (Google, Bing, Duck Duck Go, Yandex, etc) could do the same searches for you on the live web.
>
> With the Common Crawl data, you have two options:
> 1. Use the Athena-based Common Crawl index to search for likely keywords in URLs, which will be cheap and fast, but require a
> second level of validation to weed out book reviews, author biographies, etc.
> 2. Use Spark/Hadoop to do a brute force search across all the page captures, which will be computationally expensive.
>
> Tom
>
> Le sam. 20 févr. 2021 à 19:13, Tom Morris <tfmo...@gmail.com <mailto:tfmo...@gmail.com>> a écrit :
>
>
> On Sat, Feb 20, 2021 at 4:34 AM Claude Grunspan <grunspa...@gmail.com <mailto:grunspa...@gmail.com>> wrote:
>
>
> 1) Could anyone from this list help me know how I can trace Isaac Asimov's texts in Common Crawl? With which tools?
> (Whether URLs or datasets)
>
>
> For a targeted literature selection such as this, you'd be better off with HathiTrust, OpenLibrary, or a similar
> resource, BUT Isaac Asimov has only been writing since 1938, so all of his works are still in copyright.
>
> Tom
>
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe
> <https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> <https://groups.google.com/d/msgid/common-crawl/CAE9vqEFBfdmg2_hipxQBYA1r2LL-c_UE%2BceHYTx9D-wZtwFS2g%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CABPGxYBWY53EFGVeAKpsdzN0rS8OnNU-ftJgyqh6BDZJ%2Bzd%2B%2BQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CABPGxYBWY53EFGVeAKpsdzN0rS8OnNU-ftJgyqh6BDZJ%2Bzd%2B%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe
> <https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CACnkqxF0nhoeoVc%3DgS98nZt%3D0fQogE4JE-EZ-_St8V8J%3DOGe_A%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CACnkqxF0nhoeoVc%3DgS98nZt%3D0fQogE4JE-EZ-_St8V8J%3DOGe_A%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CABPGxYBW-sBzmFqWp6n5b5zZt-2nEpEC90nSjW5PhorTSrLKMg%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CABPGxYBW-sBzmFqWp6n5b5zZt-2nEpEC90nSjW5PhorTSrLKMg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Claude Grunspan

unread,
Feb 23, 2021, 6:21:52 AM2/23/21
to common...@googlegroups.com
Thank you very much Sebastian, this is very precious!
I thought GPT-2 was trained on Common Crawl too.
Does GPT-2's training corpus contain other types of data, apart from OpenWebText?
If yes I am also looking for the list  :-)

Merci beaucoup d'avance 

Claude


To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/68c82f87-96a4-8917-2214-28f1aec7b855%40commoncrawl.org.

Sebastian Nagel

unread,
Feb 23, 2021, 6:51:41 AM2/23/21
to common...@googlegroups.com
Hi Claude,

> Does GPT-2's training corpus contain other types of data, apart from OpenWebText?

According to [1] GPT-2 was trained only on the "WebText" corpus.

The Open WebText initiative is "an open source effort to reproduce OpenAI’s WebText dataset" [2],
so I assume it's similar but not identical to "WebText".

GPT-3 was trained on multiple corpora, you'll find a list in [3] on page 9.

Best,
Sebastian

[1] https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[2] https://skylion007.github.io/OpenWebTextCorpus/
[3] https://arxiv.org/abs/2005.14165

Tom Morris

unread,
Feb 23, 2021, 11:55:24 AM2/23/21
to common...@googlegroups.com
On Tue, Feb 23, 2021 at 6:51 AM Sebastian Nagel
<seba...@commoncrawl.org> wrote:

> According to [1] GPT-2 was trained only on the "WebText" corpus.
>
> The Open WebText initiative is "an open source effort to reproduce OpenAI’s WebText dataset" [2],
> so I assume it's similar but not identical to "WebText".

Neither one is described in enough detail to be able to reproduce identically
(not to mention the effects of the passage of time), but the Open
WebText version
seems to have skipped 1) Karma filtering and 2) Wikipedia exclusion, as well as
(probably) using a different deduplication strategy (the GPT-2 paper
doesn't describe theirs).

Tom

Claude Grunspan

unread,
Mar 2, 2021, 2:54:42 AM3/2/21
to common...@googlegroups.com
Thank you very much, again Sebastian and Tom .

Best
Claude

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages