Hi Claude,
> where I can find a complete list of the websites
Either use the columnar index
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
(column url_host_name or url_host_registered_domain)
Alternatively, the data for the project
https://github.com/commoncrawl/cc-crawl-statistics
includes also counts for host and domain names:
- download the count files:
CRAWL=CC-MAIN-2021-04
aws --no-sign-request s3 cp --recursive s3://commoncrawl/crawl-analysis/$CRAWL/count $CRAWL/count
- then grep for host (id = 2) resp. domain (id = 3) counts:
bzgrep -h '^\[[23],' $CRAWL/count/part-*.bz2
- e.g.
[3,"
commoncrawl.org",65] [56,55,3]
- the second (tab-separated) column holds the counts
- number of page captures
- unique URLs
- and unique host names (only for domain counts)
If trailing numbers are identical than the list is compressed:
"1" means 1 page, 1 URL, 1 host
One remark: while GPT-3 indeed was trained on data from Common Crawl,
GPT-2 was not. The Open WebText Corpus tries to reproduce the GPT-2 training data,
see
https://skylion007.github.io/OpenWebTextCorpus/
Best,
Sebastian
On 2/23/21 11:31 AM, Claude Grunspan wrote:
> Thank you for your answer Alex, I will follow your advice.
>
> Do you have any idea of where I can find a complete list (the last one for example) of the websites which is support to be part of the
> Common Crawl database?
>
>
> Thank you again in advance
>
> Claude
>
> Le dim. 21 févr. 2021 à 22:10, Alex Henry <
alexanderp...@gmail.com <mailto:
alexanderp...@gmail.com>> a écrit :
>
> Hey Claude,
>
> Be aware that even if you find the URLs you’re thinking of in the Common Crawl index, Common Crawl itself may have only a small subset
> of the webpages associated with those domains. In other words even if a website contains the text you’re looking for Common Crawl might
> not have that part of the website. My understanding is that the lower a domain’s harmonic centrality the fewer pages from that domain
> Common Crawl will actually scrape.
>
> If you’re just trying to see whether each book is online your best bet might be to use a search engine API on a suitably long random
> string from it (though there will be false negatives).
>
> For example, Googling "They have never moved in all that time and take no notice of day or night" returns a Google Books link to
> Childhood’s End as well as a Russian site that apparently has the full text of the book along with some sketchy-seeming pop-ups.
>
> Best of luck,
> Alex
> On Sat, Feb 20, 2021 at 2:27 PM Claude Grunspan <
grunspa...@gmail.com <mailto:
grunspa...@gmail.com>> wrote:
>
> Merci, Tom.
> I will try Athena-based Common Crawl index first for sure and then the second option you propose.
>
> Claude
>
> Le sam. 20 févr. 2021 à 20:23, Tom Morris <
tfmo...@gmail.com <mailto:
tfmo...@gmail.com>> a écrit :
>
> On Sat, Feb 20, 2021 at 1:26 PM Claude Grunspan <
grunspa...@gmail.com <mailto:
grunspa...@gmail.com>> wrote:
>
> In fact i would like to know the process of searching to find how many of Asimov's writings, or another science fiction
> writer's not under copyright if possible, are part of the Common Crawl corpus.
> Where should i search?
>
>
> What you are looking for is a search engine, which isn't one of the things that Common Crawl offers. There have been some
> efforts to build search engines on top of the Common Crawl data, but I don't think any of them are currently active. One example
> is/was Elastic ChatNoir
https://www.chatnoir.eu/?q=isaac+asimov <
https://www.chatnoir.eu/?q=isaac+asimov> Of course, any of the
> main search engines (Google, Bing, Duck Duck Go, Yandex, etc) could do the same searches for you on the live web.
>
> With the Common Crawl data, you have two options:
> 1. Use the Athena-based Common Crawl index to search for likely keywords in URLs, which will be cheap and fast, but require a
> second level of validation to weed out book reviews, author biographies, etc.
> 2. Use Spark/Hadoop to do a brute force search across all the page captures, which will be computationally expensive.
>
> Tom
>
> Le sam. 20 févr. 2021 à 19:13, Tom Morris <
tfmo...@gmail.com <mailto:
tfmo...@gmail.com>> a écrit :
>
>
> On Sat, Feb 20, 2021 at 4:34 AM Claude Grunspan <
grunspa...@gmail.com <mailto:
grunspa...@gmail.com>> wrote:
>
>
> 1) Could anyone from this list help me know how I can trace Isaac Asimov's texts in Common Crawl? With which tools?
> (Whether URLs or datasets)
>
>
> For a targeted literature selection such as this, you'd be better off with HathiTrust, OpenLibrary, or a similar
> resource, BUT Isaac Asimov has only been writing since 1938, so all of his works are still in copyright.
>
> Tom
>
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit
https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe
> <
https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe>.
> <mailto:
common-crawl...@googlegroups.com>.
> <
https://groups.google.com/d/msgid/common-crawl/CAE9vqEFBfdmg2_hipxQBYA1r2LL-c_UE%2BceHYTx9D-wZtwFS2g%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/CABPGxYBWY53EFGVeAKpsdzN0rS8OnNU-ftJgyqh6BDZJ%2Bzd%2B%2BQ%40mail.gmail.com
> <
https://groups.google.com/d/msgid/common-crawl/CABPGxYBWY53EFGVeAKpsdzN0rS8OnNU-ftJgyqh6BDZJ%2Bzd%2B%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> <
https://groups.google.com/d/topic/common-crawl/KTjN1VEaPKQ/unsubscribe>.
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/CACnkqxF0nhoeoVc%3DgS98nZt%3D0fQogE4JE-EZ-_St8V8J%3DOGe_A%40mail.gmail.com
> <
https://groups.google.com/d/msgid/common-crawl/CACnkqxF0nhoeoVc%3DgS98nZt%3D0fQogE4JE-EZ-_St8V8J%3DOGe_A%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/CABPGxYBW-sBzmFqWp6n5b5zZt-2nEpEC90nSjW5PhorTSrLKMg%40mail.gmail.com
> <
https://groups.google.com/d/msgid/common-crawl/CABPGxYBW-sBzmFqWp6n5b5zZt-2nEpEC90nSjW5PhorTSrLKMg%40mail.gmail.com?utm_medium=email&utm_source=footer>.