Download common crawl (42B token) corpus

107 views
Skip to first unread message

matt.si...@gmail.com

unread,
Mar 12, 2018, 5:18:31 PM3/12/18
to GloVe: Global Vectors for Word Representation
Hello, where can I download the full 42B token common crawl corpus?

Thanks,
Matt

Bob van Luijt

unread,
May 30, 2018, 1:53:57 PM5/30/18
to GloVe: Global Vectors for Word Representation

prstev...@gmail.com

unread,
Jun 4, 2019, 4:55:03 PM6/4/19
to GloVe: Global Vectors for Word Representation
Which of the versions on this page is the version that was used to generate the GloVe vectors available for download on this page? https://nlp.stanford.edu/projects/glove/
Thanks. 

n8rro...@gmail.com

unread,
Nov 12, 2019, 7:27:35 PM11/12/19
to GloVe: Global Vectors for Word Representation
Did you all ever find an answer to this question? I'm looking for the 840B token Common Crawl (a 2.03 GB download according to the GloVe website). The Common Crawl website that Bob van Lujit posted has many datasets available, but they all seem to be much bigger (on the order of 50 TB compressed). I just want to find the same corpus that GloVe used. And hints?
Reply all
Reply to author
Forward
0 new messages