First timer with Common Crawl

Jinseung Eu

unread,

Mar 28, 2016, 7:47:58 PM3/28/16

to Common Crawl

Hello! I am a linguist building a language reference tool based on large data. I was very happy to find CC. I would like to access the corpus and do some searches. I especially want to compare its size with Google's search base. This will tell me if I can use CC for building my tool as an alternative to Google search. However, as a non-programmer I do not understand much of the explanations on the CC site. So I still haven't figured out how to do simple searches on CC (entering a search phrase, getting its frequency and examples). Can I just launch my Amazon EMR cluster and follow the tutorial? Or are there other things I need to do before getting there? Can I access it without much knowledge of programming? Thank you very much for your attention.

Message has been deleted

Tom Morris

unread,

Mar 29, 2016, 12:49:40 AM3/29/16

to common...@googlegroups.com

On Mon, Mar 28, 2016 at 7:47 PM, Jinseung Eu <jin...@hotmail.com> wrote:

Hello! I am a linguist building a language reference tool based on large data. I was very happy to find CC. I would like to access the corpus and do some searches.

Currently the only search tool based on CommonCrawl searches based on domains, subdomains, and URL prefixes. It does not search on page content. There's no publicly available search tool based on CommonCrawl which is similar to Google/Bing/Yandex, as far as I'm aware.

I especially want to compare its size with Google's search base. This will tell me if I can use CC for building my tool as an alternative to Google search.

That's an easy question answer even without a search engine. CommonCrawl is orders of magnitude smaller than the Google corpus (but then, the Google corpus isn't available to most researchers, so it doesn't really matter how big it is).

However, as a non-programmer I do not understand much of the explanations on the CC site. So I still haven't figured out how to do simple searches on CC (entering a search phrase, getting its frequency and examples). Can I just launch my Amazon EMR cluster and follow the tutorial? Or are there other things I need to do before getting there? Can I access it without much knowledge of programming?

I'm afraid you'll need to know some programming to make much progress. What CommonCrawl provides is the content of one or two billion web pages, as they existed in the moment in time when they were crawled (roughly monthly). Pretty much everything else, you need to provide yourself.

If you're primarily interested in textual content, you might be interested it in the C4Corpus which was recently mentioned on the list. They have a paper accepted for LREC and they have a Github repo which includes the extraction software and pointers to the corpus. The English portion of their corpus is 7.7M pages and 7.7B tokens.

The nice thing about using that as a starting point is that it's already sorted by language, has the boilerplate (imperfectly) removed, has the character encoding figured out, etc. If you had a list of phrases that you wanted to enumerate, you could write a quick MapReduce job to extract them along with their associated context (or you could feed the text into ElasticSearch or the search engine of your choice and create a web search tool based on the data).

Tom

Jin

unread,

Mar 29, 2016, 3:41:46 AM3/29/16

to Common Crawl

Dear Tom

Thank you very much for your reply. I thought CC is bigger than the Google corpus because it is supposed to be "petabytes of data". The Google corpus is said to contain over a trillion words. I am not sure if it is just English or all languages. I am just interested in the English portion. Do you know how many words CC contains? I just want to know how you know that CC is much smaller than the Google corpus. Thanks.

Ivan Habernal

unread,

Mar 29, 2016, 4:04:02 AM3/29/16

to Common Crawl

Hi Jin,

as Tim mentioned, the C4Corpus has "clean" text from CommonCrawl, which might be a good starting point for your use case. I've added an example for simple regex search over C4Corpus, see https://github.com/dkpro/dkpro-c4corpus/issues/26 (you'll find a short description in the documentation: https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-doc/src/main/asciidoc/documentation/C4CorpusUsersGuide.adoc under "Use-case: search..."). You still need some knowledge of regular expressions and running a MapReduce jobs on a Hadoop cluster, though.

For word counting, have a look at the other example "Word count example":

https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-doc/src/main/asciidoc/documentation/C4CorpusDevelopersGuide.adoc

Hope it helps!

Best,

Ivan

Ivan Habernal

unread,

Mar 29, 2016, 4:04:40 AM3/29/16

to Common Crawl

as Tim mentioned

I meant Tom, of course :)

Ivan

Jin

unread,

Mar 29, 2016, 4:33:59 AM3/29/16

to Common Crawl

Thank you very much for the information. However, if its size is only 7.7B tokens (words), it is way too small for my purpose. It would be better to use the Google Books Corpus, which is 155B words, and even this is too small for me. I really need a corpus the size of Google. Am I screwed?

Ivan Habernal

unread,

Mar 29, 2016, 4:48:39 AM3/29/16

to Common Crawl

Hi Jin,

I'm wondering then what is your use-case, if CommonCrawl is too small? It's been successfully used in many NLP tasks (training word embeddings using Glove - https://github.com/stanfordnlp/GloVe; machine translation - Smith et al., 2013, language models - Buck et al., 2014). You mentioned "language reference tool" - then you should also look at the Web as a corpus (WAC) initiative and similar projects, see e.g. Schäfer, & Bildhauer, 2013.

Best,

Ivan

References:

Smith, J. R., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., & Lopez, A. (2013). Dirt Cheap Web-Scale Parallel Text from the Common Crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1374–1383). Sofia, Bulgaria: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/P13-1135

Buck, C., Heafield, K., & van Ooyen, B. (2014). N-gram Counts and Language Models from the Common Crawl. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, … S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 3579–3584). Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf

Schäfer, R., & Bildhauer, F. (2013). Web Corpus Construction. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. doi:10.2200/S00508ED1V01Y201305HLT022

Jin

unread,

Mar 29, 2016, 9:54:46 AM3/29/16

to Common Crawl

It has to be large enough for even rare or erroneous phrases to score a few hits.

Tom Morris

unread,

Mar 29, 2016, 9:57:15 AM3/29/16

to common...@googlegroups.com

I'm not sure where the petabytes reference to CommonCrawl came from, but it's exaggerated. The "modern" (2013 & later) crawls total about 0.76 petabytes and that includes lots of redundancy in both time and space.

If all you care about is size, there's no question that the Google Books N-gram corpus is bigger. It has 458 billion tokens[1] for English, BUT:

- it's OCR output, so filled with OCR errors

- you can't get the raw data, only N-grams where N<6

- it's historical with the most recent volumes from 2008

- http://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/

The CommonCrawl corpus, on the other hand, represents a single point in time with no history, has no OCR errors, represents English as it's used on the web rather than in printed literature, gives entire texts in context, etc. They're just entirely different beasts with different strengths and weaknesses.

Tom

[1] I'm not sure where your 155B number comes from, but my number is from this paper: http://aclweb.org/anthology/P/P12/P12-3029.pdf

On Tue, Mar 29, 2016 at 4:33 AM, Jin <jin...@hotmail.com> wrote:

Thank you very much for the information. However, if its size is only 7.7B tokens (words), it is way too small for my purpose. It would be better to use the Google Books Corpus, which is 155B words, and even this is too small for me. I really need a corpus the size of Google. Am I screwed?

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Greg Lindahl

unread,

Mar 29, 2016, 11:47:32 AM3/29/16

to common...@googlegroups.com

On Tue, Mar 29, 2016 at 06:54:45AM -0700, Jin wrote:
>
> It has to be large enough for even rare or erroneous phrases to score a few
> hits.

Were you imagining that this corpus would be human-written? Because a
lot of the web is generated by programs, trying to attract humans and
get them to click on ads.

-- greg

Reply all

Reply to author

Forward