Obtaining Russian Data

Lewis John Mcgibbney

unread,

Oct 6, 2015, 12:16:11 AM10/6/15

to Common Crawl

Hi Folks,
I'm interested in obtaining as much Russian data as I can as I would like to build a Russian --> English Translation model for Joshua Decoder.
Can anyone please point me at Russian data crawled and available through Common Crawl?
This might also be a good opportunity for me to ask about crawl data from other langauges which is available through Common Crawl. Is there is comprehensive list somewhere?
Thank you very much in advance for any feedback.
Lewis

Tom Morris

unread,

Oct 6, 2015, 2:07:58 PM10/6/15

to common...@googlegroups.com

TL;DR You can get 1.8B Russian sentences derived from the 2012&2013 CommonCrawl corpora here: http://data.statmt.org/ngrams/deduped/ru.xz (37GB compressed)

The 2014 Statistical Machine Translation workshop includes a parallel language corpus derived from Common Crawl which includes a EN-RU component of 878K sentences extracted from 123K pages on 21K domains: http://www.statmt.org/wmt14/translation-task.html

There are also some papers which describe previous efforts to extract parallel texts and language models from the CommonCrawl:

Smith, Jason R., Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. "Dirt Cheap Web-Scale Parallel Text from the Common Crawl." In ACL (1), pp. 1374-1383. 2013.

http://www.cs.jhu.edu/~ccb/publications/bitexts-from-common-crawl.pdf

Buck, Christian, Kenneth Heafield, and Bas van Ooyen. "N-gram counts and language models from the common crawl." In Proceedings of the Language Resources and Evaluation Conference. 2014.

http://www-nlp.stanford.edu/pubs/crawl.pdf

As far as language stats go, the second paper reports the following breakdown (running CLD2 on the WET files):

Relative occurrence % Size

Language 2012 2013 both both

English 54.79 79.53 67.05 23.62 TiB

German 4.53 1.23 2.89 1.02 TiB

Spanish 3.91 1.68 2.80 986.86 GiB

French 4.01 1.14 2.59 912.16 GiB

Japanese 3.11 0.14 1.64 577.14 GiB

Russian 2.93 0.09 1.53 537.36 GiB

Polish 1.81 0.08 0.95 334.31 GiB

Italian 1.40 0.44 0.92 325.58 GiB

Portuguese 1.32 0.48 0.90 316.87 GiB

Chinese 1.45 0.04 0.75 264.91 GiB

Dutch 0.95 0.22 0.59 207.90 GiB

other 12.23 12.57 12.40 4.37 TiB

As you can see from the stats (apologies for the formatting, see the original paper for a better version), there's a significant shift towards English from 2012 to 2013 (from 55% to 80%) with a corresponding reduction in non-English languages (e.g. Russian drops from 3% to <0.1%).

The 2014-2015 crawls have yet a different set of characteristics, but there have been anecdotal comments that they are English biased as well. https://groups.google.com/d/msg/common-crawl/IGa7E680NUs/Ga8rr6GjDAAJ

The data produced by the pipeline described in the Buck paper is available here at http://statmt.org/ngrams/ with both raw sentences, deduped sentences and language models available. The deduped directory includes 37GB of deduped Russian sentences: http://data.statmt.org/ngrams/deduped/

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Lewis John Mcgibbney

unread,

Oct 6, 2015, 2:09:21 PM10/6/15

to common...@googlegroups.com

Wow thank you Tom.

More than helpful... This is great.

Lewis

You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/nCB5ILlLW30/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

--
Lewis

Greg Lindahl

unread,

Oct 6, 2015, 3:44:50 PM10/6/15

to common...@googlegroups.com

On Tue, Oct 06, 2015 at 02:07:56PM -0400, Tom Morris wrote:

> TL;DR You can get 1.8B Russian sentences derived from the 2012&2013
> CommonCrawl corpora here: http://data.statmt.org/ngrams/deduped/ru.xz (37GB
> compressed)

Wow, it's a pleasant surprise that there are so many!

-- greg

Tom Morris

unread,

Oct 6, 2015, 5:04:25 PM10/6/15

to common...@googlegroups.com

It's a good size corpus, but looking at the raw, pre-dedupe, files shows that it's almost all from the 2012 crawl:

ru.2012.raw.xz	18-Feb-2015 06:52	83G
ru.2013_1.raw.xz	18-Feb-2015 07:17	1.4G
ru.2013_2.raw.xz	18-Feb-2015 07:35	1.1G
ru.2014_1.raw.xz	18-Feb-2015 08:00	1.5G

That's why Christian Buck was lobbying for more non-English data in future crawls:

https://groups.google.com/d/msg/common-crawl/IGa7E680NUs/Ga8rr6GjDAAJ

For this use case of building language models, I'm sure the 2012 corpus will be fine, but other applications will want fresher data.

Tom

Reply all

Reply to author

Forward