Obtaining Russian Data

521 views
Skip to first unread message

Lewis John Mcgibbney

unread,
Oct 6, 2015, 12:16:11 AM10/6/15
to Common Crawl
Hi Folks,
I'm interested in obtaining as much Russian data as I can as I would like to build a Russian --> English Translation model for Joshua Decoder.
Can anyone please point me at Russian data crawled and available through Common Crawl?
This might also be a good opportunity for me to ask about crawl data from other langauges which is available through Common Crawl. Is there is comprehensive list somewhere?
Thank you very much in advance for any feedback.
Lewis

Tom Morris

unread,
Oct 6, 2015, 2:07:58 PM10/6/15
to common...@googlegroups.com
TL;DR You can get 1.8B Russian sentences derived from the 2012&2013 CommonCrawl corpora here: http://data.statmt.org/ngrams/deduped/ru.xz (37GB compressed)

The 2014 Statistical Machine Translation workshop includes a parallel language corpus derived from Common Crawl which includes a EN-RU component of 878K sentences extracted from 123K pages on 21K domainshttp://www.statmt.org/wmt14/translation-task.html

There are also some papers which describe previous efforts to extract parallel texts and language models from the CommonCrawl:

Smith, Jason R., Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. "Dirt Cheap Web-Scale Parallel Text from the Common Crawl." In ACL (1), pp. 1374-1383. 2013.

Buck, Christian, Kenneth Heafield, and Bas van Ooyen. "N-gram counts and language models from the common crawl." In Proceedings of the Language Resources and Evaluation Conference. 2014.

As far as language stats go, the second paper reports the following breakdown (running CLD2 on the WET files):

            Relative occurrence % Size 
Language  2012 2013 both both 
English 54.79 79.53 67.05 23.62 TiB 
German 4.53 1.23 2.89 1.02 TiB 
Spanish 3.91 1.68 2.80 986.86 GiB 
French 4.01 1.14 2.59 912.16 GiB 
Japanese 3.11 0.14 1.64 577.14 GiB 
Russian 2.93 0.09 1.53 537.36 GiB 
Polish 1.81 0.08 0.95 334.31 GiB 
Italian 1.40 0.44 0.92 325.58 GiB 
Portuguese 1.32 0.48 0.90 316.87 GiB 
Chinese 1.45 0.04 0.75 264.91 GiB 
Dutch 0.95 0.22 0.59 207.90 GiB 
other 12.23 12.57 12.40 4.37 TiB

As you can see from the stats (apologies for the formatting, see the original paper for a better version), there's a significant shift towards English from 2012 to 2013 (from 55% to 80%) with a corresponding reduction in non-English languages (e.g. Russian drops from 3% to <0.1%).

The 2014-2015 crawls have yet a different set of characteristics, but there have been anecdotal comments that they are English biased as well. https://groups.google.com/d/msg/common-crawl/IGa7E680NUs/Ga8rr6GjDAAJ

The data produced by the pipeline described in the Buck paper is available here at http://statmt.org/ngrams/ with both raw sentences, deduped sentences and language models available.  The deduped directory includes 37GB of deduped Russian sentences: http://data.statmt.org/ngrams/deduped/

Tom




--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Lewis John Mcgibbney

unread,
Oct 6, 2015, 2:09:21 PM10/6/15
to common...@googlegroups.com
Wow thank you Tom.
More than helpful... This is great.
Lewis 
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/nCB5ILlLW30/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.


--
Lewis

Greg Lindahl

unread,
Oct 6, 2015, 3:44:50 PM10/6/15
to common...@googlegroups.com
On Tue, Oct 06, 2015 at 02:07:56PM -0400, Tom Morris wrote:

> TL;DR You can get 1.8B Russian sentences derived from the 2012&2013
> CommonCrawl corpora here: http://data.statmt.org/ngrams/deduped/ru.xz (37GB
> compressed)

Wow, it's a pleasant surprise that there are so many!

-- greg

Tom Morris

unread,
Oct 6, 2015, 5:04:25 PM10/6/15
to common...@googlegroups.com
It's a good size corpus, but looking at the raw, pre-dedupe, files shows that it's almost all from the 2012 crawl:


ru.2012.raw.xz18-Feb-2015 06:5283G 

ru.2013_1.raw.xz18-Feb-2015 07:171.4G 

ru.2013_2.raw.xz18-Feb-2015 07:351.1G 

ru.2014_1.raw.xz18-Feb-2015 08:001.5G

That's why Christian Buck was lobbying for more non-English data in future crawls:

For this use case of building language models, I'm sure the 2012 corpus will be fine, but other applications will want fresher data.

Tom


 
Reply all
Reply to author
Forward
0 new messages