Generation of a Russian --> English Language Pack

Lewis John Mcgibbney

unread,

Oct 1, 2015, 7:28:51 PM10/1/15

to Joshua Developers

Hi Folks,
Does anyone have an interest/requirement to generate a Russian --> English language pack?
We are currently crawling large volumes of data from the Web which is is Russian and we would like to do data analysis over this data but cannot due to the source Russian language.
Thanks
Lewis

Matt Post

unread,

Oct 1, 2015, 7:30:52 PM10/1/15

to joshua_d...@googlegroups.com

I'm interested.

There is also lots of training and development data available at http://statmt.org/wmt15/translation-task.html.

matt

--
You received this message because you are subscribed to the Google Groups "Joshua Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_develop...@googlegroups.com.
To post to this group, send email to joshua_d...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.

Lewis John Mcgibbney

unread,

Oct 1, 2015, 9:59:14 PM10/1/15

to joshua_d...@googlegroups.com

Ok Matt grand.

What format does the input data need to be in? What volume do we need?

I am keen to get a TIKA service up and running relying in Joshua as soon as this is available.

Thabks

You received this message because you are subscribed to a topic in the Google Groups "Joshua Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/joshua_developers/FUBFP0hvqlQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to joshua_develop...@googlegroups.com.

To post to this group, send email to joshua_d...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.

--
Lewis

Matt Post

unread,

Oct 2, 2015, 3:50:59 PM10/2/15

to joshua_d...@googlegroups.com

I just need a parallel corpus: two files, one with Russian, the other with English, and line-by-line parallel.

matt

Lewis John Mcgibbney

unread,

Oct 2, 2015, 4:49:27 PM10/2/15

to joshua_d...@googlegroups.com

How large?

Matt Post

unread,

Oct 2, 2015, 4:51:03 PM10/2/15

to joshua_d...@googlegroups.com

As big as possible — the more, the better. I'd say at least 100k, but ideally in the millions (for reference, the Europarl corpora usually have about 2 million sentence pairs. You can see the http://statmt.org/wmt15/translation-task.html summary for more information).

Lewis John Mcgibbney

unread,

Oct 2, 2015, 5:22:12 PM10/2/15

to joshua_d...@googlegroups.com

ACK.

I'll see what I can do. Thanks have a great weekend.

Matt Post

unread,

Oct 5, 2015, 10:04:46 AM10/5/15

to joshua_d...@googlegroups.com

Keep in mind, there are almost 1M parallel sentences in the RU-EN Common Crawl. More data is better, but this should be enough for a reasonable first-pass system.

matt

Lewis John Mcgibbney

unread,

Oct 6, 2015, 2:01:44 PM10/6/15

to joshua_d...@googlegroups.com

Hi Matt,

I'm over here hashing this out
https://groups.google.com/forum/#!topic/common-crawl/nCB5ILlLW30

I am getting personal responses as well so I'll keep as much info as I can looped back here.

Ta

Lewis

Matt Post

unread,

Oct 6, 2015, 2:03:40 PM10/6/15

to joshua_d...@googlegroups.com

Sounds good.

Matt Post

unread,

Oct 6, 2015, 2:04:34 PM10/6/15

to joshua_d...@googlegroups.com

Did you look at the data released for WMT15? I expect that this is the largest piece you're going to find.

http://statmt.org/wmt15/translation-task.html

Lewis John Mcgibbney

unread,

Jan 11, 2016, 1:26:25 PM1/11/16

to Joshua Developers

Hi Matt,
OK I am working on generating the language models this week with the goal of having them completed by the week end.
I managed to locate a bunch of data which i think we can use for building models. The data can be found below

English
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz

Czech
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.cs.shuffled.gz

French
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.fr.shuffled.gz

German
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz

Hindi
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.hi.shuffled.gz

Russian
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.ru.shuffled.gz

I am going to go ahead an experiment generating the language pack for Russian based on the documentation provided at http://joshua-decoder.org/6.0/bundle.html

In addition to this I am working on augmenting Apache Tika with a NetworkTranslator as described in https://issues.apache.org/jira/browse/TIKA-1343
Would really appreciate our input in helping to build the above models. I am also very interested in building a model for Mandarin Chinese --> English so I am actively looking for data which would facilitate this.
Thanks

matt

How large?

To unsubscribe from this group and stop receiving emails from it, send an email to joshua_developers+unsubscribe@googlegroups.com.
To post to this group, send email to joshua_developers@googlegroups.com.

Visit this group at http://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Joshua Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/joshua_developers/FUBFP0hvqlQ/unsubscribe.

To unsubscribe from this group and all its topics, send an email to joshua_developers+unsubscribe@googlegroups.com.
To post to this group, send email to joshua_developers@googlegroups.com.

--
Lewis

Matt Post

unread,

Jan 11, 2016, 2:32:22 PM1/11/16

to joshua_d...@googlegroups.com

Hi Lewis,

That's the monolingual data. You want the parallel data (see the entries under the "Parallel data" box). The monolingual data you pointed to can be used to build a larger language model, but that's a bit more complicated.

I am working on building a Chinese–English model. The parallel data for that is harder to acquire because it's all tied up with DARPA stuff. Is that all right, or do you want to build it yourself?

matt

Visit this group at https://groups.google.com/group/joshua_developers.

Matt Post

unread,

Jan 11, 2016, 2:33:06 PM1/11/16

to joshua_d...@googlegroups.com

FYI, I will also soon release language packs that do not require an outside installation of Joshua, but will include the runtime compiled against Java 7.

matt

Visit this group at https://groups.google.com/group/joshua_developers.

Lewis John Mcgibbney

unread,

Jan 11, 2016, 2:37:00 PM1/11/16

to joshua_d...@googlegroups.com

Thanks for the explanation... acknowledged.

On Mon, Jan 11, 2016 at 11:32 AM, Matt Post <po...@cs.jhu.edu> wrote:

Hi Lewis,

That's the monolingual data. You want the parallel data (see the entries under the "Parallel data" box). The monolingual data you pointed to can be used to build a larger language model, but that's a bit more complicated.

I am working on building a Chinese–English model. The parallel data for that is harder to acquire because it's all tied up with DARPA stuff. Is that all right, or do you want to build it yourself?

By all means please generate the Chinese --> English model this would be extremely helpful indeed!

I'll keep you updated here.

Thanks

Lewis John Mcgibbney

unread,

Jan 11, 2016, 2:37:12 PM1/11/16

to joshua_d...@googlegroups.com

Cool

matt

How large?

--
Lewis

--

Lewis

Lewis John Mcgibbney

unread,

Feb 18, 2016, 11:46:22 PM2/18/16

to Joshua Developers

Hi Matt,
Did you ever manage to get cracking on the Chinese --> English model? BTW what variety of Chinese?
I am neatly finished with the Russian --> English pack which I will make available for public use ASAP.
Thanks
Lewis

Matt Post

unread,

Feb 21, 2016, 2:40:50 PM2/21/16

to joshua_d...@googlegroups.com

There is a Chinese–English language pack posted (http://joshua-decoder.org/language-packs/). It's a Mandarin model built on newswire text, mostly. It includes the Joshua runtime and does not use KenLM, so there are zero external dependencies.

matt

As big as possible — the more, the better. I'd say at least 100k, but ideally in the millions (for reference, the Europarl corpora usually have about 2 million sentence pairs. You can see the http://statmt.org/wmt15/translation-task.htmlsummary for more information).

--
Lewis

Lewis John Mcgibbney

unread,

Feb 21, 2016, 8:44:02 PM2/21/16

to joshua_d...@googlegroups.com

I actually saw this when I was looking at the website.

This is dynamite.

We are probably going to try a en --> mandarin model pretty soon.

Out of curiosity how are you evaluating the 'quality' of the language pack?

Thanks

matt

How large?

--
Lewis

Visit this group at https://groups.google.com/group/joshua_developers.

For more options, visit https://groups.google.com/d/optout.

--

Lewis

Matt Post

unread,

Feb 22, 2016, 9:26:59 AM2/22/16

to joshua_d...@googlegroups.com

We compute BLEU score on some held-out test data, usually in the news domain. It would be better to be more systematic about this; e.g., to collect multiple test sets, record scores on all of them, and manage this information (it's currently recorded only in the private directory where I built the models).

matt

Lewis John Mcgibbney

unread,

Feb 22, 2016, 11:39:19 AM2/22/16

to joshua_d...@googlegroups.com

Yeah good idea Matt.

We could log an issue and address this over on the website I suppose.

I think it would provide helpful insight into the quality of the language packs.

Reply all

Reply to author

Forward