Generation of a Russian --> English Language Pack

26 views
Skip to first unread message

Lewis John Mcgibbney

unread,
Oct 1, 2015, 7:28:51 PM10/1/15
to Joshua Developers
Hi Folks,
Does anyone have an interest/requirement to generate a Russian --> English language pack?
We are currently crawling large volumes of data from the Web which is is Russian and we would like to do data analysis over this data but cannot due to the source Russian language.
Thanks
Lewis

Matt Post

unread,
Oct 1, 2015, 7:30:52 PM10/1/15
to joshua_d...@googlegroups.com
I'm interested.

There is also lots of training and development data available at http://statmt.org/wmt15/translation-task.html.

matt


--
You received this message because you are subscribed to the Google Groups "Joshua Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to joshua_develop...@googlegroups.com.
To post to this group, send email to joshua_d...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.

Lewis John Mcgibbney

unread,
Oct 1, 2015, 9:59:14 PM10/1/15
to joshua_d...@googlegroups.com
Ok Matt grand.
What format does the input data need to be in? What volume do we need?
I am keen to get a TIKA service up and running relying in Joshua as soon as this is available.
Thabks 
You received this message because you are subscribed to a topic in the Google Groups "Joshua Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/joshua_developers/FUBFP0hvqlQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to joshua_develop...@googlegroups.com.

To post to this group, send email to joshua_d...@googlegroups.com.
Visit this group at http://groups.google.com/group/joshua_developers.
For more options, visit https://groups.google.com/d/optout.


--
Lewis

Matt Post

unread,
Oct 2, 2015, 3:50:59 PM10/2/15
to joshua_d...@googlegroups.com
I just need a parallel corpus: two files, one with Russian, the other with English, and line-by-line parallel.

matt

Lewis John Mcgibbney

unread,
Oct 2, 2015, 4:49:27 PM10/2/15
to joshua_d...@googlegroups.com
How large?

Matt Post

unread,
Oct 2, 2015, 4:51:03 PM10/2/15
to joshua_d...@googlegroups.com
As big as possible — the more, the better. I'd say at least 100k, but ideally in the millions (for reference, the Europarl corpora usually have about 2 million sentence pairs. You can see the http://statmt.org/wmt15/translation-task.html summary for more information).

Lewis John Mcgibbney

unread,
Oct 2, 2015, 5:22:12 PM10/2/15
to joshua_d...@googlegroups.com
ACK.
I'll see what I can do. Thanks have a great weekend.

Matt Post

unread,
Oct 5, 2015, 10:04:46 AM10/5/15
to joshua_d...@googlegroups.com
Keep in mind, there are almost 1M parallel sentences in the RU-EN Common Crawl. More data is better, but this should be enough for a reasonable first-pass system.

matt

Lewis John Mcgibbney

unread,
Oct 6, 2015, 2:01:44 PM10/6/15
to joshua_d...@googlegroups.com
I am getting personal responses as well so I'll keep as much info as I can looped back here.
Ta
Lewis

Matt Post

unread,
Oct 6, 2015, 2:03:40 PM10/6/15
to joshua_d...@googlegroups.com
Sounds good.

Matt Post

unread,
Oct 6, 2015, 2:04:34 PM10/6/15
to joshua_d...@googlegroups.com
Did you look at the data released for WMT15? I expect that this is the largest piece you're going to find.

Lewis John Mcgibbney

unread,
Jan 11, 2016, 1:26:25 PM1/11/16
to Joshua Developers
Hi Matt,
OK I am working on generating the language models this week with the goal of having them completed by the week end.
I managed to locate a bunch of data which i think we can use for building models. The data can be found below

English
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz

Czech
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.cs.shuffled.gz

French
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.fr.shuffled.gz

German
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz

Hindi
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.hi.shuffled.gz

Russian
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.ru.shuffled.gz

I am going to go ahead an experiment generating the language pack for Russian based on the documentation provided at http://joshua-decoder.org/6.0/bundle.html

In addition to this I am working on augmenting Apache Tika with a NetworkTranslator as described in https://issues.apache.org/jira/browse/TIKA-1343
Would really appreciate our input in helping to build the above models. I am also very interested in building a model for Mandarin Chinese --> English so I am actively looking for data which would facilitate this.
Thanks

matt


How large?

To unsubscribe from this group and stop receiving emails from it, send an email to joshua_developers+unsubscribe@googlegroups.com.
To post to this group, send email to joshua_developers@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Joshua Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/joshua_developers/FUBFP0hvqlQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to joshua_developers+unsubscribe@googlegroups.com.
To post to this group, send email to joshua_developers@googlegroups.com.



--
Lewis



--
Lewis



--
Lewis

Matt Post

unread,
Jan 11, 2016, 2:32:22 PM1/11/16
to joshua_d...@googlegroups.com
Hi Lewis,

That's the monolingual data. You want the parallel data (see the entries under the "Parallel data" box). The monolingual data you pointed to can be used to build a larger language model, but that's a bit more complicated.

I am working on building a Chinese–English model. The parallel data for that is harder to acquire because it's all tied up with DARPA stuff. Is that all right, or do you want to build it yourself?

matt




Matt Post

unread,
Jan 11, 2016, 2:33:06 PM1/11/16
to joshua_d...@googlegroups.com
FYI, I will also soon release language packs that do not require an outside installation of Joshua, but will include the runtime compiled against Java 7.

matt


Lewis John Mcgibbney

unread,
Jan 11, 2016, 2:37:00 PM1/11/16
to joshua_d...@googlegroups.com
Thanks for the explanation... acknowledged.

On Mon, Jan 11, 2016 at 11:32 AM, Matt Post <po...@cs.jhu.edu> wrote:

Hi Lewis,

That's the monolingual data. You want the parallel data (see the entries under the "Parallel data" box). The monolingual data you pointed to can be used to build a larger language model, but that's a bit more complicated.

I am working on building a Chinese–English model. The parallel data for that is harder to acquire because it's all tied up with DARPA stuff. Is that all right, or do you want to build it yourself?



By all means please generate the Chinese --> English model this would be extremely helpful indeed!
I'll keep you updated here.
Thanks

Lewis John Mcgibbney

unread,
Jan 11, 2016, 2:37:12 PM1/11/16
to joshua_d...@googlegroups.com
Cool


matt



matt


How large?




--
Lewis



--
Lewis



--
Lewis



--
Lewis

Lewis John Mcgibbney

unread,
Feb 18, 2016, 11:46:22 PM2/18/16
to Joshua Developers
Hi Matt,
Did you ever manage to get cracking on the Chinese --> English model? BTW what variety of Chinese?
I am neatly finished with the Russian --> English pack which I will make available for public use ASAP.
Thanks
Lewis

Matt Post

unread,
Feb 21, 2016, 2:40:50 PM2/21/16
to joshua_d...@googlegroups.com
There is a Chinese–English language pack posted (http://joshua-decoder.org/language-packs/). It's a Mandarin model built on newswire text, mostly. It includes the Joshua runtime and does not use KenLM, so there are zero external dependencies.





matt


As big as possible — the more, the better. I'd say at least 100k, but ideally in the millions (for reference, the Europarl corpora usually have about 2 million sentence pairs. You can see the http://statmt.org/wmt15/translation-task.htmlsummary for more information).



-- 
Lewis 



-- 
Lewis 

Lewis John Mcgibbney

unread,
Feb 21, 2016, 8:44:02 PM2/21/16
to joshua_d...@googlegroups.com
I actually saw this when I was looking at the website.
This is dynamite.
We are probably going to try a en --> mandarin model pretty soon.
Out of curiosity how are you evaluating the 'quality' of the language pack?
Thanks


matt


How large?




-- 
Lewis 



-- 
Lewis 



-- 
Lewis 

For more options, visit https://groups.google.com/d/optout.



--
Lewis

Matt Post

unread,
Feb 22, 2016, 9:26:59 AM2/22/16
to joshua_d...@googlegroups.com
We compute BLEU score on some held-out test data, usually in the news domain. It would be better to be more systematic about this; e.g., to collect multiple test sets, record scores on all of them, and manage this information (it's currently recorded only in the private directory where I built the models).

matt

Lewis John Mcgibbney

unread,
Feb 22, 2016, 11:39:19 AM2/22/16
to joshua_d...@googlegroups.com
Yeah good idea Matt.
We could log an issue and address this over on the website I suppose.
I think it would provide helpful insight into the quality of the language packs.
Reply all
Reply to author
Forward
0 new messages