Thrax 2.0

11 views
Skip to first unread message

Juri Ganitkevitch

unread,
Mar 31, 2013, 11:55:26 AM3/31/13
to Joshua Developers
Hey everyone,

Having finally probably-approximately-completed an epic effort dating back as far as 1300 B.C., we've pushed a major overhaul to Thrax, now in both its master branch and Joshua's submodule version. 

Core changes:

- New extraction core. Jonny rewrote the rule extraction from scratch to be substantially cleaner and easier to understand.

- Integer-based representation. Thrax now collects a global vocabulary in a first pass over the data, and uses it to replace the previously string-based rule and lexprob table representations with integer-based ones. Alignments are no longer stored redundantly and have more compact byte-based representation.

- Bugfixes. Bugs in the computation of lexical probabilities and some syntax-specific conditional probabilities were fixed.

- Feature polish. While the old feature keys are still understood, it's one-key-one-feature now and the feature keys are a little easier to understand now. Thrax now has an "alignment" feature that dumps the max-count alignment for each rule (currently as a feature value, Alignment=0-0:0-2:1-1, but subject to change). We'll update the documentation soon.

Effects (BLEU, intermediate data size on HDFS during extraction, time for grammar extraction) on Ur-En (200k sentence pairs) and De-En (1.8M sentence pairs):

|  Ur-En Brkly |  dev  |  test  | space |  time  |
|-------------------------------|-------|--------|
|  Old - Hiero | 20.43 | 20.25  |  3.6G |  8m15s |
|-------------------------------|-------|--------|
|  New - Hiero | 20.62 | 20.54  |  1.1G |  5m24s |
|-------------------------------|-------|--------|
|  Net         | + 0.2 | + 0.3  |  -69% |  -34%  |


|  Ur-En GIZA  |  dev  |  test  | space |  time  |
|-------------------------------|-------|--------|
|  Old - SAMT  | 21.45 | 20.67  |  2.9G |  7m00s |
|-------------------------------|-------|--------|
|  New - SAMT  | 21.87 | 21.24  |  0.8G |  4m58s |
|-------------------------------|-------|--------|
|  Net         | + 0.4 | + 0.5  |  -72% |  -15%  |


| De-En EurPrl |  dev  |  test  | space |  time  |
|-------------------------------|-------|--------|
|  Old - Hiero | 23.53 | 22.39  |  245G |  3h39m |
|-------------------------------|-------|--------|
|  New - Hiero | 23.62 | 22.47  |   74G |    55m |
|-------------------------------|-------|--------|
|  Net         | + 0.1 | + 0.1  |  -70% |   -75% |

-- Juri

Lane Schwartz

unread,
Mar 31, 2013, 2:18:08 PM3/31/13
to joshua_d...@googlegroups.com
For those of us who do not have a Hadoop cluster, is it
possible/practical to run thrax?
> --
> You received this message because you are subscribed to the Google Groups
> "Joshua Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to joshua_develop...@googlegroups.com.
> To post to this group, send email to joshua_d...@googlegroups.com.
> Visit this group at http://groups.google.com/group/joshua_developers?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
When a place gets crowded enough to require ID's, social collapse is not
far away. It is time to go elsewhere. The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"

Juri Ganitkevitch

unread,
Mar 31, 2013, 2:27:28 PM3/31/13
to Joshua Developers
It's better than before, at least.

Run sans cluster Thrax uses a local Hadoop setup that'll run single jobs one at a time. I'm not familiar with the configuration details of Hadoop well enough, but I think it should be feasible to jack up chunk sizes and memory allotments for the sorting and impromptu HDFS to reduce the crippling disk overhead Hadoop in local mode has.

Matt Post

unread,
Mar 31, 2013, 4:25:45 PM3/31/13
to joshua_d...@googlegroups.com
Running in standalone mode should be much more feasible than before. If you're using the Joshua pipeline and do not have $HADOOP defined, this mode is automatically triggered.

matt
Reply all
Reply to author
Forward
0 new messages