Hey everyone,
Having finally probably-approximately-completed an epic effort dating back as far as 1300 B.C., we've pushed a major overhaul to Thrax, now in both its master branch and Joshua's submodule version.
Core changes:
- New extraction core. Jonny rewrote the rule extraction from scratch to be substantially cleaner and easier to understand.
- Integer-based representation. Thrax now collects a global vocabulary in a first pass over the data, and uses it to replace the previously string-based rule and lexprob table representations with integer-based ones. Alignments are no longer stored redundantly and have more compact byte-based representation.
- Bugfixes. Bugs in the computation of lexical probabilities and some syntax-specific conditional probabilities were fixed.
- Feature polish. While the old feature keys are still understood, it's one-key-one-feature now and the feature keys are a little easier to understand now. Thrax now has an "alignment" feature that dumps the max-count alignment for each rule (currently as a feature value, Alignment=0-0:0-2:1-1, but subject to change). We'll update the documentation soon.
Effects (BLEU, intermediate data size on HDFS during extraction, time for grammar extraction) on Ur-En (200k sentence pairs) and De-En (1.8M sentence pairs):
| Ur-En Brkly | dev | test | space | time |
|-------------------------------|-------|--------|
| Old - Hiero | 20.43 | 20.25 | 3.6G | 8m15s |
|-------------------------------|-------|--------|
| New - Hiero | 20.62 | 20.54 | 1.1G | 5m24s |
|-------------------------------|-------|--------|
| Net | + 0.2 | + 0.3 | -69% | -34% |
| Ur-En GIZA | dev | test | space | time |
|-------------------------------|-------|--------|
| Old - SAMT | 21.45 | 20.67 | 2.9G | 7m00s |
|-------------------------------|-------|--------|
| New - SAMT | 21.87 | 21.24 | 0.8G | 4m58s |
|-------------------------------|-------|--------|
| Net | + 0.4 | + 0.5 | -72% | -15% |
| De-En EurPrl | dev | test | space | time |
|-------------------------------|-------|--------|
| Old - Hiero | 23.53 | 22.39 | 245G | 3h39m |
|-------------------------------|-------|--------|
| New - Hiero | 23.62 | 22.47 | 74G | 55m |
|-------------------------------|-------|--------|
| Net | + 0.1 | + 0.1 | -70% | -75% |
-- Juri