Revision: 189a860f6e0c
Branch: default
Author: Michael Gasser <
gas...@cs.indiana.edu>
Date: Fri May 16 19:51:18 2014 UTC
Log: Last-minute edits to LG-LP paper.
http://code.google.com/p/hltdi-l3/source/detail?r=189a860f6e0c
Modified:
/paperdrafts/lglp/lglp14.pdf
/paperdrafts/lglp/lglp14.tex
=======================================
--- /paperdrafts/lglp/lglp14.pdf Fri May 16 07:05:16 2014 UTC
+++ /paperdrafts/lglp/lglp14.pdf Fri May 16 19:51:18 2014 UTC
Binary file, no diff available.
=======================================
--- /paperdrafts/lglp/lglp14.tex Fri May 16 07:05:16 2014 UTC
+++ /paperdrafts/lglp/lglp14.tex Fri May 16 19:51:18 2014 UTC
@@ -128,7 +128,7 @@
millions of speakers, such as Telugu, Burmese, Oromo, and Hausa.
For machine translation (MT) and computer-assisted translation (CAT),
the lack is even more serious because what is
-required for machine learning is bitext, sentence-aligned translations.
+required for machine learning is sentence-aligned translations.
For these reasons, work on many such languages will continue to
consist in large part in the writing of computational grammars and
@@ -169,7 +169,7 @@
Arguments based on idiomaticity and ambiguity are semantic, but they
extend naturally to translation.
If the meaning of a source-language phrase fails to be the strict
combination of the meanings
of the words in the phrase, then it is unlikely that the translation of
the phrase will be the
-combination of the translations of the words.
+combination of the translations of the source-language words.
Adding lexical context to an ambiguous noun or verb can sometimes permit
an MT
system to select the appropriate translation.
@@ -187,7 +187,8 @@
A group's entry also specifies translations to groups in one or more other
languages.
For each translation, the group's entry gives an \textbf{alignment},
representing inter-group correspondences between
elements, as in the phrase tables of PBSMT.
-Entry~\ref{entry:end} shows a simple group entry of this sort.
+Entry~\ref{entry:end} shows a simple group entry of
+this sort.\footnote{We serialize Hiiktuu lexical with YAML
(\url{
http://www.yaml.org/})}
The English group \textit{the end of the world} with head \textit{end} has
as its Spanish translation
the group \textit{el fin del mundo} (which has its own entry in the
Spanish lexicon).
In the alignment, each word other than the fourth word (\textit{the}) in
the English group is associated with the position
@@ -211,17 +212,17 @@
\subsection{The lexicon-grammar tradeoff}
\label{subsect:lexgram}
-A rudimentary lexicon with entries like the one in Entry~\ref{entry:end}
is simple
+A rudimentary lexicon with entries of this sort is simple
in two senses: a user with no formal knowledge of linguistics can add
entries in a
straightforward manner, and the resulting entries are
easily understood.
-Such a lexicon permits the translation of sentences consisting of verbatim
combinations
+Such a lexicon permits the translation of sentences that are combinations
of the wordforms in the group entries, as long as group order is preserved
across
the languages and there are no constraints between groups that would
affect the form
of the target-language words.
-However, since it contains no grammatical information, such a lexicon
permits no
+However, such a lexicon permits no
{\em generalization} to combinations of wordforms that are not explicit in
the lexicon.
-Such a system would require a group entry for every reasonably possible
combination of
+It would require a group entry for every reasonably possible combination of
wordforms.
%Even for language pairs with enormous available bitext corpora,
%SMT researchers have discovered the need to incorporate some syntax in
their systems.
@@ -237,7 +238,7 @@
%As we have seen, abstract word-based grammars also miss the information
that is inherent
%in words in context.
-In the Hiiktuu project, the goal is to permit a range of possibilities
along the continuum from
+In the Hiiktuu project, the goal is a range of possibilities along the
continuum from
purely lexical (and phrasal) to syntactic/grammatical, with the emphasis
on ease of entry
creation and interpretation.
@@ -246,7 +247,7 @@
We can achieve significant generalization over simple groups consisting of
wordforms by
permitting lexemes in groups.
-As an example, consider the English group \textit{pass\_v the buck}, where
\textit{pass\_v} is
+As an example, consider the English group \textit{passV the buck}, where
\textit{passV} is
the verb lexeme \textit{pass}.
In order to make such a group usable, the lexicon also requires
\textbf{form} entries,
giving the lexeme roots as well as grammatical features for specific
wordforms.
@@ -257,20 +258,20 @@
\small
\begin{verbatim}
groups:
- pass_v:
- - words: [pass_v, the, buck]
+ passV:
+ - words: [passV, the, buck]
spa:
- [escurrir_el_bulto,
{align: [1,2,3], agr: [{tns: tmp, prs: prs, num: num}, 0, 0]}]
forms:
pass:
- - root: pass_v, features: {prs: 1, tns: prs}
- - root: pass_v, features: {prs: 3, num: plr, tns: prs}
- - root: pass_n, features: {num: sng}
+ - root: passV, features: {prs: 1, tns: prs}
+ - root: passV, features: {prs: 3, num: plr, tns: prs}
+ - root: passN, features: {num: sng}
passes:
- root: pass_v, features: {prs: 3, num: sng, tns: prs}
+ root: passV, features: {prs: 3, num: sng, tns: prs}
passed:
- root: pass_v, features: {tns: pst}
+ root: passV, features: {tns: pst}
\end{verbatim}
\normalsize
%\end{spacing}
@@ -278,16 +279,17 @@
\label{entry:pass}
\end{entry}
-Because this entry accommodates multiple sequences of English word forms,
+Because this entry accommodates multiple sequences of English wordforms,
we need to map these onto appropriate target-language sequences.
This is accomplished through pairs of agreement features
for the lexeme, constraining the corresponding target language form to
agree with the source
form on those features.
In the example, the
-head \textit{pass\_v} and its translation in the Spanish group agree on
-tense and \textit{tiempo}, person and \textit{persona}, and number and
\textit{nœmero} features.
+head \textit{passV} and its translation in the Spanish group agree on
+tense and \textit{tiempo}, person and \textit{persona}, and number and
\textit{n\'{u}mero} features.
For example, if this group is selected in the translation of the sentence
\textit{Carl passes the buck},
-the head of the corresponding Spanish group will be constrained to be
third person singular present tense (tiempo):
+the head of the corresponding Spanish group will be constrained to be
+third person singular present tense (\textit{tiempo}):
\textit{Carl \textbf{escurre} el bulto}.
\subsection{Lexical/grammatical categories}
@@ -299,15 +301,15 @@
and \textit{gave them a piece of my mind} by replacing the specific
wordforms in positions
2 and 6 in the group with categories that include the wordforms that can
fill those positions.
This requires the forms dictionary to record the categories that wordforms
belong to.
-Entry~\ref{entry:mind} shows how this information would be recorded.
+Entry~\ref{entry:mind} shows how this appears in the lexicon.
Category names are preceded by \$.
\begin{entry}
\small
\begin{verbatim}
groups:
- give_v:
- - words: [give_v, $sbd, a, piece, of, $sbds, mind]
+ giveV:
+ - words: [giveV, $sbd, a, piece, of, $sbds, mind]
agr: [[2, 6, {prs: prs, num: num}]]
my:
- words: [my]
@@ -318,15 +320,15 @@
mayor: [{cats: [$sbd]}]
\end{verbatim}
\normalsize
-\caption{Three group entries and a few associated form entries}
+\caption{Three group entries and two associated form entries}
\label{entry:mind}
\end{entry}
Because group positions that are filled by categories do not specify a
surface form,
-for parsing and generation of sentences they must be merged with other
groups that match
+during parsing and generation of sentences they must be merged with other
groups that match
the category and do specify a form.
For example, to parse or translate the sentence \textit{I gave the mayor a
piece of my mind} requires
-that positions 2 and 6 in the group
\textit{give\_v\_$sbd\_a\_piece\_of\_$sbds\_mind} be
+that positions 2 and 6 in the group
\textit{giveV\_$sbd\_a\_piece\_of\_$sbds\_mind} be
filled by the heads of the groups \textit{the\_mayor} and \textit{my}.
This \textbf{node merging} process is illustrated in Figure~\ref{fig:mind}.
@@ -377,7 +379,7 @@
groups.
In this process some target-language items are assigned grammatical
features on the basis of agreement constraints.
For example, in the translation of the English sentence \textit{the mayor
passes the buck},
-the Spanish verb that is the head of the group \textit{escurrir el bulto}
would be
+the Spanish verb that is the head of the group
\textit{escurrir\_el\_bulto} would be
assigned the tense (\textit{tiempo}), person and number features
\texttt{tmp=prs, prs=3, num=1}: \textit{escurre}.
A source-language group may have more than one translation.
The transfer phase creates a separate target-language group assignment for
each combination of translations of the
@@ -440,10 +442,11 @@
The code for Hiiktuu and a set of lexical-grammatical examples
are available at [\textit{URL omitted to preserve anonymity}]
under the GPL license.
-To date, we have only tested the framework on a limited number of
Amharic-to-Oromo
-translations.
-In order to develop a more complete lexicon-grammar for this language pair
and others,
-we are currently working on methods for automatically extracting groups
from
+To date, we have only tested the framework on a limited number of
+translations using various language pairs.
+In order to develop more complete lexicon-grammars for Amharic-Oromo and
+Spanish-Guarani,
+we are working on methods for automatically extracting groups from
dictionaries in various formats and from the limited bilingual data that
are available.
As a part of this work, it will be crucial to determine whether
@@ -464,7 +467,7 @@
or major constituent order differences between source and target
language.\footnote{
The only way to implement such constraints in the current version of
Hiiktuu is through
groups that incorporate, for example, subjects in verb-headed groups, as in
-\textit{\$sbd kick\_v \$sth}.}
+\textit{\$sbd kickV \$sth}.}
To alleviate this problem, we will be implementing
dependencies between group heads, much as in the
``interchunk module'' of Apertium.