Discussion with the localization team at the All-Hands yesterday

91 views
Skip to first unread message

Jono

unread,
Apr 28, 2009, 8:24:46 PM4/28/09
to ubiqui...@googlegroups.com
Hey everybody,
I had a good talk with some of the localization team yesterday about
localizing Ubiquity.  Here's a summary of things that I learned.

1. Languages that decline nouns will be our biggest challenge.  See:
Latin and most Romance languages.  This has architectural
implications, in that the nountypes will have to be able to return not
only a score for how well a string matches the nountype, not only a
list of suggestions, but they must also be able to say "Judging by the
ending of this string, I estimate with 70% confidence that it is the
direct object of the sentence."  Nountype objects will need to be able
to do something they've never done before, which is to influence
asignment to semantic roles.

2. Localizing nountypes will also be a big challenge that we haven't
addressed yet.   For some pairs (nountype, language) we can just
replace a regexp with a different regexp, or replace a list of strings
with a different list of strings, but others will need whole new
algorithms.  We must create an architecture where a contributor (who
may not be the author or maintainer of the command feed that contains
the English version of a nountype) is able to provide an entire new
implementation of  var noun_foo = { suggest: function(),
defaults:function()} based on a new algorithm.

3. Someone on the Ubiquity team needs to learn about l20n (the
successor to l10n), how it works, and how to use it.  We shouldn't
bother trying to use the DTD file approache of the current l10n
architecture, as it is a bit limited for our purposes; instead we
should leapfrog to using the cutting-edge l20n stuff, which is much
better suited to our purposes.  There is a set of links to l20n
resources at https://wiki.mozilla.org/l20n

4. We want to have the spiritual equivalent of a Wiki, even if it's
not implemented with a wiki platform -- a place where people can go to
and enter localized strings for commands.  A nice side benefit is that
the documentation and help strings can then be improved by any
community member -- this can be to localize them but can also be to
improve the quality of the documentation.  This is highly structured
data that we want people to enter, so a wiki supporting semantic
markup (or even just a series of forms) would be more suited than a
freeform edit-the-whole-page wiki.

5. It is fairly straightforward to allow localization contributions to
our built-in first-party commands (and nountypes) through such a wiki,
but what about the case where one third-party wants to provide
localization for a different third-party command feed?  So for
example:

     * Contributor A publishes a command feed C
     * Contributor B publishes a feed that contains the strings and
metadata needed to localize feed C into Polish.  Call it C.po.
     * User subscribes to C and C.po. The ubiquity core matches up
the urls or some other metadata to discover that C.po is meant to
apply to C.  It slots the strings from C.po into the command objects
from C, registers the Polish nountype implementations from C.po, etc.

In order to do all this properly, it may be neccessary to start
versioning command feeds.  So the above feeds become C-1.0.1 and
C.po-1.0.1.  This would be part of the metadata used by the core to
match up a localization feed to its parent command feed.  If
contributor A then publishes changes, he or she updates the version
number of C to C-1.0.2.  C.po-1.0.1 then stops working on it until
contributor B does any updates neccessary and updates the version
number of C.po.  This would prevent the case where localizations break
things by attempting to refer to stuff that has changed in the
underlying feed and is no longer compatible.  However, the "stop
working until contributor B updates the version number" may actually
be a cure that is worse than the disease.  We need to think about
this.

Side point: version numbers on command feeds would have other
benefits, like letting Ubiquity know which versions of the parser a
command feed is compatible with.


6.  The Polish language was designed specifically to foil all attempts
at computer parsing.  When the machines rise up and try to take over
the world, Poland will be the last bastion of humanity.

 On a somewhat more serious note, I learned that there are some
languages with unique grammars that will resist our attempts to parse
them.  No-one has yet come up with a satisfactory way for a computer
to parse Polish (this is also true for Finnish, I think?) even though
some very smart people have tried.  So, instead of struggling in vain
to get 100% natural-language parsing, we will have to settle for
accepting input in "pidgin Polish" or "pidgin Finnish" -- focus on the
20% of the work that will provide 80% of the benefit to Polish or
Finnish users and hope that they will forgive our mangling of their
tongues.

Remember that parsing a full sentence is only required for a minority
of inputs, anyway.  The majority of Ubiquity input is likely to be
much simpler -- single-word verb-only commands, single nouns followed
by selecting a suggested verb, or verb + single noun sentences.

The point was also raised that it may be useful to define different
levels of approximation to natural language input, with some verbs and
nountypes supporting a higher level than others, and graceful
degredation when higher levels are not available.  I am not exactly
sure where this idea fits into the overall vision, though.

--Jono

"mitcho (Michael 芳貴 Erlewine)"

unread,
Apr 29, 2009, 2:44:21 AM4/29/09
to Jono, ubiqui...@googlegroups.com, Aza, Atul Varma, gan...@mozilla.com, Blair McBride, l10n...@googlemail.com
Hi Jono,

Thanks for the report from your meeting with the l10n folks... sounds
like you had a great, rewarding meeting! I'll comment on these items
inline and respond to Pike's email after.

> 1. Languages that decline nouns will be our biggest challenge. See:
> Latin and most Romance languages. This has architectural
> implications, in that the nountypes will have to be able to return not
> only a score for how well a string matches the nountype, not only a
> list of suggestions, but they must also be able to say "Judging by the
> ending of this string, I estimate with 70% confidence that it is the
> direct object of the sentence." Nountype objects will need to be able
> to do something they've never done before, which is to influence
> asignment to semantic roles.

Yes, this will be the most challenging, or at least the most non-
automatable component of Ubiq i18n. We will definitely have to play
around with some case marking patterns and see how difficult it will
be to recognize such patterns. I do think that in many cases, though,
we wouldn't have to do more than a regexp here or there. In some cases
we may get away without dealing with case markers — for example,
there are a number of languages that have both short suffixes and
prepositions which correspond to the same argument role, but the
prepositions tend to be used for longer expressions and borrowed
words. Perhaps Ubiquity in these cases would only recognize the
adpositions and not the morphological cases...? (this jives with point
6 below.)

In Parser 2, as it stands now, there is a conceptual divide between
"roles" and "nountypes," which correspond to the conceptual divide
between morphosyntactic and semantic features of arguments. When we
first drafted the Parser 2 (then Parser TNG) design, we split up the
handling of case markers into two steps: we would split off possible
case markers in step 1 (splitting up words), then create parses in
step 4 (argument structure parsing) based on scenarios where the
possible case markers are indeed case markers or just words. I am not
sure that the case-handling needs to be combined with the nountype
detection—I would argue that it should not be.

That does not mean that case markers need to be deterministic: The
benefit of Parser 2 is that the whole system is non-deterministic. For
example, if a single noun stem were appropriate for multiple roles, it
could produce multiple parses early on but the nountype detection and
other factors would then rule out the inappropriate parse. This
strategy has shown great promise with Japanese, but we do need to work
on strongly case marked languages with more complex morphophonology.
(We currently only incorporate confidence scores from nountype matches
in the parse score but no confidence scores for the argument
parsing... this is something we should also consider.)

I've been planning to blog about this and related issues for a while,
but I will ++ the priority on that.

> 2. Localizing nountypes will also be a big challenge that we haven't
> addressed yet. For some pairs (nountype, language) we can just
> replace a regexp with a different regexp, or replace a list of strings
> with a different list of strings, but others will need whole new
> algorithms. We must create an architecture where a contributor (who
> may not be the author or maintainer of the command feed that contains
> the English version of a nountype) is able to provide an entire new
> implementation of var noun_foo = { suggest: function(),
> defaults:function()} based on a new algorithm.

Agreed — let's discuss.

> 3. Someone on the Ubiquity team needs to learn about l20n (the
> successor to l10n), how it works, and how to use it. We shouldn't
> bother trying to use the DTD file approache of the current l10n
> architecture, as it is a bit limited for our purposes; instead we
> should leapfrog to using the cutting-edge l20n stuff, which is much
> better suited to our purposes. There is a set of links to l20n
> resources at https://wiki.mozilla.org/l20n

Seth and I talked briefly about l20n and I looked over some of those
notes a little while back, but it was not clear to me how production-
ready these systems are. Perhaps I missed something?

Jono, when you say "someone on the Ubiquity team" it probably should
be the both of us. :)

> 4. We want to have the spiritual equivalent of a Wiki, even if it's
> not implemented with a wiki platform -- a place where people can go to
> and enter localized strings for commands. A nice side benefit is that
> the documentation and help strings can then be improved by any
> community member -- this can be to localize them but can also be to
> improve the quality of the documentation. This is highly structured
> data that we want people to enter, so a wiki supporting semantic
> markup (or even just a series of forms) would be more suited than a
> freeform edit-the-whole-page wiki.

Yes yes yes! The "just a series of forms" is what I was
conceptualizing... are you talking here about command names, nountype
code, etc., or also data for basic parser training (à la http://mitcho.com/blog/projects/automating-the-linguists-job/)
? Did you guys discuss that concept at all?

> 5. It is fairly straightforward to allow localization contributions to
> our built-in first-party commands (and nountypes) through such a wiki,
> but what about the case where one third-party wants to provide
> localization for a different third-party command feed? So for
> example:
>
> * Contributor A publishes a command feed C
> * Contributor B publishes a feed that contains the strings and
> metadata needed to localize feed C into Polish. Call it C.po.
> * User subscribes to C and C.po. The ubiquity core matches up
> the urls or some other metadata to discover that C.po is meant to
> apply to C. It slots the strings from C.po into the command objects
> from C, registers the Polish nountype implementations from C.po, etc.

Hmm... I had a great conversation with Matt Mullenweg (who I know
y'all just saw as well!) recently while he was in Tokyo on plugin
localization in WordPress and how it all has to go through the plugin
author and how, ideally, it could be centralized (since the plugin
code already is anyway). I feel like, even for third party commands,
we could use the herd (or something like it) to be a central place for
this type of collaboration, rather than forcing contributors to
localize whole commands at a time and hosting them. Perhaps (as we
talked about in the past) if you subscribe to the herd-hosted copy of
a command you'll get the localizations with it?

As for the way to pull out the localizable components of command code,
I have seen a couple implementations of gettext in js, though I don't
know if we want to go down that road...

> 6. The Polish language was designed specifically to foil all attempts
> at computer parsing. When the machines rise up and try to take over
> the world, Poland will be the last bastion of humanity.
>
> On a somewhat more serious note, I learned that there are some
> languages with unique grammars that will resist our attempts to parse
> them. No-one has yet come up with a satisfactory way for a computer
> to parse Polish (this is also true for Finnish, I think?) even though
> some very smart people have tried. So, instead of struggling in vain
> to get 100% natural-language parsing, we will have to settle for
> accepting input in "pidgin Polish" or "pidgin Finnish" -- focus on the
> 20% of the work that will provide 80% of the benefit to Polish or
> Finnish users and hope that they will forgive our mangling of their
> tongues.
>
> Remember that parsing a full sentence is only required for a minority
> of inputs, anyway. The majority of Ubiquity input is likely to be
> much simpler -- single-word verb-only commands, single nouns followed
> by selecting a suggested verb, or verb + single noun sentences.

Great point. This, I believe, is the huge benefit we have over other
similar "natural language" projects of the past... by greatly limiting
the scope of what we're doing I think there is great potential for us
to make this interface easier to use for more people with (relatively)
minimal work. Many of the greatest complexities in natural language
understanding (recursion, quantifiers, negation, etc.) are expressly
ruled out as non-problems for this project based on our goal.

> The point was also raised that it may be useful to define different
> levels of approximation to natural language input, with some verbs and
> nountypes supporting a higher level than others, and graceful
> degredation when higher levels are not available. I am not exactly
> sure where this idea fits into the overall vision, though.

Hmm... interesting idea... sort of goes with Pike's thought on users
who may want to use multiple languages in different contexts... and
with that, I will go respond to that email. :D

Thanks again for all the great ideas! Let's keep this conversation
going (perhaps splitting it into different threads).

mitcho

--
mitcho (Michael 芳貴 Erlewine)
mit...@mitcho.com
http://mitcho.com/
linguist, coder, teacher

Blair McBride

unread,
Apr 29, 2009, 12:30:35 PM4/29/09
to ubiqui...@googlegroups.com
I'd like to avoid explicit versioning if at all possible - its a bad
enough problem with slowly changing products, let alone things like
command feeds that can change often and are more social-based. Either
way, such versioning applies to the code, yet the actual strings may not
change (or the other way around).

Anyway, what I'd like to see is automatic detection of whether a feed
localization applies to a given copy of a feed or not. Essentially,
versioning of strings, not the whole feed.

More thoughts coming later...

- Blair

marsf

unread,
Apr 30, 2009, 5:07:13 AM4/30/09
to Ubiquity i18n
Hi, all.
Here is the list of Ubiquity parts for localizing on current parser
version 1 system in my thoughts.


1. HTML pages
* Just translate the page. :)

2. Nountypes
* Each nountype's name.
* Some of types has the parts that is needed to localize in their
code.
(e.g. LanguageCodes and noun_type_date)

3. Verbs
* Each verb's name.
* Embedded modifier's name ("to", "in" or something others) in each
verb command.
I think that this is the biggest obstacle part for localizing on
parser 1.
(new parser 2 solves this issue.)

4. Command previews
* Need support for left branching language (like Japanese).

5. Language specific parser
* The most difficult part to localize that you know.
* Even mitcho's new parser 2 has needed more flexible logics, not
algorithms.
But, new parser 2 is greater than parser 1 for localizing. :)
> resources athttps://wiki.mozilla.org/l20n
--
mar

Blair McBride

unread,
Apr 30, 2009, 1:03:16 PM4/30/09
to ubiqui...@googlegroups.com
You mean, copy all markup and change the text content? The usual DTD
system seems easier and less problematic (for both Ubiq devs and
localizers).

- Blair

marsf

unread,
Apr 30, 2009, 9:25:49 PM4/30/09
to Ubiquity i18n
Blair:

No, the list means just a place. Using DTD system is better way.
FYI, old in-product firefox help contained complete HTML pages in each
locale.
-- mar

"mitcho (Michael 芳貴 Erlewine)"

unread,
May 7, 2009, 1:30:29 AM5/7/09
to mitcho (Michael 芳貴 Erlewine), Jono, ubiqui...@googlegroups.com, Aza, Atul Varma, gan...@mozilla.com, Blair McBride, l10n...@googlemail.com
Hi all,

I just posted on my blog regarding some of the difficulties with
strongly case-marked languages. Highlights include the idea that, for
languages like German which mark case on the article, we can just
pretend they're prepositions, but otherwise we'll most likely want to
require users to use prepositions/postpositions. I believe this is a
good compromise between the user and the parser.

http://mitcho.com/blog/projects/in-case-of-case/

I'll look forward to your comments!

mitcho

Reply all
Reply to author
Forward
0 new messages