Hey everybody, I had a good talk with some of the localization team yesterday about localizing Ubiquity. Here's a summary of things that I learned.
1. Languages that decline nouns will be our biggest challenge. See: Latin and most Romance languages. This has architectural implications, in that the nountypes will have to be able to return not only a score for how well a string matches the nountype, not only a list of suggestions, but they must also be able to say "Judging by the ending of this string, I estimate with 70% confidence that it is the direct object of the sentence." Nountype objects will need to be able to do something they've never done before, which is to influence asignment to semantic roles.
2. Localizing nountypes will also be a big challenge that we haven't addressed yet. For some pairs (nountype, language) we can just replace a regexp with a different regexp, or replace a list of strings with a different list of strings, but others will need whole new algorithms. We must create an architecture where a contributor (who may not be the author or maintainer of the command feed that contains the English version of a nountype) is able to provide an entire new implementation of var noun_foo = { suggest: function(), defaults:function()} based on a new algorithm.
3. Someone on the Ubiquity team needs to learn about l20n (the successor to l10n), how it works, and how to use it. We shouldn't bother trying to use the DTD file approache of the current l10n architecture, as it is a bit limited for our purposes; instead we should leapfrog to using the cutting-edge l20n stuff, which is much better suited to our purposes. There is a set of links to l20n resources at https://wiki.mozilla.org/l20n
4. We want to have the spiritual equivalent of a Wiki, even if it's not implemented with a wiki platform -- a place where people can go to and enter localized strings for commands. A nice side benefit is that the documentation and help strings can then be improved by any community member -- this can be to localize them but can also be to improve the quality of the documentation. This is highly structured data that we want people to enter, so a wiki supporting semantic markup (or even just a series of forms) would be more suited than a freeform edit-the-whole-page wiki.
5. It is fairly straightforward to allow localization contributions to our built-in first-party commands (and nountypes) through such a wiki, but what about the case where one third-party wants to provide localization for a different third-party command feed? So for example:
* Contributor A publishes a command feed C * Contributor B publishes a feed that contains the strings and metadata needed to localize feed C into Polish. Call it C.po. * User subscribes to C and C.po. The ubiquity core matches up the urls or some other metadata to discover that C.po is meant to apply to C. It slots the strings from C.po into the command objects from C, registers the Polish nountype implementations from C.po, etc.
In order to do all this properly, it may be neccessary to start versioning command feeds. So the above feeds become C-1.0.1 and C.po-1.0.1. This would be part of the metadata used by the core to match up a localization feed to its parent command feed. If contributor A then publishes changes, he or she updates the version number of C to C-1.0.2. C.po-1.0.1 then stops working on it until contributor B does any updates neccessary and updates the version number of C.po. This would prevent the case where localizations break things by attempting to refer to stuff that has changed in the underlying feed and is no longer compatible. However, the "stop working until contributor B updates the version number" may actually be a cure that is worse than the disease. We need to think about this.
Side point: version numbers on command feeds would have other benefits, like letting Ubiquity know which versions of the parser a command feed is compatible with.
6. The Polish language was designed specifically to foil all attempts at computer parsing. When the machines rise up and try to take over the world, Poland will be the last bastion of humanity.
On a somewhat more serious note, I learned that there are some languages with unique grammars that will resist our attempts to parse them. No-one has yet come up with a satisfactory way for a computer to parse Polish (this is also true for Finnish, I think?) even though some very smart people have tried. So, instead of struggling in vain to get 100% natural-language parsing, we will have to settle for accepting input in "pidgin Polish" or "pidgin Finnish" -- focus on the 20% of the work that will provide 80% of the benefit to Polish or Finnish users and hope that they will forgive our mangling of their tongues.
Remember that parsing a full sentence is only required for a minority of inputs, anyway. The majority of Ubiquity input is likely to be much simpler -- single-word verb-only commands, single nouns followed by selecting a suggested verb, or verb + single noun sentences.
The point was also raised that it may be useful to define different levels of approximation to natural language input, with some verbs and nountypes supporting a higher level than others, and graceful degredation when higher levels are not available. I am not exactly sure where this idea fits into the overall vision, though.
Thanks for the report from your meeting with the l10n folks... sounds like you had a great, rewarding meeting! I'll comment on these items inline and respond to Pike's email after.
> 1. Languages that decline nouns will be our biggest challenge. See: > Latin and most Romance languages. This has architectural > implications, in that the nountypes will have to be able to return not > only a score for how well a string matches the nountype, not only a > list of suggestions, but they must also be able to say "Judging by the > ending of this string, I estimate with 70% confidence that it is the > direct object of the sentence." Nountype objects will need to be able > to do something they've never done before, which is to influence > asignment to semantic roles.
Yes, this will be the most challenging, or at least the most non- automatable component of Ubiq i18n. We will definitely have to play around with some case marking patterns and see how difficult it will be to recognize such patterns. I do think that in many cases, though, we wouldn't have to do more than a regexp here or there. In some cases we may get away without dealing with case markers — for example, there are a number of languages that have both short suffixes and prepositions which correspond to the same argument role, but the prepositions tend to be used for longer expressions and borrowed words. Perhaps Ubiquity in these cases would only recognize the adpositions and not the morphological cases...? (this jives with point 6 below.)
In Parser 2, as it stands now, there is a conceptual divide between "roles" and "nountypes," which correspond to the conceptual divide between morphosyntactic and semantic features of arguments. When we first drafted the Parser 2 (then Parser TNG) design, we split up the handling of case markers into two steps: we would split off possible case markers in step 1 (splitting up words), then create parses in step 4 (argument structure parsing) based on scenarios where the possible case markers are indeed case markers or just words. I am not sure that the case-handling needs to be combined with the nountype detection—I would argue that it should not be.
That does not mean that case markers need to be deterministic: The benefit of Parser 2 is that the whole system is non-deterministic. For example, if a single noun stem were appropriate for multiple roles, it could produce multiple parses early on but the nountype detection and other factors would then rule out the inappropriate parse. This strategy has shown great promise with Japanese, but we do need to work on strongly case marked languages with more complex morphophonology. (We currently only incorporate confidence scores from nountype matches in the parse score but no confidence scores for the argument parsing... this is something we should also consider.)
I've been planning to blog about this and related issues for a while, but I will ++ the priority on that.
> 2. Localizing nountypes will also be a big challenge that we haven't > addressed yet. For some pairs (nountype, language) we can just > replace a regexp with a different regexp, or replace a list of strings > with a different list of strings, but others will need whole new > algorithms. We must create an architecture where a contributor (who > may not be the author or maintainer of the command feed that contains > the English version of a nountype) is able to provide an entire new > implementation of var noun_foo = { suggest: function(), > defaults:function()} based on a new algorithm.
Agreed — let's discuss.
> 3. Someone on the Ubiquity team needs to learn about l20n (the > successor to l10n), how it works, and how to use it. We shouldn't > bother trying to use the DTD file approache of the current l10n > architecture, as it is a bit limited for our purposes; instead we > should leapfrog to using the cutting-edge l20n stuff, which is much > better suited to our purposes. There is a set of links to l20n > resources at https://wiki.mozilla.org/l20n
Seth and I talked briefly about l20n and I looked over some of those notes a little while back, but it was not clear to me how production- ready these systems are. Perhaps I missed something?
Jono, when you say "someone on the Ubiquity team" it probably should be the both of us. :)
> 4. We want to have the spiritual equivalent of a Wiki, even if it's > not implemented with a wiki platform -- a place where people can go to > and enter localized strings for commands. A nice side benefit is that > the documentation and help strings can then be improved by any > community member -- this can be to localize them but can also be to > improve the quality of the documentation. This is highly structured > data that we want people to enter, so a wiki supporting semantic > markup (or even just a series of forms) would be more suited than a > freeform edit-the-whole-page wiki.
Yes yes yes! The "just a series of forms" is what I was conceptualizing... are you talking here about command names, nountype code, etc., or also data for basic parser training (à la http://mitcho.com/blog/projects/automating-the-linguists-job/) ? Did you guys discuss that concept at all?
> 5. It is fairly straightforward to allow localization contributions to > our built-in first-party commands (and nountypes) through such a wiki, > but what about the case where one third-party wants to provide > localization for a different third-party command feed? So for > example:
> * Contributor A publishes a command feed C > * Contributor B publishes a feed that contains the strings and > metadata needed to localize feed C into Polish. Call it C.po. > * User subscribes to C and C.po. The ubiquity core matches up > the urls or some other metadata to discover that C.po is meant to > apply to C. It slots the strings from C.po into the command objects > from C, registers the Polish nountype implementations from C.po, etc.
Hmm... I had a great conversation with Matt Mullenweg (who I know y'all just saw as well!) recently while he was in Tokyo on plugin localization in WordPress and how it all has to go through the plugin author and how, ideally, it could be centralized (since the plugin code already is anyway). I feel like, even for third party commands, we could use the herd (or something like it) to be a central place for this type of collaboration, rather than forcing contributors to localize whole commands at a time and hosting them. Perhaps (as we talked about in the past) if you subscribe to the herd-hosted copy of a command you'll get the localizations with it?
As for the way to pull out the localizable components of command code, I have seen a couple implementations of gettext in js, though I don't know if we want to go down that road...
> 6. The Polish language was designed specifically to foil all attempts > at computer parsing. When the machines rise up and try to take over > the world, Poland will be the last bastion of humanity.
> On a somewhat more serious note, I learned that there are some > languages with unique grammars that will resist our attempts to parse > them. No-one has yet come up with a satisfactory way for a computer > to parse Polish (this is also true for Finnish, I think?) even though > some very smart people have tried. So, instead of struggling in vain > to get 100% natural-language parsing, we will have to settle for > accepting input in "pidgin Polish" or "pidgin Finnish" -- focus on the > 20% of the work that will provide 80% of the benefit to Polish or > Finnish users and hope that they will forgive our mangling of their > tongues.
> Remember that parsing a full sentence is only required for a minority > of inputs, anyway. The majority of Ubiquity input is likely to be > much simpler -- single-word verb-only commands, single nouns followed > by selecting a suggested verb, or verb + single noun sentences.
Great point. This, I believe, is the huge benefit we have over other similar "natural language" projects of the past... by greatly limiting the scope of what we're doing I think there is great potential for us to make this interface easier to use for more people with (relatively) minimal work. Many of the greatest complexities in natural language understanding (recursion, quantifiers, negation, etc.) are expressly ruled out as non-problems for this project based on our goal.
> The point was also raised that it may be useful to define different > levels of approximation to natural language input, with some verbs and > nountypes supporting a higher level than others, and graceful > degredation when higher levels are not available. I am not exactly > sure where this idea fits into the overall vision, though.
Hmm... interesting idea... sort of goes with Pike's thought on users who may want to use multiple languages in different contexts... and with that, I will go respond to that email. :D
Thanks again for all the great ideas! Let's keep this conversation going (perhaps splitting it into different threads).
I'd like to avoid explicit versioning if at all possible - its a bad enough problem with slowly changing products, let alone things like command feeds that can change often and are more social-based. Either way, such versioning applies to the code, yet the actual strings may not change (or the other way around).
Anyway, what I'd like to see is automatic detection of whether a feed localization applies to a given copy of a feed or not. Essentially, versioning of strings, not the whole feed.
Hi, all.
Here is the list of Ubiquity parts for localizing on current parser
version 1 system in my thoughts.
1. HTML pages
* Just translate the page. :)
2. Nountypes
* Each nountype's name.
* Some of types has the parts that is needed to localize in their
code.
(e.g. LanguageCodes and noun_type_date)
3. Verbs
* Each verb's name.
* Embedded modifier's name ("to", "in" or something others) in each
verb command.
I think that this is the biggest obstacle part for localizing on
parser 1.
(new parser 2 solves this issue.)
4. Command previews
* Need support for left branching language (like Japanese).
5. Language specific parser
* The most difficult part to localize that you know.
* Even mitcho's new parser 2 has needed more flexible logics, not
algorithms.
But, new parser 2 is greater than parser 1 for localizing. :)
> Hey everybody,
> I had a good talk with some of the localization team yesterday about
> localizing Ubiquity. Here's a summary of things that I learned.
> 1. Languages that decline nouns will be our biggest challenge. See:
> Latin and most Romance languages. This has architectural
> implications, in that the nountypes will have to be able to return not
> only a score for how well a string matches the nountype, not only a
> list of suggestions, but they must also be able to say "Judging by the
> ending of this string, I estimate with 70% confidence that it is the
> direct object of the sentence." Nountype objects will need to be able
> to do something they've never done before, which is to influence
> asignment to semantic roles.
> 2. Localizing nountypes will also be a big challenge that we haven't
> addressed yet. For some pairs (nountype, language) we can just
> replace a regexp with a different regexp, or replace a list of strings
> with a different list of strings, but others will need whole new
> algorithms. We must create an architecture where a contributor (who
> may not be the author or maintainer of the command feed that contains
> the English version of a nountype) is able to provide an entire new
> implementation of var noun_foo = { suggest: function(),
> defaults:function()} based on a new algorithm.
> 3. Someone on the Ubiquity team needs to learn about l20n (the
> successor to l10n), how it works, and how to use it. We shouldn't
> bother trying to use the DTD file approache of the current l10n
> architecture, as it is a bit limited for our purposes; instead we
> should leapfrog to using the cutting-edge l20n stuff, which is much
> better suited to our purposes. There is a set of links to l20n
> resources athttps://wiki.mozilla.org/l20n
> 4. We want to have the spiritual equivalent of a Wiki, even if it's
> not implemented with a wiki platform -- a place where people can go to
> and enter localized strings for commands. A nice side benefit is that
> the documentation and help strings can then be improved by any
> community member -- this can be to localize them but can also be to
> improve the quality of the documentation. This is highly structured
> data that we want people to enter, so a wiki supporting semantic
> markup (or even just a series of forms) would be more suited than a
> freeform edit-the-whole-page wiki.
> 5. It is fairly straightforward to allow localization contributions to
> our built-in first-party commands (and nountypes) through such a wiki,
> but what about the case where one third-party wants to provide
> localization for a different third-party command feed? So for
> example:
> * Contributor A publishes a command feed C
> * Contributor B publishes a feed that contains the strings and
> metadata needed to localize feed C into Polish. Call it C.po.
> * User subscribes to C and C.po. The ubiquity core matches up
> the urls or some other metadata to discover that C.po is meant to
> apply to C. It slots the strings from C.po into the command objects
> from C, registers the Polish nountype implementations from C.po, etc.
> In order to do all this properly, it may be neccessary to start
> versioning command feeds. So the above feeds become C-1.0.1 and
> C.po-1.0.1. This would be part of the metadata used by the core to
> match up a localization feed to its parent command feed. If
> contributor A then publishes changes, he or she updates the version
> number of C to C-1.0.2. C.po-1.0.1 then stops working on it until
> contributor B does any updates neccessary and updates the version
> number of C.po. This would prevent the case where localizations break
> things by attempting to refer to stuff that has changed in the
> underlying feed and is no longer compatible. However, the "stop
> working until contributor B updates the version number" may actually
> be a cure that is worse than the disease. We need to think about
> this.
> Side point: version numbers on command feeds would have other
> benefits, like letting Ubiquity know which versions of the parser a
> command feed is compatible with.
> 6. The Polish language was designed specifically to foil all attempts
> at computer parsing. When the machines rise up and try to take over
> the world, Poland will be the last bastion of humanity.
> On a somewhat more serious note, I learned that there are some
> languages with unique grammars that will resist our attempts to parse
> them. No-one has yet come up with a satisfactory way for a computer
> to parse Polish (this is also true for Finnish, I think?) even though
> some very smart people have tried. So, instead of struggling in vain
> to get 100% natural-language parsing, we will have to settle for
> accepting input in "pidgin Polish" or "pidgin Finnish" -- focus on the
> 20% of the work that will provide 80% of the benefit to Polish or
> Finnish users and hope that they will forgive our mangling of their
> tongues.
> Remember that parsing a full sentence is only required for a minority
> of inputs, anyway. The majority of Ubiquity input is likely to be
> much simpler -- single-word verb-only commands, single nouns followed
> by selecting a suggested verb, or verb + single noun sentences.
> The point was also raised that it may be useful to define different
> levels of approximation to natural language input, with some verbs and
> nountypes supporting a higher level than others, and graceful
> degredation when higher levels are not available. I am not exactly
> sure where this idea fits into the overall vision, though.
No, the list means just a place. Using DTD system is better way.
FYI, old in-product firefox help contained complete HTML pages in each
locale.
On 5月1日, 午前2:03, Blair McBride <unfocu...@gmail.com> wrote:
> You mean, copy all markup and change the text content? The usual DTD
> system seems easier and less problematic (for both Ubiq devs and
> localizers).
> - Blair
> On 30/4/09 2:07 AM, marsf wrote:
> > 1. HTML pages
> > * Just translate the page.:)
I just posted on my blog regarding some of the difficulties with strongly case-marked languages. Highlights include the idea that, for languages like German which mark case on the article, we can just pretend they're prepositions, but otherwise we'll most likely want to require users to use prepositions/postpositions. I believe this is a good compromise between the user and the parser.
> Thanks for the report from your meeting with the l10n folks... > sounds like you had a great, rewarding meeting! I'll comment on > these items inline and respond to Pike's email after.
>> 1. Languages that decline nouns will be our biggest challenge. See: >> Latin and most Romance languages. This has architectural >> implications, in that the nountypes will have to be able to return >> not >> only a score for how well a string matches the nountype, not only a >> list of suggestions, but they must also be able to say "Judging by >> the >> ending of this string, I estimate with 70% confidence that it is the >> direct object of the sentence." Nountype objects will need to be >> able >> to do something they've never done before, which is to influence >> asignment to semantic roles.
> Yes, this will be the most challenging, or at least the most non- > automatable component of Ubiq i18n. We will definitely have to play > around with some case marking patterns and see how difficult it will > be to recognize such patterns. I do think that in many cases, > though, we wouldn't have to do more than a regexp here or there. In > some cases we may get away without dealing with case markers — for > example, there are a number of languages that have both short > suffixes and prepositions which correspond to the same argument > role, but the prepositions tend to be used for longer expressions > and borrowed words. Perhaps Ubiquity in these cases would only > recognize the adpositions and not the morphological cases...? (this > jives with point 6 below.)
> In Parser 2, as it stands now, there is a conceptual divide between > "roles" and "nountypes," which correspond to the conceptual divide > between morphosyntactic and semantic features of arguments. When we > first drafted the Parser 2 (then Parser TNG) design, we split up the > handling of case markers into two steps: we would split off possible > case markers in step 1 (splitting up words), then create parses in > step 4 (argument structure parsing) based on scenarios where the > possible case markers are indeed case markers or just words. I am > not sure that the case-handling needs to be combined with the > nountype detection—I would argue that it should not be.
> That does not mean that case markers need to be deterministic: The > benefit of Parser 2 is that the whole system is non-deterministic. > For example, if a single noun stem were appropriate for multiple > roles, it could produce multiple parses early on but the nountype > detection and other factors would then rule out the inappropriate > parse. This strategy has shown great promise with Japanese, but we > do need to work on strongly case marked languages with more complex > morphophonology. (We currently only incorporate confidence scores > from nountype matches in the parse score but no confidence scores > for the argument parsing... this is something we should also > consider.)
> I've been planning to blog about this and related issues for a > while, but I will ++ the priority on that.
>> 2. Localizing nountypes will also be a big challenge that we haven't >> addressed yet. For some pairs (nountype, language) we can just >> replace a regexp with a different regexp, or replace a list of >> strings >> with a different list of strings, but others will need whole new >> algorithms. We must create an architecture where a contributor (who >> may not be the author or maintainer of the command feed that contains >> the English version of a nountype) is able to provide an entire new >> implementation of var noun_foo = { suggest: function(), >> defaults:function()} based on a new algorithm.
> Agreed — let's discuss.
>> 3. Someone on the Ubiquity team needs to learn about l20n (the >> successor to l10n), how it works, and how to use it. We shouldn't >> bother trying to use the DTD file approache of the current l10n >> architecture, as it is a bit limited for our purposes; instead we >> should leapfrog to using the cutting-edge l20n stuff, which is much >> better suited to our purposes. There is a set of links to l20n >> resources at https://wiki.mozilla.org/l20n
> Seth and I talked briefly about l20n and I looked over some of those > notes a little while back, but it was not clear to me how production- > ready these systems are. Perhaps I missed something?
> Jono, when you say "someone on the Ubiquity team" it probably should > be the both of us. :)
>> 4. We want to have the spiritual equivalent of a Wiki, even if it's >> not implemented with a wiki platform -- a place where people can go >> to >> and enter localized strings for commands. A nice side benefit is >> that >> the documentation and help strings can then be improved by any >> community member -- this can be to localize them but can also be to >> improve the quality of the documentation. This is highly structured >> data that we want people to enter, so a wiki supporting semantic >> markup (or even just a series of forms) would be more suited than a >> freeform edit-the-whole-page wiki.
> Yes yes yes! The "just a series of forms" is what I was > conceptualizing... are you talking here about command names, > nountype code, etc., or also data for basic parser training (à la http://mitcho.com/blog/projects/automating-the-linguists-job/) > ? Did you guys discuss that concept at all?
>> 5. It is fairly straightforward to allow localization contributions >> to >> our built-in first-party commands (and nountypes) through such a >> wiki, >> but what about the case where one third-party wants to provide >> localization for a different third-party command feed? So for >> example:
>> * Contributor A publishes a command feed C >> * Contributor B publishes a feed that contains the strings and >> metadata needed to localize feed C into Polish. Call it C.po. >> * User subscribes to C and C.po. The ubiquity core matches up >> the urls or some other metadata to discover that C.po is meant to >> apply to C. It slots the strings from C.po into the command objects >> from C, registers the Polish nountype implementations from C.po, etc.
> Hmm... I had a great conversation with Matt Mullenweg (who I know > y'all just saw as well!) recently while he was in Tokyo on plugin > localization in WordPress and how it all has to go through the > plugin author and how, ideally, it could be centralized (since the > plugin code already is anyway). I feel like, even for third party > commands, we could use the herd (or something like it) to be a > central place for this type of collaboration, rather than forcing > contributors to localize whole commands at a time and hosting them. > Perhaps (as we talked about in the past) if you subscribe to the > herd-hosted copy of a command you'll get the localizations with it?
> As for the way to pull out the localizable components of command > code, I have seen a couple implementations of gettext in js, though > I don't know if we want to go down that road...
>> 6. The Polish language was designed specifically to foil all >> attempts >> at computer parsing. When the machines rise up and try to take over >> the world, Poland will be the last bastion of humanity.
>> On a somewhat more serious note, I learned that there are some >> languages with unique grammars that will resist our attempts to parse >> them. No-one has yet come up with a satisfactory way for a computer >> to parse Polish (this is also true for Finnish, I think?) even though >> some very smart people have tried. So, instead of struggling in vain >> to get 100% natural-language parsing, we will have to settle for >> accepting input in "pidgin Polish" or "pidgin Finnish" -- focus on >> the >> 20% of the work that will provide 80% of the benefit to Polish or >> Finnish users and hope that they will forgive our mangling of their >> tongues.
>> Remember that parsing a full sentence is only required for a minority >> of inputs, anyway. The majority of Ubiquity input is likely to be >> much simpler -- single-word verb-only commands, single nouns followed >> by selecting a suggested verb, or verb + single noun sentences.
> Great point. This, I believe, is the huge benefit we have over other > similar "natural language" projects of the past... by greatly > limiting the scope of what we're doing I think there is great > potential for us to make this interface easier to use for more > people with (relatively) minimal work. Many of the greatest > complexities in natural language understanding (recursion, > quantifiers, negation, etc.) are expressly ruled out as non-problems > for this project based on our goal.
>> The point was also raised that it may be useful to define different >> levels of approximation to natural language input, with some verbs >> and >> nountypes supporting a higher level than others, and graceful >> degredation when higher levels are not available. I am not exactly >> sure where this idea fits into the overall vision, though.
> Hmm... interesting idea... sort of goes with Pike's thought on users > who may want to use multiple languages in different contexts... and > with that, I will go respond to that email. :D
> Thanks again for all the great ideas! Let's keep this conversation > going (perhaps splitting it into different threads).