crazy verb matching

0 views
Skip to first unread message

"mitcho (Michael 芳貴 Erlewine)"

unread,
Jul 1, 2009, 12:13:02 PM7/1/09
to ubiqui...@googlegroups.com, ubiqui...@googlegroups.com
We have what I believe is a big verb matching issue in Japanese (a no-
space language)... here's the pattern. If you can figure it out, I'll
buy you a beer: (Testing this requires the latest changes to the ja
localizations I just committed.)

Background: we have a verb in Japanese called "Flickrで検索する",
which is "search with flickr"

1. Input: "flick", we get the flickr verb as an option (it's kind of
low, but okay)
2. Input: "flickr", we get the flickr verb as an option (kind of low,
same as (1))
3. Input: "花火をflick", we get the flickr verb. OK. (searching
flickr for fireworks)
4. Input: "花火をflickr", we *DO NOT* get the flickr verb. AT ALL.

(I recommend you try all of these in the playpen so you can see lots
of suggestions... the scoring is not the issue... some parses aren't
being generated at all.)

If you use the playpen and turn on the "display parse info", you'll
see that in input 4, "Flickrで検索する" is not showing up as a
verb match at all.

It looks like in input (4), the final "r" is being picked up as the
first letter of "run selector-selector" and it thus isn't finding the
longer "flickr" match. This is a huge bug. We should be getting *both*
of those verb matches... the "flickr" match should simply get a higher
verb score.

mitcho

--
mitcho (Michael 芳貴 Erlewine)
mit...@mitcho.com
http://mitcho.com/
linguist, coder, teacher

Aza

unread,
Jul 1, 2009, 5:17:56 PM7/1/09
to ubiqui...@googlegroups.com
I know that this is a suboptimal solution, but how about requiring spaces in Japanese...

-- aza | ɐzɐ --

satyr

unread,
Jul 1, 2009, 5:18:02 PM7/1/09
to ubiqui...@googlegroups.com
> 3. Input: "花火をflick", we get the flickr verb. OK. (searching
> flickr for fireworks)
> 4. Input: "花火をflickr", we *DO NOT* get the flickr verb. AT ALL.

I don't get parses at all in playpen with these inputs. The parsing
seems to stop before step 9 somehow.

Anyhow, this seems like an inherent problem of trying to do the final
match with a single regexp.

demoParserInterface.currentParser._patternCache.verbFinalTest("花火をflickr")
// => ["花火をflickr", "花火をflick", "r"]

I think we want suffixes for verb-final matching. Not prefixes.

satyr

unread,
Jul 1, 2009, 5:45:19 PM7/1/09
to ubiqui...@googlegroups.com
> I don't get parses at all in playpen with these inputs. The parsing
> seems to stop before step 9 somehow.

Update for this behavior. I get this error when it stops:

Error: [Exception... "'JavaScript component does not have a method
named: "onTextEntered"' when calling method:
[nsIAutoCompleteInput::onTextEntered]" nsresult: "0x80570030
(NS_ERROR_XPC_JSOBJECT_HAS_NO_FUNCTION_NAMED)" location: "<unknown>"
data: no]

Looks like this is related to Utils.history and nouns using it
(noun_type_url/noun_type_awesomebar).

Jono DiCarlo

unread,
Jul 1, 2009, 6:04:20 PM7/1/09
to Ubiquity i18n
 スペース は 日本語 に 入力 が 難しい です。
The spacebar is used to switch between different character matches to
the romaji you input. Inputting spaces is not impossible, but it's
like asking an english typist to put a tab character between each word
-- it throws off your typing really bad. That's not just suboptimal,
it's a complete non-solution.
--Jono

Jono DiCarlo

unread,
Jul 1, 2009, 6:05:55 PM7/1/09
to Ubiquity i18n
Mitcho: Can you clarify -- is this a problem that happens only when a
user uses romaji to enter the verb name?
Isn't an input like "花火をflick" very unnatural, because the user would
have to switch input modes mid-sentence?
--Jono

On Jul 1, 9:13 am, "mitcho (Michael 芳貴 Erlewine)" <mit...@mitcho.com>
wrote:

"mitcho (Michael 芳貴 Erlewine)"

unread,
Jul 1, 2009, 7:53:04 PM7/1/09
to ubiqui...@googlegroups.com
Japanaese keyboards have "English" and "Japanese" keys on them (though
mine doesn't) so many people are used to entering text like this.

I think satyr might be right—what this might mean is that we just
can't use one single regexp to pick out all possible regexps... we
might have to have separate ones for each verb and run them on the
input. I don't know how much of a performance hit that will be...
probably not much, actually, as we'd cache each of those regexps ahead
of time.

m

"mitcho (Michael 芳貴 Erlewine)"

unread,
Jul 1, 2009, 8:59:00 PM7/1/09
to ubiqui...@googlegroups.com
Alright y'all, the immediate tide has been stemmed... there was a
legit bug in the verb final matching regexp so it was trying to find
the verb at the end in a non-greedy fashion, preferring the match "r"
as a verb to "flickr" in things like "花火をflickr". I think that's
why the input "flickr" was working, as then it was being caught by the
verb-initial match regexp.

This still means that it won't pick up matches of different lengths
and try them all against the verbs... something to think about for the
near future.

mitcho
Reply all
Reply to author
Forward
0 new messages