Search-only senses

Kim Ahlström

unread,

Dec 6, 2023, 2:34:00 AM12/6/23

to EDICT-JMdict

Hi folks,

A Jisho.org user recently emailed me to ask why he could not find 着陸 when searching for "to land". Since it's tagged as a noun and suru verb the definition is written in the noun form (landing; alighting; touch down).

The editorial policy specifically states that these entries should not include verb glosses, but allows it for entries where the verb sense can not be easily derived from the noun sense, and for vs entries that are also not n.

Is the intent here that verb senses could be derived computationally by dictionary software for vs+n entries to make them findable as verbs? A computational approach seems within the realm of possibility, but a human curated approach would be more accurate.

Since we now have search-only readings, could we introduce search-only senses or glosses to make finding these vs+n entries easier when searching in English using verb forms?

Cheers

Kim

Jim Breen

unread,

Dec 7, 2023, 1:31:27 AM12/7/23

to edict-...@googlegroups.com

Thanks, Kim, for raising this.

Support for E->J lookups has always been a thing of interest, and we
often included glosses that can assist. That said, it's recognized
that the practice of not including verb or adjective glosses for
(n.vs) and (n,adj-*) entries can make such lookups difficult. You
won't easily find 料理 by looking up "to cook". About 20 years ago I
did some experimenting within WWWJDIC with taking a search key such as
"to XXXX" and converting it to possible targets such as "XXXXing". It
was noisy and only partially successful. and eventually I gave up.
(Ironically it would have worked with 着陸.)

Certainly adding verb glosses, either as new senses or within the
existing senses would help, but it would be a major task - about
13,000 entries are of the "n,vs" variety. I hadn't even thought about
"hidden glosses", but it's an interesting concept. Rough versions
could be created automatically, but I think human involvement would be
needed to get any reliability. and if work is going to be needed the
results may as well be visible.

If you look at the 着陸 entry in GG5, it has:
(a) landing; alighting; 〔接地〕 a touchdown.
～する land; make a landing; alight; 〔接地〕 touch [put, set] down.

You could envisage the current JMdict glosses ("landing; alighting;
touch down") being extended with something like:
"{vf} land; alight; set down". That would allow dictionary systems to
respond to keys such as "to land". A sense extension of this form
would not upset the sense numbering.

Anyway, food for thought, and thanks for raising it. I'll be
interested to see what the community thinks.

Jim

> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/02d86fe4-7412-4352-91da-f3125a55dc3an%40googlegroups.com.

--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/

Chris Vasselli

unread,

Dec 7, 2023, 8:25:13 PM12/7/23

to 'Paul Blay' via EDICT-JMdict

Personally my instinct is that this should be handled by clients computationally, but I could be convinced otherwise!

It seems like part of a broader problem of matching user queries that use words in a different form from how they're written in the dictionary. For example, someone might still search for "made a landing" instead of "make a landing", even if "make a landing" were added to the dictionary. So you still need to deal with transforming user queries in some way. In my iOS app I use the "porter" tokenizer of sqlite3 for this for English, and Apple's built-in natural language lemma support for non-English languages.

Granted, it's not perfect, and I just checked and my app also fails to find 着陸 for "to land". But I feel like trying to solve this with manual additions to JMdict will only solve one small part of a larger problem that kind of inherently needs a computational solution.

Just my initial thought though, curious to hear what others think.

Chris

To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq6ekz4gJbDVosaXa3fTo%2B_ghgYZevNw124JK41%2BaJ01rg%40mail.gmail.com.

Kim Ahlström

unread,

Dec 8, 2023, 2:24:59 AM12/8/23

to edict-...@googlegroups.com

I wonder if there’s two separate problems to solve here, with some overlap - having glosses that explain the full range of uses of a headword, and handling user queries of a form that’s not in the dictionary data.

I think my original idea of search-only senses only solve the latter, and I agree that computational tools can go a long way here. As Jim mentioned it can be hard to get this working well though. For example I think it would be hard for a lemmatizer to find 結婚 when searching for “marry" since the noun entry is “marriage”, and they are separate lemmas. Ironically some stemmers do a better job here, but I’ve generally avoided them since the output is not always natural language.

I think there are several upsides to adding verb forms to n,vs entries. It would make common Japanese words be findable using common English words. It would benefit all clients using JMdict, not just the systems that implement linguistics smarts. It would also clarify word usage. Someone searching for “to marry” and finding 結婚/“marriage” would not necessarily know that this is the most common way, or a way at all, to write “to marry”. This could be especially hard for non-native English speakers.

I quite like Jim’s idea of delineating verb forms with something like {vf}, since it would allow clients to format the entry as they prefer - ~する like GG5, or maybe an English explanation like “as a verb:”, without requiring changes to the XML schema. It would still require some language smarts to turn “to land” queries into “land”, but would be simple enough that clients could do it brute force without a separate stemmer/lemmatizer.

Since adding the verb forms would be quite an undertaking, maybe a combined approach could be used. A one time computational process to add verb forms as hidden glosses. These could then bit by bit be looked over by editors, starting with more common words, turning them into well written visible glosses. Yes, I’m aware that I’m asking for a lot from the editorial group here 😅 But the more I think about this approach the less I like it. I shudder a bit at the thought of having machine made text inside JMdict, even if it would be hidden data.

Cheers

Kim

To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/7327b78d-951b-41c2-91f3-9c80ee14ed17%40mail.shortwave.com.

Adam Nohejl

unread,

Dec 8, 2023, 3:46:47 AM12/8/23

to edict-...@googlegroups.com

Hi everyone,

I also feel that it's better to think about this as of two problems outlined by Kim, even though there is potentially a lot of overlap, and definitely lot of space for a fuzzy search. Let me add a few points:

Why adding English verb glosses to n,vs entries is a good idea:
- All other dictionaries I know do that (at least if the verb usage is common), people expect that.
- Many of these words are used as verbs in most of their occurrences, and noun (gerund) glosses often feel a little forced (e.g. 就職: finding employment; getting a job).
- Adding the verb glosses would make "n,vs" entries consistent with vs-only entries, which already have them.
- In addition to improving E->J search, verb glosses would be a clear and conspicuous way of stating that the word can be used as a verb (compared to the POS tags, which are not very conspicuous in most apps and I assume most users don't bother to read them).
- Not all applications are going to do elaborate client-side processing of search queries. High quality on-the-fly processing (beyond lemmatization/stemming/rules) would either require a large language model or a small on-device fine-tuned model.
What might be a good way to start adding the glosses:
- Use a corpus to dermine vs+n that are frequently used as verbs (e.g. have high N+する frequency, ignoring other occurrences), so that we know which entries need the verb senses the most.
- Optionally, use a "computational method" to annotate them with provisional verb glosses to be reviewed by humans.

As for the "computational method": With very little money it would be easy to do using a commercial LLM. I tried ChatGPT on a dozen examples and it did a pretty good (i.e. educated human-grade) job, given that I used only the English glosses. Adding the Japanese words may (or may not) improve it. GPT 3.5 now costs $0.001/$0.002 for 1K input/output tokens, so potentially we could get provisional glosses for a thousand entries for a few dollars. The glosses would still require human review, but this would save a ton of work. It seems that there is only 13,969 "vs" entries (I guess most of them "n,vs"), correct?

--
Adam Nohejl

To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/FD9ECAD2-FE43-4A10-A8FF-4BA7E1E9C6EE%40gmail.com.

Chris Vasselli

unread,

Dec 9, 2023, 5:55:23 PM12/9/23

to 'Paul Blay' via EDICT-JMdict

I like this approach. LLM's seem like they could make quick work of this, but are too expensive/slow to run client-side. Moving that to the database generation layer, and then having human review seems like a good idea.

Chris

To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CE8B6DDC-6F71-45A7-85F0-DBA290A9DBD3%40nohejl.name.

Kim Ahlström

unread,

Jan 14, 2024, 8:36:08 PM1/14/24

to edict-...@googlegroups.com

Thanks Adam, I agree with your points.

I have also played around with ChatGPT a bit and was pleasantly surprised that it handled queries like “Turn this into verb form: shopping; purchased goods”, and returned "to shop”, ignoring the, in this case nonsensical, purchased goods.

Jim, how much work would it be to add your {vf} suggestion to JMdict? Do you imagine it as an attribute on the gloss tag? Or add these as text additions inside the current gloss tags, like this?

<gloss>marriage, {vf} marry</gloss>

Cheers

Kim

To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CE8B6DDC-6F71-45A7-85F0-DBA290A9DBD3%40nohejl.name.

Jim Breen

unread,

Jan 18, 2024, 2:05:13 AM1/18/24

to edict-...@googlegroups.com

> Jim, how much work would it be to add your {vf} suggestion to JMdict? Do you imagine it as an attribute on the gloss tag?
> Or add these as text additions inside the current gloss tags, like this?
>
> <gloss>marriage, {vf} marry</gloss>

I was thinking more along the lines of:

<gloss>marriage</gloss>
<gloss g_type="vf">marry</gloss> or <gloss g_type="vf">to marry</gloss>

It's a major step, and I'm not really sure it's one we should make.
There are alternatives, such as having an auxiliary file with
something like:

1254790 1 to marry

which apps could use to map "to marry" to entry 1254790, sense 1.

I probably should raise this as an issue it the github set.

Jim

Kim Ahlström

unread,

Feb 1, 2024, 4:43:51 AM2/1/24

to edict-...@googlegroups.com

Thanks Jim. Makes sense that this is a big change. Maybe better suited for JMdict NG?

I’ve been thinking of the separate file approach as well - it’s what I’m planning on doing to align Jreibun with JMdict (no updates on when that is happening.)

The one problem is that if this data is edited outside of the regular JMdict editing workflow, then it’s quite likely to see drift when JMdict is edited. I suppose if a computational/chatgpt approach is taken it’s possible to build a system that auto edits the file when JMdict is edited.

I posted to this email list purely out of habit. Should I be using the GitHub list in the future?

Cheers
Kim

> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq5r87oPYkfa6PT9e%3DirX0LQrjoz0tKjsWxPGiOGOjk%2BMg%40mail.gmail.com.

Reply all

Reply to author

Forward