Backwards cross-reference functionality using JMdict XML data

17 views
Skip to first unread message

Stephen Kraus

unread,
Feb 25, 2023, 5:05:50 PM2/25/23
to EDICT-JMdict
There is some discussion taking place in the comments on JMdict entry #2761770 regarding backwards references in applications which use JMdict data.

Brian Krznarich wrote:
> Usually it doesn't seem a great loss that jisho.org doesn't display backwards references, but this feels like an exception.

Robin Scott wrote:
> I think back-references are helpful for editing purposes but not so helpful for learners. And they'd introduce a lot of clutter on some entries.

Jim Breen wrote:
> Just a quick comment about "jisho.org doesn't display backwards references". You are only seeing them here because you are getting a maintenance view of a fairly complex relational database. The distributed dictionary file (jmdict.xml) only contains forward references. AFAIK no app tries to do backward ones - it would be very difficult to implement.

I have some experience attempting to implement this functionality using only the data provided in the JMdict XML file, so I just thought I'd share some thoughts.

The cross references in the XML file do not contain the sequence numbers (Entry IDs) of the referenced entries, so in many cases the exact identity of the entry being referenced is ambiguous. For example, entry 2579910 (ご本) contains a reference to 「本」, but the reading is not specified. So in principle this could be a reference to either 本・ほん (entry 1260670)  or 本・もと (entry 1522150).  Similarly, entry 2853252 contains a reference to sense #2 of 「いかん」 without a kanji form specified. There are 13 different entries in JMdict which have いかん as a valid reading.

So which of those entries do we insert backwards references into? The easy answer is "all of them," although that is clearly not ideal. We can do better by applying some heuristics to filter out the entries that are unlikely to be the intended destinations of the references. I applied four different criteria in my filter:

  1. The number of senses. Even though 13 entries have 「いかん」 as a reading, only two of these entries contain more than one sense. Since the reference is to sense #2, we can rule out 11 of the entries.
  2. The position of the headword in the entry. Since 本 is the first headword in entry #1260670 (ほん) and the second headword in entry #1522150 (もと), we can assume that the reference is to the former rather than the latter.
  3. Frequency tags. If one headword is tagged as a priority term and the other is not, then we can prioritize the former.
  4. Finally, the sequence numbers (IDs) of the entries. These numbers don't actually provide any useful information, but if we select the entry with the smallest ID (all else being equal), we can always narrow our search down to a single entry.
This method can't produce 100% accurate results, but I found that it worked correctly the vast majority of the time.

Last year I produced a test version of JMdict for Yomichan which contained these backwards references. After using it for many months, I came around to share Robin's opinion that these notes mostly just clutter entries rather than provide useful information.

However, there's still a good purpose for this disambiguation method. If we want to display a glossary preview of a referenced entry inside of the entry which references it, we need to know exactly which entry is being referenced. I implemented this functionality in my latest version of JMdict for Yomichan. (Unfortunately, Yomichan is now abandoned as of today, so I guess not many people are going to be able to use it.)

Of course, if sequence numbers were included explicitly in the reference elements of the JMdict XML file, then none of this would be necessary. This has already been noted on the JMdict: Next Generation page.

I didn't think it would be appropriate to post this in the comments on JMdictDB or as a new Github issue, so I thought I'd post it here. Hopefully someone else might find this interesting or useful.

Stuart McGraw

unread,
Feb 25, 2023, 5:57:19 PM2/25/23
to edict-...@googlegroups.com
As you noted, the new xml format will have entry sequence numbers in the xrefs. Unfortunately I am the blocking factor on getting the new format implemented. For various personal reasons my life became unexpectedly complicated during the pandemic and I have been finding it hard to get enough time to finish up the implementation. It has been proceeding though, albeit slowly (a prerequisite, the switch of jmdictdb from cgi to wsgi at edrdg.org was mostly complete around the end of last year).

-- Stuart


On 2/25/23 15:05, Stephen Kraus wrote:
> There is some discussion taking place in the comments on JMdict entry #2761770 <https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=2761770.1> regarding backwards references in applications which use JMdict data.
>
> Brian Krznarich wrote:
> > Usually it doesn't seem a great loss that jisho.org doesn't display backwards references, but this feels like an exception.
>
> Robin Scott wrote:
> > I think back-references are helpful for editing purposes but not so helpful for learners. And they'd introduce a lot of clutter on some entries.
>
> Jim Breen wrote:
> > Just a quick comment about "jisho.org doesn't display backwards references". You are only seeing them here because you are getting a maintenance view of a fairly complex relational database. The distributed dictionary file (jmdict.xml) only contains forward references. AFAIK no app tries to do backward ones - it would be very difficult to implement.
>
> I have some experience attempting to implement this functionality using only the data provided in the JMdict XML file, so I just thought I'd share some thoughts.
>
> The cross references in the XML file do not contain the sequence numbers (Entry IDs) of the referenced entries, so in many cases the exact identity of the entry being referenced is ambiguous. For example, entry 2579910 <https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2177701> (ご本) contains a reference to 「本」, but the reading is not specified. So in principle this could be a reference to either 本・ほん (entry 1260670)  or 本・もと (entry 1522150).  Similarly, entry 2853252 <https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2175783> contains a reference to sense #2 of 「いかん」 without a kanji form specified. There are 13 different entries in JMdict which have いかん as a valid reading.
>
> So which of those entries do we insert backwards references into? The easy answer is "all of them," although that is clearly not ideal. We can do better by applying some heuristics to filter out the entries that are unlikely to be the intended destinations of the references. I applied four different criteria in my filter: <https://github.com/FooSoft/yomichan-import/blob/00dc44386e68850d7e55ea01ed7da868f1021256/jmdict_references.go#L140-L170>
>
> 1. The number of senses. Even though 13 entries have 「いかん」 as a reading, only two of these entries contain more than one sense. Since the reference is to sense #2, we can rule out 11 of the entries.
> 2. The position of the headword in the entry. Since 本 is the first headword in entry #1260670 (ほん) and the second headword in entry #1522150 (もと), we can assume that the reference is to the former rather than the latter.
> 3. Frequency tags. If one headword is tagged as a priority term and the other is not, then we can prioritize the former.
> 4. Finally, the sequence numbers (IDs) of the entries. These numbers don't actually provide any useful information, but if we select the entry with the smallest ID (all else being equal), we can always narrow our search down to a single entry.
>
> This method can't produce 100% accurate results, but I found that it worked correctly the vast majority of the time.
>
> Last year I produced a test version of JMdict for Yomichan which contained these backwards references <https://github.com/FooSoft/yomichan/issues/1165#issuecomment-1082441593>. After using it for many months, I came around to share Robin's opinion that these notes mostly just clutter entries rather than provide useful information.
>
> However, there's still a good purpose for this disambiguation method. If we want to display a glossary preview of a referenced entry inside of the entry which references it, we need to know exactly which entry is being referenced. I implemented this functionality in my latest version of JMdict for Yomichan <https://github.com/FooSoft/yomichan-import/pull/40#issue-1561034131>. (Unfortunately, Yomichan is now abandoned as of today <https://foosoft.net/posts/sunsetting-the-yomichan-project/>, so I guess not many people are going to be able to use it.)
>
> Of course, if sequence numbers were included explicitly in the reference elements of the JMdict XML file, then none of this would be necessary. This has already been noted on the JMdict: Next Generation <https://www.edrdg.org/wiki/index.php/JMdict:_Next_Generation#Cross-References> page.
Reply all
Reply to author
Forward
0 new messages