Problem with version 1.10 with new field 'info'

15 views
Skip to first unread message

Fryderyk Mazurek

unread,
May 11, 2026, 8:21:58 AMMay 11
to edict-...@googlegroups.com
Hello!

I have a question, why new field "info" doesn't have xml:lang attribute like the "gloss" field does? This means that it's impossible to add information in a language other than English if there's a "gloss" exists in that language. I think the "info" element should have an xml:lang attribute.

Best regards,
Fryderyk

Jim Breen

unread,
May 11, 2026, 11:48:15 PMMay 11
to edict-...@googlegroups.com
Thanks for raising this. It's something we'll have to consider.

The new entry-wide <info> field is in many ways a partner to the
sense-level <s-inf> field. That field also lacks an xml:lang
attribute. Quite possibly they should both have them.

Jim
> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/edict-jmdict/CAK%2ByPMnF7dDzYnTVBAt4Z_RZJ1oEndowz2k8sJiGs7sgkN3dzA%40mail.gmail.com.



--
Jim Breen
https://www.edrdg.org/~jwb/ http://www.jimbreen.org/

Stuart McGraw

unread,
May 12, 2026, 3:58:48 PMMay 12
to edict-...@googlegroups.com
Regarding the entry notes (<info> elements):
How will multi-language notes be used? Will it be expected that the
same information will be provided in different languages (a note could have
multiple translations)? Or will notes be independent with a German note
saying something totally unrelated to what is in an adjacent English note?
If the latter is ok will there be a need to indicate when the case is the
former (ie, translations grouped into sets)?

Regarding sense notes (<s_inf> elements):
It appears (from a quick look) that the multi-lingual JMdict XML file
puts each language group of glosses in a separate sense. If this is
a general rule then would it be more reasonable to put the language
attribute on the <sense> elements (so it would apply to the sense note
as well as the glosses) and remove them from the glosses? Since the XML
is changing due to XML-NG, this would be a good time to make such a change.

-- Stuart

Jim Breen

unread,
May 17, 2026, 1:23:20 AMMay 17
to edict-...@googlegroups.com
Thanks Stuart for the reply to Fryderyk.

The question and response highlight a slightly messy situation with
the multilingual version of JMdict.

My original intention was to have the glosses for all the languages in
the one relevant sense, and each tagged with the relevant attribute
(xml:lang="ger", etc.). Since the database is compiled by aggregating
the entries from a heap of bilingual sources, many of which did not
have multiple senses or did not follow the sense structure of the
Japanese-English JMdict, I applied an expedient of putting the glosses
for each language in one or more distinct senses. The only exception
(originally) was the Japanese-French file, which started off
sense-aligned with the JE version. Soon they began to drift apart, so
I changed the Japanese-French handling to be the same as the other
languages.

Ideally, I would like things to be as originally planned, but the task
of aligning the glosses is huge and is unlikely to happen any time
soon.

That brings me to language tags associated with the <info> and <s_inf> elements.

It would be great to cater for the possible implementation of the
original scheme for multilingual glosses. In such a situation, it's
quite possible that there would be <info> and <s_inf> elements
associated with entries in more than one language. I think the most
appropriate approach would be to have a separate element for each
language with the relevant tagging.

In terms of the DTD, I think this means we'd need something like:

<!ELEMENT info (#PCDATA)>
<!ATTLIST info xml:lang CDATA "eng">

and

<!ELEMENT s_inf (#PCDATA)>
<!ATTLIST s_inf xml:lang CDATA "eng">

The raw entry data might look a little odd, as the <info> and <s_inf>
elements could be some distance from the glosses they relate to, but
that's an issue for the apps and systems using the data.

HTH

Jim
> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/edict-jmdict/62b213ba-c188-4829-87a7-2cbce6f28bd6%40mtneva.com.

Jean-Marc D

unread,
May 22, 2026, 10:12:57 AM (12 days ago) May 22
to edict-...@googlegroups.com
Le dim. 17 mai 2026 à 07:23, Jim Breen <jimb...@gmail.com> a écrit :
The only exception
(originally) was the Japanese-French file, which started off
sense-aligned with the JE version. Soon they began to drift apart, so
I changed the Japanese-French handling to be the same as the other
languages.
 
It's something I really regret but actually from the start dico.fj wasn't sense-aligned.
I unfortunately got the original data from JMdic shortly before the split into senses occurred,
I have been over the years considering linking all the french meaning back to the correct senses, I don't think it would be that much work.
The major issue for me would be in which format I should send back the results to be sure it can be integrated in the main database easily.

Jim Breen

unread,
May 23, 2026, 6:34:40 PM (11 days ago) May 23
to edict-...@googlegroups.com
You did provide sense numbers at some stage. I wrote on this list in 2007:

For all languages other than French [....], the
glosses are put at the end of the English ones.

For French (the FR1 set), Jean-Marc Desperrier marked them with sense numbers
so they are added to the sense he marked. However his markup was done several
years ago and since them a lot of extra senses have been marked up. Entry
1366410 had only one sense then. Now it has two.

Cheers 

Jim
--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.

Chris Lindsay

unread,
May 23, 2026, 11:39:10 PM (11 days ago) May 23
to 'Paul Blay' via EDICT-JMdict
I actually have a little more information I can share about the French definitions in Jmdict.

Over the years, I occasionally got reports from users of my iOS app Nihongo about mistakes in the French definitions. Far more frequently than the other languages in Jmdict. I usually handled them as one-off fixes, but got curious about what was going on, so I recently asked AI to analyze the French Jmdict entries.

I ran an AI-assisted audit over 15,301 JMdict entries containing French glosses, comparing Japanese headwords, English senses, and French glosses. The audit flagged 1,040 likely errors and 2,613 entries needing review. The main pattern was missing core meanings: French often preserved a secondary or rare sense while omitting the common meaning.

A couple examples:

1424150 中古
JMdict French had only: Moyen Âge
But the common dictionary meaning is: used; second-hand; pre-owned

1597040 立つ
JMdict French had senses like: se trouver (par ex. en position difficile) and partir (en avion, en train, etc.)
But it missed the basic physical meaning of the verb.

So I suspect there's something going on with the mapping of senses. Perhaps the drift has grown large over the years, causing more problems? I'm happy to share the results of this audit if anyone would like to look into this more.

Chris

Jean-Marc D

unread,
May 25, 2026, 4:26:11 PM (9 days ago) May 25
to edict-...@googlegroups.com
I'm interested in the result of your audit Chris
.
About what you cited :

> 1424150 中古
> JMdict French had only: Moyen Âge
> But the common dictionary meaning is: used; second-hand; pre-owned
The 2002 version of the dico.fj dictionary is still online on dico.fj.free.fr/fj.utf as a utf8 edict format file, with the JMDict entry numbers at the end of each line.
For this entry, it has :
中古 [ちゅうこ] /(n-t) occasion/deuxième main//Moyen-Âge/1424150
中古 [ちゅうぶる] /(n-t) occasion/deuxième main//Moyen-Âge/1424150

> 1597040 立つ
> JMdict French had senses like: se trouver (par ex. en position difficile) and partir (en avion, en train, etc.)
> But it missed the basic physical meaning of the verb.
Fo this one it has :
立つ [たつ] /(v5t) tenir debout/être levé/être dressé/être érigé//partir/démarrer/1597040
建つ [たつ] /(v5t) tenir debout/être levé/être dressé/être érigé//partir/démarrer/1597040

Jim has also obtained some French glosses from at least one other source, so I'm not surprised some of the French entries of jmdict were not in dico.fj, but it seems here there's several cases where the first meaning has somehow been erased.
I could add those missing senses again, and check for alignment.

Given that the pos etc. are only present on the english senses, I think it would be really useful indeed to be able to realign the senses again for all languages.
I think even a very crude version wouldn't be worse than what people currently have to do in order to get the pos information when they display a non english dictionary.

Can  a xml:lang tag be added to s_inf to show it applies to a non english language ? There's many cases where I would love to move the information given in parentheses to a s_inf tag instead.

Jim Breen

unread,
May 26, 2026, 3:42:10 AM (9 days ago) May 26
to edict-...@googlegroups.com
I was puzzled by those meanings being missing. On checking, I see that
in sorting out some versions about 12 years ago, I seem to have
deleted quite a few senses from Jean-Marc's file.

It may take a day or two, but I'll try to put it back together
properly. The deletion has also impacted the version used on the
WWWJDIC server.

Apologies, everyone, and especially to Jean-Marc.

Jim
> To view this discussion visit https://groups.google.com/d/msgid/edict-jmdict/CAKch6zcLi96N6USS52sU7PrWBRZEZVaDLWoa8YykM5sub-14dg%40mail.gmail.com.



--

Chris Lindsay

unread,
May 26, 2026, 8:26:14 PM (8 days ago) May 26
to 'Paul Blay' via EDICT-JMdict
Sorry for the delay. I couldn't find the original audit output, but I still had the code so I ran it again. I've attached the raw output. Take it with a grain of salt, as this is AI-generated and I was mostly just trying to understand the nature of the problem. I viewed the ones with a status of "likely_error" and an issue_type of "missing_core_meaning" as being the most concerning.

Hope this is useful!

Chris

flagged_entries_with_needs_review.csv

Jim Breen

unread,
May 28, 2026, 2:45:05 AM (7 days ago) May 28
to edict-...@googlegroups.com
I have rebuilt the set of French glosses to go in JMdict. They will go
into tomorrow's build of the full JMdict file. Once that's done, it
would be good if you could rerun your script, Chris.

For entry 1424150 (中古) it should have the two senses:
(1) deuxième main; occasion
(2) Moyen-Âge

For 立つ/建つ things have moved a bit over the years. 建つ is now entry
1597045 and will have one gloss "être érigé; être fondé".
立つ, which is entry 1597040 and may need some work, will have:
(1) être érigé; être dressé; être levé; tenir debout
(2) démarrer; partir

Cheers

Jim

Chris Lindsay

unread,
May 28, 2026, 3:26:10 AM (7 days ago) May 28
to 'Paul Blay' via EDICT-JMdict
Sure thing. I'll check for the new JMdict file tomorrow.

Chris

Chris Lindsay

unread,
May 29, 2026, 2:43:50 AM (6 days ago) May 29
to 'Paul Blay' via EDICT-JMdict
The results are mixed. It marks those specific entries I called out as correct now, but the overall count that it flagged as potentially problematic went up. It also calls out that some sequence numbers seem to have ended up in the glosses themselves. So there may just be some new importing bugs.

I've attached the CSV, and I'll include here the AI's analysis of why the overall count went up:

The increase seems mostly real, not just noise from the two fixed examples.
The strongest signal is that the May 29 JMdict rebuild changed a lot of French gloss text, not just 中古 / 立つ / 建つ. Among entries common to both audits:
  • 3,197 entries had changed French glosses.
  • For those changed entries, statuses shifted from:
    • old: 2,581 ok, 395 needs_review, 221 likely_error
    • new: 1,815 ok, 901 needs_review, 481 likely_error
So the changed French data alone added about +766 flagged entries. That more than explains the overall increase.
The pattern I saw in examples: some rebuilt glosses appear shorter, less sense-complete, or contain artifacts.
Examples:
  • さあ: went from several French senses to only allez!; va!.
  • ちゃんと: went from multiple meanings to only correctement.
  • もう: went from several meanings to only déja.
  • カバー: went from several senses to only couverture.
  • Some entries now include numeric artifacts like 1288820, 1006800, 1429350, 1078050, etc.
There is also some audit/model variability. About half of individual “worsened” transitions had unchanged French text, so I would not interpret every row-level change as meaningful. But net-wise, unchanged entries did not drive the increase; changed French glosses did.
So my read is:
  1. The specific reported fixes landed correctly.
  2. The rebuild likely changed many other French glosses.
  3. Some of those changes made entries more incomplete or introduced formatting/import artifacts.
  4. The audit is also somewhat noisy, but the direction of the increase is probably reflecting real data changes in the rebuilt French set.
suspect_and_needs_review_entries_may_29.csv

Jim Breen

unread,
May 29, 2026, 7:53:03 AM (5 days ago) May 29
to edict-...@googlegroups.com
Thanks Chris,

Clearly my attempt to revive some of the missing translations was only
partly successful. I've tidied up a couple of easy errors with the
compilation, e.g. the incluson of those odd sequence numbers, but it
will take a few days to work through the rest.

I'll let you know when it's a good time to run your analysis again.

Jim
> To view this discussion visit https://groups.google.com/d/msgid/edict-jmdict/afe8a887-928c-4b86-82d5-c9229f72d6e7%40mail.shortwave.com.
Reply all
Reply to author
Forward
0 new messages