Example sentences in JMdict

79 views
Skip to first unread message

Jim Breen

unread,
May 26, 2021, 9:02:40 PM5/26/21
to edict-...@googlegroups.com
I'm raising this topic after seeing a comment on the interesting page at:
https://learnjapanese.moe/monolingual/#getting-and-using-monolingual-dictionaries
where, in some remarks on JMdict, it has: "The biggest flaw of this
dictionary (apart from it being bilingual) is how it has no example
sentences." This is of course true, as I have concentrated on using
the collection of sentence pairs (now) in the Tatoeba project, which
are linked at the entry level by systems such as WWWJDIC. In the
original JMdict design I included examples sentences in the structure,
but removed them later.

It occurs to me that it would actually be quite possible to add the
30K "priority" sentence pairs to JMdict as an option. For example the
糖分 entry could become:
<entry>
<ent_seq>1449650</ent_seq>
<k_ele><keb>糖分</keb></k_ele>
<r_ele><reb>とうぶん</reb></r_ele>
<sense>
<pos>&n;</pos>
<gloss>amount of sugar</gloss>
<gloss>sugar content</gloss>
<example>
<ex_sent xml:lang="jpn">私は、糖分のあるものは食べてはいけないのです 。</ex_sent>
<ex_sent xml:lang="eng">I shouldn't eat food that has sugar in it.</ex_sent>
</example>
</sense>
</entry>

This version could easily be generated on-the-fly, and would be an
alternative to the regular Japanese-English edition (JMdict_e). The
sentences would still come weekly from Tatoeba and would not go into
the JMdict maintenance database itself.

This question, which is really directed to app and website developers,
is whether this would be useful? I might just do it anyway, but it
would useful to get some feedback on the idea.

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/

Chase Colburn

unread,
May 26, 2021, 9:11:14 PM5/26/21
to edict-...@googlegroups.com
Hi Jim,

I'm the developer of Kanji Study for Android. I think it would be great to have example sentences in the dictionary and I am sure others would very much appreciate it. Something to keep in mind though is that a lot of the sentences in Tatoeba contain mistakes or are taken out of context and have very strange translations. They also don't have the proper furigana marked up which would be useful. That being said, this can all be improved over time with the help of the community which is basically what I have been doing for the past 5 years. 

Just my two cents. I look forward to any and all developments. 

-Chase

--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq5jxm9ANKYHoae_Fsh-a1FN4cP57JDMgQvWy0-8P0U8rw%40mail.gmail.com.

Brian Birtles

unread,
May 26, 2021, 9:41:53 PM5/26/21
to edict-...@googlegroups.com
On Thu, 27 May 2021 at 10:11, Chase Colburn <kanjis...@gmail.com> wrote:
On Thu, May 27, 2021 at 10:02 AM Jim Breen <jimb...@gmail.com> wrote:
This question, which is really directed to app and website developers,
is whether this would be useful? I might just do it anyway, but it
would useful to get some feedback on the idea.

Hi Jim,

I'm the developer of Rikaichamp and Hikibiki, both of which (currently) rely on storing the dictionary data offline.

I think I would not use this data since, as Chase mentions, my experience with the tatoeba project is that the quality of example sentences is often low and wouldn't justify the extra download size. That said, having it upstream would not cause problems either.

Best regards,

Brian

Jim Breen

unread,
May 26, 2021, 9:59:54 PM5/26/21
to edict-...@googlegroups.com
Thanks for the comments, Chase. A couple of specific points:
> Something to keep in mind though is that a lot of the sentences in Tatoeba contain mistakes or are taken out of context and have very strange translations.

True. The ~30k "priority" sentence pairs (the ones that show up in
WWWJDIC under the entry display) have often been selected for
appropriateness, etc. although in many cases they are the only
sentence containing the term. The priority tag can always be shifted
to another sentence, or a new sentence pair entered.

> They also don't have the proper furigana marked up which would be useful.

That furigana markup would not be included. Within Tatoeba it is
automatically generated (I think they use MeCab/IPADIC) but I think
they have a way of allowing overriding corrections.

Jim
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABrBoGS%3D7bAexXya2kB1KAa-A1bK2JQiTQsFrU_RD6yh0qFgXg%40mail.gmail.com.

Jim Breen

unread,
May 29, 2021, 12:26:26 AM5/29/21
to edict-...@googlegroups.com
I have compiled a proof-of-concept version of JMdict with the 30k
examples embedded. It can be downloaded from:
http://ftp.edrdg.org/pub/Nihongo/JMdict_e_examp.gz

The examples appear at the end of the relevant sense as follows
(example - 火照る entry):
<example>
<ex_text>ほてった</ex_text>
<ex_sent xml:lang="jpn">私は気まずい思いで体がほてった。</ex_sent>
<ex_sent xml:lang="eng">I was feverish with embarrassment.</ex_sent>
</example>

The production of this version is now scripted so it would be fairly
easy to make it a regular daily release.

Jim

Chris Vasselli

unread,
May 29, 2021, 6:08:21 PM5/29/21
to edict-...@googlegroups.com
Thanks Jim, this looks interesting.

I am using Tatoeba sentences in my iOS app Nihongo already, but picking the best example sentence to display for each word can be tough. So, if I’m understanding correctly and these are 30k handpicked sentences that should be good illustrations of how to use the word, that information could definitely be interesting to me. I could imagine prioritizing these sentences in the UI somehow.

In my case, having the Tatoeba sentence identifiers in the data would be really useful, so I can correlate them to the data I’m importing from Tatoeba, like the sentence indices. Maybe an additional attribute on ex_sent, like “tatoeba_id=1234567”?

Chris
--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.

Jim Breen

unread,
May 29, 2021, 11:09:43 PM5/29/21
to edict-...@googlegroups.com
On Sun, 30 May 2021 at 08:08, Chris Vasselli <clin...@gmail.com> wrote:
> I am using Tatoeba sentences in my iOS app Nihongo already, but picking the best example sentence to display for each word can be tough. So, if I’m understanding correctly and these are 30k handpicked sentences that should be good illustrations of how to use the word, that information could definitely be interesting to me. I could imagine prioritizing these sentences in the UI somehow.

The 30k subset are the sentences where one or more of the index terms
is tagged with a "~". See
https://www.edrdg.org/wiki/index.php/Sentence-Dictionary_Linking
Paul Blay did the original tagging.

> In my case, having the Tatoeba sentence identifiers in the data would be really useful, so I can correlate them to the data I’m importing from Tatoeba, like the sentence indices. Maybe an additional attribute on ex_sent, like “tatoeba_id=1234567”?

I'll take that on board. Fairly easy to add to the <example> entity.

Thanks

Jim
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/3b62594f-493b-4857-a5dc-6ad68322bc44%40Spark.

Jim Breen

unread,
Jun 4, 2021, 2:51:15 AM6/4/21
to edict-...@googlegroups.com
As suggested by Chris Vasselli, I have added some linking information
to the example elements. The example for the 火照る entry I mentioned
before is now:
<example>
<ex_srce exsrc_type="tat">157736</ex_srce>
<ex_text>ほてった</ex_text>
<ex_sent xml:lang="jpn">私は気まずい思いで体がほてった。</ex_sent>
<ex_sent xml:lang="eng">I was feverish with embarrassment.</ex_sent>
</example>

The <ex_srce> element is indicating that the example is from Tatoeba,
where its identification number is 157736.

I've updated the example version at:
http://ftp.edrdg.org/pub/Nihongo/JMdict_e_examp.gz

Jim

Chris Vasselli

unread,
Jun 4, 2021, 9:13:12 AM6/4/21
to edict-...@googlegroups.com
Thanks Jim!

Chris
--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.

Kim Ahlström

unread,
Jun 4, 2021, 10:48:14 PM6/4/21
to edict-...@googlegroups.com
Hi everyone,

Just want to add my two cents from Jisho.org, specifically on furigana.

The way I add furigana to example sentences is by using the metadata from the wwwjdic.csv file. Since that data is manually created I make the assumption that the word-splitting is better than Mecab or other morphological analyzers can do, and that it's generally aligned with JMdict headwords. I take the surface form of the word from the metadata and look it up in JMdict to get the reading from there, and then assign it as furigana. Only in a couple of edge cases do I fall back to Mecab. (I wrote that code over ten years ago, so it took me quite a bit just now to figure out how it actually works.)

For that reason I would need the metadata to also be included in JMdict for me to be able to use the sentences. But that will increase the file size by quite a bit, so not sure it's worth it if I'm the only one using the metadata :)

I also import all the sentences from the wwwjdic.csv file, not just the assigned sentences, so I would probably not end up using this regardless.

That being said, I think this is a great idea. It lowers the barrier for projects that use JMdict to also show example sentences, without them having to download a second file and then match things up with JMdict.

Cheers,
Kim



--

Jim Breen

unread,
Jun 5, 2021, 1:26:37 AM6/5/21
to edict-...@googlegroups.com
Thanks, Kim. A couple of comments:

On Sat, 5 Jun 2021 at 12:48, Kim Ahlström <kim.ah...@gmail.com> wrote:
> The way I add furigana to example sentences is by using the metadata from the wwwjdic.csv file. Since that data is manually created I make the assumption that the word-splitting is better than Mecab or other morphological analyzers can do, and that it's generally aligned with JMdict headwords. I take the surface form of the word from the metadata and look it up in JMdict to get the reading from there, and then assign it as furigana. Only in a couple of edge cases do I fall back to Mecab. (I wrote that code over ten years ago, so it took me quite a bit just now to figure out how it actually works.)

Interesting. I didn't know you used that approach to do the furigana.
(If people are wondering about "the wwwjdic.csv file", it's the
download file of example pairs and indices from the Tatoeba project.
It's generated weekly.)

> For that reason I would need the metadata to also be included in JMdict for me to be able to use the sentences. But that will increase the file size by quite a bit, so not sure it's worth it if I'm the only one using the metadata :)

It would add a bit, but not a great deal if there was a demand for it.
[...]

> That being said, I think this is a great idea. It lowers the barrier for projects that use JMdict to also show example sentences, without them having to download a second file and then match things up with JMdict.

Yes, indeed. No rush, but I might push ahead with making it an
alternative distributed form of JMdict. Most of the programming is
done. I'll put the XML extensions into the standard DTD, and then I
just need to set up some cron jobs. The programs that align the
sentences with the JMdict entries and senses threw up about 300
mismatches. I've cleared the 100 or so that related to the sense
numbers and am about halfway through the others. Most are to do with
entries either being modified or removed.

Cheers

Jim
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAKQe%2B-oa7XHfceiVqra9xeMRmoc4pv%2BURCXscQ2oUV8%2B8X_Utw%40mail.gmail.com.

Kim Ahlström

unread,
Jun 7, 2021, 1:09:43 AM6/7/21
to edict-...@googlegroups.com
Hi Jim,

On Fri, 4 Jun 2021 at 22:26, Jim Breen <jimb...@gmail.com> wrote:
Thanks, Kim. A couple of comments:

On Sat, 5 Jun 2021 at 12:48, Kim Ahlström <kim.ah...@gmail.com> wrote:
> The way I add furigana to example sentences is by using the metadata from the wwwjdic.csv file. Since that data is manually created I make the assumption that the word-splitting is better than Mecab or other morphological analyzers can do, and that it's generally aligned with JMdict headwords. I take the surface form of the word from the metadata and look it up in JMdict to get the reading from there, and then assign it as furigana. Only in a couple of edge cases do I fall back to Mecab. (I wrote that code over ten years ago, so it took me quite a bit just now to figure out how it actually works.)

Interesting. I didn't know you used that approach to do the furigana.
(If people are wondering about "the wwwjdic.csv file", it's the
download file of example pairs and indices from the Tatoeba project.
It's generated weekly.)

Of course I made a typo above (is it a typo when it's an entire word?) "surface form" should be "headword".

This approach seems to have worked ok so far, but it's fairly processing heavy since it requires looking up each word in the sentence in JMdict and possibly also in Mecab. There's also some funky code in there that aligns the headword with the surface form. But it's not very smart, so compound words like 走り出す end up with the furigana はしりだ instead of just はし and だ. I'll probably revisit this at some point.
 
Yes, indeed. No rush, but I might push ahead with making it an
alternative distributed form of JMdict. Most of the programming is
done. I'll put the XML extensions into the standard DTD, and then I
just need to set up some cron jobs. The programs that align the
sentences with the JMdict entries and senses threw up about 300
mismatches. I've cleared the 100 or so that related to the sense
numbers and am about halfway through the others. Most are to do with
entries either being modified or removed.

Fantastic news. That cleanup will be great for other projects as well. I get the occasional email from users who have found a mismatch between the example sentence and the sense it's attached to.

Cheers,
Kim
 

Ahmed Fasih

unread,
Jun 7, 2021, 4:59:47 PM6/7/21
to edict-...@googlegroups.com
Hi Kim! When you say,

This approach seems to have worked ok so far, but it's fairly processing heavy since it requires looking up each word in the sentence in JMdict and possibly also in Mecab. There's also some funky code in there that aligns the headword with the surface form. But it's not very smart, so compound words like 走り出す end up with the furigana はしりだ instead of just はし and だ. I'll probably revisit this at some point.

Have you by any chance tried using the JmdictFurigana project? If I understand you correctly (big if 😅), it solves just this problem: 

Jim Breen

unread,
Jun 9, 2021, 4:05:55 AM6/9/21
to edict-...@googlegroups.com
On Sat, 5 Jun 2021 at 15:26, Jim Breen <jimb...@gmail.com> wrote:
> [...] I might push ahead with making it an
> alternative distributed form of JMdict. Most of the programming is
> done. I'll put the XML extensions into the standard DTD, and then I
> just need to set up some cron jobs.

This has been done now. The version is called "JMdict_e_examp", and is
available from http://ftp.edrdg.org/pub/Nihongo/
It will be updated daily.

Mentioned also at
https://www.edrdg.org/wiki/index.php/JMdict-EDICT_Dictionary_Project#CURRENT_VERSION_.26_DOWNLOAD

Jim Breen

unread,
Jun 9, 2021, 4:16:26 AM6/9/21
to edict-...@googlegroups.com
On Tue, 8 Jun 2021 at 06:59, Ahmed Fasih <ah...@aldebrn.me> wrote:
> Have you by any chance tried using the JmdictFurigana project? If I understand you correctly (big if ), it solves just this problem:
>
> https://github.com/Doublevil/JmdictFurigana

I'd quite forgotten about that project. It may help, but I think Kim's
issue is with getting the furigana into the sentences rather than the
dictionary entries (I could be wrong). The sentences on the Tatoeba
have furigana somewhere - I think it's initially generated
automatically but corrections can be made manually. I don't know if
they can be downloaded.

BTW, I looked at the JSON version of the JMdict-with-furigana, sort-of
expecting that the JMdict sequence numbers would be included but
they're not. Is there any way they can be added. It's a challenge to
align it with JMdict without them - is 川柳 せんりゅう or かわやぎ?

Marcella M. MARIOTTI

unread,
Jun 9, 2021, 4:18:19 AM6/9/21
to edict-...@googlegroups.com
Dear Jim
dear all,

just to say Hi from Italy/Venice.
We are always reading you and will be there hopefully with good news soon.

Best wishes as always to you and to Jim,
Marcella
*`*`*`*`*``*`*`*`**`
Marcella MARIOTTI, Ph.D.
Associate Professor
Dpt. of Asian and North African Studies
Ca' Foscari University of Venice





--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.

Kim Ahlström

unread,
Jun 14, 2021, 2:37:35 AM6/14/21
to edict-...@googlegroups.com
On Mon, 7 Jun 2021 at 13:59, Ahmed Fasih <ah...@aldebrn.me> wrote:

Hi Ahmed,

Have you by any chance tried using the JmdictFurigana project? If I understand you correctly (big if 😅), it solves just this problem: 

I have seen it, but by that time it was too late to use in the current version of Jisho :)

As Jim pointed out though, the approach I described was specifically for applying furigana to sentences. For words I have another home-grown algorithm that uses readings from Kanjidic2 to apply furigana. But it's far from perfect and has quite a few mis-alignments. Better headword furigana is high on my list of things to fix.

So I might take another look at the JmdictFurigana project at some point to see if I can use it in a future version of the site. But I agree with Jim that it would be great if it was aligned with JMdict sequence numbers.

Cheers,
Kim

Jim Breen

unread,
Jun 15, 2021, 8:13:09 AM6/15/21
to edict-...@googlegroups.com
Hi Marcella.

Great to hear from you. Hope all is well with you.

[Marcella and her colleagues/students are responsible for the ItaDict
Japanese-Italian dictionary. She has visited us in Melbourne, and we
met up in Venice a couple of years ago.]

Cheers

Jim
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAGQQ-YCfRUqAR3WHwxjJZpgJHc_nxRVEiC%3Dvcg2KGNZCaVGWdg%40mail.gmail.com.

Marcella M. MARIOTTI

unread,
Jun 15, 2021, 9:27:16 AM6/15/21
to edict-...@googlegroups.com
😁 Hi Jim! Hi all.
Hopefully, after 9 years, by the end of June, I'll let you know more about the Japanese-Italian version.
It is always so nice to hear from you,
Marcella
*`*`*`*`*``*`*`*`**`
Marcella MARIOTTI, Ph.D.
Associate Professor
Dpt. of Asian and North African Studies
Ca' Foscari University of Venice
Department Delegate for Students Placement & Internships

AbleMind

unread,
Jun 15, 2021, 2:14:30 PM6/15/21
to edict-...@googlegroups.com
> Jim wrote on Jun 9, 2021, 4:16 AM
> BTW, I looked at the JSON version of the JMdict-with-furigana, sort-of
> expecting that the JMdict sequence numbers would be included but
> they're not. Is there any way they can be added. It's a challenge to
> align it with JMdict without them - is 川柳 せんりゅう or かわやぎ?

> Kim wrote on Jun 14, 2021, 2:37 AM
> So I might take another look at the JmdictFurigana project at some point to see if I can use it in a future version of the site. But I agree with Jim that it would be great if it was
> aligned with JMdict sequence numbers.

Hey all, pleasure to join the chat.

I'm a programmer that's been tinkering with a Japanese study application that uses the JMdict, amongst other resources mentioned before. I'm trying to gain more exposure and experience in contributing to open-source projects. I took a glance at the JmdictFurigana project and it seems like it should be possible, although there is the question of what to do when an entry doesn't have a corresponding id in the JMdict. If this is something that is indeed an interest, I'd be happy to take a shot at it.

-
Cameron Chambers

Ahmed Fasih

unread,
Jun 15, 2021, 2:45:34 PM6/15/21
to edict-...@googlegroups.com
On Tue, Jun 15, 2021, at 11:14, AbleMind wrote:
I took a glance at the JmdictFurigana project and it seems like it should be possible

Cameron, I saw your PR in JmdictFurigana 👏! I opened an issue to track Jim's request, Doubleevil has been super-responsive over the years so feel free to jump in:


Thank you!

Jim Breen

unread,
Jun 15, 2021, 6:52:41 PM6/15/21
to edict-...@googlegroups.com
I'm going to move the furigana-related emails to a specific thread to
make it a bit easier to find later. I'll copy in the ones so far
below. Jim
From: Kim Ahlström <kim.ah...@gmail.com>
Date: Mon, 14 Jun 2021 at 16:37
Subject: Re: Example sentences in JMdict
To: <edict-...@googlegroups.com>


On Mon, 7 Jun 2021 at 13:59, Ahmed Fasih <ah...@aldebrn.me> wrote:

Hi Ahmed,

> Have you by any chance tried using the JmdictFurigana project? If I understand you correctly (big if ), it solves just this problem:
>
> https://github.com/Doublevil/JmdictFurigana

I have seen it, but by that time it was too late to use in the current
version of Jisho :)

As Jim pointed out though, the approach I described was specifically
for applying furigana to sentences. For words I have another
home-grown algorithm that uses readings from Kanjidic2 to apply
furigana. But it's far from perfect and has quite a few
mis-alignments. Better headword furigana is high on my list of things
to fix.

So I might take another look at the JmdictFurigana project at some
point to see if I can use it in a future version of the site. But I
agree with Jim that it would be great if it was aligned with JMdict
sequence numbers.

Cheers,
Kim
Reply all
Reply to author
Forward
0 new messages