Jreibun, new sentences database

41 views
Skip to first unread message

Kim Ahlström

unread,
Jul 2, 2021, 5:11:21 AM7/2/21
to edict-...@googlegroups.com
Hi everyone, Kim from Jisho.org here.

[Jim has kindly allowed me to post this]

Professor Suzuki Tomomi at Tokyo University of Foreign Studies, together with a research team, is embarking on a project to create an open database of high quality Japanese-English example sentences geared towards Japanese learning apps and sites.

The project is based on research done at TUFS around study resources used by Japanese learners. 

To quote the project:

日本語学習者の使っている辞書アプリを見て、その例文をもっといいものにしたいと思ったことはありませんか。本科研では、アプリ・ウェブサイト開発に使ってもらえるよう、 日本語教育の観点から見た質の高い例文バンクを作成し、オープンデータとして公開します。

They are doing their first seminar for the project over Zoom on Sunday July 18 at 1pm (Japan time) to explain the background and aim of the project, and to solicit volunteers. I’m attaching the event flyer. Signups close on July 10th.

Hopefully this project will be useful for many of you on this mailing list, and if it sounds interesting I encourage you to join the seminar.

My connection to this project comes from studying under Suzuki many years ago, and helping out in the early planning stage of the project by answering technical questions around how Jisho uses example sentences from the Tanaka corpus.

I have high hopes for Jreibun, and have already agreed to use the sentences in Jisho. I'm also considering creating an open set of JMdict-Jreibun mappings similar to the "good sentences" (the ones marked ~) in the Tanaka corpus.

Cheers,
Kim


Jreibun_第1回公開研究会210718広報チラシ.pdf

Paul Blay

unread,
Jul 2, 2021, 6:44:26 AM7/2/21
to edict-...@googlegroups.com
> Hi everyone, Kim from Jisho.org here.

Hi there,

I'm not dead. Although I'm not much healthier either.
I'd be interested to know how you/they are going to be dealing with
linking words and senses to actual example sentences. One of the
things I dealt with when I had more energy was updating on a monthly
basis the link between jmdict and example sentences, so I know some of
the 'gotcha's waiting for you.

The theoretical ideal is that every part of the Japanese example
sentence would be linked to a unique word/phrase in the dictionary, or
would be explicitly excluded as 'white space / junk text'. In a sort
of Orwellian "Everything not forbidden is compulsory" sense. Keeping
track of junk text / white space was something that I did in my (sadly
obsolete) files which wasn't really handled elsewhere and is useful to
avoid spurious matches (which can lead to the wrong words being
highlighted in example sentences and other problems).

One point that was a lot of trouble to work with, and not handled as
well as it could have been, was tracking when words in the dictionary
had new senses added, merged, or removed. I would recommend being
aware of this issue from the start as it happens a lot.

Best regards,

Paul Blay

Kim Ahlström

unread,
Jul 3, 2021, 3:35:19 AM7/3/21
to edict-...@googlegroups.com
> On 2 Jul 2021, at 03:44, 'Paul Blay' via EDICT-JMdict <edict-...@googlegroups.com> wrote:

Hi Paul,

Thank you for sharing your experiences and for your work on the Tanaka corpus. I make heavy use of it in Jisho, and used the indices as a reference when talking to Suzuki about the technical details of a sentence corpus.

> I'd be interested to know how you/they are going to be dealing with
> linking words and senses to actual example sentences. One of the
> things I dealt with when I had more energy was updating on a monthly
> basis the link between jmdict and example sentences, so I know some of
> the 'gotcha's waiting for you.

I haven’t thought much about the practicalities yet to be honest. The project is very much in its infancy.

> The theoretical ideal is that every part of the Japanese example
> sentence would be linked to a unique word/phrase in the dictionary, or
> would be explicitly excluded as 'white space / junk text'. In a sort
> of Orwellian "Everything not forbidden is compulsory" sense. Keeping
> track of junk text / white space was something that I did in my (sadly
> obsolete) files which wasn't really handled elsewhere and is useful to
> avoid spurious matches (which can lead to the wrong words being
> highlighted in example sentences and other problems).

For showing sentences next to dictionary headwords I’ll most likely keep a manual map between sentences and JMdict entries. I was planning on releasing these mappings, but the more I think about it the choice of which sentence shows on which headword is highly editorial, and other projects might prefer different mapping choices. I still feel that it would be useful to have this as an open data set though. Maybe something that can support mappings for multiple projects.

Something like this:
JMdict:123;Jreibun:111;Jisho,WWWJDIC
JMdict:124;Jreibun:222;Jisho
JMdict:126;Jreibun:333;WWWJDIC

It would probably also be good to keep track of the surface form of the word in the sentence directly in the mapping. But I can see how that will entail a lot of manual work and keeping track of text segments to avoid the spurious matches issue.

> One point that was a lot of trouble to work with, and not handled as
> well as it could have been, was tracking when words in the dictionary
> had new senses added, merged, or removed. I would recommend being
> aware of this issue from the start as it happens a lot.

Good point. I’m working on a new version of Jisho with a much improved database, that should hopefully make it easier to keep track of changes to senses. I could use that to alert me when sentence mappings need to be reviewed.

Cheers
Kim

Jim Breen

unread,
Jul 3, 2021, 8:26:33 AM7/3/21
to edict-...@googlegroups.com
Thanks for passing this on, Kim. I'll try and join in the seminar. I don't have the opportunity to comment in detail (I'm touring in what Australians call The Red Centre), but I certainly have some views based on our experience with the Tanaka Corpus.

Building fit-for-purpose examples from scratch is actually quite a challenging task. Several of my acquaintances worked on this aspect of the GG5 and found it quite demanding.

More later.

Jim


--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAKQe%2B-qFo%2BjxZ7qhQ%2BeER%2B4x5LRNwx4MzYPo3c7%2BodXwUxH%2BiQ%40mail.gmail.com.

Jim Breen

unread,
Jul 3, 2021, 8:46:58 AM7/3/21
to edict-...@googlegroups.com
Paul Blay wrote:
I'd be interested to know how you/they are going to be dealing with
linking words and senses to actual example sentences. One of the
things I dealt with when I had more energy was updating on a monthly
basis the link between jmdict and example sentences, so I know some of
the 'gotcha's waiting for you.

I'm interested too. The clunky system I put together in 2002/2003 for the Tanaka Corpus ii's probably past its use-by date.

The theoretical ideal is that every part of the Japanese example
sentence would be linked to a unique word/phrase in the dictionary, or
would be explicitly excluded as 'white space / junk text'. In a sort
of Orwellian "Everything not forbidden is compulsory" sense. Keeping
track of junk text / white space was something that I did in my (sadly
obsolete) files which wasn't really handled elsewhere and is useful to
avoid spurious matches (which can lead to the wrong words being
highlighted in example sentences and other problems).

It's quite a task. At present a (mercifully small) group of NSJs are busily tidying Tanaka sentences, which means each week  between 20 and 70 sentences need their indices adjusted. I run a checker which identifies the changes.


One point that was a lot of trouble to work with, and not handled as
well as it could have been, was tracking when words in the dictionary
had new senses added, merged, or removed. I would recommend being
aware of this issue from the start as it happens a lot.

Yes, that's a bugger of a problem. I try and monitor dictionary changes and adjust the senses in the indiices. I'm sure there are many that are wrong. The recent development of a JMdict version with embedded examples identified quite a few broken sense assignments.

Jim

Reply all
Reply to author
Forward
0 new messages