> On 2 Jul 2021, at 03:44, 'Paul Blay' via EDICT-JMdict <
edict-...@googlegroups.com> wrote:
Hi Paul,
Thank you for sharing your experiences and for your work on the Tanaka corpus. I make heavy use of it in Jisho, and used the indices as a reference when talking to Suzuki about the technical details of a sentence corpus.
> I'd be interested to know how you/they are going to be dealing with
> linking words and senses to actual example sentences. One of the
> things I dealt with when I had more energy was updating on a monthly
> basis the link between jmdict and example sentences, so I know some of
> the 'gotcha's waiting for you.
I haven’t thought much about the practicalities yet to be honest. The project is very much in its infancy.
> The theoretical ideal is that every part of the Japanese example
> sentence would be linked to a unique word/phrase in the dictionary, or
> would be explicitly excluded as 'white space / junk text'. In a sort
> of Orwellian "Everything not forbidden is compulsory" sense. Keeping
> track of junk text / white space was something that I did in my (sadly
> obsolete) files which wasn't really handled elsewhere and is useful to
> avoid spurious matches (which can lead to the wrong words being
> highlighted in example sentences and other problems).
For showing sentences next to dictionary headwords I’ll most likely keep a manual map between sentences and JMdict entries. I was planning on releasing these mappings, but the more I think about it the choice of which sentence shows on which headword is highly editorial, and other projects might prefer different mapping choices. I still feel that it would be useful to have this as an open data set though. Maybe something that can support mappings for multiple projects.
Something like this:
JMdict:123;Jreibun:111;Jisho,WWWJDIC
JMdict:124;Jreibun:222;Jisho
JMdict:126;Jreibun:333;WWWJDIC
It would probably also be good to keep track of the surface form of the word in the sentence directly in the mapping. But I can see how that will entail a lot of manual work and keeping track of text segments to avoid the spurious matches issue.
> One point that was a lot of trouble to work with, and not handled as
> well as it could have been, was tracking when words in the dictionary
> had new senses added, merged, or removed. I would recommend being
> aware of this issue from the start as it happens a lot.
Good point. I’m working on a new version of Jisho with a much improved database, that should hopefully make it easier to keep track of changes to senses. I could use that to alert me when sentence mappings need to be reviewed.
Cheers
Kim