--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq5jxm9ANKYHoae_Fsh-a1FN4cP57JDMgQvWy0-8P0U8rw%40mail.gmail.com.
On Thu, May 27, 2021 at 10:02 AM Jim Breen <jimb...@gmail.com> wrote:This question, which is really directed to app and website developers,
is whether this would be useful? I might just do it anyway, but it
would useful to get some feedback on the idea.
--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq4aqthXs6%3DUNcfKS-aWRdz%3DfvfDOMdjNtKOQJZUdaXhAg%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq7VC-EjoDoc5xEadZf6f6B7XLqZ53KUXHhup9z45hi3cQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/69250e0a-2e12-4382-80ff-7bc7cb3acdc0%40Spark.
Thanks, Kim. A couple of comments:
On Sat, 5 Jun 2021 at 12:48, Kim Ahlström <kim.ah...@gmail.com> wrote:
> The way I add furigana to example sentences is by using the metadata from the wwwjdic.csv file. Since that data is manually created I make the assumption that the word-splitting is better than Mecab or other morphological analyzers can do, and that it's generally aligned with JMdict headwords. I take the surface form of the word from the metadata and look it up in JMdict to get the reading from there, and then assign it as furigana. Only in a couple of edge cases do I fall back to Mecab. (I wrote that code over ten years ago, so it took me quite a bit just now to figure out how it actually works.)
Interesting. I didn't know you used that approach to do the furigana.
(If people are wondering about "the wwwjdic.csv file", it's the
download file of example pairs and indices from the Tatoeba project.
It's generated weekly.)
Yes, indeed. No rush, but I might push ahead with making it an
alternative distributed form of JMdict. Most of the programming is
done. I'll put the XML extensions into the standard DTD, and then I
just need to set up some cron jobs. The programs that align the
sentences with the JMdict entries and senses threw up about 300
mismatches. I've cleared the 100 or so that related to the sense
numbers and am about halfway through the others. Most are to do with
entries either being modified or removed.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq7HCkfwrvf9DAqngzQCwsaZegRtrp6JR09wD2-a5Pbu2Q%40mail.gmail.com.
This approach seems to have worked ok so far, but it's fairly processing heavy since it requires looking up each word in the sentence in JMdict and possibly also in Mecab. There's also some funky code in there that aligns the headword with the surface form. But it's not very smart, so compound words like 走り出す end up with the furigana はしりだ instead of just はし and だ. I'll probably revisit this at some point.
--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq716pKR4Uh%2BRNnRmqBLiDMJDBGHDFdUBuwPvfm3EtgKzQ%40mail.gmail.com.
Have you by any chance tried using the JmdictFurigana project? If I understand you correctly (big if 😅), it solves just this problem:
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq4g%2B3S7uviYwp4_8R%3DjM8O%3DR5sf8fbofu0Nv0Bp4-cPhQ%40mail.gmail.com.
I took a glance at the JmdictFurigana project and it seems like it should be possible