For the first problem, you might want to look into using the CMU
Pronouncing Dictionary! It's included as a corpus in NLTK; you can use
it to look up a representation for the pronunciation for lots of
words. There are examples for that in
This leaves you with two interesting problems:
- you could use this to look up *exact* homophones, no problem -- just
find words that have the exact same pronunciation in the dictionary,
or maybe remove stress cues. But! How do you find near-homophones? Now
you've got to come up with some kind of similarity measure.
- How do you get pronunciations for other languages?
The second problem, as you might guess, is called compound-splitting,
and you can find tools for that; not sure if there's such a thing
built into NLTK. A quick search came up with this one:
. Not sure if it's
On Sun, Jul 4, 2021 at 11:00 PM Dominic Mcg <nether...@gmail.com
> I'm very comfortable with Python, but completely new to NLTK, and broadly speaking computational linguistics as well, and I was wondering whether I could use NLTK to search for homophones both within a language, for example searching "beach" in English would return "beech", and between languages, for example "beater" in English might return "beter" in Turkish or "hot" in English might return "hotz" in Basque.
> On another note, can NLTK parse out compound words, for example "starfish" -> ["star", "fish"] or "football" -> ["foot", "ball"]?
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com
> To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/32246d01-dba5-4e43-b9ed-93209a3b6f3an%40googlegroups.com