ACTUALLY using CMUDict to programmatically generate translation dictionaries

131 views
Skip to first unread message

Martin Sherman-Marks

unread,
Jul 18, 2016, 11:40:32 AM7/18/16
to Plover
The previous thread on this subject... meandered a bit, into a pretty deep (and very compelling!) debate that may someday be reprinted as a foundational text of the new science of stenology. (The study of... narrowness? Don't overthink it.) I think that thread is worth keeping as a place to argue about how many disambiguation keys we really need (I'm firmly on Team +*) and the best ways to develop steno theories for Rotokas and !Xóõ. This thread, though, I'd like to keep focused on the original challenge: can an algorithm generate a useful basic steno dictionary using only the word's spelling, its pronunciation, and its frequency?

We have a file that maps spelling to pronunciation (from the CMUDict) and frequency (in the American National Corpus). Here are the steps I see the algorithm needing to take. I'd like to invite comment on each step of the algorithm. I haven't written any code yet; this is me planning out what would actually be required to do so.
  1. Strip off any prefices or suffices; those will be written in using a replacement table. For example, "criticism"/"K R IH1 T IH0 S IH2 Z AH 0 M" should be analyzed as "K R IH1 T IH0 S" + the -ism stroke EUFPL.
  2. Determine the canonical definition. This is the most predictable possible stroke for the word - not necessarily the best, just the most predictable. No matter how uncommon the word is, ideally we want this definition to make it in. Note that for many reasonably common words, the canonical stroke might not be the one that actually gets used the most in practice.
    1. Syllabify the word using a predictable algorithm, probably maximal onset: "criticism" becomes "K R IH1 | T IH0 S" + EUFPL.
    2. Map each of the syllable nuclei to the orthographic representation of the word to check if the vowel is spelled <ea>, <oo>, or <oa>, and substitute in the appropriate vowel disambiguator as needed. For example, both "bear" and "bare" are "B EH1 R", but the former should get the vowel AE and the latter should get AEU. This is a tricky problem, given the joys of English vowel orthography.
    3. Convert the word to steno: "criticism" becomes KREU/TEUS/EUFPL, "bear" becomes PWAER. (If you're about to say "it should be KREUT/SEUFPL!", please note - we're still working on the canonical definition here.)
    4. Confirm that it works within steno order; if it doesn't, figure out how to break it into two strokes. (Inversion is fine, but again, not in a canonical definition.) For example, "worst" should be "WOR/*FT". This is also a difficult problem; how do you avoid hitting common briefs? ("WOR/-FT", without the asterisk, would map to "wore of the".)
  3. Determine alternate definitions.
    1. Alternate syllabifications: e.g. "K R IH1 T | IH0 S" + EUFPL.
    2. Inversions: e.g. "WOFRT" instead of "WOR/*FT".
    3. Dropping unstressed vowels: e.g. "K R IH1 T S" + EUFPL.
    4. Versions without homophone disambiguators: e.g. "heart" as HART in addition to the canonical definition HAERT.
    5. Multiple combinations of the above! This is where you'd get strokes like KREUT/SEUFPL, by combining #1 and #3.
    6. Give each possible alternate definition a score from 0.01 to 0.99. An inversion that reduces the number of strokes required, like "WOFRT", would have a pretty high score (meaning we really want that to make it into the dictionary). A version with a different disambiguator vowel, like "HART", would have a pretty low score (meaning it's great if it makes it in, but it's not something we're going to push hard for).
  4. When you've finished making canonical definitions and alternate definitions for every word, resolve conflicts and build the dictionary.
    1. If two canonical definitions collide, try adding an asterisk to the less frequent one if possible. For example, "lest" is less common than "left", so "left" gets HREFT and "lest" gets HR*EFT.
    2. If adding an asterisk doesn't work, take the highest-scoring alternate definition of the less frequent word and try again using that as the canonical definition.
    3. A canonical definition always trumps an alternate definition. For example, "hart" has the canonical definition HART, so "heart" shouldn't get to keep the alternate definition HART. (Note that this isn't the case in the current Plover dictionary!)
    4. If two alternate definitions are in conflict, the one with the highest (score * frequency) stays, and the other is discarded.
    5. I'll probably figure out more conflict-solving rules as I go.

Thoughts? I see... a lot of problems. I'm curious if other people see the same ones or not. If someone wants to propose a completely different algorithm, I'm all ears.

Zack Brown

unread,
Jul 23, 2016, 12:23:56 PM7/23/16
to ploversteno
Hi Martin,

In step 1, I don't think it's good to make the assumption that the
user will know the prefix and suffix replacement tables. Typically,
those are among the last things a Plover user learns. The canonical
dictionary file should assume that users are not relying on the lookup
tables, but are simply sounding everything out by ear.

I think step 2 puts too strong an emphasis on syllabification. I know
linguistics is your field Marin, but I think steno has some natural
characteristics that require letting go of certain basic linguistic
ideas, such as syllabification.

In steno, the 'most predictable' stroke is not the one that follows
syllabification, it's the one that jams the most possible sounds into
a single stroke, without dropping any vowels or inverting any
consonants. In other words, I think a stroke is 'most predictable' if
the user can place each finger on the keyboard from left to right,
asking themselves, "is there another sound I can fit before I run out
of fingers?" and placing each finger down if there is another
available sound that can be added. That's just the nature of steno -
it's not truly a syllable-based thing.

In step 2.2 (and related steps), I don't think you should bother
trying to algorithmically implement disambiguators. Any disambiguator
is most properly thought of as a 'brief'. Disambiguation is what steno
people actually consider to be their steno 'theory'. This is an idea I
got wrong originally when I started learning from Mirabai, when I
thought that the keyboard layout was the 'theory'. In fact, briefs are
the theory. So the difference between Plover and other systems such as
Phoenix is simply their approach to disambiguation. Because of that,
I'd leave that whole question out, and simply come up with the
canonical set of dictionary entries that would be essentially the same
for all theories that use the same keyboard layout.

So basically I'm suggesting that you simplify your algorithm, and
don't even try to resolve any ambiguities. Just construct the
canonical dictionary file, and let each user design their own theory
of brief forms on their own. This would have the benefit of being
extremely useful for generating dictionaries for other languages.

If you want to go beyond that and also calculate disambiguation
entries, I'd suggest making your approach generic, i.e. just give
people the option to define their own set of disambiguation rules,
such as using spelling to disambiguate between 'heart' and 'hart', and
have your code implement their rules to produce a working dictionary
file. This would have the tremendous benefit for English, of making it
easy for people to design and test their own theories of brief forms.
But it would also make it easy for people using other languages to
construct their own complete steno theory.

The problem with designing a new steno theory by hand (i.e. a theory
of brief forms) is that you can think everything's going so well,
until you suddenly run up against some horrible contradiction that
invalidates some of your basic rules. That's one reason why Plover and
the other steno systems are so great - for the most part they manage
to resolve conflicts and produce briefs without contradicting too many
of their own rules.

Be well,
Zack
> --
> You received this message because you are subscribed to the Google Groups
> "Plover" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ploversteno...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Zack Brown

Martin Sherman-Marks

unread,
Jul 23, 2016, 12:32:01 PM7/23/16
to plove...@googlegroups.com

But what's the utility of thousands of entries if nothing is done to resolve conflicts? A human going through that and trying to make it into a useful dictionary file would a) die of boredom and b) not be able to maintain a systematic method of disambiguation.


--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/lFf-mymukl0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Zack Brown

unread,
Jul 23, 2016, 12:46:11 PM7/23/16
to ploversteno
Ideally, the canonical dictionary file would have just two types of entries:
1) entries that had no natural conflicts with any other words.
2) entries for words that do conflict, but that win the tie by being
more popular than the other words.

So basically you'll have a dictionary file with many thousands of
canonical entries, and no conflicts. Any words that conflicted with
more popular words would be left out.

When a user wants to start adding new words that conflict with
existing entries, they'll have to start coming up with their own
'theory' for resolving those conflicts. So if it were me, I'd come up
with "Zacksteno Theory", which would have all the new entries that
followed the rules I'd come up with for resolving conflicts and
creating briefs.

That's where the generic form of your conflict resolution code would
come into play. Using it, I would be able to come up with an idea for
conflict resolution, use your code to generate dictionary entries for
any word conflicts that it would resolve, and then see how well my
idea worked. If it worked well, I'd keep it and come up with
additional ideas for conflict resolution.

Be well,
Zack

Steven Tammen

unread,
Aug 1, 2016, 9:53:53 PM8/1/16
to Plover
I was going to make another thread for this purpose, but I'm not sad you got to it first since your the one with more relevant experience. We just got too sidetracked in the last one for it to be helpful for people interested in the original topic, even though there were bits and pieces related to it sprinkled throughout all the interesting conversations that were really tangential to it.

Here are some other things that we'll want to think about (other than what you mentioned, which hit most of the problems I'd come up with):
  • Dealing with proper nouns, place names, and so on. IIRC, you had already worked out something related to this problem in the prior thread.
  • Whether or not we'll want to have some words that are accessed with briefs only.
  • Whether it will be possible to add most possible permutations of syllabifications of common words to the dictionary without causing conflicts (rather than just a couple alternatives).
  • Whether we'll want to sacrifice consistency to allow for common existing briefs or briefing patterns, like the example you give.
  • Whether we'll want to ignore very uncommon words (like "hart") and instead use their canonical definitions as alternate definitions for much more common words. We could set a minimum usage frequency in the corpus, for example, under which canonical definitions for such words would get appropriated for use as alternate definitions for more common words if the more common words were above a different usage frequency.
  • The ability to let users decide what they want to be "canonical". (This is only important after we have something that works given a certain set of assumptions and want to expand the algorithm to account for different tastes).
  • The ability to choose which symbols are used to disambiguate in different situations (in case someone wants to use different symbols for different purposes: one for homophone disambiguation, one for capitalization forms, etc.)
I think this task is significantly more difficult than I had initially thought it to be, and I'm sure I'll continue to think of more things over time, especially now that I'm actually learning stenography.

Keep up the good work!
Reply all
Reply to author
Forward
0 new messages