Using CMUDict to programmatically generate translation dictionaries

1,067 views
Skip to first unread message

Steven Tammen

unread,
May 29, 2016, 4:21:57 PM5/29/16
to Plover
Over the last couple weeks I was considering what theory I was going to pick up when my SOFT/HRUF comes. Then it occurred to me that I liked some things from different theories without necessarily liking everything about any one theory; that is to say, I realized that what I really wanted to do was steal certain bits from different theories and combine them all to make my own. The problem is, I don't think there's any good way to do this at this point in time.

I wrote up an idea that I had that might allow this sort of freedom, and I'd really like to hear people's feedback on it (particularly from a feasibility perspective on the backend). I'm no programmer, so I couldn't do anything like this without the support of the devs and the Plover community at large.

Questions to start:
  • How valuable do you think something like this would be in relation to other possible additions to Plover?
  • What would the demand for this be? How many other people want to make their own theories, or change things about existing ones?
  • (Targeted at experienced stenographers) If you could change things about how you currently write, what would they be? How could these ideas help contribute to a project like this where you get to handcraft your own theory from scratch?
  • (Targeted at Ted and Benoit) How hard would this be to implement? Would it take a long time and detract from other development goals?
I think it would be good to keep much of the discussion on the google group for reasons of permanence, but I'll be on discord too.

Thoughts?

Theodore Morin

unread,
May 29, 2016, 4:41:41 PM5/29/16
to Plover

I think you'll find that no matter what theory you start with, you will customize it to your taste. Just do you.

Stanographer said said that his dictionary has tripled in size since starting and he has changed base theories multiple times.

Plover's is a solid base and you can really get into customizing briefs after 50WPM and you'll have developed your taste.

--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steven Tammen

unread,
May 29, 2016, 4:50:05 PM5/29/16
to plove...@googlegroups.com
I'm not really talking about briefs per se. Even Pheonix people use briefs. I'm talking about customizing the underpinnings, the "guts" of a theory, so to speak. Things like what letter combinations you use to generate phonemes.

I was planning to start out with Plover anyhow, I'm just trying to think more long-term and ideal. Like Colemak/Workman/etc. vs QWERTY.

How would you go about customizing a theory at present?

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Tony Wright

unread,
May 29, 2016, 5:09:39 PM5/29/16
to plove...@googlegroups.com
I want to take just a snip from your write up:

"What I have in mind is a program that reads in the Carnegie Mellon University Pronouncing Dictionary and outputs a translation dictionary according to the preferences of an individual stenographer."

This is exactly what I have been dreaming of for a while. I'm very familiar with the CMU dictionary. It's a resource that stenography should be exploiting in many ways, and this is an important one. The ability to automatically generate a dictionary that would contain every reasonably frequent word in English, including proper names, would be huge.

I don't have the programming ability to do something like this on my own, but I'm a linguist, and I'd be glad to help develop the rules for phoneme-to-grapheme mappings that users could choose as options.

--Tony


--

Steven Tammen

unread,
May 29, 2016, 5:26:33 PM5/29/16
to plove...@googlegroups.com
Exactly! To be quite honest, I did a good bit of Googling before I spent time on this because I couldn't believe that I was the first person to look at CMUDict and go "well, that'd be useful for stenography".

In terms of the mappings, I think it would be prudent to work on "reconstructing" common mappings (e.g., those of Plover's theory and Pheonix) before getting to more specialized options. Like I said in the piece, I think this going to be the hardest part for spelling-dependent theories because the graphemes will change based on context (i.e., the same sound can be stroked multiple ways depending on how the word is spelled).

What are some of the other steno-related things you were thinking about using CMUDict for? 



You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

JustLisnin2

unread,
May 29, 2016, 9:16:47 PM5/29/16
to Plover
Hi everyone,

I'm neither linguist nor programmer nor professional stenographer, but I do have some thoughts to add if they're of any value. I've always loved the idea of consistency in dictionary definitions, so this is definitely an interesting discussion. Some points, though:

1. Are most frequently used words briefs? My intuition is to brief common words, and any longer/derivative words are handled by word parts. It's always been the technical/medical terms that I've had difficulty defining entries for. Would it be constructive to compare the phonetic dictionary to Plover's main dictionary or any of the proprietary steno dictionaries and see just how many of the entries are in fact briefs? Ore you proposing this dictionary format so that learners can have consistent, non-brief forms as they're learning?

2. From own personal steno experience, how comfortable a stroke is for my hands plays a far larger role than any consistency in definition. If the stroke isn't comfortable, then I just won't use it. 

3. If you already have a well-defined vision of how this might translate, can you give a few examples of words in the Plover main dictionary that would change based on the phonetic dictionary you're suggesting? Just to clarify (thanks)


Nat

Gavan Browne

unread,
May 29, 2016, 9:36:14 PM5/29/16
to Plover
I think that's an awesome idea but as a non-programmer and non-linguist I'm not sure I have much to add to the conversation. I did something similar-ish to programmatically generate text expander abbreviations from a large word list (12dicts). I had a list of about 11000 abbreviations gifted from another transcriber, but I decided I wanted more and now have about 60000 or so. I generated stuff like sstnblte for sustainability which isn't great, but a lot of what it churned out was good and usable as is. If I had more free time I might have attempted to delve a bit deeper and try to use phonetics to generate better results.

As a qwerty typist giving precedence to higher frequency words for the shorter abbreviations is important as is generally keeping all abbreviations as short as possible. To that end I've started using any unused 2 letters pairs, so uf=contribute which is a lot better than kntrbt. That requires long term memory though and has to be learned. That's a balance between logic/consistency and efficiency I guess. As a full time typist I'd trade consistency for efficiency every time but a learner or novice would find that confusing and off putting I'd imagine. Is there currently a way to generate a list of unused strokes in the base Plover dictionary? Would you want to fill those up and sacrifice consistency? Good example is S-G for something. It's perfect and easy to use and makes sense but it's perhaps less consistent than maybe STH-EUPBG.

I'd also say the shortest possible way to stroke a word isn't necessarily the easiest way. For example I have trouble stroking words ending -FPB so I just stroke them -FRPB and thankfully it works for the few I've encountered so far.

One other thing I'm thinking of is can pronunciations be converted to steno strokes without modification? I'm thinking of obliged, which would be stroked OB/BLIG/D (not steno), but would that be represented in phonetics as O/BLIG/D?

I imagine what you're saying can be done and probably done well with a linguist and a programmer though. Further and final thought is instead of trying to mass covert a huge quantity of words maybe an app that a user can input a word into, the software goes off and finds the corresponding phonetic pronunciation, generates a list of every possible way to stroke that word in steno excluding any conflicts within a dictionary the user may specify, present the list to the user and allow them to choose which brief they want to use. Something like that would suit me because let's say I could tick a box that says "include unused strokes" and get a really short stroke for a long word, and it would suit the person who prefers consistency because they could choose the one that makes most sense to them.

That sounds easy but I can imagine there might be a huge number of ways to stroke a given word. I lied about the final thought above. Let's imagine the algorithm determines the best way to stroke "something" is SEUG which is literally "sing" if you were to convert it back to English. That's not a problem because sing is defined as something else in the dictionary. The problem is confusion for the person who strokes SEUG expecting sing but who gets something, if that makes sense. I guess maybe a consideration is any programmatically generated steno stroke/brief should avoid this or be flagged in some way.

I'll stop rambling now.

JustLisnin2

unread,
May 29, 2016, 9:42:28 PM5/29/16
to Plover
Looks like we had some similar ideas, Gavan. I forgot about word boundary errors, though. That's very important.

Nat

Steven Tammen

unread,
May 29, 2016, 10:15:22 PM5/29/16
to Plover

Hi Nat,


Of course everyone's thoughts are valuable. I'm actually like you: neither a linguist nor a programmer nor a professional stenographer (nor a stenographer of any sort, really -- still need to get something to practice on). The more people we have participating in this discussion, the better!


1) You are correct that most common words are briefed. My idea in doing this is actually entirely separate from briefs, and I had attempted to make that clear. The thing I am interested in here is giving some thought to stenography without briefs -- "everything else", so to speak. Even adherents of theories like Magnum have to stroke stuff out fairly frequently, compounded all the more for new people that don't have thousands of briefs in muscle memory. So, the logic goes, shouldn't we try to optimize for this portion of stenography as much as we can as well?


My main motivation for having something like this is for making everything that is not briefed as efficient as possible, in a way that lets people do something that makes sense from them instead of drilling someone else’s theory by rote — people make their own dictionaries instead of learning someone else’s. So it is in a way related to learning, but it’s also a matter of pure efficiency, letting people do what works for them. (And I know from first hand experience that if it “doesn’t work for me” I do far better building something for myself rather than trying to force someone else’s thought processes on myself).


Being able to tweak theories easily is really impossible currently, AFAIK. Being able to generate different dictionaries to “test out” changes is another primary motivation behind this idea. I come from a background of custom-designing my own 6+ layer keyboard layout, so not being able to change stuff is a major downside to stenography in its current form, in my opinion.


2) I think this is in relation briefs again. For briefs, everyone in fact must do what makes sense for them otherwise they’ll never stick. What I’m talking about is just the equivalent of this for the rest of stenography — doing what’s comfortable for you instead of having to “learn” something someone else came up with.


3) Just comparing to Pheonix (which is a form of phonetic theory), we can look at a few non-briefed words:


Word: Neither

Plover’s theory:  TPHAOE/THER or TPHAOEU/THER (among other definitions, see here)

Pheonix: TPHAOEURGT


Word: Excesses

Plover’s Theory: EBGS/SES/-S

Pheonix: KPES/-Z


Word: Metallurgy

Plover’s Theory: PHET/A*L/AOURPBLG/SKWREU (among other definitions, see here)

Pheonix: PHET/HRAERPBL


I don’t really have a fine-grained vision in mind yet because I wanted to see what other people thought first. Ideally, we wouldn’t be limited to just choices between existing theories, but we could choose our own strokes for a particular phoneme (sound).

Keep the thoughts coming!

-Steven

Steven Tammen

unread,
May 29, 2016, 10:50:48 PM5/29/16
to Plover

Hi Gavan,


You bring up some good points. There is always a tension between shortness (or “efficiency”) and consistency. This is actually the primary difference between phonetic theories like Phoenix and brief heavy theories like Magnum: the former tries to be consistent and sacrifices short writing because of it, and the latter tries to be short but sacrifices consistent writing because of it.


I’m not convinced this has to be an either/or, however. You can have an efficient phonetic theory base for writing out uncommon/nasty words, and still brief like crazy. The two aren’t mutually exclusive. What this project would be focused on, however, is the former: getting that theory base for writing out words independent of briefs in a form that makes sense to individuals rather than trying to adopt someone else’s base for “consistency”. Your thoughts on briefs are spot on, but that’s a whole different subject.


To take your “something” example, what I had in mind here was a program that would take the individual sounds in the word (known as phonemes) and let an individual choose how to stroke them, either phonetically (as in Phoenix) or based on spelling (as in Plover’s theory). This would give users flexibility with regard to their non-briefed dictionary entries, which is actually the part that we don’t have control over right now. We can brief stuff out to our heart's content, but changing how you write normally — external to briefing — is a much different task.


—————————


On syllable division, this is a linguistics problem. One option for us would be to follow the maximum onset principle as it is classically defined. You can read about it here (less than you probably want) or here (more than you probably want). The onset is the beginning part of a syllable, and the coda is the ending part. Pretty much, if you always stick as many consonants as you can in the onset instead of the coda (so long as it is phonotactically allowed in your language), you won’t run into as many problems of syllabification.


Basically, if we followed this rule in how we split up syllables, we could stroke the words in the same way we split up the syllables, and we wouldn’t have this problem because it would be consistent. Perhaps someone more knowledgeable about linguistics than I could explain better.


—————————


I’m afraid I don’t follow your last bit on having the program spit out a “list” of possible ways to stroke something out. If we allow people to define their own strokes for phonemes, theoretically there are many different ways to stroke the same thing, but only one that follows any given person’s phonemic map. People would certainly be free to brief on top of the consistent definition for their personal theory, but I think it would be a mistake to make words only accessible by briefs except for a select few that are extremely common.


On the other hand, if what you’re suggesting is a program that suggests briefs for words based on what’s available, then I think that is another fantastic idea — but it is different from the one I am forwarding here.


Good stuff! Keep the ideas coming.


-Steven


On Sunday, May 29, 2016 at 9:36:14 PM UTC-4, Gavan Browne wrote:

JustLisnin2

unread,
May 29, 2016, 11:06:57 PM5/29/16
to Plover
Hi Steven,

1. I see. That's what I thought. The goal of this effort would be to create an optimal, consistent dictionary for non-brief forms. I understood that you had wanted to make this entirely separate from briefs, but I was wondering how constructive it would be considering that, as a learner, most of the first words I picked were, in fact, brief forms. But it does make sense to optimize the rest of stenography, especially since there are still some words that may not even be defined in non-brief forms.

2. When I said "comfort", I meant physical comfort on my hands not so much my comfort with the theory. It sounds like you're looking at this solely from a theoretical/linguistic standpoint, and I just wanted to add the practical as well. But I understand that people can default to their own briefs/definitions as they need to for comfort.

3. Can you explain this part to me? "Ideally, we wouldn’t be limited to just choices between existing theories, but we could choose our own strokes for a particular phoneme (sound)." I'm not sure I understand. Are you suggesting going as far as changing the definition of the "ch" sound, etc.? Are you referring to the treatment of vowels among the different steno theories? Or what phonemes do you feel the existing theories are restricting you to? Also, I think I misunderstood the "what" portion of this project as well. I thought that you meant creating a standard dictionary based on the Carnegie Mellon dictionary, but now it sounds like you're suggesting an entry generator, so to speak? If that's the case, and the Carnegie Mellon dictionary has consistent, phonetic rules, how will this result in multiple entries for learners to choose from?

4. I replied to Gavan earlier with a mention of word boundary errors. While having the freedom to choose your own strokes instead of learning someone's else's theory by rote sounds liberating, one of the reasons why I, personally, was reluctant to add my own entries to the dictionary when I first started learning was because I had no intuition for what word boundary errors would wreak havoc on my writing. Aside from Learn Plover, there's really not a lot of free formal study material to go alongside of Plover. It was trial and error. I made months of changes to the main dictionary before I realized that I should've been adding entries to my own personal dictionary. As I said. I'm not linguist, so if any linguists can chime in: is there a way to define a set of rules that could uncover all possible conflicts that could occur? How do you ensure that this new, customized theory is conflict-free, especially since you're targeting new learners?

My final thoughts on this for tonight :) Good night!
Nat

Steven Tammen

unread,
May 29, 2016, 11:49:37 PM5/29/16
to Plover

Hi Nat,


1) I see what you’re saying. There would need to be some minimal set of briefs that are not a part of the normal dictionary generation (“the”, “and”, “he”, “or”, etc.). Of course these could be variable too, but then we get into the subjective issues of briefs discussed above. I hadn’t really thought about this too much (no doubt because I haven't really learned steno yet).


2) Wow I totally misread what you were saying, haha. Agreed. The difficulty of strokes from a physical standpoint should be taken into account as well (holding down 2 keys is easier than 6, for example).


3) I was thinking of pretty much opening up how phonemes are stroked entirely. Of course most everyone would probably leave the “ch” sound as it is… but what if someone didn’t want to, and wanted to move stuff around on the steno keyboard? Well now they’d have the option to. The differences in vowel sounds between theories were a primary motivator of this consideration, but another one I was thinking of was how things are stroked depending on how the word is spelled. Unless I’m totally misinformed, most theories might stroke the same sound different ways if it is made with different letters in English. Letting people choose these sounds was something else I had in mind.


The generated dictionary will be “standard” (i.e., consistent) according to the preferences that the user specifies — it could be totally different from a dictionary that someone else generates based on their preferences. What exactly do you mean by “entry generator”?


4) I had kinda mentioned this — albeit vaguely and not in a very good way — in my write-up:


“It [the generator] will automatically take out medial schwa, roll in suffixes, and create disambiguation briefs to the extent possible without creating conflicts. Problematic words will be displayed for either further programmatic processing (e.g., if a word ends in -ing without it being a suffix, do ____ to add on -ing), or hand-correction.”


From what I’ve read, totally conflict free writing is a myth. This is always a game of compromise. I’m not qualified to comment on the specifics of this (in fact you probably know more than me because you’ve actually been learning steno for a while), so it would be good if someone knowledgeable helped think of ways to deal with this problem. What I do know is that long words tend to have less word-boundary problems than short words, and that briefing very common short words can solve many of these problems.


-Steven

JustLisnin2

unread,
May 30, 2016, 10:26:07 AM5/30/16
to Plover
Hi Steven,

The picture is becoming clearer now, thanks. So you're suggesting an entire dictionary that's generated based on preset preferences. I was thinking along the lines of a word-by-word generator based on the Carnegie Mellon dictionary, where the user would have the option to choose between multiple dictionary entry suggestions and pick the entry that fits their preference. So if you want an entry for a word, you would type that word into the generator and get a list of choices, then you could pick the one you want. This led me to think that an unwary user might mix and match among entries, thinking only in terms of individual words and not thinking about potential word boundary errors that could arise. You're right; an entirely conflict-free theory is a myth. But an entire dictionary makes much more sense to me in terms of conflicts. You're sort of generating your own theory that fits your personal writing style, and, as you said, it's a way to test out potential new theories and picking the most efficient one on an individual basis. That sounds so cool :)

Nat

Steven Tammen

unread,
May 30, 2016, 11:13:28 AM5/30/16
to plove...@googlegroups.com
Yes, that's it. 

I think letting people choose from options like that isn't a terrible idea, but it would need some sort of conflict detection (at least ideally) for it to work well. It would serve a purpose of letting people "try out" different briefs for words without necessarily having to come up with them haphazardly every time they want to brief something. So long as it could filter based on individual people's dictionaries, people could see what strokes are "available" for them to keep their briefing conflict free. I believe this is what Gavan was thinking of above. To extend the idea even further, if we got the permission of established theories (Magnum, etc.), we could even have this hypothetical "brief generator" display the briefs these established theories use for words as well. 

-Steven

--

Steven Bhardwaj

unread,
May 31, 2016, 1:04:21 AM5/31/16
to Plover
Hi All,

On the subject of Plover and Linguistics,

This thread makes me think of the idea of using a chorded keyboard system for inputting raw IPA transcriptions...
IPA keyboard for English: http://ipa.typeit.org/
IPA for English Audio key: http://www.antimoon.com/how/pronunc-soundsipa.htm
IPA keyboard with all symbols: http://ipa.typeit.org/full/

I expect it would be helpful, to retain any vestige of efficiency, to have a language-specific IPA theory. But the theory might ought to be an orthographic theory like Mecatipia, Jackdaw, or Kinglet, making it (I suppose) easier to create. CMUdict would probably be helpful in designing a reasonably-well optimized theory, but it might be more flexible for this application to use an orthographic-style dictionary rather than a regular word-centric steno theory.

An English setup might be fun if I ever wanted to learn how to imitate different English accents! Although more serious uses would include transcribing endangered languages from DAT tapes, etc.

:)
Steven

Steven Tammen

unread,
May 31, 2016, 10:35:01 PM5/31/16
to Plover
Hi Steven,

I had toyed with the idea of basing the generation itself off of IPA, but I couldn't seem to find a good English dictionary in IPA. CMUDict uses Arpabet which is English-specific, and also entirely ASCII based (requires no Unicode support -- IPA does).

We could use CMUDict to generate IPA transcriptions like you say, but I wanted to keep the goal focused for the first little bit (i.e., getting a working dictionary generator for "normal" English stenography). Over time, something else I thought we could do was take in IPA as a language independent source, and get steno support for languages without developed theories yet (assuming we could find pronunciations for words in said language in IPA). This would help open stenography up to more people who wouldn't otherwise have access to it in their native language.

I think you may be right that orthographic input systems have the upper hand for full IPA. There are a lot more sounds than we use in English, for example, and it would be difficult to fit them all on a typical 22-key stenotype.

-Steven

Martin Sherman-Marks

unread,
Jun 1, 2016, 4:45:16 PM6/1/16
to Plover
This conversation is relevant to my interests! CMUDict is a good starting point, but I see a few problems you'll need to solve along the way:
  • Syllabification: CMUDict doesn't define syllable boundaries, which are critical for any steno theory. (As I know very well from my own experience as a learner. Most of the time, when I can't figure out how to stroke a word, I eventually realize it's because the definition in Plover is based on a different syllabification than the one in my head.) And unfortunately, there is no particularly good rule for syllabification in English, especially when you're etymology-blind. The TeX hyphenation algorithm has been ported to Python, and that might be a good starting point - but note that "where to stick a hyphen" and "where to stick a syllable break" are related but different problems. (The TeX algorithm won't hyphenate "project", for example.) I'm not saying it's going to be impossible to syllabify the CMUDict algorithmically, but it'll present some interesting challenges.
  • Morphemes: CMUDict is morphology-blind; it has separate line entries for "ABANDON", "ABANDONS", "ABANDONING", "ABANDONED", and "ABANDONMENT", for example, with no way to know that those words are all connected. Before you start trying to run through CMUDict, you'll want a prefix/suffix dictionary, which will almost necessarily be non-phonetic. (For example, Plover uses "*PLT" for "-ment", not just because it's shorter but also so that "PHEPBT" is available for the first syllable of "mental". Otherwise, how would you talk about your friend Abandonment Al? Poor Abandonment Al. He's got some problems.) Oh, and going back to my earlier point, you'll have to make sure that your syllabification algorithm sees "ABANDON/ING" rather than "ABANDO/NING" - but still sees "SING", not "S/ING" - or you'll have a disaster on your hands.
  • Conflicts: I threw together a quick Python script to count how many homophones there are in the CMUDict. I found 13,015! (Admittedly, many of them, like "beet" and "beat", can probably be dealt with using the Plover theory's built-in disambiguation rules. I didn't account for that.) So conflict resolution definitely isn't a "figure it out manually" kind of problem, unless you intend to pore over a hell of a lot of dictionary entries. Unfortunately, conflict resolution relies in large part on something CMUDict won't tell you: word frequency. Which is more common: the word "accord" or the acronym "ACORD" (the Association for Cooperative Operations Research and Development, naturally)? You can answer that very quickly; CMUDict can't. And that means that you need some other test to tell you which should get the rule-following stroke A/KORD and which should get a rule-breaking stroke like A/KO*RD. (Or, in this case, which is so uncommon that maybe it should just be quietly ignored.) Plus, you need to teach your script how to craft a good rule-breaking stroke. It's easy enough to say "just throw an asterisk in there", but remember that Plover theory uses S* for initial /z/ and *T for final /th/, so your word may already have an asterisk in it. You can also change out the vowels, or repeat the stroke more than once to cycle through related options (Repeating A*PB to switch between {^an}/Anne/Ann is one of the more remarkable versions of this in the Plover default dictionary.)
  • Capitalization: Another thing the CMUDict doesn't have: lowercase letters! The Plover dictionary has PHARBG for "mark" and PHA*RBG for "Mark"; that sort of thing is very common. If I hadn't looked up "ACORD" in the last example, I wouldn't have had any way to know it wasn't "acord" (not a word). Even a smart algorithm that was reading through the CMUDict would have surely given me a dictionary entry "A/KO*RD: acord" for a word that doesn't actually exist!

I think a common solution to many of these problems is to incorporate more than one wordlist. For example, a word frequency table would help with conflict resolution at the very least - though you'd need a big one from a good corpus. Step one would be to write some kind of script that turned the CMUDict into a more complete dictionary with a format like:

word    W ER1 D    245

That's the word in its normal capitalization, pronunciation, and then frequency rank. You'd still have morphology and syllabification problems to think about, but that would be a good step one.

Zack Brown

unread,
Jun 1, 2016, 5:20:30 PM6/1/16
to ploversteno
For syllabification, even if there's no good rule for it, there may be
good rule to identify the range of possibilities. Any new dictionary
will probably want to have entries for as many possible
syllabifications of words as it can, to account for everyone's
personal tastes (similar to what Plover does now). Also bear in mind
that you will probably want to include things like dropping unstressed
vowels, and the inversion rule. This messes with syllabification a bit
as well. You'll probably need to come up with a whole new approach to
syllabification, based on making those assumptions.

Also, for any programmatic analysis of CMUDict, I'd recommend
prioritizing words based on frequency of use. Peter Norvig's word
frequency table is at http://norvig.com/google-books-common-words.txt

I'd also suggest programmatically coming up with a set of prefix and
suffix strokes, similar to what Plover has. The idea would be for no
word to end with the keys used in any prefix stroke, and for no word
to begin with the keys used in any suffix stroke, to avoid word
boundary errors.

Another thing to bear in mind is that in steno (although I don't use
this lingo in Learn Plover), the "theory" is generally considered to
be the particular approach to constructing briefs. The whole set of
standard and repeating rules governing consonant and vowel sounds, and
things like the inversion rule and so on, is not called 'theory'
because it's considered so fundamental that it's not even questioned -
all English language steno systems use those same basic chords and
rules, for the most part. At least that's my understanding.

But I would suggest changing that. Anyone coming up with a new
dictionary should truly start fresh. use CMUDict and the Norvig files,
and come up with an entirely new set of keys and chords for all the
different English sounds. I think if you do that, it may be possible
to improve on Ward Stone Ireland's original keyboard layout. At that
point, it might be possible to significantly reduce word conflicts,
and fit a far greater number of multi-syllable words into single
strokes.

Ward Ireland's keyboard was designed 100 years ago, with virtually no
statistical calculation to guide him. Additionally, it was designed to
be entirely syllabic. There were no briefs because there were no
lookup files. It was only in the 1980s that the proprietary steno
companies introduced dictionary files and briefs. Given that kind of
chaotic history, I think there's a very good chance that a much better
solution exists than the one that's come down to us. I think whoever
works on this is very likely to find a much cleaner, sharper system
than any of the steno systems currently in existence.

Be well,
Zack
> --
> You received this message because you are subscribed to the Google Groups
> "Plover" group.
> To unsubscribe from this group and stop receiving emails from it, send an
--
Zack Brown

Martin Sherman-Marks

unread,
Jun 2, 2016, 9:17:41 AM6/2/16
to Plover
Zach, I was thinking about that very idea of "identifying the range of possibilities"; part of the challenge will be determining how much a particular stroke should be allowed to "spread" in the dictionary. Not just for syllabification, but for misstrokes too - the algorithm will have to think about how hard a particular stroke is, what the likely misstrokes are, and then will have to weigh how frequently the word is used against the space that the likely misstrokes will take up. A fairly complicated and nuanced process!

I've been trying to find a word frequency list that is case-aware, but with no luck so far. The American National Corpus - which, I'm pleased to note, contains among other things 3 million words from a Buffy the Vampire Slayer fan forum - has frequency data, which doesn't differentiate by case but does differentiate by part of speech, including proper nouns. (It also includes bare lemmas for plural nouns, which I suspect may be helpful down the line.) I'll attempt to pull it together into a case-aware word frequency list on the assumption that pretty much all proper nouns are capitalized. The next step after that will be combining it with the CMUDict to add in pronunciation, which should be fairly straightforward, I hope. (There is a larger, cleaner word list, from the 30x larger Corpus of Contemporary American English, but that costs $250, or $125 if we can claim academic use. If the ANC wordlist works, then it would be fairly trivial to modify the script to use the CoCAE data when I'm feeling wealthier.)

With regard to what Zach was saying about developing new ground-up principles of steno from this - I think he may well be right, and it's something I'm interested in exploring. Unfortunately we need to conquer syllabification first. Once we have that, we can develop a complete list of syllable onsets, nuclei, and codas in American English (and their frequency!) - that's the point where we can start rethinking the keyboard.

Martin Sherman-Marks

unread,
Jun 2, 2016, 11:10:51 AM6/2/16
to Plover
Yikes. Okay. The ANC list has some issues. My assumption that anything flagged as a proper noun should be capitalized has run into the issue that they flagged a lot of words as proper nouns. The word "accent", for example, occurs 449 times in the corpus, and is flagged as a proper noun 23 of those times. Not super helpful. In total, it looks like about 40% of the words that occur more than twice in the sample are flagged as proper nouns at least once, which is... ugh. There are more proper nouns in the dataset than improper ones!

I was able to improve things by using an SSA dataset to generate a complete list of all 95k first names registered since 1880, then only capitalizing entries if they're flagged as proper nouns and are in that dataset and are in the CMUDict. (I'm downloading GNIS/GNS datasets now so I can add geographical names as wel - they're huge datasets, naturally, but by limiting the list to the intersection with CMUDict, and by stripping all data but the placenames themselves, I'll be able to make a fairly small file of geographical names.) This greatly helps - though it still thinks that the first name "Zeppelin" is 267% more common than the actual word "zeppelin" (since "zeppelin" is tagged as a proper noun 24 times in the dataset and as a typical noun only 9 times). There's no way to address that short of using a better corpus, which I'm going to continue looking for.

My first draft of a case-sensitive dictionary with pronunciation information and word frequency information is attached. Anyone who wants to play around with it, or who wants to see any of the source files I'm using, let me know.
newDict

Steven Tammen

unread,
Jun 2, 2016, 11:35:41 AM6/2/16
to plove...@googlegroups.com
This is great stuff guys, and exactly the sort of thing I thought might come up once you got under the hood, so to speak. My training in linguistics has been limited to several hours of casual reading on Google and my knowledge of steno is about equivalent (no NKRO keyboard + no SOFT/HRUF yet = lack of steno skills). If I say something really silly... that's probably why.

I had initially had the idea of rebuilding steno from the ground up in mind, but decided that I'm not the one to do this (though it would make a great thesis topic for someone in a relevant field). However, I would be most supportive of such an effort, and in fact I think it should be considered somewhat a priority compared to many other features. All the cool stuff that Zach and Jennifer have been doing with Kinglet and Jackdaw need not be limited to orthographic input systems.

On the other hand, I do think there is value in making the system easily accessible for people still in more traditional forms of steno. On a practical level, we're going to have to convince people that all this complicated stuff is worth doing, which means it has to be usable by them as well. There are plenty of advantages to having a dictionary that is algorithmically generated rather than hand-crafted, with some obvious ones being that it's much easier to tweak, and could be easily regenerated if something happens to to the main one.

-----------------------------

Let me see if I can get a handle on some of the issues in play:

1) Syllabification 

Even though things like the maximum onset principle exist, there is no great consensus on how words are split. Furthermore, any abstract pontification about syllabification is ignoring the reality that different people will split syllables in a way that makes sense to them (even if it's not "canonical"), and therefore there is no one size fits all answer. We will have to account for as many different syllabifications as reasonably possibly, just like misstrokes.

2) Morphemes and Suffixes

To get related words connected (verb conjugations, for example: tag, tagged, tagging), we will have to figure out a way to 1) parse this data out of CMUDict, and 2) use it somehow in the resulting dictionary. This is further complicated by the fact that semantic matching will have to occur using syllabification that results in normal suffixes such as -ing, -ation, and so forth, while ignoring words like sing and nation.

3) Homophones

Hand correction is out of the question, and it would be inconsistent anyhow. Disambiguating the conflicts should rely on frequency data and be done consistently if possible.

4) Capitalization

Not present in CMUDict initially. Will probably be easiest to add using word-lists of names, places, etc. (proper nouns). Dealing with some words that are in both capitalized and uncapitalized forms (as above: Zepplin as in "Lead Zeppelin" vs. zeppelin as in in the Hindenburg) will present a challenge.

-----------------------------

@Martin, your dictionary looks good on first glance, but not all of the "word bases" (what the second column is, I take it), look right. For example:

absolute absolute
absolutely absolutely
absoluteness absoluteness
absolutes absolute
absolution absolution
absolutism absolutism
absolutist absolutist

Unless I'm misunderstanding the purpose of that column, more of these words than just "absolute" and "absolutes" are related.

What do you think the next step is?

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Steven Tammen

unread,
Jun 2, 2016, 11:52:03 AM6/2/16
to Plover
Oh, one other thing to think about:

As we are going through this process (if we are going through this process?), something to think about is documenting things more than usual. If we can figure out the issues in English, we can generalize our solutions to other languages and different phonetic transcriptions (at least somewhat/mostly). Stenography is still limited in its support for many languages, and if we get a system in place that can generate a framework based on frequency statistics, ease of combinations, etc., we might enable others to build efficient stenographic frameworks without ever giving an inefficient framework a chance to take hold.

Martin Sherman-Marks

unread,
Jun 2, 2016, 12:24:41 PM6/2/16
to Plover
Steven, my next step is to continue trying to refine this wordlist. I don't think I'm going to find a significantly better corpus without shelling out for the CoCAE wordlist, which I'm certainly not ready to do, so I'm going to try to take whatever steps I can do improve this one (without, you know, going through it all by hand or anything). I may ultimately not be able to match word frequency to proper nouns. And that might be okay! I can certainly generate a pretty decent list of proper nouns which occur in CMUDict - I've got first names already, will soon have geo names, and just need last names to round out the set - and we can just treat proper nouns as if they were middle-of-the-pack words. We're not going to find a corpus that's actually designed for the crazy-ass purpose we're putting it to, so we'll have to avoid making the perfect the enemy of the good.

The second column in the dictionary (which comes straight from ANC data) is actually the lemma, which is to say (roughly speaking) the word form that you look up in the dictionary. So "is" has the lemma "be", but "absolutism" doesn't get the lemma "absolute" even though it's clearly related to "absolute". The "-ism", "-ly", "-ness", etc suffices are what we call derivational morphemes: they change the meaning or part of speech of a word, so the word gets a new lemma. The "-s" and "-ed" suffices are inflectional morphemes; they modify a noun's number or a verb's tense, but they don't change the word in any more fundamental sense, so the lemma remains the same. The lemmas may or may not turn out to be helpful for us in the long run; I kept them in my dictionary because I figured they wouldn't hurt, basically.

Apart from generating the wordlist, we need a general syllabification algorithm. Even if we're going to "spread out" strokes to account for alternate syllabifications, we'll need a starting point. I think the best rule I've seen is that consonants on a syllable border go with the more heavily stressed vowel. ARPAbet stress values go, somewhat confusingly, 1 (most stress), 2, 0 (least stress). (ARPAbet defines a few syllabic nasals and liquids too (EM, EN, ENG, EL) but it looks like CMUDict doesn't use those, so we can just look at numbered vowels.) This gets us a lot of the way there, but there are still issues: consider "abandoning" [AH0 B AE1 N D AH0 N IH0 NG] - the algorithm won't know where to put the <n>, but we want to make sure it winds up in the penultimate syllable so we can recognize the "ing" suffix on the end. This is actually a case where the lemma column may be useful. (Note that this isn't the final syllabification algorithm for mapping pronunciation to steno: that will have to factor in a whole mess of other problems, which Zach alluded to above.)

With that algorithm and my wordlist, we can get a full list of syllable onsets, nuclei, and codas. I keep coming back to that, but it's the heart of English steno. The layout of the left side of the keyboard starts to make total sense when you think about all the English syllables that start with /str/ or /spr/ - as well as all the English syllables that don't start with /pw/. (A few foreign words like "pueblo" and "Poitier" do, but for the most part it's very safe to map PW to /b/.) If we intend to reinvent the wheel, that's the kind of data we'll need.

There are other fronts we can be attacking this on, like putting together our list of prefices and suffices, but this is what I'm seeing as the most critical issues.

In regard to internationalization of this system... well, keep in mind that English has much better corpora than most languages. True, you might not need a CMUDict for, say, Spanish (because the writing system is so phonetic) but you'll still need word frequency data at the very least, as well as a thorough list of proper nouns.

Martin Sherman-Marks

unread,
Jun 2, 2016, 12:31:28 PM6/2/16
to Plover
Oh, and I'd be wary of using the maximal onset principle for syllabification. Phonologically, it may be fairly accurate, but it plays fast and loose with morphology. Maximal onset says that "dancing" is syllabified dan/cing, but we want it to obey the morphological boundary: danc/ing. The "hungry stressed syllable" idea would give us the correct answer (and just about always will, since "-ing" is never stressed - the only problem arises where the preceding syllable is also not stressed, like "abandoning".)

Martin Sherman-Marks

unread,
Jun 2, 2016, 12:33:40 PM6/2/16
to Plover
(Also, I have a linguistics degree and tend to use jargon without explaining it. Please feel free to ask if there's anything you don't understand or need me to define.)

Steven Tammen

unread,
Jun 2, 2016, 1:10:16 PM6/2/16
to Plover
You're fine haha. I think I actually got most all of it.

I'll try chip in where I can but I think I'm going to get eclipsed here, having neither background in CS nor linguistics, nor, practically speaking, stenography itself.

we'll have to avoid making the perfect the enemy of the good.

Now this is the problem I have. https://xkcd.com/1445/ 

Jennifer Brien

unread,
Jun 2, 2016, 3:11:58 PM6/2/16
to Plover


On Thursday, 2 June 2016 17:31:28 UTC+1, Martin Sherman-Marks wrote:
Oh, and I'd be wary of using the maximal onset principle for syllabification. Phonologically, it may be fairly accurate, but it plays fast and loose with morphology. Maximal onset says that "dancing" is syllabified dan/cing, but we want it to obey the morphological boundary: danc/ing. The "hungry stressed syllable" idea would give us the correct answer (and just about always will, since "-ing" is never stressed - the only problem arises where the preceding syllable is also not stressed, like "abandoning".)

The nice thing about orthography is that you can split your words anywhere you like, because there is no dictionary. I'm inclined, where possible to have an extra key for the each of the main suffixes, so they can be folded into the main stroke whenever you get the chance. Also, even with a dictionary, it would be good if there was a way to mark (as with Velotype's No Space)  whether a stroke is a complete word, a prefix or a suffix. That would mean you could find a multi-stroke word in the dictionary no matter how it was split, and it means that Abandonment Al works out just fine.

Discounting homophones (granted, that's a big discounting!), a system based on CMUdict would be rather like an orthographic system for English with Simplified Spelling. It might be a bit faster than one for Standard spelling - provided your own pronunciation is sufficiently Standard. I don't think it would be that great for sight-transcribing unfamiliar words. 

I don't do real-time audio transcription and I probably never shall (I have done quite a bit of tape transcribing) so I don't know what is ideal for that purpose, but I'm very wary of the idea of a Big Comprehensive Dictionary. ISTM that once a corpus of words exceeds a few hundred, it becomes quite obvious where its bias lies. I want to be able to write any word (even ones that I have invented) without having to spell it out letter-by-letter, and if it's a long word that I'm likely to need again, I want to be able to quickly make a brief for it. If it's something that only comes up once in several thousand words, why waste sleep about losing a stroke?

To make this efficient I need to be able to stroke the most common onsets and codas as they are spelled, in the most straightforward manner. I'm not interested in word frequencies or even syllable frequencies, but I am interested in the frequencies of consonant sequences. If such a sequence precedes a vowel its an onset to be keyed by the left hand; if it follows a vowel, it's a coda, and if it has a vowel at each end it's a coda followed by an onset and you can divide it by the maximum onset principle. It would also be useful to record the adjacent vowels. Jackdaw's leading A and trailing E and Y/I seem to save a lot of strokes, but I wonder how it compares with giving more space to consonant combinations?

The basic principle is, use the easiest keys for the most common sequences, whether they be consonants or phrases. If they are natural prefixes, arrange if possible for them to be stroked solely by the left hand (or by the right if they are natural suffixes) so that more can be included in the same stroke. I think this principle is also widely used in Magnum Steno. but allowing the ouputput of different parts of the keyboard to be combined, as I outlined here  - https://groups.google.com/d/msg/ploversteno/mo7OF0D6UM0/s4YZItf0EwAJ  avoids dictionary inflation.

Steven Tammen

unread,
Jun 2, 2016, 5:24:14 PM6/2/16
to Plover
Well, it looks like we're on our own. Ted thought it was a decent idea but neither he nor Mirabai were sold completely. Something along the lines of a whole awful lot of upfront work for questionable payoff.

I still think it would be a great thing to have eventually.

Zack Brown

unread,
Jun 2, 2016, 7:00:34 PM6/2/16
to ploversteno
Heh, I could've told you Mirabai would disagree. We had many
discussions about that while I was working on Learn Plover. She
represents the position that Ward Ireland really knocked it out of the
park, and any possible improvement will be minimal at best. She could
be right. But if someone really did come up with an improved
dictionary, I'm sure she'd acknowledge it. She just has a lot of faith
in Plover's dictionary, for good reason - it's her own personal
system, that she developed over years.

The thing about the Ward Ireland keyboard layout is this: to improve
upon it, you need to find a layout that can produce a wider variety of
words in a single stroke, without using any briefs, than the Ireland
keyboard. On top of that, any briefs that are used for disambiguation
have to rely on a simpler set of general guidelines than Plover
(https://sites.google.com/site/ploverdoc/lesson-9-designing-briefs).
Also, any briefs that are *not* used for disambiguation but instead
are simply for speed ("the", "of", "that", "as", etc), have to be at
least as easy to type as the Plover versions, because that will have a
strong aggregate affect on typing speed.

BTW, regarding a syllabification algorithm - I don't think it's as
important as other folks seem to. The reason is this: the new keyboard
layout will define a new "steno order". Its value will lie in its
ability to cram more words into that order than traditional steno
order does (otherwise the new system will not offer a speed
improvement over Plover). Since that's the case, syllabification
doesn't matter as much as the ability to cram a word into the new
steno order. Steno has never really been about syllables anyway - as
witnessed by the vowel-dropping rule. So, personally, I believe the
hunt for syllabification algorithms will be a time-wasting red
herring. I'd recommend focusing on identifying the most
all-encompassing steno order instead. Let stroke-breaks take care of
themselves.

Be well,
Zack

Theodore Morin

unread,
Jun 2, 2016, 8:19:59 PM6/2/16
to Plover

I support you in the sense that I think it's worth trying/doing ☺️ just not something that I'd like to put effort into myself.

Plover will definitely be there to support you technically, including a different steno order and more keys if need be.

Zack Brown

unread,
Jun 2, 2016, 9:22:13 PM6/2/16
to ploversteno
Excellent! So at least if anything does come out of this, it'll have a home in the software.

So, is anyone actually pursuing a new steno dictionary as a real project - or in particular, the software to construct a solid language-agnostic dictionary for anyone who has a phonetic dictionary file and frequency stats in a given language?


Be well,
Zack


Martin Sherman-Marks

unread,
Jun 2, 2016, 10:15:55 PM6/2/16
to plove...@googlegroups.com

It's funny, I was saying to Mirabai about a week before this thread started that I didn't really think that any computer generated dictionary could be as good as a human built one. I'm still not at all convinced it can! I'm enjoying working on the problem but am fully prepared for it to be a fool's errand.

My gut says that we're unlikely to find any massive improvement over the Ward Ireland model. It's a good model! I have quibbles (in particular, I feel that the asterisk is overloaded, and that there must be a better solution for, e.g., final -th) but I don't think we're going to upend anything. His steno order makes a great of intuitive sense to me. I'd be fascinated to be proven wrong!

I will disagree with you, Zach, in that I think you need syllable information - in particular a list of onsets and codas with their frequency. Otherwise, what information would you even have to question steno order?

I did successfully create a first pass at a "hungry stressed vowel" algorithm. However, I'm not super happy with it, and may end up eating my words and going with maximal onset after all. Switching between the two is fairly easy. I'll update more on that tomorrow.

You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Steven Tammen

unread,
Jun 2, 2016, 10:38:47 PM6/2/16