Using CMUDict to programmatically generate translation dictionaries

847 views
Skip to first unread message

Steven Tammen

unread,
May 29, 2016, 4:21:57 PM5/29/16
to Plover
Over the last couple weeks I was considering what theory I was going to pick up when my SOFT/HRUF comes. Then it occurred to me that I liked some things from different theories without necessarily liking everything about any one theory; that is to say, I realized that what I really wanted to do was steal certain bits from different theories and combine them all to make my own. The problem is, I don't think there's any good way to do this at this point in time.

I wrote up an idea that I had that might allow this sort of freedom, and I'd really like to hear people's feedback on it (particularly from a feasibility perspective on the backend). I'm no programmer, so I couldn't do anything like this without the support of the devs and the Plover community at large.

Questions to start:
  • How valuable do you think something like this would be in relation to other possible additions to Plover?
  • What would the demand for this be? How many other people want to make their own theories, or change things about existing ones?
  • (Targeted at experienced stenographers) If you could change things about how you currently write, what would they be? How could these ideas help contribute to a project like this where you get to handcraft your own theory from scratch?
  • (Targeted at Ted and Benoit) How hard would this be to implement? Would it take a long time and detract from other development goals?
I think it would be good to keep much of the discussion on the google group for reasons of permanence, but I'll be on discord too.

Thoughts?

Theodore Morin

unread,
May 29, 2016, 4:41:41 PM5/29/16
to Plover

I think you'll find that no matter what theory you start with, you will customize it to your taste. Just do you.

Stanographer said said that his dictionary has tripled in size since starting and he has changed base theories multiple times.

Plover's is a solid base and you can really get into customizing briefs after 50WPM and you'll have developed your taste.

--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steven Tammen

unread,
May 29, 2016, 4:50:05 PM5/29/16
to plove...@googlegroups.com
I'm not really talking about briefs per se. Even Pheonix people use briefs. I'm talking about customizing the underpinnings, the "guts" of a theory, so to speak. Things like what letter combinations you use to generate phonemes.

I was planning to start out with Plover anyhow, I'm just trying to think more long-term and ideal. Like Colemak/Workman/etc. vs QWERTY.

How would you go about customizing a theory at present?

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Tony Wright

unread,
May 29, 2016, 5:09:39 PM5/29/16
to plove...@googlegroups.com
I want to take just a snip from your write up:

"What I have in mind is a program that reads in the Carnegie Mellon University Pronouncing Dictionary and outputs a translation dictionary according to the preferences of an individual stenographer."

This is exactly what I have been dreaming of for a while. I'm very familiar with the CMU dictionary. It's a resource that stenography should be exploiting in many ways, and this is an important one. The ability to automatically generate a dictionary that would contain every reasonably frequent word in English, including proper names, would be huge.

I don't have the programming ability to do something like this on my own, but I'm a linguist, and I'd be glad to help develop the rules for phoneme-to-grapheme mappings that users could choose as options.

--Tony


--

Steven Tammen

unread,
May 29, 2016, 5:26:33 PM5/29/16
to plove...@googlegroups.com
Exactly! To be quite honest, I did a good bit of Googling before I spent time on this because I couldn't believe that I was the first person to look at CMUDict and go "well, that'd be useful for stenography".

In terms of the mappings, I think it would be prudent to work on "reconstructing" common mappings (e.g., those of Plover's theory and Pheonix) before getting to more specialized options. Like I said in the piece, I think this going to be the hardest part for spelling-dependent theories because the graphemes will change based on context (i.e., the same sound can be stroked multiple ways depending on how the word is spelled).

What are some of the other steno-related things you were thinking about using CMUDict for? 



You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

JustLisnin2

unread,
May 29, 2016, 9:16:47 PM5/29/16
to Plover
Hi everyone,

I'm neither linguist nor programmer nor professional stenographer, but I do have some thoughts to add if they're of any value. I've always loved the idea of consistency in dictionary definitions, so this is definitely an interesting discussion. Some points, though:

1. Are most frequently used words briefs? My intuition is to brief common words, and any longer/derivative words are handled by word parts. It's always been the technical/medical terms that I've had difficulty defining entries for. Would it be constructive to compare the phonetic dictionary to Plover's main dictionary or any of the proprietary steno dictionaries and see just how many of the entries are in fact briefs? Ore you proposing this dictionary format so that learners can have consistent, non-brief forms as they're learning?

2. From own personal steno experience, how comfortable a stroke is for my hands plays a far larger role than any consistency in definition. If the stroke isn't comfortable, then I just won't use it. 

3. If you already have a well-defined vision of how this might translate, can you give a few examples of words in the Plover main dictionary that would change based on the phonetic dictionary you're suggesting? Just to clarify (thanks)


Nat

Gavan Browne

unread,
May 29, 2016, 9:36:14 PM5/29/16
to Plover
I think that's an awesome idea but as a non-programmer and non-linguist I'm not sure I have much to add to the conversation. I did something similar-ish to programmatically generate text expander abbreviations from a large word list (12dicts). I had a list of about 11000 abbreviations gifted from another transcriber, but I decided I wanted more and now have about 60000 or so. I generated stuff like sstnblte for sustainability which isn't great, but a lot of what it churned out was good and usable as is. If I had more free time I might have attempted to delve a bit deeper and try to use phonetics to generate better results.

As a qwerty typist giving precedence to higher frequency words for the shorter abbreviations is important as is generally keeping all abbreviations as short as possible. To that end I've started using any unused 2 letters pairs, so uf=contribute which is a lot better than kntrbt. That requires long term memory though and has to be learned. That's a balance between logic/consistency and efficiency I guess. As a full time typist I'd trade consistency for efficiency every time but a learner or novice would find that confusing and off putting I'd imagine. Is there currently a way to generate a list of unused strokes in the base Plover dictionary? Would you want to fill those up and sacrifice consistency? Good example is S-G for something. It's perfect and easy to use and makes sense but it's perhaps less consistent than maybe STH-EUPBG.

I'd also say the shortest possible way to stroke a word isn't necessarily the easiest way. For example I have trouble stroking words ending -FPB so I just stroke them -FRPB and thankfully it works for the few I've encountered so far.

One other thing I'm thinking of is can pronunciations be converted to steno strokes without modification? I'm thinking of obliged, which would be stroked OB/BLIG/D (not steno), but would that be represented in phonetics as O/BLIG/D?

I imagine what you're saying can be done and probably done well with a linguist and a programmer though. Further and final thought is instead of trying to mass covert a huge quantity of words maybe an app that a user can input a word into, the software goes off and finds the corresponding phonetic pronunciation, generates a list of every possible way to stroke that word in steno excluding any conflicts within a dictionary the user may specify, present the list to the user and allow them to choose which brief they want to use. Something like that would suit me because let's say I could tick a box that says "include unused strokes" and get a really short stroke for a long word, and it would suit the person who prefers consistency because they could choose the one that makes most sense to them.

That sounds easy but I can imagine there might be a huge number of ways to stroke a given word. I lied about the final thought above. Let's imagine the algorithm determines the best way to stroke "something" is SEUG which is literally "sing" if you were to convert it back to English. That's not a problem because sing is defined as something else in the dictionary. The problem is confusion for the person who strokes SEUG expecting sing but who gets something, if that makes sense. I guess maybe a consideration is any programmatically generated steno stroke/brief should avoid this or be flagged in some way.

I'll stop rambling now.

JustLisnin2

unread,
May 29, 2016, 9:42:28 PM5/29/16
to Plover
Looks like we had some similar ideas, Gavan. I forgot about word boundary errors, though. That's very important.

Nat

Steven Tammen

unread,
May 29, 2016, 10:15:22 PM5/29/16
to Plover

Hi Nat,


Of course everyone's thoughts are valuable. I'm actually like you: neither a linguist nor a programmer nor a professional stenographer (nor a stenographer of any sort, really -- still need to get something to practice on). The more people we have participating in this discussion, the better!


1) You are correct that most common words are briefed. My idea in doing this is actually entirely separate from briefs, and I had attempted to make that clear. The thing I am interested in here is giving some thought to stenography without briefs -- "everything else", so to speak. Even adherents of theories like Magnum have to stroke stuff out fairly frequently, compounded all the more for new people that don't have thousands of briefs in muscle memory. So, the logic goes, shouldn't we try to optimize for this portion of stenography as much as we can as well?


My main motivation for having something like this is for making everything that is not briefed as efficient as possible, in a way that lets people do something that makes sense from them instead of drilling someone else’s theory by rote — people make their own dictionaries instead of learning someone else’s. So it is in a way related to learning, but it’s also a matter of pure efficiency, letting people do what works for them. (And I know from first hand experience that if it “doesn’t work for me” I do far better building something for myself rather than trying to force someone else’s thought processes on myself).


Being able to tweak theories easily is really impossible currently, AFAIK. Being able to generate different dictionaries to “test out” changes is another primary motivation behind this idea. I come from a background of custom-designing my own 6+ layer keyboard layout, so not being able to change stuff is a major downside to stenography in its current form, in my opinion.


2) I think this is in relation briefs again. For briefs, everyone in fact must do what makes sense for them otherwise they’ll never stick. What I’m talking about is just the equivalent of this for the rest of stenography — doing what’s comfortable for you instead of having to “learn” something someone else came up with.


3) Just comparing to Pheonix (which is a form of phonetic theory), we can look at a few non-briefed words:


Word: Neither

Plover’s theory:  TPHAOE/THER or TPHAOEU/THER (among other definitions, see here)

Pheonix: TPHAOEURGT


Word: Excesses

Plover’s Theory: EBGS/SES/-S

Pheonix: KPES/-Z


Word: Metallurgy

Plover’s Theory: PHET/A*L/AOURPBLG/SKWREU (among other definitions, see here)

Pheonix: PHET/HRAERPBL


I don’t really have a fine-grained vision in mind yet because I wanted to see what other people thought first. Ideally, we wouldn’t be limited to just choices between existing theories, but we could choose our own strokes for a particular phoneme (sound).

Keep the thoughts coming!

-Steven

Steven Tammen

unread,
May 29, 2016, 10:50:48 PM5/29/16
to Plover

Hi Gavan,


You bring up some good points. There is always a tension between shortness (or “efficiency”) and consistency. This is actually the primary difference between phonetic theories like Phoenix and brief heavy theories like Magnum: the former tries to be consistent and sacrifices short writing because of it, and the latter tries to be short but sacrifices consistent writing because of it.


I’m not convinced this has to be an either/or, however. You can have an efficient phonetic theory base for writing out uncommon/nasty words, and still brief like crazy. The two aren’t mutually exclusive. What this project would be focused on, however, is the former: getting that theory base for writing out words independent of briefs in a form that makes sense to individuals rather than trying to adopt someone else’s base for “consistency”. Your thoughts on briefs are spot on, but that’s a whole different subject.


To take your “something” example, what I had in mind here was a program that would take the individual sounds in the word (known as phonemes) and let an individual choose how to stroke them, either phonetically (as in Phoenix) or based on spelling (as in Plover’s theory). This would give users flexibility with regard to their non-briefed dictionary entries, which is actually the part that we don’t have control over right now. We can brief stuff out to our heart's content, but changing how you write normally — external to briefing — is a much different task.


—————————


On syllable division, this is a linguistics problem. One option for us would be to follow the maximum onset principle as it is classically defined. You can read about it here (less than you probably want) or here (more than you probably want). The onset is the beginning part of a syllable, and the coda is the ending part. Pretty much, if you always stick as many consonants as you can in the onset instead of the coda (so long as it is phonotactically allowed in your language), you won’t run into as many problems of syllabification.


Basically, if we followed this rule in how we split up syllables, we could stroke the words in the same way we split up the syllables, and we wouldn’t have this problem because it would be consistent. Perhaps someone more knowledgeable about linguistics than I could explain better.


—————————


I’m afraid I don’t follow your last bit on having the program spit out a “list” of possible ways to stroke something out. If we allow people to define their own strokes for phonemes, theoretically there are many different ways to stroke the same thing, but only one that follows any given person’s phonemic map. People would certainly be free to brief on top of the consistent definition for their personal theory, but I think it would be a mistake to make words only accessible by briefs except for a select few that are extremely common.


On the other hand, if what you’re suggesting is a program that suggests briefs for words based on what’s available, then I think that is another fantastic idea — but it is different from the one I am forwarding here.


Good stuff! Keep the ideas coming.


-Steven


On Sunday, May 29, 2016 at 9:36:14 PM UTC-4, Gavan Browne wrote:

JustLisnin2

unread,
May 29, 2016, 11:06:57 PM5/29/16
to Plover
Hi Steven,

1. I see. That's what I thought. The goal of this effort would be to create an optimal, consistent dictionary for non-brief forms. I understood that you had wanted to make this entirely separate from briefs, but I was wondering how constructive it would be considering that, as a learner, most of the first words I picked were, in fact, brief forms. But it does make sense to optimize the rest of stenography, especially since there are still some words that may not even be defined in non-brief forms.

2. When I said "comfort", I meant physical comfort on my hands not so much my comfort with the theory. It sounds like you're looking at this solely from a theoretical/linguistic standpoint, and I just wanted to add the practical as well. But I understand that people can default to their own briefs/definitions as they need to for comfort.

3. Can you explain this part to me? "Ideally, we wouldn’t be limited to just choices between existing theories, but we could choose our own strokes for a particular phoneme (sound)." I'm not sure I understand. Are you suggesting going as far as changing the definition of the "ch" sound, etc.? Are you referring to the treatment of vowels among the different steno theories? Or what phonemes do you feel the existing theories are restricting you to? Also, I think I misunderstood the "what" portion of this project as well. I thought that you meant creating a standard dictionary based on the Carnegie Mellon dictionary, but now it sounds like you're suggesting an entry generator, so to speak? If that's the case, and the Carnegie Mellon dictionary has consistent, phonetic rules, how will this result in multiple entries for learners to choose from?

4. I replied to Gavan earlier with a mention of word boundary errors. While having the freedom to choose your own strokes instead of learning someone's else's theory by rote sounds liberating, one of the reasons why I, personally, was reluctant to add my own entries to the dictionary when I first started learning was because I had no intuition for what word boundary errors would wreak havoc on my writing. Aside from Learn Plover, there's really not a lot of free formal study material to go alongside of Plover. It was trial and error. I made months of changes to the main dictionary before I realized that I should've been adding entries to my own personal dictionary. As I said. I'm not linguist, so if any linguists can chime in: is there a way to define a set of rules that could uncover all possible conflicts that could occur? How do you ensure that this new, customized theory is conflict-free, especially since you're targeting new learners?

My final thoughts on this for tonight :) Good night!
Nat

Steven Tammen

unread,
May 29, 2016, 11:49:37 PM5/29/16
to Plover

Hi Nat,


1) I see what you’re saying. There would need to be some minimal set of briefs that are not a part of the normal dictionary generation (“the”, “and”, “he”, “or”, etc.). Of course these could be variable too, but then we get into the subjective issues of briefs discussed above. I hadn’t really thought about this too much (no doubt because I haven't really learned steno yet).


2) Wow I totally misread what you were saying, haha. Agreed. The difficulty of strokes from a physical standpoint should be taken into account as well (holding down 2 keys is easier than 6, for example).


3) I was thinking of pretty much opening up how phonemes are stroked entirely. Of course most everyone would probably leave the “ch” sound as it is… but what if someone didn’t want to, and wanted to move stuff around on the steno keyboard? Well now they’d have the option to. The differences in vowel sounds between theories were a primary motivator of this consideration, but another one I was thinking of was how things are stroked depending on how the word is spelled. Unless I’m totally misinformed, most theories might stroke the same sound different ways if it is made with different letters in English. Letting people choose these sounds was something else I had in mind.


The generated dictionary will be “standard” (i.e., consistent) according to the preferences that the user specifies — it could be totally different from a dictionary that someone else generates based on their preferences. What exactly do you mean by “entry generator”?


4) I had kinda mentioned this — albeit vaguely and not in a very good way — in my write-up:


“It [the generator] will automatically take out medial schwa, roll in suffixes, and create disambiguation briefs to the extent possible without creating conflicts. Problematic words will be displayed for either further programmatic processing (e.g., if a word ends in -ing without it being a suffix, do ____ to add on -ing), or hand-correction.”


From what I’ve read, totally conflict free writing is a myth. This is always a game of compromise. I’m not qualified to comment on the specifics of this (in fact you probably know more than me because you’ve actually been learning steno for a while), so it would be good if someone knowledgeable helped think of ways to deal with this problem. What I do know is that long words tend to have less word-boundary problems than short words, and that briefing very common short words can solve many of these problems.


-Steven

JustLisnin2

unread,
May 30, 2016, 10:26:07 AM5/30/16
to Plover
Hi Steven,

The picture is becoming clearer now, thanks. So you're suggesting an entire dictionary that's generated based on preset preferences. I was thinking along the lines of a word-by-word generator based on the Carnegie Mellon dictionary, where the user would have the option to choose between multiple dictionary entry suggestions and pick the entry that fits their preference. So if you want an entry for a word, you would type that word into the generator and get a list of choices, then you could pick the one you want. This led me to think that an unwary user might mix and match among entries, thinking only in terms of individual words and not thinking about potential word boundary errors that could arise. You're right; an entirely conflict-free theory is a myth. But an entire dictionary makes much more sense to me in terms of conflicts. You're sort of generating your own theory that fits your personal writing style, and, as you said, it's a way to test out potential new theories and picking the most efficient one on an individual basis. That sounds so cool :)

Nat

Steven Tammen

unread,
May 30, 2016, 11:13:28 AM5/30/16
to plove...@googlegroups.com
Yes, that's it. 

I think letting people choose from options like that isn't a terrible idea, but it would need some sort of conflict detection (at least ideally) for it to work well. It would serve a purpose of letting people "try out" different briefs for words without necessarily having to come up with them haphazardly every time they want to brief something. So long as it could filter based on individual people's dictionaries, people could see what strokes are "available" for them to keep their briefing conflict free. I believe this is what Gavan was thinking of above. To extend the idea even further, if we got the permission of established theories (Magnum, etc.), we could even have this hypothetical "brief generator" display the briefs these established theories use for words as well. 

-Steven

--

Steven Bhardwaj

unread,
May 31, 2016, 1:04:21 AM5/31/16
to Plover
Hi All,

On the subject of Plover and Linguistics,

This thread makes me think of the idea of using a chorded keyboard system for inputting raw IPA transcriptions...
IPA keyboard for English: http://ipa.typeit.org/
IPA for English Audio key: http://www.antimoon.com/how/pronunc-soundsipa.htm
IPA keyboard with all symbols: http://ipa.typeit.org/full/

I expect it would be helpful, to retain any vestige of efficiency, to have a language-specific IPA theory. But the theory might ought to be an orthographic theory like Mecatipia, Jackdaw, or Kinglet, making it (I suppose) easier to create. CMUdict would probably be helpful in designing a reasonably-well optimized theory, but it might be more flexible for this application to use an orthographic-style dictionary rather than a regular word-centric steno theory.

An English setup might be fun if I ever wanted to learn how to imitate different English accents! Although more serious uses would include transcribing endangered languages from DAT tapes, etc.

:)
Steven

Steven Tammen

unread,
May 31, 2016, 10:35:01 PM5/31/16
to Plover
Hi Steven,

I had toyed with the idea of basing the generation itself off of IPA, but I couldn't seem to find a good English dictionary in IPA. CMUDict uses Arpabet which is English-specific, and also entirely ASCII based (requires no Unicode support -- IPA does).

We could use CMUDict to generate IPA transcriptions like you say, but I wanted to keep the goal focused for the first little bit (i.e., getting a working dictionary generator for "normal" English stenography). Over time, something else I thought we could do was take in IPA as a language independent source, and get steno support for languages without developed theories yet (assuming we could find pronunciations for words in said language in IPA). This would help open stenography up to more people who wouldn't otherwise have access to it in their native language.

I think you may be right that orthographic input systems have the upper hand for full IPA. There are a lot more sounds than we use in English, for example, and it would be difficult to fit them all on a typical 22-key stenotype.

-Steven

Martin Sherman-Marks

unread,
Jun 1, 2016, 4:45:16 PM6/1/16
to Plover
This conversation is relevant to my interests! CMUDict is a good starting point, but I see a few problems you'll need to solve along the way:
  • Syllabification: CMUDict doesn't define syllable boundaries, which are critical for any steno theory. (As I know very well from my own experience as a learner. Most of the time, when I can't figure out how to stroke a word, I eventually realize it's because the definition in Plover is based on a different syllabification than the one in my head.) And unfortunately, there is no particularly good rule for syllabification in English, especially when you're etymology-blind. The TeX hyphenation algorithm has been ported to Python, and that might be a good starting point - but note that "where to stick a hyphen" and "where to stick a syllable break" are related but different problems. (The TeX algorithm won't hyphenate "project", for example.) I'm not saying it's going to be impossible to syllabify the CMUDict algorithmically, but it'll present some interesting challenges.
  • Morphemes: CMUDict is morphology-blind; it has separate line entries for "ABANDON", "ABANDONS", "ABANDONING", "ABANDONED", and "ABANDONMENT", for example, with no way to know that those words are all connected. Before you start trying to run through CMUDict, you'll want a prefix/suffix dictionary, which will almost necessarily be non-phonetic. (For example, Plover uses "*PLT" for "-ment", not just because it's shorter but also so that "PHEPBT" is available for the first syllable of "mental". Otherwise, how would you talk about your friend Abandonment Al? Poor Abandonment Al. He's got some problems.) Oh, and going back to my earlier point, you'll have to make sure that your syllabification algorithm sees "ABANDON/ING" rather than "ABANDO/NING" - but still sees "SING", not "S/ING" - or you'll have a disaster on your hands.
  • Conflicts: I threw together a quick Python script to count how many homophones there are in the CMUDict. I found 13,015! (Admittedly, many of them, like "beet" and "beat", can probably be dealt with using the Plover theory's built-in disambiguation rules. I didn't account for that.) So conflict resolution definitely isn't a "figure it out manually" kind of problem, unless you intend to pore over a hell of a lot of dictionary entries. Unfortunately, conflict resolution relies in large part on something CMUDict won't tell you: word frequency. Which is more common: the word "accord" or the acronym "ACORD" (the Association for Cooperative Operations Research and Development, naturally)? You can answer that very quickly; CMUDict can't. And that means that you need some other test to tell you which should get the rule-following stroke A/KORD and which should get a rule-breaking stroke like A/KO*RD. (Or, in this case, which is so uncommon that maybe it should just be quietly ignored.) Plus, you need to teach your script how to craft a good rule-breaking stroke. It's easy enough to say "just throw an asterisk in there", but remember that Plover theory uses S* for initial /z/ and *T for final /th/, so your word may already have an asterisk in it. You can also change out the vowels, or repeat the stroke more than once to cycle through related options (Repeating A*PB to switch between {^an}/Anne/Ann is one of the more remarkable versions of this in the Plover default dictionary.)
  • Capitalization: Another thing the CMUDict doesn't have: lowercase letters! The Plover dictionary has PHARBG for "mark" and PHA*RBG for "Mark"; that sort of thing is very common. If I hadn't looked up "ACORD" in the last example, I wouldn't have had any way to know it wasn't "acord" (not a word). Even a smart algorithm that was reading through the CMUDict would have surely given me a dictionary entry "A/KO*RD: acord" for a word that doesn't actually exist!

I think a common solution to many of these problems is to incorporate more than one wordlist. For example, a word frequency table would help with conflict resolution at the very least - though you'd need a big one from a good corpus. Step one would be to write some kind of script that turned the CMUDict into a more complete dictionary with a format like:

word    W ER1 D    245

That's the word in its normal capitalization, pronunciation, and then frequency rank. You'd still have morphology and syllabification problems to think about, but that would be a good step one.

Zack Brown

unread,
Jun 1, 2016, 5:20:30 PM6/1/16
to ploversteno
For syllabification, even if there's no good rule for it, there may be
good rule to identify the range of possibilities. Any new dictionary
will probably want to have entries for as many possible
syllabifications of words as it can, to account for everyone's
personal tastes (similar to what Plover does now). Also bear in mind
that you will probably want to include things like dropping unstressed
vowels, and the inversion rule. This messes with syllabification a bit
as well. You'll probably need to come up with a whole new approach to
syllabification, based on making those assumptions.

Also, for any programmatic analysis of CMUDict, I'd recommend
prioritizing words based on frequency of use. Peter Norvig's word
frequency table is at http://norvig.com/google-books-common-words.txt

I'd also suggest programmatically coming up with a set of prefix and
suffix strokes, similar to what Plover has. The idea would be for no
word to end with the keys used in any prefix stroke, and for no word
to begin with the keys used in any suffix stroke, to avoid word
boundary errors.

Another thing to bear in mind is that in steno (although I don't use
this lingo in Learn Plover), the "theory" is generally considered to
be the particular approach to constructing briefs. The whole set of
standard and repeating rules governing consonant and vowel sounds, and
things like the inversion rule and so on, is not called 'theory'
because it's considered so fundamental that it's not even questioned -
all English language steno systems use those same basic chords and
rules, for the most part. At least that's my understanding.

But I would suggest changing that. Anyone coming up with a new
dictionary should truly start fresh. use CMUDict and the Norvig files,
and come up with an entirely new set of keys and chords for all the
different English sounds. I think if you do that, it may be possible
to improve on Ward Stone Ireland's original keyboard layout. At that
point, it might be possible to significantly reduce word conflicts,
and fit a far greater number of multi-syllable words into single
strokes.

Ward Ireland's keyboard was designed 100 years ago, with virtually no
statistical calculation to guide him. Additionally, it was designed to
be entirely syllabic. There were no briefs because there were no
lookup files. It was only in the 1980s that the proprietary steno
companies introduced dictionary files and briefs. Given that kind of
chaotic history, I think there's a very good chance that a much better
solution exists than the one that's come down to us. I think whoever
works on this is very likely to find a much cleaner, sharper system
than any of the steno systems currently in existence.

Be well,
Zack
> --
> You received this message because you are subscribed to the Google Groups
> "Plover" group.
> To unsubscribe from this group and stop receiving emails from it, send an
--
Zack Brown

Martin Sherman-Marks

unread,
Jun 2, 2016, 9:17:41 AM6/2/16
to Plover
Zach, I was thinking about that very idea of "identifying the range of possibilities"; part of the challenge will be determining how much a particular stroke should be allowed to "spread" in the dictionary. Not just for syllabification, but for misstrokes too - the algorithm will have to think about how hard a particular stroke is, what the likely misstrokes are, and then will have to weigh how frequently the word is used against the space that the likely misstrokes will take up. A fairly complicated and nuanced process!

I've been trying to find a word frequency list that is case-aware, but with no luck so far. The American National Corpus - which, I'm pleased to note, contains among other things 3 million words from a Buffy the Vampire Slayer fan forum - has frequency data, which doesn't differentiate by case but does differentiate by part of speech, including proper nouns. (It also includes bare lemmas for plural nouns, which I suspect may be helpful down the line.) I'll attempt to pull it together into a case-aware word frequency list on the assumption that pretty much all proper nouns are capitalized. The next step after that will be combining it with the CMUDict to add in pronunciation, which should be fairly straightforward, I hope. (There is a larger, cleaner word list, from the 30x larger Corpus of Contemporary American English, but that costs $250, or $125 if we can claim academic use. If the ANC wordlist works, then it would be fairly trivial to modify the script to use the CoCAE data when I'm feeling wealthier.)

With regard to what Zach was saying about developing new ground-up principles of steno from this - I think he may well be right, and it's something I'm interested in exploring. Unfortunately we need to conquer syllabification first. Once we have that, we can develop a complete list of syllable onsets, nuclei, and codas in American English (and their frequency!) - that's the point where we can start rethinking the keyboard.

Martin Sherman-Marks

unread,
Jun 2, 2016, 11:10:51 AM6/2/16
to Plover
Yikes. Okay. The ANC list has some issues. My assumption that anything flagged as a proper noun should be capitalized has run into the issue that they flagged a lot of words as proper nouns. The word "accent", for example, occurs 449 times in the corpus, and is flagged as a proper noun 23 of those times. Not super helpful. In total, it looks like about 40% of the words that occur more than twice in the sample are flagged as proper nouns at least once, which is... ugh. There are more proper nouns in the dataset than improper ones!

I was able to improve things by using an SSA dataset to generate a complete list of all 95k first names registered since 1880, then only capitalizing entries if they're flagged as proper nouns and are in that dataset and are in the CMUDict. (I'm downloading GNIS/GNS datasets now so I can add geographical names as wel - they're huge datasets, naturally, but by limiting the list to the intersection with CMUDict, and by stripping all data but the placenames themselves, I'll be able to make a fairly small file of geographical names.) This greatly helps - though it still thinks that the first name "Zeppelin" is 267% more common than the actual word "zeppelin" (since "zeppelin" is tagged as a proper noun 24 times in the dataset and as a typical noun only 9 times). There's no way to address that short of using a better corpus, which I'm going to continue looking for.

My first draft of a case-sensitive dictionary with pronunciation information and word frequency information is attached. Anyone who wants to play around with it, or who wants to see any of the source files I'm using, let me know.
newDict

Steven Tammen

unread,
Jun 2, 2016, 11:35:41 AM6/2/16
to plove...@googlegroups.com
This is great stuff guys, and exactly the sort of thing I thought might come up once you got under the hood, so to speak. My training in linguistics has been limited to several hours of casual reading on Google and my knowledge of steno is about equivalent (no NKRO keyboard + no SOFT/HRUF yet = lack of steno skills). If I say something really silly... that's probably why.

I had initially had the idea of rebuilding steno from the ground up in mind, but decided that I'm not the one to do this (though it would make a great thesis topic for someone in a relevant field). However, I would be most supportive of such an effort, and in fact I think it should be considered somewhat a priority compared to many other features. All the cool stuff that Zach and Jennifer have been doing with Kinglet and Jackdaw need not be limited to orthographic input systems.

On the other hand, I do think there is value in making the system easily accessible for people still in more traditional forms of steno. On a practical level, we're going to have to convince people that all this complicated stuff is worth doing, which means it has to be usable by them as well. There are plenty of advantages to having a dictionary that is algorithmically generated rather than hand-crafted, with some obvious ones being that it's much easier to tweak, and could be easily regenerated if something happens to to the main one.

-----------------------------

Let me see if I can get a handle on some of the issues in play:

1) Syllabification 

Even though things like the maximum onset principle exist, there is no great consensus on how words are split. Furthermore, any abstract pontification about syllabification is ignoring the reality that different people will split syllables in a way that makes sense to them (even if it's not "canonical"), and therefore there is no one size fits all answer. We will have to account for as many different syllabifications as reasonably possibly, just like misstrokes.

2) Morphemes and Suffixes

To get related words connected (verb conjugations, for example: tag, tagged, tagging), we will have to figure out a way to 1) parse this data out of CMUDict, and 2) use it somehow in the resulting dictionary. This is further complicated by the fact that semantic matching will have to occur using syllabification that results in normal suffixes such as -ing, -ation, and so forth, while ignoring words like sing and nation.

3) Homophones

Hand correction is out of the question, and it would be inconsistent anyhow. Disambiguating the conflicts should rely on frequency data and be done consistently if possible.

4) Capitalization

Not present in CMUDict initially. Will probably be easiest to add using word-lists of names, places, etc. (proper nouns). Dealing with some words that are in both capitalized and uncapitalized forms (as above: Zepplin as in "Lead Zeppelin" vs. zeppelin as in in the Hindenburg) will present a challenge.

-----------------------------

@Martin, your dictionary looks good on first glance, but not all of the "word bases" (what the second column is, I take it), look right. For example:

absolute absolute
absolutely absolutely
absoluteness absoluteness
absolutes absolute
absolution absolution
absolutism absolutism
absolutist absolutist

Unless I'm misunderstanding the purpose of that column, more of these words than just "absolute" and "absolutes" are related.

What do you think the next step is?

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Steven Tammen

unread,
Jun 2, 2016, 11:52:03 AM6/2/16
to Plover
Oh, one other thing to think about:

As we are going through this process (if we are going through this process?), something to think about is documenting things more than usual. If we can figure out the issues in English, we can generalize our solutions to other languages and different phonetic transcriptions (at least somewhat/mostly). Stenography is still limited in its support for many languages, and if we get a system in place that can generate a framework based on frequency statistics, ease of combinations, etc., we might enable others to build efficient stenographic frameworks without ever giving an inefficient framework a chance to take hold.

Martin Sherman-Marks

unread,
Jun 2, 2016, 12:24:41 PM6/2/16
to Plover
Steven, my next step is to continue trying to refine this wordlist. I don't think I'm going to find a significantly better corpus without shelling out for the CoCAE wordlist, which I'm certainly not ready to do, so I'm going to try to take whatever steps I can do improve this one (without, you know, going through it all by hand or anything). I may ultimately not be able to match word frequency to proper nouns. And that might be okay! I can certainly generate a pretty decent list of proper nouns which occur in CMUDict - I've got first names already, will soon have geo names, and just need last names to round out the set - and we can just treat proper nouns as if they were middle-of-the-pack words. We're not going to find a corpus that's actually designed for the crazy-ass purpose we're putting it to, so we'll have to avoid making the perfect the enemy of the good.

The second column in the dictionary (which comes straight from ANC data) is actually the lemma, which is to say (roughly speaking) the word form that you look up in the dictionary. So "is" has the lemma "be", but "absolutism" doesn't get the lemma "absolute" even though it's clearly related to "absolute". The "-ism", "-ly", "-ness", etc suffices are what we call derivational morphemes: they change the meaning or part of speech of a word, so the word gets a new lemma. The "-s" and "-ed" suffices are inflectional morphemes; they modify a noun's number or a verb's tense, but they don't change the word in any more fundamental sense, so the lemma remains the same. The lemmas may or may not turn out to be helpful for us in the long run; I kept them in my dictionary because I figured they wouldn't hurt, basically.

Apart from generating the wordlist, we need a general syllabification algorithm. Even if we're going to "spread out" strokes to account for alternate syllabifications, we'll need a starting point. I think the best rule I've seen is that consonants on a syllable border go with the more heavily stressed vowel. ARPAbet stress values go, somewhat confusingly, 1 (most stress), 2, 0 (least stress). (ARPAbet defines a few syllabic nasals and liquids too (EM, EN, ENG, EL) but it looks like CMUDict doesn't use those, so we can just look at numbered vowels.) This gets us a lot of the way there, but there are still issues: consider "abandoning" [AH0 B AE1 N D AH0 N IH0 NG] - the algorithm won't know where to put the <n>, but we want to make sure it winds up in the penultimate syllable so we can recognize the "ing" suffix on the end. This is actually a case where the lemma column may be useful. (Note that this isn't the final syllabification algorithm for mapping pronunciation to steno: that will have to factor in a whole mess of other problems, which Zach alluded to above.)

With that algorithm and my wordlist, we can get a full list of syllable onsets, nuclei, and codas. I keep coming back to that, but it's the heart of English steno. The layout of the left side of the keyboard starts to make total sense when you think about all the English syllables that start with /str/ or /spr/ - as well as all the English syllables that don't start with /pw/. (A few foreign words like "pueblo" and "Poitier" do, but for the most part it's very safe to map PW to /b/.) If we intend to reinvent the wheel, that's the kind of data we'll need.

There are other fronts we can be attacking this on, like putting together our list of prefices and suffices, but this is what I'm seeing as the most critical issues.

In regard to internationalization of this system... well, keep in mind that English has much better corpora than most languages. True, you might not need a CMUDict for, say, Spanish (because the writing system is so phonetic) but you'll still need word frequency data at the very least, as well as a thorough list of proper nouns.

Martin Sherman-Marks

unread,
Jun 2, 2016, 12:31:28 PM6/2/16
to Plover
Oh, and I'd be wary of using the maximal onset principle for syllabification. Phonologically, it may be fairly accurate, but it plays fast and loose with morphology. Maximal onset says that "dancing" is syllabified dan/cing, but we want it to obey the morphological boundary: danc/ing. The "hungry stressed syllable" idea would give us the correct answer (and just about always will, since "-ing" is never stressed - the only problem arises where the preceding syllable is also not stressed, like "abandoning".)

Martin Sherman-Marks

unread,
Jun 2, 2016, 12:33:40 PM6/2/16
to Plover
(Also, I have a linguistics degree and tend to use jargon without explaining it. Please feel free to ask if there's anything you don't understand or need me to define.)

Steven Tammen

unread,
Jun 2, 2016, 1:10:16 PM6/2/16
to Plover
You're fine haha. I think I actually got most all of it.

I'll try chip in where I can but I think I'm going to get eclipsed here, having neither background in CS nor linguistics, nor, practically speaking, stenography itself.

we'll have to avoid making the perfect the enemy of the good.

Now this is the problem I have. https://xkcd.com/1445/ 

Jennifer Brien

unread,
Jun 2, 2016, 3:11:58 PM6/2/16
to Plover


On Thursday, 2 June 2016 17:31:28 UTC+1, Martin Sherman-Marks wrote:
Oh, and I'd be wary of using the maximal onset principle for syllabification. Phonologically, it may be fairly accurate, but it plays fast and loose with morphology. Maximal onset says that "dancing" is syllabified dan/cing, but we want it to obey the morphological boundary: danc/ing. The "hungry stressed syllable" idea would give us the correct answer (and just about always will, since "-ing" is never stressed - the only problem arises where the preceding syllable is also not stressed, like "abandoning".)

The nice thing about orthography is that you can split your words anywhere you like, because there is no dictionary. I'm inclined, where possible to have an extra key for the each of the main suffixes, so they can be folded into the main stroke whenever you get the chance. Also, even with a dictionary, it would be good if there was a way to mark (as with Velotype's No Space)  whether a stroke is a complete word, a prefix or a suffix. That would mean you could find a multi-stroke word in the dictionary no matter how it was split, and it means that Abandonment Al works out just fine.

Discounting homophones (granted, that's a big discounting!), a system based on CMUdict would be rather like an orthographic system for English with Simplified Spelling. It might be a bit faster than one for Standard spelling - provided your own pronunciation is sufficiently Standard. I don't think it would be that great for sight-transcribing unfamiliar words. 

I don't do real-time audio transcription and I probably never shall (I have done quite a bit of tape transcribing) so I don't know what is ideal for that purpose, but I'm very wary of the idea of a Big Comprehensive Dictionary. ISTM that once a corpus of words exceeds a few hundred, it becomes quite obvious where its bias lies. I want to be able to write any word (even ones that I have invented) without having to spell it out letter-by-letter, and if it's a long word that I'm likely to need again, I want to be able to quickly make a brief for it. If it's something that only comes up once in several thousand words, why waste sleep about losing a stroke?

To make this efficient I need to be able to stroke the most common onsets and codas as they are spelled, in the most straightforward manner. I'm not interested in word frequencies or even syllable frequencies, but I am interested in the frequencies of consonant sequences. If such a sequence precedes a vowel its an onset to be keyed by the left hand; if it follows a vowel, it's a coda, and if it has a vowel at each end it's a coda followed by an onset and you can divide it by the maximum onset principle. It would also be useful to record the adjacent vowels. Jackdaw's leading A and trailing E and Y/I seem to save a lot of strokes, but I wonder how it compares with giving more space to consonant combinations?

The basic principle is, use the easiest keys for the most common sequences, whether they be consonants or phrases. If they are natural prefixes, arrange if possible for them to be stroked solely by the left hand (or by the right if they are natural suffixes) so that more can be included in the same stroke. I think this principle is also widely used in Magnum Steno. but allowing the ouputput of different parts of the keyboard to be combined, as I outlined here  - https://groups.google.com/d/msg/ploversteno/mo7OF0D6UM0/s4YZItf0EwAJ  avoids dictionary inflation.

Steven Tammen

unread,
Jun 2, 2016, 5:24:14 PM6/2/16
to Plover
Well, it looks like we're on our own. Ted thought it was a decent idea but neither he nor Mirabai were sold completely. Something along the lines of a whole awful lot of upfront work for questionable payoff.

I still think it would be a great thing to have eventually.

Zack Brown

unread,
Jun 2, 2016, 7:00:34 PM6/2/16
to ploversteno
Heh, I could've told you Mirabai would disagree. We had many
discussions about that while I was working on Learn Plover. She
represents the position that Ward Ireland really knocked it out of the
park, and any possible improvement will be minimal at best. She could
be right. But if someone really did come up with an improved
dictionary, I'm sure she'd acknowledge it. She just has a lot of faith
in Plover's dictionary, for good reason - it's her own personal
system, that she developed over years.

The thing about the Ward Ireland keyboard layout is this: to improve
upon it, you need to find a layout that can produce a wider variety of
words in a single stroke, without using any briefs, than the Ireland
keyboard. On top of that, any briefs that are used for disambiguation
have to rely on a simpler set of general guidelines than Plover
(https://sites.google.com/site/ploverdoc/lesson-9-designing-briefs).
Also, any briefs that are *not* used for disambiguation but instead
are simply for speed ("the", "of", "that", "as", etc), have to be at
least as easy to type as the Plover versions, because that will have a
strong aggregate affect on typing speed.

BTW, regarding a syllabification algorithm - I don't think it's as
important as other folks seem to. The reason is this: the new keyboard
layout will define a new "steno order". Its value will lie in its
ability to cram more words into that order than traditional steno
order does (otherwise the new system will not offer a speed
improvement over Plover). Since that's the case, syllabification
doesn't matter as much as the ability to cram a word into the new
steno order. Steno has never really been about syllables anyway - as
witnessed by the vowel-dropping rule. So, personally, I believe the
hunt for syllabification algorithms will be a time-wasting red
herring. I'd recommend focusing on identifying the most
all-encompassing steno order instead. Let stroke-breaks take care of
themselves.

Be well,
Zack

Theodore Morin

unread,
Jun 2, 2016, 8:19:59 PM6/2/16
to Plover

I support you in the sense that I think it's worth trying/doing ☺️ just not something that I'd like to put effort into myself.

Plover will definitely be there to support you technically, including a different steno order and more keys if need be.

Zack Brown

unread,
Jun 2, 2016, 9:22:13 PM6/2/16
to ploversteno
Excellent! So at least if anything does come out of this, it'll have a home in the software.

So, is anyone actually pursuing a new steno dictionary as a real project - or in particular, the software to construct a solid language-agnostic dictionary for anyone who has a phonetic dictionary file and frequency stats in a given language?


Be well,
Zack


Martin Sherman-Marks

unread,
Jun 2, 2016, 10:15:55 PM6/2/16
to plove...@googlegroups.com

It's funny, I was saying to Mirabai about a week before this thread started that I didn't really think that any computer generated dictionary could be as good as a human built one. I'm still not at all convinced it can! I'm enjoying working on the problem but am fully prepared for it to be a fool's errand.

My gut says that we're unlikely to find any massive improvement over the Ward Ireland model. It's a good model! I have quibbles (in particular, I feel that the asterisk is overloaded, and that there must be a better solution for, e.g., final -th) but I don't think we're going to upend anything. His steno order makes a great of intuitive sense to me. I'd be fascinated to be proven wrong!

I will disagree with you, Zach, in that I think you need syllable information - in particular a list of onsets and codas with their frequency. Otherwise, what information would you even have to question steno order?

I did successfully create a first pass at a "hungry stressed vowel" algorithm. However, I'm not super happy with it, and may end up eating my words and going with maximal onset after all. Switching between the two is fairly easy. I'll update more on that tomorrow.

You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Steven Tammen

unread,
Jun 2, 2016, 10:38:47 PM6/2/16
to plove...@googlegroups.com
Regarding the asterisk, this was one of the things that really bothered me the first time I saw typical steno layouts. It's used for so much. One thing I am absolutely convinced about, regardless of how the rest of this endeavor turns out, is the addition of at least one more disambiguator to steno. The favored location, in my opinion, would be on one of the split S keys. Mirabai is of the opinion that having two S keys is useful because it makes stroking S- comfortable no matter what other keys you have to press on your left hand. I think this argument is valid, but not strong enough to keep two S keys instead of making one into another "asterisk-like" key.

This would help eliminate some of the more obtuse methods of handling conflicts (cycling by repeating, for example). It would also actually be in a better location than the asterisk itself, requiring no lateral finger movement (it's on the "home row", if you will). Ted said that it's already theoretically possible to do this with protocols that would support it, if you can update dictionaries accordingly.

What do you think of a such a thing? It doesn't really nuke the foundation, but it has the potential to solve one of the bigger flaws in the current system.

Zack Brown

unread,
Jun 2, 2016, 10:40:54 PM6/2/16
to ploversteno
Onsets and codas seem fairly easy to determine. Once you drop everything but the stressed vowel, the onset is whatever is before the vowel, and the coda is whatever is after it.

To calculate frequency for a given coda seems easy too: First you take all the words that have that particular coda, and take the sum of all the frequencies associated with those words. The number you get is the frequency of that coda.

So for example, if you have a two-word dictionary consisting of "have" and "five", the coda is the 'v' phoneme. If your text corpus has 10 occurrences of 'have' and 20 occurrences of 'five', the frequency of the 'v' coda is 10+20=30.

Be well,
Zack


Zack Brown

unread,
Jun 2, 2016, 10:47:54 PM6/2/16
to ploversteno
Martin - about syllables, I don't see how any algorithm can work. Here's the result of some phonetic transcriptions, after we process out all the unstressed vowels:

IMPOUNDMENT   M P AW1 N D M  N T
IMPOUNDMENTS   M P AW1 N D M  N T S
IMPOUNDS   M P AW1 N D S
IMPOVERISHED   M P AA1 V R  SH T
IMPOVERISH   M P AA1 V R  SH
IMPOVERISHING   M P AA1 V R  SH  NG
IMPOVERISHMENT   M P AA1 V R  SH M  N T
IMPRACTICABLE   M P R AE1 K T  K  B  L
IMPRACTICAL   M P R AE1 K T  K  L
IMPRECISE  IH1 M P R  S  S
IMPREGNABLE   M P R EH1 G N  B  L
IMPREGNATED   M P R EH1 G N  T  D
IMPREGNATE   M P R EH1 G N  T
IMPREGNATES   M P R EH1 G N  T S
IMPREGNATING   M P R EH1 G N  T  NG
IMPREGNATION   M P R EH1 G N  SH  N
IMPRESARIO   M P R  S AA1 R  
IMPRESSED   M P R EH1 S T
IMPRESSES   M P R EH1 S  S
IMPRESS   M P R EH1 S
IMPRESSING   M P R EH1 S  NG
IMPRESSIONABLE   M P R EH1 SH  N  B  L
IMPRESSION   M P R EH1 SH  N
IMPRESSIONISM   M P R EH1 SH  N  Z  M
IMPRESSIONISTIC   M P R  SH  N IH1 S T  K
IMPRESSIONIST   M P R EH1 SH  N  S T

After dropping the unstressed vowels, there isn't really anything left to make syllables with. So for a word like 'impregnation', you just want to get a steno order that can handle 'mpr' at the start of a word and 'gnshn' at the end.

Am I missing something? I'm not trained in linguistics, but I know Plover drops all those unstressed vowels, so any new dictionary file should do the same if it wants to be speed-competitive.

Be well,
Zack


Martin Sherman-Marks

unread,
Jun 2, 2016, 11:10:21 PM6/2/16
to plove...@googlegroups.com

Zach, you're getting a bit ahead of me; I'm still a long way from dropping unstressed vowels. The short answer is that "impregnation" might be a single stroke, if we can figure out a steno order that can go mprgn-a-shn. But since I doubt we'll ever find a steno order that can stroke "mprgn" with the left hand while also handling the other 79,999 words in the dictionary, it seems likely that "impregnation" will be a multi-stroke word regardless of keyboard layout.

My starting point is to find the lists of onsets and codas in natural syllables. Everything has to start with a theory that can handle natural syllables, after all! Once we've got some data there - which I think I can have tomorrow - then we can start seriously questioning steno order.

Zack Brown

unread,
Jun 3, 2016, 6:49:11 AM6/3/16
to ploversteno
I'm sure you'll find out some cool stuff. My first prediction: something other than 'S' belongs at the extreme left.

Martin Sherman-Marks

unread,
Jun 3, 2016, 8:01:05 AM6/3/16
to plove...@googlegroups.com

I'd bet money that the left hand S won't be dethroned. It can precede just about any other letter in an onset (stop, scop, spot, swop, shop, slop...), is the only sound that can start a three letter onset (strong), and is pretty much never preceded by another letter, except in foreign words like "tsar" and "psych" where the other letter isn't pronounced. No, I'm quite certain that Ward Ireland got that one right.

Martin Sherman-Marks

unread,
Jun 3, 2016, 10:38:45 AM6/3/16
to Plover
Attached is a dictionary featuring syllabification using the maximal onset principle. A sample:

impoundment impoundment IH2 M | P AW1 N D | M AH0 N T 3
impoverish impoverish IH2 M | P AA1 | V R IH0 SH 4
impoverished impoverish IH2 M | P AA1 | V R IH0 SH T 81
impoverishment impoverishment IH2 M | P AA1 | V R IH0 | SH M AH0 N T 7
impracticable impracticable IH2 M | P R AE1 K | T IH0 | K AH0 | B AH0 L 18
impractical impractical IH2 M | P R AE1 K | T AH0 | K AH0 L 86
imprecise imprecise IH1 M | P R AH0 | S AY2 S 30
impregnable impregnable IH2 M | P R EH1 G | N AH0 | B AH0 L 18
impregnate impregnate IH2 M | P R EH1 G | N EY2 T 3
impregnated impregnated IH2 M | P R EH1 G | N EY2 | T AH0 D 20
impregnating impregnate IH2 M | P R EH1 G | N EY2 | T IH0 NG 2
impregnation impregnation IH2 M | P R EH1 G | N EY1 | SH AH0 N 4
impresario impresario IH2 M | P R IH0 | S AA1 | R IY0 | OW2 22
impress impress IH2 M | P R EH1 S 122
impressed impress IH2 M | P R EH1 S T 472
impresses impress IH2 M | P R EH1 | S IH0 Z 19
impressing impress IH2 M | P R EH1 | S IH0 NG 9
impression impression IH2 M | P R EH1 | SH AH0 N 656
impressionable impressionable IH2 M | P R EH1 | SH AH0 | N AH0 | B AH0 L 23
impressionism impressionism IH2 M | P R EH1 | SH AH0 | N IH2 | Z AH0 M 15
impressionist impressionist IH2 M | P R EH1 | SH AH0 | N AH0 S T 43

On the whole, I find the results to be generally consistent with my instincts, which is slightly annoying since I really wanted to dislike the maximal onset principle, but probably a good thing overall.

This is just a first pass, of course. I'm going to need to deal with things like the -tions; it'd be tricky to argue for any steno order that didn't allow the -tion suffix to be folded into the preceding stroke. And I need to put together a list of suffices that get snipped off first. Ultimately, im/pre/sio/ni/sm should be more like im/presion^ism. But one thing at a time.

Also attached are three JSON dictionaries, one each for syllable onsets, syllable nuclei, and syllable codas. Each is ranked from lowest frequency to highest. There are 121 unique onsets, 15 unique nuclei (ignoring different stress levels), and 257 unique codas, for a total of... well, quite a lot of possible syllables. Thanks, English! You can immediately see that it's much more important to be able to handle the onset "S T" (237,658 uses of that onset in the ANC) than it is to handle "JH F" (one use, in the word "jfet", which is... uh...)

Looking over the onset file, I definitely should have told Zack to pony up. By my count, the ANC contains well over a million uses of a syllable where S comes before another letter (I'm including SH here; it's a separate phoneme, but I strongly doubt we'll decide to put a separate ʃ key on the board). There are 171 uses of a syllable starting with "T S" (like "tsetse"), all of them foreign words, and 3 uses of a syllable starting with "L K S", thanks to the very useful word "lxi" (which, even if it were a word, would be pronounced /lək.si/ or at best /l̩k.si/ by an English speaker, and I'll fight everyone at Carnegie Mellon if I have to). Since S occurs before T or TH thousands of times more often than T before S, there's no argument to be made that I can see for knocking S off its pedestal.

To my eye, pretty much all the onsets used less than a couple hundred times are unintegrated foreign words that can be largely ignored - I might draw the dividing line at "SH M", since I think "shmear" is now generally pretty well-integrated into English, though it's still recognized as a foreign sound. (It's possible that I just hang out with a lot of New Yorkers.) All the nuclei are, unsurprisingly, used a lot - /ɔɪ/ is used the least, /ə/ the most, but obviously we need to handle all of them. (That said, I think we might be able to come up with a way to stroke the vowels that involves less multi-key presses for common sounds; /ɪ/, for example, which is stroked EU, is three times as common as /a/, stroked A.) The coda file is more complicated than the onsets, partly because I'm seeing a lot of errors in the CMUDict showing up here. ("NG K D" should be "NG K T", "N K" should be "NG K", etc.) But even the one-off "K S TH S" is a valid pronunciation of the actual English word "sixths". Note, too, that "TH" occurs in codas almost 75,000 times in the ANC corpus. Granted, it occurs in 230,000 onsets, but still, that's an awful lot of syllables that get an asterisk not for disambiguation but just for the final TH sound.

I'll leave this here for people to think about for now, and to run their own analyses on if they'd like.
autodidict_v0-1b
codaList.json
nucleusList.json
onsetList.json

Steven Tammen

unread,
Jun 3, 2016, 11:16:59 AM6/3/16
to plove...@googlegroups.com
Good job!

I think it's worth pointing out that we might be able to rearrange vowels and mess with -TH without necessary changing everything else about steno order. Something this does, aside from being very interesting and possibly useful in the future, is give us a good feel for what might need to be improved in the Ireland layout itself.

Also (my previous post appears to have gotten buried), what do you think of having another "asterisk-like" key on one of the split S's, that could, among other things, making stroking -TH easier?

--

Zack Brown

unread,
Jun 3, 2016, 11:47:56 AM6/3/16
to ploversteno
My preliminary code doesn't have S on the left. Steno order for onsets seems to be:
K S R P N D T M B F G L V J TH H W Z SH NG CH

This is not definitive at all and I expect it to change. But it's based on the following:

1) use CMUDict and Norvig's frequency file.

2) remove the 1000 most often used words from our calculations, on the assumption that we'll have briefs for them.

3) make some basic phonetic conversions. for example, convert vocalized 'th' to generic 'th', and vocalized 'zh' to generic 'sh'. For 'z' at the end of words, convert to 's'.

4) drop unstressed vowels; including 'y' in words that have stressed vowels

5) go through each word in the norvig file. Split the onset into ordered phonemes. For each phoneme that occurs to the left of another, increment the frequency value of that phoneme by the frequency value of the word that contained it.

6) sort all phonemes by the total frequency values obtained.

Doing the same for codas produces the second half of steno order:
N R T S K L M SH D P B V F Z G J NG CH W TH H

Of course, this assumes one key per phoneme, which won't work. This steno order still has to be packed into a 22-key keyboard, with appropriate chord selections for all phonemes that don't have their own key.

Also I'm not satisfied that any of the above is correct. It's just where I'm at right now.

But Martin - the bet's on! I don't think S will end up on the left. So, whichever one of us admits defeat, buys the other dinner the next time we're in the same city! In my case NYC. Deal? :-)

Be well,
Zack



--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Zack Brown

Zack Brown

unread,
Jun 3, 2016, 11:50:57 AM6/3/16
to ploversteno
Steven, FWIW I agree - another asterisk-like key would come in handy. But I'd favor keeping both of them in the center of the keyboard. So the central asterisk would be split into two keys, and also the left S would be split into two keys. But the extra key on the left (I'd suggest) would be for a phoneme rather than a control key.

--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Zack Brown

Martin Sherman-Marks

unread,
Jun 3, 2016, 12:11:25 PM6/3/16
to Plover
Zack, what syllables have a K before S? Are you syllabifying "extra" as "e/kstra"? Because most English speakers would say "ek/stra." (This is why we have /z/ylophones, not /ks/ylophones.)

Also, looking back at your earlier email: I think you were taking out waaay too many vowels in removing unstressed vowels. Unstressed means it has a 0; you were taking out some 2s as well.

Steven Tammen

unread,
Jun 3, 2016, 12:28:07 PM6/3/16
to plove...@googlegroups.com
It's an interesting thought. I guess I've been thinking of the problem with respect to Ireland's layout, and in that case, it would have to be a control key or one of the less common onset phonemes that are currently chorded (J-, Z-). In this circumstance I believe the control key would be more useful than having a dedicated Z- key, for example.

With respect to the layout we are designing, I think it would be foolish to limit ourselves to the exact same physical layout that Ireland Steno follows. Here is what I view as an ideal physical layout (crude but you get the idea. Blue keys are "home position"):

Inline image 1

The infinity ergonomic is a stenotype that supports this many keys (or at least most of them), as will the newer lightspeed writer from Stenovations if it ever gets released (see here). There's no reason to cripple the layout because of hardware tradition.

From my experiences designing keyboard layouts, there are a few factors we should keep in mind when placing phonemes. First off, for our normal fingers (not thumbs), vertical flexion and contraction is far more ergonomic than lateral or diagonal movement. Practically, this means that movement to the extension keys for index fingers and pinkies should be minimized to the extent possible. If we could roll in all coda and onset phonemes to chords of the blue keys it would be ideal, but it's probably not entirely realistic (especially for the coda: why Ireland has -D and -Z).

The middle keys on each side could be set to the same thing, or we could make them all different to help disambiguate or perform other functions (on the fly brief recording, mode-switching, etc.). The far left pinky keys could be used various different ways as well, perhaps as "initial vowel" keys for wrapping in strokes (A-, O-), or "prefix keys" that could be used as placeholders for things like inter-, contra- and so forth (different depending on the definitions chosen by users).

What do you guys think?

Martin Sherman-Marks

unread,
Jun 3, 2016, 12:37:18 PM6/3/16
to Plover
Let's think about how a second disambiguator might work. We'll call Steven's suggestion + and put it on the upper left S key, at the start of the steno order. We'll call Zack's suggestion % and put it on the upper asterisk key.

For starters, let's take a set of super-fun words: mat, math, Matt, matte, meat, meet, mete, Meath (the county in Ireland), and, just for fun, let's throw in meeth (an archaic name for mead). (Plover strokes: PHAT, PHA*T, PHAT/PHAT, PHA*ET, PHAET, PHAOET, PHAO*ET; Meath and meeth are not included and it's not super clear to me how they could be assigned any rule-following stroke.) There's clearly something to be said there for more disambiguators, right?

Since the + would be on the far left, and therefore would be associated with onsets, it would be pretty weird to use +-T, keys at opposite ends of the keyboard, for -th. Instead, we'll still use -*T for -th and + will replace * as the standard disambiguator: PHAT, PHA*T, +PHAT, +PHAET, PHAET, PHAOET, +PHAOET, PHA*ET, PHAO*ET.

If we used % instead, we could use -%T for -th and continue using the asterisk for most disambiguation: PHAT, PHA%T, PHA*T, PHA*ET, PHAET, PHAOET, PHAO*ET, PHA%ET, PHAO%ET.

Both can handle that word set fairly predictably, but from where I'm sitting, the second option works better - the centrally-located * is more convenient for disambiguation than a + way out in the left-hand boonies. The current Plover layout uses * for four coda combinations (-th, -mp, -lk, and -nk); % would probably pay for itself just with those.

I would hesitate to add more than one extra disambiguator, and I would strongly recommend that disambiguator have one very clear and well-defined purpose. Otherwise, the question of which disambiguator to use will itself become ambiguous. That's exactly the problem with the current overuse of asterisk. If you want to give yourself a nice headache, sit down for a while and really think about the strokes "north" (TPHORT) and "North" (TPHO*RT)... and then tell me what "east" should be.

Steven Tammen

unread,
Jun 3, 2016, 12:53:09 PM6/3/16
to plove...@googlegroups.com
Why exactly do we need to preserve steno order with additional disambiguators? Sure, it's philosophically pleasing to have the symbol next to the character you're modifying, but I don't think placing disambiguators elsewhere is functionally inferior. If we we were talking about read-back it would make a difference because then we would have to be able to look at the outline and get the word. But we don't do that anymore. If you make Plover a black-box, you stroke some combination, and out pops a word. Short of thinking "Ok, I'm stroking this with a disambiguator because ____", I don't see how the location of the disambiguator has any bearing on the fact that you are using it. It's all muscle memory.

I'd also be curious to hear why you think more disambiguators are bad. I agree that there would need to be consistent use of them across dictionaries, but being able to use more of them for less broad purposes would decrease the ambiguity not increase it. It would also give people more flexibility in defining their own briefs in a way that made sense to them. Which was one of the things I had in mind with this whole thing anyhow.

Martin Sherman-Marks

unread,
Jun 3, 2016, 1:12:43 PM6/3/16
to Plover
Steven, I would definitely find it pretty weird to be stroking part of the my coda stroke with my left hand. Regardless of how you write the steno order, typing +-T means that the final TH is being written with your left pinky and right pinky. Definitely weird.

It's not that more disambiguators are automatically bad, but more disambiguators that work like asterisk does would definitely be bad. Right now, Plover has some disambiguators with a defined function: AE rather than AOE is (basically) always used for the <ea> digraph, so it gives you "meat" instead of "meet". Very predictable. But at its heart the asterisk just means "give me a slightly different stroke from what I would otherwise expect". Maybe the word is capitalized, maybe it's a less common homonym, maybe it's a brief, who knows? If you have two unpredictable disambiguators, you've got to remember: did I define this brief with *, +, %, or some combination thereof? Too much memorization completely divorced from phonology and orthography.

(You know what + above the left S would actually be great for, though? Initial capitalization. If "north" was TPHO%RT and "North" was +TPHO%RT, that would be very handy, and it would be 100% predictable. I certainly wouldn't mind retiring the ol' KPA stroke.)

I won't go too deep down the rabbit hole of keyboard layout - it'd be nice if we could make any system we come up with work with the keys on a typical steno machine, since, you know, they already exist.

But I do think there are four keys that any steno keyboard that wants to replace a QWERTY keyboard needs: Num Lock, Shift Lock, Ctrl Lock, and Navigation Lock. Num Lock would turn the keyboard into a typical number pad, e.g. digits 1-8 on the onset keys, 9-0 on the left vowel keys, and common mathematical symbols (maybe + - * / % . , : ' " ( )) on the coda keys and right vowels. (I know I couldn't function in my work using only the steno system for number entry!) Shift Lock and Ctrl Lock would do exactly what they say on the tin. (I don't think you need separate buttons for Alt Lock or Win/Command/Super Lock, because those are used for one-off keystrokes like ALT-Tab and WIN-E, which you can just brief. I can't think of anything, other than ALT codes for diacritic entry, where you need to hold ALT or WIN down. Also, ALT-click and WIN-click aren't really a thing, whereas Shift-click and Ctrl-click are.) Navigation Lock would turn the keyboard into navigation mode; right now you can use STPH-R for "left", for example, but if you're moving around to any serious extent, you want to be able to just hit R. (Or S, if you prefer the left hand.) Okay, end of tangent.

Martin Sherman-Marks

unread,
Jun 3, 2016, 1:17:50 PM6/3/16
to Plover
Oh, and there's also the question of how easy it is to hit +S. I don't have a proper steno keyboard yet - still using QWERTY - but I know I find it much more difficult to stroke -TS and -DZ than I do, say, -PB. That would be substantially improved by a proper keyboard (it's on my list!) but the pinky is always going to be weaker than the index finger. Not a deal-breaker, but given the choice, another reason to prefer a key in the middle over a key under the pinky.

Mirabai Knight

unread,
Jun 3, 2016, 1:39:05 PM6/3/16
to ploversteno

Why not NumLock, ScrollLock, etc command strokes rather than dedicated keys?

Martin Sherman-Marks

unread,
Jun 3, 2016, 1:57:27 PM6/3/16
to Plover
...not a particularly good reason! I probably just have the classic QWERTY-user "I wish there were a key for that" syndrome. You'd also have to be able to come out of the Number and Navigation modes again to enter text normally, of course, but you could just define a different stroke to return to normal. I don't think anyone's ever going to intentionally need to hit .,:' or ←↓→End (each of which would map to -RBGS on the standard layout), so those would be perfectly good "return to normal mode" strokes.

Martin Sherman-Marks

unread,
Jun 3, 2016, 2:00:05 PM6/3/16
to Plover
In either case, software is a larger limiting factor than hardware - Plover would need a pretty significant enhancement to make the modes I'm suggesting possible. (One reason I'm enjoying hacking away at this CMUDict project is that it's an opportunity to brush up on my Python - I'd almost forgotten how pleasant a language it is! - so that I might eventually be ready to contribute to Plover. But that's a way off, so I'll just keep dreaming for now.)

JustLisnin2

unread,
Jun 3, 2016, 2:10:42 PM6/3/16
to Plover
I agree with Mirabai that those keys would be better off as commands than dedicated keys. Have you been keeping up with the changes to Plover, Martin? The newest dictionaries on Stenodict as well as the new commands and functions added by our new devs? If not, you might want to check out the new software changes and commands available. But I love, love, love the idea of retiring the KPA stroke and replacing it with an initial caps key. Someone make it so!!!

Best,
Nat