Using CMUDict to programmatically generate translation dictionaries

Steven Tammen

unread,

May 29, 2016, 4:21:57 PM5/29/16

to Plover

Over the last couple weeks I was considering what theory I was going to pick up when my SOFT/HRUF comes. Then it occurred to me that I liked some things from different theories without necessarily liking everything about any one theory; that is to say, I realized that what I really wanted to do was steal certain bits from different theories and combine them all to make my own. The problem is, I don't think there's any good way to do this at this point in time.

I wrote up an idea that I had that might allow this sort of freedom, and I'd really like to hear people's feedback on it (particularly from a feasibility perspective on the backend). I'm no programmer, so I couldn't do anything like this without the support of the devs and the Plover community at large.

Questions to start:

How valuable do you think something like this would be in relation to other possible additions to Plover?
What would the demand for this be? How many other people want to make their own theories, or change things about existing ones?
(Targeted at experienced stenographers) If you could change things about how you currently write, what would they be? How could these ideas help contribute to a project like this where you get to handcraft your own theory from scratch?
(Targeted at Ted and Benoit) How hard would this be to implement? Would it take a long time and detract from other development goals?

I think it would be good to keep much of the discussion on the google group for reasons of permanence, but I'll be on discord too.

Thoughts?

Theodore Morin

unread,

May 29, 2016, 4:41:41 PM5/29/16

to Plover

I think you'll find that no matter what theory you start with, you will customize it to your taste. Just do you.

Stanographer said said that his dictionary has tripled in size since starting and he has changed base theories multiple times.

Plover's is a solid base and you can really get into customizing briefs after 50WPM and you'll have developed your taste.

--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steven Tammen

unread,

May 29, 2016, 4:50:05 PM5/29/16

to plove...@googlegroups.com

I'm not really talking about briefs per se. Even Pheonix people use briefs. I'm talking about customizing the underpinnings, the "guts" of a theory, so to speak. Things like what letter combinations you use to generate phonemes.

I was planning to start out with Plover anyhow, I'm just trying to think more long-term and ideal. Like Colemak/Workman/etc. vs QWERTY.

How would you go about customizing a theory at present?

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Tony Wright

unread,

May 29, 2016, 5:09:39 PM5/29/16

to plove...@googlegroups.com

I want to take just a snip from your write up:

"What I have in mind is a program that reads in the Carnegie Mellon University Pronouncing Dictionary and outputs a translation dictionary according to the preferences of an individual stenographer."

This is exactly what I have been dreaming of for a while. I'm very familiar with the CMU dictionary. It's a resource that stenography should be exploiting in many ways, and this is an important one. The ability to automatically generate a dictionary that would contain every reasonably frequent word in English, including proper names, would be huge.

I don't have the programming ability to do something like this on my own, but I'm a linguist, and I'd be glad to help develop the rules for phoneme-to-grapheme mappings that users could choose as options.

--Tony

--

Steven Tammen

unread,

May 29, 2016, 5:26:33 PM5/29/16

to plove...@googlegroups.com

Exactly! To be quite honest, I did a good bit of Googling before I spent time on this because I couldn't believe that I was the first person to look at CMUDict and go "well, that'd be useful for stenography".

In terms of the mappings, I think it would be prudent to work on "reconstructing" common mappings (e.g., those of Plover's theory and Pheonix) before getting to more specialized options. Like I said in the piece, I think this going to be the hardest part for spelling-dependent theories because the graphemes will change based on context (i.e., the same sound can be stroked multiple ways depending on how the word is spelled).

What are some of the other steno-related things you were thinking about using CMUDict for?

You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

JustLisnin2

unread,

May 29, 2016, 9:16:47 PM5/29/16

to Plover

Hi everyone,

I'm neither linguist nor programmer nor professional stenographer, but I do have some thoughts to add if they're of any value. I've always loved the idea of consistency in dictionary definitions, so this is definitely an interesting discussion. Some points, though:

1. Are most frequently used words briefs? My intuition is to brief common words, and any longer/derivative words are handled by word parts. It's always been the technical/medical terms that I've had difficulty defining entries for. Would it be constructive to compare the phonetic dictionary to Plover's main dictionary or any of the proprietary steno dictionaries and see just how many of the entries are in fact briefs? Ore you proposing this dictionary format so that learners can have consistent, non-brief forms as they're learning?

2. From own personal steno experience, how comfortable a stroke is for my hands plays a far larger role than any consistency in definition. If the stroke isn't comfortable, then I just won't use it.

3. If you already have a well-defined vision of how this might translate, can you give a few examples of words in the Plover main dictionary that would change based on the phonetic dictionary you're suggesting? Just to clarify (thanks)

Nat

Gavan Browne

unread,

May 29, 2016, 9:36:14 PM5/29/16

to Plover

I think that's an awesome idea but as a non-programmer and non-linguist I'm not sure I have much to add to the conversation. I did something similar-ish to programmatically generate text expander abbreviations from a large word list (12dicts). I had a list of about 11000 abbreviations gifted from another transcriber, but I decided I wanted more and now have about 60000 or so. I generated stuff like sstnblte for sustainability which isn't great, but a lot of what it churned out was good and usable as is. If I had more free time I might have attempted to delve a bit deeper and try to use phonetics to generate better results.

As a qwerty typist giving precedence to higher frequency words for the shorter abbreviations is important as is generally keeping all abbreviations as short as possible. To that end I've started using any unused 2 letters pairs, so uf=contribute which is a lot better than kntrbt. That requires long term memory though and has to be learned. That's a balance between logic/consistency and efficiency I guess. As a full time typist I'd trade consistency for efficiency every time but a learner or novice would find that confusing and off putting I'd imagine. Is there currently a way to generate a list of unused strokes in the base Plover dictionary? Would you want to fill those up and sacrifice consistency? Good example is S-G for something. It's perfect and easy to use and makes sense but it's perhaps less consistent than maybe STH-EUPBG.

I'd also say the shortest possible way to stroke a word isn't necessarily the easiest way. For example I have trouble stroking words ending -FPB so I just stroke them -FRPB and thankfully it works for the few I've encountered so far.

One other thing I'm thinking of is can pronunciations be converted to steno strokes without modification? I'm thinking of obliged, which would be stroked OB/BLIG/D (not steno), but would that be represented in phonetics as O/BLIG/D?

I imagine what you're saying can be done and probably done well with a linguist and a programmer though. Further and final thought is instead of trying to mass covert a huge quantity of words maybe an app that a user can input a word into, the software goes off and finds the corresponding phonetic pronunciation, generates a list of every possible way to stroke that word in steno excluding any conflicts within a dictionary the user may specify, present the list to the user and allow them to choose which brief they want to use. Something like that would suit me because let's say I could tick a box that says "include unused strokes" and get a really short stroke for a long word, and it would suit the person who prefers consistency because they could choose the one that makes most sense to them.

That sounds easy but I can imagine there might be a huge number of ways to stroke a given word. I lied about the final thought above. Let's imagine the algorithm determines the best way to stroke "something" is SEUG which is literally "sing" if you were to convert it back to English. That's not a problem because sing is defined as something else in the dictionary. The problem is confusion for the person who strokes SEUG expecting sing but who gets something, if that makes sense. I guess maybe a consideration is any programmatically generated steno stroke/brief should avoid this or be flagged in some way.

I'll stop rambling now.

JustLisnin2

unread,

May 29, 2016, 9:42:28 PM5/29/16

to Plover

Looks like we had some similar ideas, Gavan. I forgot about word boundary errors, though. That's very important.

Nat

Steven Tammen

unread,

May 29, 2016, 10:15:22 PM5/29/16

to Plover

Hi Nat,

Of course everyone's thoughts are valuable. I'm actually like you: neither a linguist nor a programmer nor a professional stenographer (nor a stenographer of any sort, really -- still need to get something to practice on). The more people we have participating in this discussion, the better!

1) You are correct that most common words are briefed. My idea in doing this is actually entirely separate from briefs, and I had attempted to make that clear. The thing I am interested in here is giving some thought to stenography without briefs -- "everything else", so to speak. Even adherents of theories like Magnum have to stroke stuff out fairly frequently, compounded all the more for new people that don't have thousands of briefs in muscle memory. So, the logic goes, shouldn't we try to optimize for this portion of stenography as much as we can as well?

My main motivation for having something like this is for making everything that is not briefed as efficient as possible, in a way that lets people do something that makes sense from them instead of drilling someone else’s theory by rote — people make their own dictionaries instead of learning someone else’s. So it is in a way related to learning, but it’s also a matter of pure efficiency, letting people do what works for them. (And I know from first hand experience that if it “doesn’t work for me” I do far better building something for myself rather than trying to force someone else’s thought processes on myself).

Being able to tweak theories easily is really impossible currently, AFAIK. Being able to generate different dictionaries to “test out” changes is another primary motivation behind this idea. I come from a background of custom-designing my own 6+ layer keyboard layout, so not being able to change stuff is a major downside to stenography in its current form, in my opinion.

2) I think this is in relation briefs again. For briefs, everyone in fact must do what makes sense for them otherwise they’ll never stick. What I’m talking about is just the equivalent of this for the rest of stenography — doing what’s comfortable for you instead of having to “learn” something someone else came up with.

3) Just comparing to Pheonix (which is a form of phonetic theory), we can look at a few non-briefed words:

Word: Neither

Plover’s theory: TPHAOE/THER or TPHAOEU/THER (among other definitions, see here)

Pheonix: TPHAOEURGT

Word: Excesses

Plover’s Theory: EBGS/SES/-S

Pheonix: KPES/-Z

Word: Metallurgy

Plover’s Theory: PHET/A*L/AOURPBLG/SKWREU (among other definitions, see here)

Pheonix: PHET/HRAERPBL

I don’t really have a fine-grained vision in mind yet because I wanted to see what other people thought first. Ideally, we wouldn’t be limited to just choices between existing theories, but we could choose our own strokes for a particular phoneme (sound).

Keep the thoughts coming!

-Steven

Steven Tammen

unread,

May 29, 2016, 10:50:48 PM5/29/16

to Plover

Hi Gavan,

You bring up some good points. There is always a tension between shortness (or “efficiency”) and consistency. This is actually the primary difference between phonetic theories like Phoenix and brief heavy theories like Magnum: the former tries to be consistent and sacrifices short writing because of it, and the latter tries to be short but sacrifices consistent writing because of it.

I’m not convinced this has to be an either/or, however. You can have an efficient phonetic theory base for writing out uncommon/nasty words, and still brief like crazy. The two aren’t mutually exclusive. What this project would be focused on, however, is the former: getting that theory base for writing out words independent of briefs in a form that makes sense to individuals rather than trying to adopt someone else’s base for “consistency”. Your thoughts on briefs are spot on, but that’s a whole different subject.

To take your “something” example, what I had in mind here was a program that would take the individual sounds in the word (known as phonemes) and let an individual choose how to stroke them, either phonetically (as in Phoenix) or based on spelling (as in Plover’s theory). This would give users flexibility with regard to their non-briefed dictionary entries, which is actually the part that we don’t have control over right now. We can brief stuff out to our heart's content, but changing how you write normally — external to briefing — is a much different task.

—————————

On syllable division, this is a linguistics problem. One option for us would be to follow the maximum onset principle as it is classically defined. You can read about it here (less than you probably want) or here (more than you probably want). The onset is the beginning part of a syllable, and the coda is the ending part. Pretty much, if you always stick as many consonants as you can in the onset instead of the coda (so long as it is phonotactically allowed in your language), you won’t run into as many problems of syllabification.

Basically, if we followed this rule in how we split up syllables, we could stroke the words in the same way we split up the syllables, and we wouldn’t have this problem because it would be consistent. Perhaps someone more knowledgeable about linguistics than I could explain better.

—————————

I’m afraid I don’t follow your last bit on having the program spit out a “list” of possible ways to stroke something out. If we allow people to define their own strokes for phonemes, theoretically there are many different ways to stroke the same thing, but only one that follows any given person’s phonemic map. People would certainly be free to brief on top of the consistent definition for their personal theory, but I think it would be a mistake to make words only accessible by briefs except for a select few that are extremely common.

On the other hand, if what you’re suggesting is a program that suggests briefs for words based on what’s available, then I think that is another fantastic idea — but it is different from the one I am forwarding here.

Good stuff! Keep the ideas coming.

-Steven

On Sunday, May 29, 2016 at 9:36:14 PM UTC-4, Gavan Browne wrote:

JustLisnin2

unread,

May 29, 2016, 11:06:57 PM5/29/16

to Plover

Hi Steven,

1. I see. That's what I thought. The goal of this effort would be to create an optimal, consistent dictionary for non-brief forms. I understood that you had wanted to make this entirely separate from briefs, but I was wondering how constructive it would be considering that, as a learner, most of the first words I picked were, in fact, brief forms. But it does make sense to optimize the rest of stenography, especially since there are still some words that may not even be defined in non-brief forms.

2. When I said "comfort", I meant physical comfort on my hands not so much my comfort with the theory. It sounds like you're looking at this solely from a theoretical/linguistic standpoint, and I just wanted to add the practical as well. But I understand that people can default to their own briefs/definitions as they need to for comfort.

3. Can you explain this part to me? "Ideally, we wouldn’t be limited to just choices between existing theories, but we could choose our own strokes for a particular phoneme (sound)." I'm not sure I understand. Are you suggesting going as far as changing the definition of the "ch" sound, etc.? Are you referring to the treatment of vowels among the different steno theories? Or what phonemes do you feel the existing theories are restricting you to? Also, I think I misunderstood the "what" portion of this project as well. I thought that you meant creating a standard dictionary based on the Carnegie Mellon dictionary, but now it sounds like you're suggesting an entry generator, so to speak? If that's the case, and the Carnegie Mellon dictionary has consistent, phonetic rules, how will this result in multiple entries for learners to choose from?

4. I replied to Gavan earlier with a mention of word boundary errors. While having the freedom to choose your own strokes instead of learning someone's else's theory by rote sounds liberating, one of the reasons why I, personally, was reluctant to add my own entries to the dictionary when I first started learning was because I had no intuition for what word boundary errors would wreak havoc on my writing. Aside from Learn Plover, there's really not a lot of free formal study material to go alongside of Plover. It was trial and error. I made months of changes to the main dictionary before I realized that I should've been adding entries to my own personal dictionary. As I said. I'm not linguist, so if any linguists can chime in: is there a way to define a set of rules that could uncover all possible conflicts that could occur? How do you ensure that this new, customized theory is conflict-free, especially since you're targeting new learners?

My final thoughts on this for tonight :) Good night!

Nat

Steven Tammen

unread,

May 29, 2016, 11:49:37 PM5/29/16

to Plover

Hi Nat,

1) I see what you’re saying. There would need to be some minimal set of briefs that are not a part of the normal dictionary generation (“the”, “and”, “he”, “or”, etc.). Of course these could be variable too, but then we get into the subjective issues of briefs discussed above. I hadn’t really thought about this too much (no doubt because I haven't really learned steno yet).

2) Wow I totally misread what you were saying, haha. Agreed. The difficulty of strokes from a physical standpoint should be taken into account as well (holding down 2 keys is easier than 6, for example).

3) I was thinking of pretty much opening up how phonemes are stroked entirely. Of course most everyone would probably leave the “ch” sound as it is… but what if someone didn’t want to, and wanted to move stuff around on the steno keyboard? Well now they’d have the option to. The differences in vowel sounds between theories were a primary motivator of this consideration, but another one I was thinking of was how things are stroked depending on how the word is spelled. Unless I’m totally misinformed, most theories might stroke the same sound different ways if it is made with different letters in English. Letting people choose these sounds was something else I had in mind.

The generated dictionary will be “standard” (i.e., consistent) according to the preferences that the user specifies — it could be totally different from a dictionary that someone else generates based on their preferences. What exactly do you mean by “entry generator”?

4) I had kinda mentioned this — albeit vaguely and not in a very good way — in my write-up:

“It [the generator] will automatically take out medial schwa, roll in suffixes, and create disambiguation briefs to the extent possible without creating conflicts. Problematic words will be displayed for either further programmatic processing (e.g., if a word ends in -ing without it being a suffix, do ____ to add on -ing), or hand-correction.”

From what I’ve read, totally conflict free writing is a myth. This is always a game of compromise. I’m not qualified to comment on the specifics of this (in fact you probably know more than me because you’ve actually been learning steno for a while), so it would be good if someone knowledgeable helped think of ways to deal with this problem. What I do know is that long words tend to have less word-boundary problems than short words, and that briefing very common short words can solve many of these problems.

-Steven

JustLisnin2

unread,

May 30, 2016, 10:26:07 AM5/30/16

to Plover

Hi Steven,

The picture is becoming clearer now, thanks. So you're suggesting an entire dictionary that's generated based on preset preferences. I was thinking along the lines of a word-by-word generator based on the Carnegie Mellon dictionary, where the user would have the option to choose between multiple dictionary entry suggestions and pick the entry that fits their preference. So if you want an entry for a word, you would type that word into the generator and get a list of choices, then you could pick the one you want. This led me to think that an unwary user might mix and match among entries, thinking only in terms of individual words and not thinking about potential word boundary errors that could arise. You're right; an entirely conflict-free theory is a myth. But an entire dictionary makes much more sense to me in terms of conflicts. You're sort of generating your own theory that fits your personal writing style, and, as you said, it's a way to test out potential new theories and picking the most efficient one on an individual basis. That sounds so cool :)

Nat

Steven Tammen

unread,

May 30, 2016, 11:13:28 AM5/30/16

to plove...@googlegroups.com

Yes, that's it.

I think letting people choose from options like that isn't a terrible idea, but it would need some sort of conflict detection (at least ideally) for it to work well. It would serve a purpose of letting people "try out" different briefs for words without necessarily having to come up with them haphazardly every time they want to brief something. So long as it could filter based on individual people's dictionaries, people could see what strokes are "available" for them to keep their briefing conflict free. I believe this is what Gavan was thinking of above. To extend the idea even further, if we got the permission of established theories (Magnum, etc.), we could even have this hypothetical "brief generator" display the briefs these established theories use for words as well.

-Steven

--

Steven Bhardwaj

unread,

May 31, 2016, 1:04:21 AM5/31/16

to Plover

Hi All,

On the subject of Plover and Linguistics,

This thread makes me think of the idea of using a chorded keyboard system for inputting raw IPA transcriptions...
IPA keyboard for English: http://ipa.typeit.org/
IPA for English Audio key: http://www.antimoon.com/how/pronunc-soundsipa.htm
IPA keyboard with all symbols: http://ipa.typeit.org/full/

I expect it would be helpful, to retain any vestige of efficiency, to have a language-specific IPA theory. But the theory might ought to be an orthographic theory like Mecatipia, Jackdaw, or Kinglet, making it (I suppose) easier to create. CMUdict would probably be helpful in designing a reasonably-well optimized theory, but it might be more flexible for this application to use an orthographic-style dictionary rather than a regular word-centric steno theory.

An English setup might be fun if I ever wanted to learn how to imitate different English accents! Although more serious uses would include transcribing endangered languages from DAT tapes, etc.

:)
Steven

Steven Tammen

unread,

May 31, 2016, 10:35:01 PM5/31/16

to Plover

Hi Steven,

I had toyed with the idea of basing the generation itself off of IPA, but I couldn't seem to find a good English dictionary in IPA. CMUDict uses Arpabet which is English-specific, and also entirely ASCII based (requires no Unicode support -- IPA does).

We could use CMUDict to generate IPA transcriptions like you say, but I wanted to keep the goal focused for the first little bit (i.e., getting a working dictionary generator for "normal" English stenography). Over time, something else I thought we could do was take in IPA as a language independent source, and get steno support for languages without developed theories yet (assuming we could find pronunciations for words in said language in IPA). This would help open stenography up to more people who wouldn't otherwise have access to it in their native language.

I think you may be right that orthographic input systems have the upper hand for full IPA. There are a lot more sounds than we use in English, for example, and it would be difficult to fit them all on a typical 22-key stenotype.

-Steven

Martin Sherman-Marks

unread,

Jun 1, 2016, 4:45:16 PM6/1/16

to Plover

This conversation is relevant to my interests! CMUDict is a good starting point, but I see a few problems you'll need to solve along the way:

Syllabification: CMUDict doesn't define syllable boundaries, which are critical for any steno theory. (As I know very well from my own experience as a learner. Most of the time, when I can't figure out how to stroke a word, I eventually realize it's because the definition in Plover is based on a different syllabification than the one in my head.) And unfortunately, there is no particularly good rule for syllabification in English, especially when you're etymology-blind. The TeX hyphenation algorithm has been ported to Python, and that might be a good starting point - but note that "where to stick a hyphen" and "where to stick a syllable break" are related but different problems. (The TeX algorithm won't hyphenate "project", for example.) I'm not saying it's going to be impossible to syllabify the CMUDict algorithmically, but it'll present some interesting challenges.
Morphemes: CMUDict is morphology-blind; it has separate line entries for "ABANDON", "ABANDONS", "ABANDONING", "ABANDONED", and "ABANDONMENT", for example, with no way to know that those words are all connected. Before you start trying to run through CMUDict, you'll want a prefix/suffix dictionary, which will almost necessarily be non-phonetic. (For example, Plover uses "*PLT" for "-ment", not just because it's shorter but also so that "PHEPBT" is available for the first syllable of "mental". Otherwise, how would you talk about your friend Abandonment Al? Poor Abandonment Al. He's got some problems.) Oh, and going back to my earlier point, you'll have to make sure that your syllabification algorithm sees "ABANDON/ING" rather than "ABANDO/NING" - but still sees "SING", not "S/ING" - or you'll have a disaster on your hands.
Conflicts: I threw together a quick Python script to count how many homophones there are in the CMUDict. I found 13,015! (Admittedly, many of them, like "beet" and "beat", can probably be dealt with using the Plover theory's built-in disambiguation rules. I didn't account for that.) So conflict resolution definitely isn't a "figure it out manually" kind of problem, unless you intend to pore over a hell of a lot of dictionary entries. Unfortunately, conflict resolution relies in large part on something CMUDict won't tell you: word frequency. Which is more common: the word "accord" or the acronym "ACORD" (the Association for Cooperative Operations Research and Development, naturally)? You can answer that very quickly; CMUDict can't. And that means that you need some other test to tell you which should get the rule-following stroke A/KORD and which should get a rule-breaking stroke like A/KO*RD. (Or, in this case, which is so uncommon that maybe it should just be quietly ignored.) Plus, you need to teach your script how to craft a good rule-breaking stroke. It's easy enough to say "just throw an asterisk in there", but remember that Plover theory uses S* for initial /z/ and *T for final /th/, so your word may already have an asterisk in it. You can also change out the vowels, or repeat the stroke more than once to cycle through related options (Repeating A*PB to switch between {^an}/Anne/Ann is one of the more remarkable versions of this in the Plover default dictionary.)
Capitalization: Another thing the CMUDict doesn't have: lowercase letters! The Plover dictionary has PHARBG for "mark" and PHA*RBG for "Mark"; that sort of thing is very common. If I hadn't looked up "ACORD" in the last example, I wouldn't have had any way to know it wasn't "acord" (not a word). Even a smart algorithm that was reading through the CMUDict would have surely given me a dictionary entry "A/KO*RD: acord" for a word that doesn't actually exist!

I think a common solution to many of these problems is to incorporate more than one wordlist. For example, a word frequency table would help with conflict resolution at the very least - though you'd need a big one from a good corpus. Step one would be to write some kind of script that turned the CMUDict into a more complete dictionary with a format like:

word W ER1 D 245

That's the word in its normal capitalization, pronunciation, and then frequency rank. You'd still have morphology and syllabification problems to think about, but that would be a good step one.

Zack Brown

unread,

Jun 1, 2016, 5:20:30 PM6/1/16

to ploversteno

For syllabification, even if there's no good rule for it, there may be
good rule to identify the range of possibilities. Any new dictionary
will probably want to have entries for as many possible
syllabifications of words as it can, to account for everyone's
personal tastes (similar to what Plover does now). Also bear in mind
that you will probably want to include things like dropping unstressed
vowels, and the inversion rule. This messes with syllabification a bit
as well. You'll probably need to come up with a whole new approach to
syllabification, based on making those assumptions.

Also, for any programmatic analysis of CMUDict, I'd recommend
prioritizing words based on frequency of use. Peter Norvig's word
frequency table is at http://norvig.com/google-books-common-words.txt

I'd also suggest programmatically coming up with a set of prefix and
suffix strokes, similar to what Plover has. The idea would be for no
word to end with the keys used in any prefix stroke, and for no word
to begin with the keys used in any suffix stroke, to avoid word
boundary errors.

Another thing to bear in mind is that in steno (although I don't use
this lingo in Learn Plover), the "theory" is generally considered to
be the particular approach to constructing briefs. The whole set of
standard and repeating rules governing consonant and vowel sounds, and
things like the inversion rule and so on, is not called 'theory'
because it's considered so fundamental that it's not even questioned -
all English language steno systems use those same basic chords and
rules, for the most part. At least that's my understanding.

But I would suggest changing that. Anyone coming up with a new
dictionary should truly start fresh. use CMUDict and the Norvig files,
and come up with an entirely new set of keys and chords for all the
different English sounds. I think if you do that, it may be possible
to improve on Ward Stone Ireland's original keyboard layout. At that
point, it might be possible to significantly reduce word conflicts,
and fit a far greater number of multi-syllable words into single
strokes.

Ward Ireland's keyboard was designed 100 years ago, with virtually no
statistical calculation to guide him. Additionally, it was designed to
be entirely syllabic. There were no briefs because there were no
lookup files. It was only in the 1980s that the proprietary steno
companies introduced dictionary files and briefs. Given that kind of
chaotic history, I think there's a very good chance that a much better
solution exists than the one that's come down to us. I think whoever
works on this is very likely to find a much cleaner, sharper system
than any of the steno systems currently in existence.

Be well,
Zack

> --
> You received this message because you are subscribed to the Google Groups
> "Plover" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to ploversteno...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Zack Brown

Martin Sherman-Marks

unread,

Jun 2, 2016, 9:17:41 AM6/2/16

to Plover

Zach, I was thinking about that very idea of "identifying the range of possibilities"; part of the challenge will be determining how much a particular stroke should be allowed to "spread" in the dictionary. Not just for syllabification, but for misstrokes too - the algorithm will have to think about how hard a particular stroke is, what the likely misstrokes are, and then will have to weigh how frequently the word is used against the space that the likely misstrokes will take up. A fairly complicated and nuanced process!

I've been trying to find a word frequency list that is case-aware, but with no luck so far. The American National Corpus - which, I'm pleased to note, contains among other things 3 million words from a Buffy the Vampire Slayer fan forum - has frequency data, which doesn't differentiate by case but does differentiate by part of speech, including proper nouns. (It also includes bare lemmas for plural nouns, which I suspect may be helpful down the line.) I'll attempt to pull it together into a case-aware word frequency list on the assumption that pretty much all proper nouns are capitalized. The next step after that will be combining it with the CMUDict to add in pronunciation, which should be fairly straightforward, I hope. (There is a larger, cleaner word list, from the 30x larger Corpus of Contemporary American English, but that costs $250, or $125 if we can claim academic use. If the ANC wordlist works, then it would be fairly trivial to modify the script to use the CoCAE data when I'm feeling wealthier.)

With regard to what Zach was saying about developing new ground-up principles of steno from this - I think he may well be right, and it's something I'm interested in exploring. Unfortunately we need to conquer syllabification first. Once we have that, we can develop a complete list of syllable onsets, nuclei, and codas in American English (and their frequency!) - that's the point where we can start rethinking the keyboard.

Martin Sherman-Marks

unread,

Jun 2, 2016, 11:10:51 AM6/2/16

to Plover

Yikes. Okay. The ANC list has some issues. My assumption that anything flagged as a proper noun should be capitalized has run into the issue that they flagged a lot of words as proper nouns. The word "accent", for example, occurs 449 times in the corpus, and is flagged as a proper noun 23 of those times. Not super helpful. In total, it looks like about 40% of the words that occur more than twice in the sample are flagged as proper nouns at least once, which is... ugh. There are more proper nouns in the dataset than improper ones!

I was able to improve things by using an SSA dataset to generate a complete list of all 95k first names registered since 1880, then only capitalizing entries if they're flagged as proper nouns and are in that dataset and are in the CMUDict. (I'm downloading GNIS/GNS datasets now so I can add geographical names as wel - they're huge datasets, naturally, but by limiting the list to the intersection with CMUDict, and by stripping all data but the placenames themselves, I'll be able to make a fairly small file of geographical names.) This greatly helps - though it still thinks that the first name "Zeppelin" is 267% more common than the actual word "zeppelin" (since "zeppelin" is tagged as a proper noun 24 times in the dataset and as a typical noun only 9 times). There's no way to address that short of using a better corpus, which I'm going to continue looking for.

My first draft of a case-sensitive dictionary with pronunciation information and word frequency information is attached. Anyone who wants to play around with it, or who wants to see any of the source files I'm using, let me know.

newDict

Steven Tammen

unread,

Jun 2, 2016, 11:35:41 AM6/2/16

to plove...@googlegroups.com

This is great stuff guys, and exactly the sort of thing I thought might come up once you got under the hood, so to speak. My training in linguistics has been limited to several hours of casual reading on Google and my knowledge of steno is about equivalent (no NKRO keyboard + no SOFT/HRUF yet = lack of steno skills). If I say something really silly... that's probably why.

I had initially had the idea of rebuilding steno from the ground up in mind, but decided that I'm not the one to do this (though it would make a great thesis topic for someone in a relevant field). However, I would be most supportive of such an effort, and in fact I think it should be considered somewhat a priority compared to many other features. All the cool stuff that Zach and Jennifer have been doing with Kinglet and Jackdaw need not be limited to orthographic input systems.

On the other hand, I do think there is value in making the system easily accessible for people still in more traditional forms of steno. On a practical level, we're going to have to convince people that all this complicated stuff is worth doing, which means it has to be usable by them as well. There are plenty of advantages to having a dictionary that is algorithmically generated rather than hand-crafted, with some obvious ones being that it's much easier to tweak, and could be easily regenerated if something happens to to the main one.

-----------------------------

Let me see if I can get a handle on some of the issues in play:

1) Syllabification

Even though things like the maximum onset principle exist, there is no great consensus on how words are split. Furthermore, any abstract pontification about syllabification is ignoring the reality that different people will split syllables in a way that makes sense to them (even if it's not "canonical"), and therefore there is no one size fits all answer. We will have to account for as many different syllabifications as reasonably possibly, just like misstrokes.

2) Morphemes and Suffixes

To get related words connected (verb conjugations, for example: tag, tagged, tagging), we will have to figure out a way to 1) parse this data out of CMUDict, and 2) use it somehow in the resulting dictionary. This is further complicated by the fact that semantic matching will have to occur using syllabification that results in normal suffixes such as -ing, -ation, and so forth, while ignoring words like sing and nation.

3) Homophones

Hand correction is out of the question, and it would be inconsistent anyhow. Disambiguating the conflicts should rely on frequency data and be done consistently if possible.

4) Capitalization

Not present in CMUDict initially. Will probably be easiest to add using word-lists of names, places, etc. (proper nouns). Dealing with some words that are in both capitalized and uncapitalized forms (as above: Zepplin as in "Lead Zeppelin" vs. zeppelin as in in the Hindenburg) will present a challenge.

-----------------------------

@Martin, your dictionary looks good on first glance, but not all of the "word bases" (what the second column is, I take it), look right. For example:

absolute absolute

absolutely absolutely

absoluteness absoluteness

absolutes absolute

absolution absolution

absolutism absolutism

absolutist absolutist

Unless I'm misunderstanding the purpose of that column, more of these words than just "absolute" and "absolutes" are related.

What do you think the next step is?

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Steven Tammen

unread,

Jun 2, 2016, 11:52:03 AM6/2/16

to Plover

Oh, one other thing to think about:

As we are going through this process (if we are going through this process?), something to think about is documenting things more than usual. If we can figure out the issues in English, we can generalize our solutions to other languages and different phonetic transcriptions (at least somewhat/mostly). Stenography is still limited in its support for many languages, and if we get a system in place that can generate a framework based on frequency statistics, ease of combinations, etc., we might enable others to build efficient stenographic frameworks without ever giving an inefficient framework a chance to take hold.

Martin Sherman-Marks

unread,

Jun 2, 2016, 12:24:41 PM6/2/16

to Plover

Steven, my next step is to continue trying to refine this wordlist. I don't think I'm going to find a significantly better corpus without shelling out for the CoCAE wordlist, which I'm certainly not ready to do, so I'm going to try to take whatever steps I can do improve this one (without, you know, going through it all by hand or anything). I may ultimately not be able to match word frequency to proper nouns. And that might be okay! I can certainly generate a pretty decent list of proper nouns which occur in CMUDict - I've got first names already, will soon have geo names, and just need last names to round out the set - and we can just treat proper nouns as if they were middle-of-the-pack words. We're not going to find a corpus that's actually designed for the crazy-ass purpose we're putting it to, so we'll have to avoid making the perfect the enemy of the good.

The second column in the dictionary (which comes straight from ANC data) is actually the lemma, which is to say (roughly speaking) the word form that you look up in the dictionary. So "is" has the lemma "be", but "absolutism" doesn't get the lemma "absolute" even though it's clearly related to "absolute". The "-ism", "-ly", "-ness", etc suffices are what we call derivational morphemes: they change the meaning or part of speech of a word, so the word gets a new lemma. The "-s" and "-ed" suffices are inflectional morphemes; they modify a noun's number or a verb's tense, but they don't change the word in any more fundamental sense, so the lemma remains the same. The lemmas may or may not turn out to be helpful for us in the long run; I kept them in my dictionary because I figured they wouldn't hurt, basically.

Apart from generating the wordlist, we need a general syllabification algorithm. Even if we're going to "spread out" strokes to account for alternate syllabifications, we'll need a starting point. I think the best rule I've seen is that consonants on a syllable border go with the more heavily stressed vowel. ARPAbet stress values go, somewhat confusingly, 1 (most stress), 2, 0 (least stress). (ARPAbet defines a few syllabic nasals and liquids too (EM, EN, ENG, EL) but it looks like CMUDict doesn't use those, so we can just look at numbered vowels.) This gets us a lot of the way there, but there are still issues: consider "abandoning" [AH0 B AE1 N D AH0 N IH0 NG] - the algorithm won't know where to put the <n>, but we want to make sure it winds up in the penultimate syllable so we can recognize the "ing" suffix on the end. This is actually a case where the lemma column may be useful. (Note that this isn't the final syllabification algorithm for mapping pronunciation to steno: that will have to factor in a whole mess of other problems, which Zach alluded to above.)

With that algorithm and my wordlist, we can get a full list of syllable onsets, nuclei, and codas. I keep coming back to that, but it's the heart of English steno. The layout of the left side of the keyboard starts to make total sense when you think about all the English syllables that start with /str/ or /spr/ - as well as all the English syllables that don't start with /pw/. (A few foreign words like "pueblo" and "Poitier" do, but for the most part it's very safe to map PW to /b/.) If we intend to reinvent the wheel, that's the kind of data we'll need.

There are other fronts we can be attacking this on, like putting together our list of prefices and suffices, but this is what I'm seeing as the most critical issues.

In regard to internationalization of this system... well, keep in mind that English has much better corpora than most languages. True, you might not need a CMUDict for, say, Spanish (because the writing system is so phonetic) but you'll still need word frequency data at the very least, as well as a thorough list of proper nouns.

Martin Sherman-Marks

unread,

Jun 2, 2016, 12:31:28 PM6/2/16

to Plover

Oh, and I'd be wary of using the maximal onset principle for syllabification. Phonologically, it may be fairly accurate, but it plays fast and loose with morphology. Maximal onset says that "dancing" is syllabified dan/cing, but we want it to obey the morphological boundary: danc/ing. The "hungry stressed syllable" idea would give us the correct answer (and just about always will, since "-ing" is never stressed - the only problem arises where the preceding syllable is also not stressed, like "abandoning".)

Martin Sherman-Marks

unread,

Jun 2, 2016, 12:33:40 PM6/2/16

to Plover

(Also, I have a linguistics degree and tend to use jargon without explaining it. Please feel free to ask if there's anything you don't understand or need me to define.)

Steven Tammen

unread,

Jun 2, 2016, 1:10:16 PM6/2/16

to Plover

You're fine haha. I think I actually got most all of it.

I'll try chip in where I can but I think I'm going to get eclipsed here, having neither background in CS nor linguistics, nor, practically speaking, stenography itself.

we'll have to avoid making the perfect the enemy of the good.

Now this is the problem I have. https://xkcd.com/1445/

Jennifer Brien

unread,

Jun 2, 2016, 3:11:58 PM6/2/16

to Plover

On Thursday, 2 June 2016 17:31:28 UTC+1, Martin Sherman-Marks wrote:

Oh, and I'd be wary of using the maximal onset principle for syllabification. Phonologically, it may be fairly accurate, but it plays fast and loose with morphology. Maximal onset says that "dancing" is syllabified dan/cing, but we want it to obey the morphological boundary: danc/ing. The "hungry stressed syllable" idea would give us the correct answer (and just about always will, since "-ing" is never stressed - the only problem arises where the preceding syllable is also not stressed, like "abandoning".)

The nice thing about orthography is that you can split your words anywhere you like, because there is no dictionary. I'm inclined, where possible to have an extra key for the each of the main suffixes, so they can be folded into the main stroke whenever you get the chance. Also, even with a dictionary, it would be good if there was a way to mark (as with Velotype's No Space) whether a stroke is a complete word, a prefix or a suffix. That would mean you could find a multi-stroke word in the dictionary no matter how it was split, and it means that Abandonment Al works out just fine.

Discounting homophones (granted, that's a big discounting!), a system based on CMUdict would be rather like an orthographic system for English with Simplified Spelling. It might be a bit faster than one for Standard spelling - provided your own pronunciation is sufficiently Standard. I don't think it would be that great for sight-transcribing unfamiliar words.

I don't do real-time audio transcription and I probably never shall (I have done quite a bit of tape transcribing) so I don't know what is ideal for that purpose, but I'm very wary of the idea of a Big Comprehensive Dictionary. ISTM that once a corpus of words exceeds a few hundred, it becomes quite obvious where its bias lies. I want to be able to write any word (even ones that I have invented) without having to spell it out letter-by-letter, and if it's a long word that I'm likely to need again, I want to be able to quickly make a brief for it. If it's something that only comes up once in several thousand words, why waste sleep about losing a stroke?

To make this efficient I need to be able to stroke the most common onsets and codas as they are spelled, in the most straightforward manner. I'm not interested in word frequencies or even syllable frequencies, but I am interested in the frequencies of consonant sequences. If such a sequence precedes a vowel its an onset to be keyed by the left hand; if it follows a vowel, it's a coda, and if it has a vowel at each end it's a coda followed by an onset and you can divide it by the maximum onset principle. It would also be useful to record the adjacent vowels. Jackdaw's leading A and trailing E and Y/I seem to save a lot of strokes, but I wonder how it compares with giving more space to consonant combinations?

The basic principle is, use the easiest keys for the most common sequences, whether they be consonants or phrases. If they are natural prefixes, arrange if possible for them to be stroked solely by the left hand (or by the right if they are natural suffixes) so that more can be included in the same stroke. I think this principle is also widely used in Magnum Steno. but allowing the ouputput of different parts of the keyboard to be combined, as I outlined here - https://groups.google.com/d/msg/ploversteno/mo7OF0D6UM0/s4YZItf0EwAJ avoids dictionary inflation.

Steven Tammen

unread,

Jun 2, 2016, 5:24:14 PM6/2/16

to Plover

Well, it looks like we're on our own. Ted thought it was a decent idea but neither he nor Mirabai were sold completely. Something along the lines of a whole awful lot of upfront work for questionable payoff.

I still think it would be a great thing to have eventually.

Zack Brown

unread,

Jun 2, 2016, 7:00:34 PM6/2/16

to ploversteno

Heh, I could've told you Mirabai would disagree. We had many
discussions about that while I was working on Learn Plover. She
represents the position that Ward Ireland really knocked it out of the
park, and any possible improvement will be minimal at best. She could
be right. But if someone really did come up with an improved
dictionary, I'm sure she'd acknowledge it. She just has a lot of faith
in Plover's dictionary, for good reason - it's her own personal
system, that she developed over years.

The thing about the Ward Ireland keyboard layout is this: to improve
upon it, you need to find a layout that can produce a wider variety of
words in a single stroke, without using any briefs, than the Ireland
keyboard. On top of that, any briefs that are used for disambiguation
have to rely on a simpler set of general guidelines than Plover
(https://sites.google.com/site/ploverdoc/lesson-9-designing-briefs).
Also, any briefs that are *not* used for disambiguation but instead
are simply for speed ("the", "of", "that", "as", etc), have to be at
least as easy to type as the Plover versions, because that will have a
strong aggregate affect on typing speed.

BTW, regarding a syllabification algorithm - I don't think it's as
important as other folks seem to. The reason is this: the new keyboard
layout will define a new "steno order". Its value will lie in its
ability to cram more words into that order than traditional steno
order does (otherwise the new system will not offer a speed
improvement over Plover). Since that's the case, syllabification
doesn't matter as much as the ability to cram a word into the new
steno order. Steno has never really been about syllables anyway - as
witnessed by the vowel-dropping rule. So, personally, I believe the
hunt for syllabification algorithms will be a time-wasting red
herring. I'd recommend focusing on identifying the most
all-encompassing steno order instead. Let stroke-breaks take care of
themselves.

Be well,
Zack

Theodore Morin

unread,

Jun 2, 2016, 8:19:59 PM6/2/16

to Plover

I support you in the sense that I think it's worth trying/doing ☺️ just not something that I'd like to put effort into myself.

Plover will definitely be there to support you technically, including a different steno order and more keys if need be.

Zack Brown

unread,

Jun 2, 2016, 9:22:13 PM6/2/16

to ploversteno

Excellent! So at least if anything does come out of this, it'll have a home in the software.

So, is anyone actually pursuing a new steno dictionary as a real project - or in particular, the software to construct a solid language-agnostic dictionary for anyone who has a phonetic dictionary file and frequency stats in a given language?

Be well,

Zack

Martin Sherman-Marks

unread,

Jun 2, 2016, 10:15:55 PM6/2/16

to plove...@googlegroups.com

It's funny, I was saying to Mirabai about a week before this thread started that I didn't really think that any computer generated dictionary could be as good as a human built one. I'm still not at all convinced it can! I'm enjoying working on the problem but am fully prepared for it to be a fool's errand.

My gut says that we're unlikely to find any massive improvement over the Ward Ireland model. It's a good model! I have quibbles (in particular, I feel that the asterisk is overloaded, and that there must be a better solution for, e.g., final -th) but I don't think we're going to upend anything. His steno order makes a great of intuitive sense to me. I'd be fascinated to be proven wrong!

I will disagree with you, Zach, in that I think you need syllable information - in particular a list of onsets and codas with their frequency. Otherwise, what information would you even have to question steno order?

I did successfully create a first pass at a "hungry stressed vowel" algorithm. However, I'm not super happy with it, and may end up eating my words and going with maximal onset after all. Switching between the two is fairly easy. I'll update more on that tomorrow.

You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Steven Tammen

unread,

Jun 2, 2016, 10:38:47 PM6/2/16

to plove...@googlegroups.com

Regarding the asterisk, this was one of the things that really bothered me the first time I saw typical steno layouts. It's used for so much. One thing I am absolutely convinced about, regardless of how the rest of this endeavor turns out, is the addition of at least one more disambiguator to steno. The favored location, in my opinion, would be on one of the split S keys. Mirabai is of the opinion that having two S keys is useful because it makes stroking S- comfortable no matter what other keys you have to press on your left hand. I think this argument is valid, but not strong enough to keep two S keys instead of making one into another "asterisk-like" key.

This would help eliminate some of the more obtuse methods of handling conflicts (cycling by repeating, for example). It would also actually be in a better location than the asterisk itself, requiring no lateral finger movement (it's on the "home row", if you will). Ted said that it's already theoretically possible to do this with protocols that would support it, if you can update dictionaries accordingly.

What do you think of a such a thing? It doesn't really nuke the foundation, but it has the potential to solve one of the bigger flaws in the current system.

Zack Brown

unread,

Jun 2, 2016, 10:40:54 PM6/2/16

to ploversteno

Onsets and codas seem fairly easy to determine. Once you drop everything but the stressed vowel, the onset is whatever is before the vowel, and the coda is whatever is after it.

To calculate frequency for a given coda seems easy too: First you take all the words that have that particular coda, and take the sum of all the frequencies associated with those words. The number you get is the frequency of that coda.

So for example, if you have a two-word dictionary consisting of "have" and "five", the coda is the 'v' phoneme. If your text corpus has 10 occurrences of 'have' and 20 occurrences of 'five', the frequency of the 'v' coda is 10+20=30.

Be well,

Zack

Zack Brown

unread,

Jun 2, 2016, 10:47:54 PM6/2/16

to ploversteno

Martin - about syllables, I don't see how any algorithm can work. Here's the result of some phonetic transcriptions, after we process out all the unstressed vowels:

IMPOUNDMENT M P AW1 N D M N T

IMPOUNDMENTS M P AW1 N D M N T S

IMPOUNDS M P AW1 N D S

IMPOVERISHED M P AA1 V R SH T

IMPOVERISH M P AA1 V R SH

IMPOVERISHING M P AA1 V R SH NG

IMPOVERISHMENT M P AA1 V R SH M N T

IMPRACTICABLE M P R AE1 K T K B L

IMPRACTICAL M P R AE1 K T K L

IMPRECISE IH1 M P R S S

IMPREGNABLE M P R EH1 G N B L

IMPREGNATED M P R EH1 G N T D

IMPREGNATE M P R EH1 G N T

IMPREGNATES M P R EH1 G N T S

IMPREGNATING M P R EH1 G N T NG

IMPREGNATION M P R EH1 G N SH N

IMPRESARIO M P R S AA1 R

IMPRESSED M P R EH1 S T

IMPRESSES M P R EH1 S S

IMPRESS M P R EH1 S

IMPRESSING M P R EH1 S NG

IMPRESSIONABLE M P R EH1 SH N B L

IMPRESSION M P R EH1 SH N

IMPRESSIONISM M P R EH1 SH N Z M

IMPRESSIONISTIC M P R SH N IH1 S T K

IMPRESSIONIST M P R EH1 SH N S T

After dropping the unstressed vowels, there isn't really anything left to make syllables with. So for a word like 'impregnation', you just want to get a steno order that can handle 'mpr' at the start of a word and 'gnshn' at the end.

Am I missing something? I'm not trained in linguistics, but I know Plover drops all those unstressed vowels, so any new dictionary file should do the same if it wants to be speed-competitive.

Be well,

Zack

Martin Sherman-Marks

unread,

Jun 2, 2016, 11:10:21 PM6/2/16

to plove...@googlegroups.com

Zach, you're getting a bit ahead of me; I'm still a long way from dropping unstressed vowels. The short answer is that "impregnation" might be a single stroke, if we can figure out a steno order that can go mprgn-a-shn. But since I doubt we'll ever find a steno order that can stroke "mprgn" with the left hand while also handling the other 79,999 words in the dictionary, it seems likely that "impregnation" will be a multi-stroke word regardless of keyboard layout.

My starting point is to find the lists of onsets and codas in natural syllables. Everything has to start with a theory that can handle natural syllables, after all! Once we've got some data there - which I think I can have tomorrow - then we can start seriously questioning steno order.

Zack Brown

unread,

Jun 3, 2016, 6:49:11 AM6/3/16

to ploversteno

I'm sure you'll find out some cool stuff. My first prediction: something other than 'S' belongs at the extreme left.

Martin Sherman-Marks

unread,

Jun 3, 2016, 8:01:05 AM6/3/16

to plove...@googlegroups.com

I'd bet money that the left hand S won't be dethroned. It can precede just about any other letter in an onset (stop, scop, spot, swop, shop, slop...), is the only sound that can start a three letter onset (strong), and is pretty much never preceded by another letter, except in foreign words like "tsar" and "psych" where the other letter isn't pronounced. No, I'm quite certain that Ward Ireland got that one right.

Martin Sherman-Marks

unread,

Jun 3, 2016, 10:38:45 AM6/3/16

to Plover

Attached is a dictionary featuring syllabification using the maximal onset principle. A sample:

impoundment	impoundment	IH2 M \| P AW1 N D \| M AH0 N T	3
impoverish	impoverish	IH2 M \| P AA1 \| V R IH0 SH	4
impoverished	impoverish	IH2 M \| P AA1 \| V R IH0 SH T	81
impoverishment	impoverishment	IH2 M \| P AA1 \| V R IH0 \| SH M AH0 N T	7
impracticable	impracticable	IH2 M \| P R AE1 K \| T IH0 \| K AH0 \| B AH0 L	18
impractical	impractical	IH2 M \| P R AE1 K \| T AH0 \| K AH0 L	86
imprecise	imprecise	IH1 M \| P R AH0 \| S AY2 S	30
impregnable	impregnable	IH2 M \| P R EH1 G \| N AH0 \| B AH0 L	18
impregnate	impregnate	IH2 M \| P R EH1 G \| N EY2 T	3
impregnated	impregnated	IH2 M \| P R EH1 G \| N EY2 \| T AH0 D	20
impregnating	impregnate	IH2 M \| P R EH1 G \| N EY2 \| T IH0 NG	2
impregnation	impregnation	IH2 M \| P R EH1 G \| N EY1 \| SH AH0 N	4
impresario	impresario	IH2 M \| P R IH0 \| S AA1 \| R IY0 \| OW2	22
impress	impress	IH2 M \| P R EH1 S	122
impressed	impress	IH2 M \| P R EH1 S T	472
impresses	impress	IH2 M \| P R EH1 \| S IH0 Z	19
impressing	impress	IH2 M \| P R EH1 \| S IH0 NG	9
impression	impression	IH2 M \| P R EH1 \| SH AH0 N	656
impressionable	impressionable	IH2 M \| P R EH1 \| SH AH0 \| N AH0 \| B AH0 L	23
impressionism	impressionism	IH2 M \| P R EH1 \| SH AH0 \| N IH2 \| Z AH0 M	15
impressionist	impressionist	IH2 M \| P R EH1 \| SH AH0 \| N AH0 S T	43

On the whole, I find the results to be generally consistent with my instincts, which is slightly annoying since I really wanted to dislike the maximal onset principle, but probably a good thing overall.

This is just a first pass, of course. I'm going to need to deal with things like the -tions; it'd be tricky to argue for any steno order that didn't allow the -tion suffix to be folded into the preceding stroke. And I need to put together a list of suffices that get snipped off first. Ultimately, im/pre/sio/ni/sm should be more like im/presion^ism. But one thing at a time.

Also attached are three JSON dictionaries, one each for syllable onsets, syllable nuclei, and syllable codas. Each is ranked from lowest frequency to highest. There are 121 unique onsets, 15 unique nuclei (ignoring different stress levels), and 257 unique codas, for a total of... well, quite a lot of possible syllables. Thanks, English! You can immediately see that it's much more important to be able to handle the onset "S T" (237,658 uses of that onset in the ANC) than it is to handle "JH F" (one use, in the word "jfet", which is... uh...)

Looking over the onset file, I definitely should have told Zack to pony up. By my count, the ANC contains well over a million uses of a syllable where S comes before another letter (I'm including SH here; it's a separate phoneme, but I strongly doubt we'll decide to put a separate ʃ key on the board). There are 171 uses of a syllable starting with "T S" (like "tsetse"), all of them foreign words, and 3 uses of a syllable starting with "L K S", thanks to the very useful word "lxi" (which, even if it were a word, would be pronounced /lək.si/ or at best /l̩k.si/ by an English speaker, and I'll fight everyone at Carnegie Mellon if I have to). Since S occurs before T or TH thousands of times more often than T before S, there's no argument to be made that I can see for knocking S off its pedestal.

To my eye, pretty much all the onsets used less than a couple hundred times are unintegrated foreign words that can be largely ignored - I might draw the dividing line at "SH M", since I think "shmear" is now generally pretty well-integrated into English, though it's still recognized as a foreign sound. (It's possible that I just hang out with a lot of New Yorkers.) All the nuclei are, unsurprisingly, used a lot - /ɔɪ/ is used the least, /ə/ the most, but obviously we need to handle all of them. (That said, I think we might be able to come up with a way to stroke the vowels that involves less multi-key presses for common sounds; /ɪ/, for example, which is stroked EU, is three times as common as /a/, stroked A.) The coda file is more complicated than the onsets, partly because I'm seeing a lot of errors in the CMUDict showing up here. ("NG K D" should be "NG K T", "N K" should be "NG K", etc.) But even the one-off "K S TH S" is a valid pronunciation of the actual English word "sixths". Note, too, that "TH" occurs in codas almost 75,000 times in the ANC corpus. Granted, it occurs in 230,000 onsets, but still, that's an awful lot of syllables that get an asterisk not for disambiguation but just for the final TH sound.

I'll leave this here for people to think about for now, and to run their own analyses on if they'd like.

autodidict_v0-1b

codaList.json

nucleusList.json

onsetList.json

Steven Tammen

unread,

Jun 3, 2016, 11:16:59 AM6/3/16

to plove...@googlegroups.com

Good job!

I think it's worth pointing out that we might be able to rearrange vowels and mess with -TH without necessary changing everything else about steno order. Something this does, aside from being very interesting and possibly useful in the future, is give us a good feel for what might need to be improved in the Ireland layout itself.

Also (my previous post appears to have gotten buried), what do you think of having another "asterisk-like" key on one of the split S's, that could, among other things, making stroking -TH easier?

--

Zack Brown

unread,

Jun 3, 2016, 11:47:56 AM6/3/16

to ploversteno

My preliminary code doesn't have S on the left. Steno order for onsets seems to be:

K S R P N D T M B F G L V J TH H W Z SH NG CH

This is not definitive at all and I expect it to change. But it's based on the following:

1) use CMUDict and Norvig's frequency file.

2) remove the 1000 most often used words from our calculations, on the assumption that we'll have briefs for them.

3) make some basic phonetic conversions. for example, convert vocalized 'th' to generic 'th', and vocalized 'zh' to generic 'sh'. For 'z' at the end of words, convert to 's'.

4) drop unstressed vowels; including 'y' in words that have stressed vowels

5) go through each word in the norvig file. Split the onset into ordered phonemes. For each phoneme that occurs to the left of another, increment the frequency value of that phoneme by the frequency value of the word that contained it.

6) sort all phonemes by the total frequency values obtained.

Doing the same for codas produces the second half of steno order:

N R T S K L M SH D P B V F Z G J NG CH W TH H

Of course, this assumes one key per phoneme, which won't work. This steno order still has to be packed into a 22-key keyboard, with appropriate chord selections for all phonemes that don't have their own key.

Also I'm not satisfied that any of the above is correct. It's just where I'm at right now.

But Martin - the bet's on! I don't think S will end up on the left. So, whichever one of us admits defeat, buys the other dinner the next time we're in the same city! In my case NYC. Deal? :-)

Be well,

Zack

--

You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Zack Brown

unread,

Jun 3, 2016, 11:50:57 AM6/3/16

to ploversteno

Steven, FWIW I agree - another asterisk-like key would come in handy. But I'd favor keeping both of them in the center of the keyboard. So the central asterisk would be split into two keys, and also the left S would be split into two keys. But the extra key on the left (I'd suggest) would be for a phoneme rather than a control key.

--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Zack Brown

Martin Sherman-Marks

unread,

Jun 3, 2016, 12:11:25 PM6/3/16

to Plover

Zack, what syllables have a K before S? Are you syllabifying "extra" as "e/kstra"? Because most English speakers would say "ek/stra." (This is why we have /z/ylophones, not /ks/ylophones.)

Also, looking back at your earlier email: I think you were taking out waaay too many vowels in removing unstressed vowels. Unstressed means it has a 0; you were taking out some 2s as well.

Steven Tammen

unread,

Jun 3, 2016, 12:28:07 PM6/3/16

to plove...@googlegroups.com

It's an interesting thought. I guess I've been thinking of the problem with respect to Ireland's layout, and in that case, it would have to be a control key or one of the less common onset phonemes that are currently chorded (J-, Z-). In this circumstance I believe the control key would be more useful than having a dedicated Z- key, for example.

With respect to the layout we are designing, I think it would be foolish to limit ourselves to the exact same physical layout that Ireland Steno follows. Here is what I view as an ideal physical layout (crude but you get the idea. Blue keys are "home position"):

Jun 3, 2016, 1:57:27 PM6/3/16

to Plover

...not a particularly good reason! I probably just have the classic QWERTY-user "I wish there were a key for that" syndrome. You'd also have to be able to come out of the Number and Navigation modes again to enter text normally, of course, but you could just define a different stroke to return to normal. I don't think anyone's ever going to intentionally need to hit .,:' or ←↓→End (each of which would map to -RBGS on the standard layout), so those would be perfectly good "return to normal mode" strokes.

Martin Sherman-Marks

unread,

Jun 3, 2016, 2:00:05 PM6/3/16

to Plover

In either case, software is a larger limiting factor than hardware - Plover would need a pretty significant enhancement to make the modes I'm suggesting possible. (One reason I'm enjoying hacking away at this CMUDict project is that it's an opportunity to brush up on my Python - I'd almost forgotten how pleasant a language it is! - so that I might eventually be ready to contribute to Plover. But that's a way off, so I'll just keep dreaming for now.)

JustLisnin2

unread,

Jun 3, 2016, 2:10:42 PM6/3/16

to Plover

I agree with Mirabai that those keys would be better off as commands than dedicated keys. Have you been keeping up with the changes to Plover, Martin? The newest dictionaries on Stenodict as well as the new commands and functions added by our new devs? If not, you might want to check out the new software changes and commands available. But I love, love, love the idea of retiring the KPA stroke and replacing it with an initial caps key. Someone make it so!!!

Best,

Nat

Steven Tammen

unread,

Jun 3, 2016, 2:11:36 PM6/3/16

to plove...@googlegroups.com

We may have to agree to disagree about the "weirdness" of having + on the onset side. The great thing about what we're doing is that it's OK for us to disagree on this! So long as we take pains to truly make the generator customizable, these sorts of things can be decided according to individual preference.

My point regarding the physical layout was that we needn't limit discussion to only the keys on the typical 22-key stenotype. I would agree that the core system should work on these (otherwise we just alienated just about all stenographers), but that we should consider uses for any additional keys as we go through this process of building a layout from the ground up.

On "asterisk-like keys", I should have been more specific. What I did not mean was to create 3 or four more terribly ambiguous keys, but rather split responsibilities between them. That is to say, * deals with less common homonyms, + deals with capitalization, % deals with briefs, etc. Does that make sense?

--

Martin Sherman-Marks

unread,

Jun 3, 2016, 2:44:45 PM6/3/16

to Plover

Nat: Yes! I've been following the Plover project intermittently for years (I knew Mirabai before she was famous!) but I've been amazed at the recent development. It was the recent weeklies leading up to 3.0 that made me decide to make the leap of faith and try to learn steno. Plover can actually send Shift_Lock... it just doesn't do anything. That'd probably be the easiest of the four to implement - though, as far as I know, the ability to hold Shift and use the mouse is still a ways away. But my Num Lock and Nav Lock commands would require Plover to completely change keyboard layout on the fly. Not impossible, I don't think, but definitely a major new feature (especially since, ideally, the Num Lock and Nav Lock modes should be customizable).

Steven: I can see that being potentially useful - like I say, + for capitalization is a definite winner. I like the idea of additional keys that are optional and user-defined, too. Ideally, our script should be able to generate a dictionary in either "classic" or "enhanced" mode.

Of course, we've gotten way ahead of ourselves in that we've yet to show any concrete evidence that we might be able to "enhance" the classic steno layout... but I might have some more time to hack at that later today. My next goal will be to come up with a phoneme priority order, which will form the basis of a steno order, and which will be quite a fun project indeed.

Mirabai Knight

unread,

Jun 3, 2016, 2:50:21 PM6/3/16

to ploversteno

So the problem with changing the top row of asterisks to + (for capitalization) and keeping the bottom row as * would be, for example, if you had a word like "career", KRAOER, and you wanted to capitalize it, because there was a Career Fair or whatever. KRAO*ER is easy to do, but how do you do KRAO+ER? Both right index fingers are on the bottom row, and it's not going to be easy to hit the + without also hitting the F and/or *, changing the word to KRAOEFR or whatever. That's what I meant when I said that redundancy adds flexibility, and specificity can interfere with it.

--

You received this message because you are subscribed to the Google Groups "Plover" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.

Steven Tammen

unread,

Jun 3, 2016, 2:57:32 PM6/3/16

to plove...@googlegroups.com

Don't kill yourself working on this. I still haven't made up my mind whether I'm going to hold out for something custom-designed or just throw in the towel and learn Ireland's layout and Plover --> Magnum to preserve my sanity. In an ideal world, with infinite time and no relearning curve, I'd just start learning now (well, as soon as the SOFT/HRUF comes) and then switch to something completely different when it became possible. I'm not so sure right now, though, because that's a lot of extra work.

If we can get something that works well even within a year, I'll be happy to be a guinea pig since this isn't my income stream.

Mirabai, we were talking about + being the top split S key not on one of the middle columns. Unless you're just using that as an example.

--

Martin Sherman-Marks

unread,

Jun 3, 2016, 2:57:34 PM6/3/16

to Plover

Mirabai, you make a really excellent point. We were actually using + to refer to a symbol on the upper left S, which wouldn't have that problem, but the symbol we're calling % (which we discussed using to pair with, e.g., -T for TH) certainly would: birth would be PW%EURT, and it would be nigh-impossible to hit -R and % at the same time. You can't type upper*-R or lower*-F at the same time, because then your index finger has to hit two keys that are cattycorner. Thank you for pointing that out! We'll leave the asterisks alone.

Zack Brown

unread,

Jun 3, 2016, 3:07:08 PM6/3/16

to ploversteno

Martin, yup, I'm counting words like 'accede', 'accident' etc., along with 'casino', 'castration', 'coincide', 'irksome', 'misconstrue', 'pakistani', 'workstation', etc.

Note: I'm not syllabifying any words. I'm removing all vowels except the single strongest one. This way, hopefully, my calculations will lead to a steno order that is capable of housing the most possible words. Any that *can't* fit into that order, will still be able to be typed syllabically.

In other words, I don't see any downside to trying to shoehorn as much of the language into steno order as possible, given that any failures will be caught by the standard multi-stroke catch-all.

The thing about syllabification is, steno was never intended to be syllable-centric. Even Ward Ireland dropped vowels. The idea that steno is syllable-centric is just marketing-language that came out of the steno companies. That's why I think syllabification is such a red herring. Even if someone came up with a perfect syllabification algorithm, the end result wouldn't be as fast as what Ward Ireland came up with. It would probably be a lot cleaner and easier to learn, but it wouldn't result in faster typing. I'm working from the assumption that any improvement over Plover has to be not only cleaner and easier to learn, but *also* faster.

Having said that, the stuff you're doing is fascinating, and I could be totally wrong about my whole concept.

Be well,

Zack

Martin Sherman-Marks

unread,

Jun 3, 2016, 3:19:57 PM6/3/16

to Plover

Have you checked to see whether you've got a whole bunch of extra homonym collisions as a result of stripping so many vowels? That strikes me as a potential issue.

I'm probably biased; I'm such a new steno learner, and I'm still using a QWERTY keyboard, so my hand cramps up just thinking about stroking "mscnstrue". But I'm really interested in seeing where it takes you! I, too, could be totally wrong about my concept. I suspect we may wind up meeting in the middle at some point.

Mirabai Knight

unread,

Jun 3, 2016, 3:41:10 PM6/3/16

to ploversteno

Yeah, replacing top S- with + would just make strokes like STRAP or SFRAJ slightly more uncomfortable, but not horrendously so. I'd be more worried about words like +KPHRAOEUPBS (e.g. for the Compliance Department) that would feel a bit more finger twisty. Come to think of it, I'm not sure I'd much like having to type +TKEPLT either. The nice thing about the doubled S- is that it somewhat compensates for the relative weakness of the pinky; you don't have to be as precise about only striking the top or bottom row, but have a wider target to hit, improving accuracy.

On Jun 3, 2016 2:57 PM, "Martin Sherman-Marks" <mma...@gmail.com> wrote:

Mirabai, you make a really excellent point. We were actually using + to refer to a symbol on the upper left S, which wouldn't have that problem, but the symbol we're calling % (which we discussed using to pair with, e.g., -T for TH) certainly would: birth would be PW%EURT, and it would be nigh-impossible to hit -R and % at the same time. You can't type upper*-R or lower*-F at the same time, because then your index finger has to hit two keys that are cattycorner. Thank you for pointing that out! We'll leave the asterisks alone.

--

Mirabai Knight

unread,

Jun 3, 2016, 3:46:11 PM6/3/16

to ploversteno

From a previous conversation with Zack (';

Forget
Forgot

Unpin
Unpen

Conformation
Confirmation

Hero
Hera

Deafen
Define

Decade
Decode

Decant
Decent

Decays
Decoys

Demure
Demur

Mantel
Mantle

Atone
Attain

Arise
Arose

Mucus
Mucous

Elated
Eluted

Unfold
Unfilled

Exult
Exalt

Extent
Extant

Axes
Axis

Concealed
Consoled

Libretti
Libretto

Oversee
Oversaw

More minimal pairs at

http://myweb.tiscali.co.uk/wordscape /wordlist /minimal.html

On Jun 3, 2016 3:19 PM, "Martin Sherman-Marks" <mma...@gmail.com> wrote:

Have you checked to see whether you've got a whole bunch of extra homonym collisions as a result of stripping so many vowels? That strikes me as a potential issue.

I'm probably biased; I'm such a new steno learner, and I'm still using a QWERTY keyboard, so my hand cramps up just thinking about stroking "mscnstrue". But I'm really interested in seeing where it takes you! I, too, could be totally wrong about my concept. I suspect we may wind up meeting in the middle at some point.

--

Jennifer Brien

unread,

Jun 3, 2016, 4:02:52 PM6/3/16

to Plover

If you have modifier for the onset and one for the coda, for 'normal' words that frees up the strokes 'onset-modifier + coda' and/or 'coda-modifier + onset' for keyboard control stuff.

Zack Brown

unread,

Jun 3, 2016, 4:29:11 PM6/3/16

to ploversteno

Martin - as Mirabai points out, my assumptions *definitely* introduce
homophonic conflicts. However, conflicts are inevitable when designing
a steno layout. Any dropped vowels will introduce conflicts. Any
phonemes that are assigned to chords instead of individual keys will
introduce conflicts. There's a whole range of assumptions that, if you
make them, will introduce conflicts. But if you *don't* make
assumptions, you come up with a steno system that is slow, and fails
to be an improvement over Ward Ireland.

The real question is, what are the *ideal* assumptions to make? I.e.,
what assumptions will allow you to match the most words in a single
stroke, with the least number of conflicts? The answer isn't obvious.
First of all, it's possible to innovate when it comes to assumptions -
so we don't really know which assumptions are even worth considering.
And second of all, it's not a zero sum game - it's completely possible
to come up with a steno layout that matches more words, *and* has more
conflicts.

I used to have a lot of spirited debates with Mirabai on this whole
issue. And I have to give her credit for one major point: I used to
believe it would be easy / trivial to improve on Ward Ireland's
keyboard. I figured Ireland had no access to sophisticated computers,
and probably only did a minimal amount of statistical analysis. A
simple computer script would *have* to do better.

But I was wrong! I beat my head against this problem for months and
months, trying tons of different assumptions, guesses, experiments,
and analysis. And the further I got, the more I realized that
Ireland's solution was actually incredibly brilliant in many ways.
Somehow, he managed to find sets of assumptions, and a bunch of subtle
nuances, that address all the issues. The guy was a genius.

I still believe it should be possible, 100 years later and with the
help of technology, to improve on his keyboard. But it won't be a
cakewalk, by any stretch of the imagination. Whoever finally does it
will have accomplished something really amazing. It's going to require
a lot of creativity, and a very subtle perception of the broad variety
of the needs of the problem.

So far, I've never even come close. Some of the elements of the
problem set seem to very strongly resist analysis, for example how to
decide which phonemes get chords versus keys, and where those chords
should be laid out. But even steno order itself seems very tough to
calculate, even when you've got things like phonetic dictionaries,
word frequency stats, and tremendously powerful computers.

I'm not saying it's pointless to try. Not at all. I'm saying that each
of these issues we're seeing debated on the list lately, are all tough
nuts to crack. The greater variety of people tinkering around with
these ideas, the more likelihood one of us will have a breakthrough.

Be well,
Zack

Joshua Taylor

unread,

Jun 3, 2016, 5:22:19 PM6/3/16

to plove...@googlegroups.com

Why not add pedals? It works for Piano players. Could take some strain
off the pinky fingers.

-- Joshua Taylor

Zack Brown

unread,

Jun 5, 2016, 8:03:55 AM6/5/16

to ploversteno

BTW Mirabai, that list of words is not the best set of examples of
conflicts that are introduced by dropping all but a primary vowel.

To be a conflict, the same stroke has to work for both words.

A lot of the words in that list are differentiated by the primary
vowel, and so they would use different strokes:

forgEt/forgOt
unpIn/unpEn
dEAfen/defIne
dEcade/decOde
decAYs/dEcoys
atOne/atAIn
arIse/arOse
unfOld/unfIlled
exUlt/exAlt
extEnt/extAnt
concEAled/consOled
oversEE/oversAW

Or they have similar spelling but different consonant phonemes, and so
would use different strokes:

dec(K)ant/dec(S)ent

Or one or both words are so rarely used that having a brief is not
going to be a problem:

conformation/confirmation
Hero/Hera
elated/eluted
libretto/libretti

('libretti' and 'eluted' don't even have entries in the Plover dictionary file)

Or the words are so similar anyway, that they are also conflicts in
Plover and every other steno system:

demure/demur
mantel/mantle
mucus/mucous

So, of the list above, there is only one legitimate conflict, i.e.
words that are *rendered* ambiguous by dropping all but the primary
vowel.

axis/axes

There are probably other conflicts as well, but probably far fewer
than anyone might suppose.

Meanwhile, the value of dropping all but the primary vowel is that
gathering statistics of consonant usage becomes much more significant.
If we base the analysis of 'onsets' on the full ordering of consonants
that appear to the left of the primary vowel, and the analysis of
'codas' on the full ordering of consonants that appear to the right of
the primary vowel, we'll just naturally come up with a set of key
assignments and chords that can fit the largest portion of any given
word into a single stroke.

At the same time, for any words that don't fit completely, we *still*
have the fallback of multi-stroke dictionary entries. The only
difference is that each multi-stroke word will require the fewest
possible strokes, because each stroke will be able to fit the most
possible phonemes onto the keyboard.

So, by dropping all but the primary vowel, we:

1) don't introduce too many new conflicts
2) produce a keyboard that maximizes the number of single-stroke
words, while minimizing the number of multi-stroke words and the
number of strokes needed for each of those words

I'm not saying anyone should give up on a syllable-based approach.
There are lots of legitimate angles to come at the whole problem. My
approach looks good. Others look good too. And we'll *need* people
coming at it from lots of different angles, if we're ever going to
improve on Ireland's work. The guy was annoyingly brilliant.

Be well,
Zack

--
Zack Brown

Jennifer Brien

unread,

Jun 10, 2016, 10:49:51 AM6/10/16

to Plover

On Sunday, 5 June 2016 13:03:55 UTC+1, Zachary Brown wrote:

Meanwhile, the value of dropping all but the primary vowel is that
gathering statistics of consonant usage becomes much more significant.
If we base the analysis of 'onsets' on the full ordering of consonants
that appear to the left of the primary vowel, and the analysis of
'codas' on the full ordering of consonants that appear to the right of
the primary vowel, we'll just naturally come up with a set of key
assignments and chords that can fit the largest portion of any given
word into a single stroke.

Wondering how to encode all those 'onsets' made me think of Kinglet again. Perhaps that was at the back of your mind too?

Two problems:

There are only 36 symbols in the Kinglet alphabet and there are 39 in the CMUDict encoding, without punctuation or any other extras.
Five characters per stroke is extremely difficult.

So - a 24 key board, where the outer three finger control two keys each and the index fingers and thumbs each control three. Twenty-four keys in all all, four sets of 64 combinations. Can using a dictionary to look up contractions compensate for only four characters per stroke?

Here I've made a guess at the amount of character reduction possible without briefing phrases (The actual characters are.only approximate). Dictionary lookup is done at word boundaries (space or punctuation) rather than per stroke, Spaces need to be included explicitly, but punctuation adds its own space and capitalisation where appropriate, and the 'space bar space' at the start of a four-character segment comes for free.

"When most I wink then do mine eyes best see, For all the day they view things unrespected, But when I sleep, in dreams they look on thee, And darkly bright, are bright in dark directed. Then thou whose shadow shadows doth make bright How would thy shadow's form, form happy show, To the clear day with thy much clearer light, When to unseeing eyes thy shade shines so! How would (I say) mine eyes be blessed made, By looking on thee in the living day, When in dead night thy fair imperfect shade, Through heavy sleep on sightless eyes doth stay! All days are nights to see till I see thee, And nights bright days when dreams do show thee me."

122 words

"Wn m|st I| wink| tn d|o mi|n iz| best| se, |Fr l| t da| th |vu| tgs| nrsp|ectd|, Bt| wn I| slep|, n d|remz| th l|uk on| te, A|n da|rkl| bryt|, r| bryt| i| dark| dirc|td. T|n th|u wu|z sa|do s|adoz| doth| mak| bryt| Hw| wd| ty| sado|'s f|orm|, for|m ha|pi s|o, |To t| cler| da w|t ty| mc c|lerr| lyt, |Wn t|o ns|eg i|z ty| sad |snz |so! H|w wd| (I| sa)| min| iz |b bl|sd m|ad, |By l|okg |on t|e i t| livg| da, |Wn i| ded| nyt| ty |far |mper|fct |sad, |Thru| hevi| slep| o sy|tls |iz |dt| sta! |Al d|az r| nytz| to s|e tl| I se| te,|n ny|ts br|yt d|az w|en d|rems| do s|o te| me."

109 strokes

.893 strokes per word as against Kinglet's .94 for the same passage.

I'm not sure this is practical, but it looks - interesting.

Zack Brown

Martin Sherman-Marks

unread,

Jun 13, 2016, 4:01:26 PM6/13/16

to Plover

I had a few minutes today, so I figured it was a good time to put a whole bunch of phonemes in a jar and shake it up to make 'em fight.

For naturalistic syllables, the following order of phonemes is pretty much on the money. This order will work without inversions for 99.999% of onsets and 98.618% of codas in the ANC corpus. Note that this is a phoneme order, which is not quite the same problem as a steno order. Since most users don't have fifteen fingers on each hand [citation needed], we obviously can't just print these letters on keys and be done. So there's a lot of careful hacking to be done yet to determine a steno order. (The order of nuclei isn't particularly important, as there are never two nuclei in the same syllable.)
S P K F B G D TH V HH SH Z CH ZH JH DH N M T Y W L R AH IH IY EH AE ER UW AA AY EY AO OW AW UH OY R L N NG M K F V P B SH G DH Y ZH HH CH JH T TH S D Z
This is based on an algorithm with a tiny bit of hand-tweaking. The problem on the coda side is the fact that /s/ is a leader (e.g. "cast") as often as it is a follower ("cats"); that's the problem that Ward Ireland solved by giving F two jobs. A similar solution here should get pretty close to 100% on the coda side too.

This may look pretty different from the Ward Ireland steno order, but it's probably not as different as it seems. For example, my algorithm shoves "T" way to the bottom of the onset list... but if I put "T" in second place, as Ireland had it, I lose precisely zero. (And it's probably easier to stroke the extremely common ST- onset when they're right next to each other.) Ward Ireland may not have had the American National Corpus, but he had the right instincts, and I'm not seeing a compelling argument to throw Ireland out here yet. Now, if I hack together a steno order checker and discover that Ward Ireland's order matches 95% of syllables and I can come up with an order that matches 99%, then we'll talk. But for now? Meh.

I know Zack is working on a much more ambitious system to focus on maximizing the ability to stroke an entire morpheme or even an entire word rather than building up from the syllable. That's certainly not something Ireland could have considered, so maybe there's something going on there. But for me, the biggest change to the keyboard that's come out of this conversation that I'm on board with is the addition of a + key in place of the upper onset S. I'll keep working and see if I can prove myself wrong. (And, even if we don't change the theory a single jot, there's something to be said for attempting a predictable bot-built base dictionary. I say this as a new learner who just tried writing "considerable controversy" without knowing the briefs for either word.)

Steven Tammen

unread,

Jun 13, 2016, 4:48:52 PM6/13/16

to plove...@googlegroups.com

It sounds good Martin. For what it's worth, I'm in a linguistics class right now so I'll be able to take a better look into this in a few weeks once we've done some more with phonotactic constraints and such.

In my opinion, we should work on one problem at a time. So far it seems as if this thread has sparked many great conversations that aren't entirely relevant to the initial thing put forth, that is, using CMUDict to programmatically generate translation dictionaries.

Other things that would be good to have after this:

1) A totally redesigned layout and steno order to improve upon Ireland. (Or better yet, multiple options depending on what things you think are important: as in Zach's thoughts vs. Martin's thoughts).

2) A similar effort to generate things for orthographic layouts (Kinglet, Jackdaw).

Whether or not these things are more worthy than the initial thing is certainly not a decided question, but I think we ought to prove it is possible in the case of Ireland stenography before we get too far ahead of ourselves with these other things.

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Martin Sherman-Marks

unread,

Jun 14, 2016, 11:05:47 AM6/14/16

to Plover

Yes, Steven, after further thought, I'm tabling the question of improving on the steno layout/order and focusing again on the original question.

I'm working on improving my syllabified dictionary by adding the concept of the "fauxneme": a syllable that's treated in steno theory as if it were just an extra sound. For example, the syllable "SH AH0 N" (i.e. "-tion") I'm replacing with the fauxneme "SHN". The word "actionable" therefore becomes "AE1 K SHN | AH0 | B AH0 L". That's the simplest one, I think; the only constraint on it is that it can't occur before "D", since you can't really hit -S and -D at the same time ("motioned" is PHOEGS/-D). The "-ing" fauxneme is a bit more complicated; it can't occur after "G" or "NG", obviously, but it also doesn't occur after a sound later in the steno order than G ("dancing" is TKAPBS/-G). The "-es" fauxneme is very rare; obviously "dances" is TKAPBS/-S, but words that end in SH or CH can accept it, like "rushes" (RURBS). Similarly, the "-ed" fauxneme only occurs after "T", though that seems to be inconsistent in the current dictionary ("wanted" is WAPBTD, but there's no similar definition for, e.g., "matted"). Figuring out how to substitute these fauxnemes into the dictionary by applying theory-specific rules is my next step. After that, I'll need to attack prefices and suffices.

As you can see, from this point on, the dictionary generator can no longer be theory-neutral - the layout of keys, steno order, and mapping of sounds to keys becomes critical now. From here on out, I'll be working with the classic Ireland/Plover system.

Steven Tammen

unread,

Jun 14, 2016, 3:09:50 PM6/14/16

to Plover

One thing that I learned yesterday was that we don't necessarily need to have an "either/or" syllabification algorithm. Ted says the translation uses a form of tree search so size is effectively irrelevant once you get to a certain point. While we'd need to be careful with word boundaries and affixes, we could theoretically include all allowable permutations of possible syllabifications for words -- thus eliminating all the compromises that come from using maximal onset or stress-focused syllabifications exclusively. Of course, I'm not entirely sure how much RAM a 2 million line long dictionary eats, so embedding might become a problem if the dictionary gets crazy huge.

While the fauxnemes themselves are theory specific, the identification of the things to be replaced are not ("SH AH0 N" in your example). This still allows for flexibility down the road when we try to encode for a different system: just swap "SHN" with whatever else, and viola, new system. (If I'm understanding correctly).

Martin Sherman-Marks

unread,

Jun 14, 2016, 3:39:10 PM6/14/16

to Plover

Oh, I agree that when it comes to the point of turning this into a dictionary, we're going to want to build in "spread" - accounting for different syllabifications, multiple pronunciations, differing vowels, common misstrokes, etc. The amount of spread a given word gets will have to be based on its frequency. That's going to be an interesting challenge to figure out in itself, of course. The real limitation is not dictionary size - I'd be surprised if we doubled the length of the current 4 MB main.json. The bigger issues are keeping collisions to a minimum and making sure users feel comfortable adding their own briefs. (If someone tries to brief, say, the word "flass", we don't want them to feel like "oh, maybe I shouldn't use TPHRAS, because the Add Translation window says that means 'glass'" - even though TPHRAS is actually just a very unlikely misstroke for TKPWHRAS. I feel like that's not a great example, but I'm short on better ones at the moment.) On the whole, I'm very pro-spread, we just have to figure out a way to do it cleverly.

As far as the fauxnemes and theory-neutrality, the biggest issue is not that I'm calling the "SH AH0 N" cluster "SHN", it's the understanding that "SHN" maps to GS. That means that "SHN" can't occur in the same coda as G, S, or D (you can't hit S and D simultaneously, remember) and is near the end of the steno order. As such, "motioned" must be two strokes - "motion" and "-ed". If the right side of your keyboard looked completely different, maybe "SHN" could exist with D but not P. Then the word "gumption" would stop working but "motioned" could be written in one stroke. For every one of those changes, you'd need to rewrite large parts of the dictionary generator. Call this problem "stenotactics"; it's the stenographic equivalent of phonotactics.

With the work that Zack is doing on word-level strokes, this becomes a real issue. "Actionability", for example, would be "AE1 K SHN B L T" in his system, so now whatever the "SHN" sound maps to on his keyboard needs to come before (and not conflict with) B, L, and T in the steno order. (Though I should note that the fauxneme concept isn't really necessary to his idea; he could just as easily use "AE1 K SH N B L T". That would create some problems, solve some others.)

Martin Sherman-Marks

unread,

Jun 14, 2016, 3:56:59 PM6/14/16

to Plover

Zack, did you ever run a check to see what percentage of the CMUDict is matched by your "K S R P N D T M B F G L V J TH H W Z SH NG CH / N R T S K L M SH D P B V F Z G J NG CH W TH H" order? Because I notice, for example, that "actionability" doesn't fit within that order. (A-N-T-K-L-SH-B?) Come to think of it, neither do any of the component pieces of either "Martin Sherman-Marks" (M-A-N-R-T SH-E-N-R-M M-A-R-S-K) or "Zachary Brown" (Z-A-R-K R-B-OU-N). I think maybe I might need to plan a dinner trip to New York sometime soon... 😉

Zack Brown

unread,

Jun 14, 2016, 5:46:21 PM6/14/16

to ploversteno

Hi Martin,

Here's my latest calculation for steno order:

S K R P N D T M B F G L TH V J W H Z SH NG CH VOWELS N R T S K L M SH D P V B F Z G NG J TH CH W H

It still doesn't handle 'actionability', but that's a pretty unusual word. But as you can see, it does have the S in front! So you are currently ahead in our wager. But I haven't given up yet!

In general, I've found it hard to calculate a proper steno order. In theory, the best steno order will match the most words. But how is that calculation made, in practice? For my algorithm, I calculate the number of times a given phoneme appears to the left of the others, and then just output the sorted set of phonemes. But there's no guarantee that this will produce a steno order that truly matches the most words.

Also, to get a real steno order, it's necessary to assign some phonemes to keys, and some to chords. I have not been able to understand exactly what criteria should go into making those decisions. In theory, again, the best steno order will allow the most words to be typed in a single stroke, while introducing the least number of stroke conflicts. But how to go about calculating that seems like a tough problem.

Also, don't forget - the calculations are about more than just the percentage of CMUDict that gets matched - it's also important to take into account the frequency of each word in English. So, being able to type the word 'the' in one stroke is far more valuable, from a scoring perspective, than being able to type 'actionability'.

Be well,

Zack

On Tue, Jun 14, 2016 at 3:56 PM, Martin Sherman-Marks <mma...@gmail.com> wrote:

Zack, did you ever run a check to see what percentage of the CMUDict is matched by your "K S R P N D T M B F G L V J TH H W Z SH NG CH / N R T S K L M SH D P B V F Z G J NG CH W TH H" order? Because I notice, for example, that "actionability" doesn't fit within that order. (A-N-T-K-L-SH-B?) Come to think of it, neither do any of the component pieces of either "Martin Sherman-Marks" (M-A-N-R-T SH-E-N-R-M M-A-R-S-K) or "Zachary Brown" (Z-A-R-K R-B-OU-N). I think maybe I might need to plan a dinner trip to New York sometime soon... 😉

--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Zack Brown

Theodore Morin

unread,

Jun 14, 2016, 7:09:04 PM6/14/16

to Plover

As part of this project, would it be within reason to generate a new phonetic default dictionary using STKPWHRAO*EUFRPBLGTSDZ? We could place the existing briefs over the generated dictionary and I bet it would help out with learning as everyone breaks syllables a little differently.

The word boundary problem is most daunting to me, though. I have no clue how you will solve that while generating.

All right, I'm thinking ahead, please, continue :)

Steven Tammen

unread,

Jun 14, 2016, 8:34:55 PM6/14/16

to plove...@googlegroups.com

Hey Ted,

Though we got a little sidetracked (OK: a lot sidetracked) talking about optimality, redoing steno order, etc., the main focus of this particular idea WAS creating a STKPWHRAO*EUFRPBLGTSDZ translation dictionary that was generated instead of added to by hand. We are still early in the process (and by "we" I mean Martin, who is doing the lion's share of the work), but this is step one. After that comes the part where we nuke everything and start from scratch.

You might want to clarify for future reference exactly what you mean by "phonetic". I'm assuming you're referring to Plover's theory, which is currently what we're looking towards (not "phonetic" in exactly the same sense as Pheonix -- different strokes are used for the same sounds to solve many homophone problems right off the bat).

Something that I think we'll want to do is use some form of metadata to tag entries. For example "this is a normal entry", "this is a misstroke", "this is a nonstandard syllabification", "this is a brief", and so on. Is this supported across all dictionary types (JSON, RTF, etc.)?

You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Martin Sherman-Marks

unread,

Jun 14, 2016, 8:46:32 PM6/14/16

to plove...@googlegroups.com

Unfortunately, JSON explicitly doesn't allow comments. The only way it could be done would be to increase the depth of the dictionary file. Instead of:
{ "PHARPB": "Martin" }
you would need something like:
{ "PHARPB": { "translation": "Martin", "type": "non-phonetic brief" } }

That would require a complete rewrite of key parts of the Plover software, and would require some significant cleverness for it to remain backwards compatible.

Zack Brown

unread,

Jun 14, 2016, 9:34:35 PM6/14/16

to ploversteno

Steven,

I tried to divvy up the Plover dictionary file awhile back, and had to give up. It would take a psychotic amount of work to split out all the briefs, misstrokes, regular entries, and so on. But until that work is done, newcomers are going to be learning directly from the dictionary file, and developing all kinds of terrible habits.

Anyone working on a new dictionary should really please try to get this right - you need to create separate files for each type of entry: normal entry, misstroke, brief. Anything beyond that is icing on the cake, but those three separate JSON files definitely seem crucial.

Theodore Morin

unread,

Jun 14, 2016, 9:51:26 PM6/14/16

to Plover

I've gotta say that it's not hard to tell what is a misstroke once you have a basic understanding of the board. I think I probably learned less than ten incorrect strokes and only some of them I've "fixed".

Martin Sherman-Marks

unread,

Jun 15, 2016, 5:30:38 PM6/15/16

to Plover

Zack, I really don't think that your steno order is a winner, even with the revisions. I went through and pulled all of the hundred most common Norvig words that include consonant clusters after unstressed vowels are removed. (Most of the common words are simple CVC syllables, so steno order is irrelevant.) I then tested them against your steno order:

rank	word	cluster	result
3	and	-N D	pass
21	from	F R-	fail
41	search	-R CH	pass
42	free	F R-	fail
46	other	-TH R	fail
49	information	N F R M-SH N	fail
65	only	-N L	pass
69	contact	-N T K T	would pass, but there are two Ts in the coda
70	business	-Z N S	fail
76	help	-L P	pass
80	online	N L-	pass
83	first	-R S T	fail
86	would	-L D	pass
91	services	-R V S S	fail; there are also two Ss in the coda
94	click	K L-	pass
97	service	-R V S	fail
100	find	-N D	pass

Out of those most common 100 words, I counted 17 that require steno order, and your order only has about a 50% passing rate on those. Many of the failing words are one-syllable words, including words as common as "from".

I don't think the problem is your algorithm; your description of the algorithm you used sounds pretty much the same as mine. As much as I liked your idea of a word-level dictionary, I think that trying to create a steno order that allows for it is probably doomed. It's not too difficult to come up with a steno order that handles virtually all English onsets and codas, because English onsets and codas follow fairly predictable rules, and those rules mean that it makes total sense that "S" would come before "L" a lot in onsets. But your "drop all but the primary stressed vowel" idea means that the "N F R M" in "information" must be treated as a valid onset too. Those onsets don't follow any rules - they're just all the consonants before the primary stressed vowel. So yeah, "N" might come before "F" a lot, but that's not because of any inherent feature of the sounds "N" and "F" - it's just random chance. It's much easier to think of a word where "F" precedes "N" before the stressed vowel (e.g. "the fantastic fornication fandango") than it is to think of a syllable where "L" comes before "S".

Now, if you ignored steno order as a factor, then a word-level dictionary becomes possible. Of course, then you open up a whole new can of worms: your number of collisions (already high) will go through the roof if you just ignore what order the sounds occur in. You could imagine a system like a pinyin input editor, where you would mash keys for a word, then pick the word you wanted from a list of matching options. That comes at an obvious speed cost, but it's conceivable it could be worth it, if the number of collisions was small enough that most strokes wouldn't cause them. But I'm pretty sure a syllable-based but heavily briefed steno system would beat you hands-down.

Steven Tammen

unread,

Jun 15, 2016, 6:22:21 PM6/15/16

to plove...@googlegroups.com

I think I understand what Martin is saying. If we drop all but the primary stress vowel, there is no guarantee that the consonants will follow an order that correlates with English phonotactic constraints. For example, CCV can never have a sonorant followed by an obstruent, so liquids and glides like /l/, /ɹ/, /w/ and /j/ will never come before the other consonants for this syllable structure ("L" before "S" is actually impossible in onsets, to my knowledge). CCCV can only have /s/ as the first consonant, /t/ /p/ or /k/ as the second consonant, and /l/, /ɹ/, /w/ and /j/ as the last consonant. Anything that doesn't follow at least some form of the first part of steno order (STKP...) would thus have problems when you run into words like strength, screw, spree, and so forth.

Martin, is there a reason why you aren't allowing the use of both hands in briefing these?

--

Steven Tammen

unread,

Jun 15, 2016, 6:25:48 PM6/15/16

to plove...@googlegroups.com

Oh, I see now. Disregard the question.

Zack Brown

unread,

Jun 15, 2016, 11:28:13 PM6/15/16

to ploversteno

Martin, you make some good points. The consonant clusters that I treat as 'codas' may have a random enough order that 50% stroke success could be the best I can ever hope for. And, ignoring steno order could fix it, while introducing another big set of collisions.

But there's a broader picture to consider - if you're right, that doesn't only mean that *my* approach will max out at 50% single-stroke success; it means that *all* potential steno systems will max out at 50% single-stroke success. Therefore, since single-stroke words are the key to speed, all steno systems will need briefs for at least half the dictionary, i.e. at least tens of thousands of dictionary entries.

Are you able to test arbitrary steno orders? If so, could you test Plover's default keyboard? I'd be very surprised if it had a 50% single-stroke success rate. I'd put it more at 30% or 35% tops, though that's just a guess. But in that case, my 50% success rate would be a huge improvement, migrating thousands of words away from requiring briefs.

Be well,

Zack

--

You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Zack Brown

Steven Bhardwaj

unread,

Jun 16, 2016, 9:06:30 AM6/16/16

to plove...@googlegroups.com

Hi Zack,

"Note: I'm not syllabifying any words. I'm removing all vowels except the single strongest one."

Your strategy reminds me of the "Abbreviation Principle" from Anniversary Gregg shorthand. It is described in detail here (large pdf, zoom to pp 108-114):

And here is a summary from a table of contents I compiled for the same manual:

The Abbreviating Principle
Short Words: shortlist where forms stop w/diphthong or strongly accented vowel
Long Words -> 3 classes:
1. Based on longhand abbreviation
2. Write thru accented syllable if distinctive
3. Write through the consonant following the accented syllable if (2) not distinctive

Hope this may be of interest!

Cheers,

Steven

Martin Sherman-Marks

unread,

Jun 16, 2016, 10:14:08 AM6/16/16

to Plover

Assuming "F" can be used as "S" or "Z" in codas, my phoneme order matches 99.999% of onsets and 99.938% of codas. My best attempt to convert Ireland's steno order into a phoneme order matches 99.876% of onsets and 99.692% of codas. (That's a little low, since it can't account for the fact that Ireland hacked in ways to match sounds like "-MP".) Your phoneme order, by contrast, matches only 96.557% of onsets and 96.598% of codas. Since a given syllable fails if either the onset or the coda fails, the error quickly becomes very significant. Common failing strings used more than 100,000 times in the ANC include "PR-", "GR-", "TR-", "FR-", and "-ST", along with many others. I've attached a file with all the onsets and codas which don't work with your order and their frequencies in the ANC.

I also looked at those same 17 words, and the only ones that require an inversion to match Ireland's order are "help" and "first". (Both of those would have passed my idealized phoneme order; I had "L" before "P" and "R" before "F/S" in codas. But of course, Ireland was working with a lot more constraints than just coda structure.) Codas containing "L P" or "R S T" account for 0.04% of the syllables in the ANC corpus, so although this is a noticeable error, it's a very small one in absolute terms. And this is true across the board; the most common onset Ireland's order fails on is "VR-", used just 15,822 times (in words like "e.v'ry.thing"; it's possible my syllabification algorithm should have done that as "ev.'ry.thing"), and the most common coda failure is "-LF", used 18,273 times (mostly in the word or prefix "self"). Those are tiny errors in a corpus with nearly ten million syllables.

Now, I know you're going to argue that I'm not comparing Ireland's steno order to multi-syllabic words. It's certainly true that none of the multi-syllabic words in the Norvig top 100 can be written in one stroke without a non-phonetic brief. (I think they all have one.) But that's not a bug, it's inherent to the design of the Ireland system. The rare cases where you actually can write a multi-syllabic word phonetically (my favorite that I've found so far is "TKPHEPBGS" for "dimension") are always fun, but most multi-syllabic words will require some kind of cheat to write in one syllable. But you always (or, well, almost always - there are some holes in the Plover dictionary file) have the option of writing it out syllable by syllable if you don't know or can't remember the brief.

If your system can increase the number of multi-syllabic words matched but substantially decreases the number of monosyllables matched, from my perspective that's not a win. The first thing a steno order must do is handle the overwhelming majority of monosyllables in one stroke. If you can tweak the order to allow more multisyllabic strokes without losing too many monosyllables, more power to you. But without the ability to build from monosyllables, I don't know that your system will be usable without massive amounts of memorization. And if the user's already memorizing that much, what's the advantage over a heavily briefed syllabic system?

for zack

Steven Tammen

unread,

Jun 16, 2016, 10:14:59 AM6/16/16

to plove...@googlegroups.com

This might be a silly question... but what do you do with the 50% that don't work? Without a steno order that follows English phonotactic constraints, you can't stroke all of them out by syllable (at least not without conflicts due to inversion etc.).

It seems to me that what you're trying to do is build briefs into the steno order. While I certainly agree that briefs are the key to speed, there are trade-offs. That other 50% you have to deal with somehow, and not being able to stroke them out, the logical answer is briefing them too... but then you are left with a theory which from the get go has 50% of the words that you can only access by brief (unless I'm missing something). On a practical level, this would never work for people that are trying to learn stenography -- it's simply too much.

In the very long term, differences in speed between a syllable-centric and word-centric steno order will probably be minimal. The fastest people brief everything anyhow, so what you're doing is providing a framework to take the work out of half of it and add some consistency. But since that consistency only works for half the words, it's not fully consistent, so it won't take all the guesswork out of recalling briefs. And something that we haven't talked about at all is phrase briefs, which are needed to get under one stroke per word. Since phrase briefs are mostly arbitrary (as far as I can tell), I'm guessing one order doesn't really have an upper hand in these, which are arguably more important for speed.

My vision of this project when I first posted was having this be a "base" (I believed I even used that word) for people to build their own theories on top of (i.e., add briefs to make it "their own"). We can't get so caught up in idealism that we forget that we actually have to learn these things -- that for most of us, it is more a lifelong process than something done rapidly over a short period of time ("school" or what have you). I'm not advocating for sacrificing speed or optimality for "ease of learning" (a la Phoenix); rather, I'm advocating for something "ground up" not "top down" -- something that we can grow with and add to instead of having to spend 2 years learning it before we can even write anything.

Feel free to correct me if I've got it all wrong.

You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.

Theodore Morin

unread,

Jun 16, 2016, 10:42:07 AM6/16/16

to Plover

@Steven about the dictionary generation: whoa, I have a short memory, sorry!

@Martin: I really enjoy how you explain things out, I'm following along and I find your reasoning quite sound and your writing really easy to follow. It's funny how you highlight all the blind spots in English steno, where we have little tricks to deal with Ireland's order's shortcomings. Off the top of my Ed:

- LF: usually replaced with just "-F", as in SEF for self-. Alternatively, -FL or *FL, like *EFL for elf.

- VR: if in a word part like "everything", we use -FR. As a word starter, like vroom…well, there's no entry in Plover's dictionary for vroom!

It would be interesting to see what conflicts you'd have to come up with to deal with these issues.

Have you resolved yet how to write steno order with multiple keys and sounds that can replace each other? (I'm under the impression that you are ignoring that problem until you figure out the ideal phonetic order for English, then adapting it to ten fingers)

S T K F P B W G N M L H R V J — A O I E U — RVE F/V/S R CH SH N P B L G NCH J LK K X T S D Z

Obviously, that steno order is generous because if you use V or J, you can't use anything else from the first bank. Likewise F blocks T, K, and P. It's the difficult nature of coding, perhaps.

Zack Brown

unread,

Jun 16, 2016, 12:02:54 PM6/16/16

to ploversteno

Martin,

There are two separate tests:

1) how many syllabic onsets and codas can fit into a given steno
order? i.e. you take onsets and codas using an algorithm that
identifies individual syllables in words from the CMUdict file, and
split those syllables into onsets and codas.

2) how many full words can fit into a single stroke of a given steno
order? i.e. you take onsets and codas while dropping all but the
primary vowel in a word, so that all consonants to the left for the
onset, and all to the right form the coda.

So, for each of these tests, the terms 'onset' and 'coda' are
referring to different things, which is fine, but we need to stay
aware of that. A 99.999% result for test #1 isn't necessarily better
than a 50% result for test #2.

One of the points you seem to be making is that using my approach
(test #2), it'll be fairly impossible to get better than a 50% success
rate at single-stroke words (which includes both single-syllable and
multi-syllable words).

My response to that is that your approach (test #1) will therefore
*also* not get better than a 50% success rate at test #2.

So that's my first point.

My second point has to do with this bit that you said:

>
> If your system can increase the number of multi-syllabic words matched but substantially decreases the number of monosyllables matched, from my perspective that's not a win.

I'm not disagreeing with this specifically. But I think it's important
to remember that we're not talking about the quantity of words. Or at
least we shouldn't be. We should be talking about the quantity of
*occurrences* of words in a generic corpus.

So, if I have a steno order that can type "the" in one stroke, and
"behead" in two, that is a much better steno order than one that can
type "behead" in one stroke, and "the" in two. That's because the word
"the" is used far more often than the word "behead".

So the goal is not to prioritize monosyllabic words over polysyllabic
words, the goal is to prioritize most-used words over least-used
words. It just so happens that a lot of the most-used words are also
monosyllabic. But, for example, "between" is a multisyllable word that
is used more frequently than lots of single-syllable words, and should
therefore get a single stroke entry in preference to giving one to
those lesser-used single-syllable words.

Martin Sherman-Marks

unread,

Jun 16, 2016, 2:36:16 PM6/16/16

to Plover

So the numbers in my previous post actually are all based on the quantity of occurrences of each onset and coda in the American National Corpus, so I'm not putting "behead" and "the" on the same playing field by any means. When I say my order matches 99.999% of onsets, that's occurrences of onsets. There are a total of nine onsets that fail my order, but they're only used a total of 318 times in the entire corpus (on two occasions, for example, someone used the word "Mnookin", so "M N" is in the list).

It's true I hadn't previously run the various steno orders against a test #2 word list. It took some doing, but I hacked at my poor onset/coda generator until I got it to remove unstressed vowels and return a one-stroke-per-word test #2 wordlist. (There are 3,667 onsets and 10,483 codas by that measure, as opposed to 122 onsets and 285 codas for the test #1 lists.) I tested it against two phonetic orders developed for type #1 (mine and Ireland's) and two developed for type #2 (yours and an algorithmically generated one):

Martin, Test #1 based: 92.557% of onsets, 84.189% of codas.
Ireland, Test #1 based: 92.684% of onsets, 87.088% of codas.*
Zack, Test #2 based: 90.037% of onsets, 77.839% of codas.
Algorithm, Test #2 based: 92.880% of onsets, 84.410% of codas.**

In other words, they all did pretty badly on test #2, but the two that were designed for test #2 did either no better or significantly worse. (The reason yours did so much worse, I think, is the same reason it does so badly on test #1 - your R is too far forward in the onsets, and your T is too far forward in the codas.) Most of the high-frequency failures were the exact same failures as in test #1: the onset and coda lists were certainly much longer, but the overwhelming majority of them were used less than a thousand times. That's because the most common words are, far and away, words that look pretty much the same on tests #1 and #2.

Designing a system specifically to be test #2 compliant clearly doesn't show noticeable improvements on the test #2 results. Even supposing you did manage to improve your order to get it slightly better than mine or Ireland's - what's the real benefit? You're throwing away the ability to easily write "from" in order to very slightly improve the odds of writing "between" in one stroke. But Plover already has a way to write "between" in one stroke: with the brief TWAOEPB. After writing PWE/TWAOEPB once or twice, the learner will notice that the suggestions window is pointing out that TWAOEPB works just as well, and will start using it. Yes, that requires memorization, but remember that a) PWE/TWAOEPB is still an option and b) the test #2-focused system would require plenty of memorization too. Is RPOES "prose" or is it "repose"? Both are somewhat uncommon words, but which is more uncommon? The only way to be sure is to memorize it. In other words, to use briefs.

More fundamentally than that, I disagree strongly with your point that test #2 is equally important as test #1. Any word, no matter how long, is composed of English syllables. Therefore, a system that's test #1-compliant will always be able to write any given word in a predictable way (leaving aside collisions, of course). But a system that's designed for test #2 only will not necessarily be able to write any word in a predictable fashion. If you can't write "prose" or "last" or "brown" without knowing some special cheat, then you've lost something huge.

Zack (or anyone else), if you want my Python scripts and my onset/coda lists to run your own testing or just generally goof around instead of doing your job, let me know. I'm happy to share them with anyone interested.

* You might look at this and say "what? Plover can't write 87% of the word list in one stroke!" This is true! Three things to keep in mind: words have both onsets and codas, so the 7.3% failure rate on onsets and 13% failure rate on codas compound. A word that fails either fails completely. Secondly, this is phonetic order, not steno order per se. Steno order has new problems, like key conflicts (D- and F- can't be typed at the same time, for example). So this is more like a ceiling of what a person with twenty-odd fingers on each hand could type, not what could actually be achieved on a real steno machine. Thirdly, I'm ignoring homophone collisions entirely.
** If you're wondering, the order my algorithm generated for test #2 had a 99.573% onsets/98.564% codas success rate on test #1. Better than I'd've guessed, but still quite a bit worse than either my order or Ireland's. It fails on such basic sounds as "SK-", "TW-", "-TS", and "-NGK". (And, of course, it didn't really do any better on test #2 either.)

Martin Sherman-Marks

unread,

Jun 16, 2016, 4:11:11 PM6/16/16

to Plover

Here, I commented up the orderFight.py script and figured I'd send it around for anyone who wants to play with it. Just put all of these files in the same folder and run orderFight.py in IDLE or similar. Right now it's set up to run test #1. In lines 91 and 92 of the orderFight.py script, just change the file names from onsetList.json and codaList.json to onsetList-t2.json and codaList-t2.json to run test #2. Note that, if "F" is higher than "S" in your coda order, it'll use "F" to mean "S" at the beginning of clusters, allowing words like "masts" (MAFTS) to pass. If you want to change what letter is substituted in for S, you can; replace "F" with your preferred letter in lines 78 and 79.

If you want to debug your order, you can see exactly which onsets fail by changing:
onsetCount = countWorking(onsetList, onsetStr)
to
onsetCount = countWorking(onsetList, onsetStr, 100)
where 100 is the minimum frequency of failing onsets you want to see. (For example, the onset "T S" occurs in the corpus 171 times and fails my steno order, so if you put 100, it'll show up, but if you put 200, it won't.) Obviously, you can do the same thing with codas, just add the threshold on the following line. If you're running test #2, you want to set this threshold to more like 1000 or you'll be watching it print failing onsets for a while. This is very helpful for figuring out how to tweak your order to make it just that little bit better.

I've actually tweaked my order a bit more myself, moving F a little higher up the coda list. It now passes test #1 with 99.999% of onsets and 99.995% of codas. Pretty sure this is the best it's possible to get, since you can't get "-Z M" (which is generally a suffix stroke anyway, except in "pleonasm" and a few others) without losing "-M F" (as in "triumph"). However, you can move many of these letters around without affecting the score in either direction; "T" in the onset, for example, can be moved all the way from 19th position up to 2nd, because it mostly just interacts with "S", "W", and "R". And a few sounds, like "JH" in the onset, are almost total loners; they don't interact with anything else, so you can put them anywhere you damn well feel like.
S P K F B G D TH V HH SH Z CH ZH JH DH N M T Y W L R [nuclei] R L N NG M F V K P B SH ZH G Y HH CH JH SHN T TH DH S D Z W

If someone still wants to work on improving on Ward Ireland's keyboard layout, I'd start with that phonetic order. It does about as well as anything else on test #2 and destroys on test #1. The real question is whether you can compress it down into seventeen keys.

orderFight.py

codaList.json

onsetList.json

onsetList-t2.json

codaList-t2.json

nucleusList.json

Steven Tammen

unread,

Jun 16, 2016, 6:33:14 PM6/16/16

to plove...@googlegroups.com

This is really cool! For some reason I thought this process would take like weeks and be super complicated, but I'm glad to see Martin proving me wrong (hooray for linguists!). Things that will still need to be done:

-Figuring out how to shrink this down to a usable number of keys (I would vote for doing two configurations: one with the normal 17 keys, and another with the full complement of usable steno keys, as shown below).

The keys are color coded by finger. Unless we needed the left index keys for onsets, they would probably be best used as additional disambiguators.

-Figuring out the best way to split up the 10 monophthongs and 5 diphthongs in English among the 4 vowel keys. I would assume the most frequently used vowel sounds go on the four base keys, with others accessed by chords.

-Figuring out a disambiguation system with remaining keys (perhaps keys for differentiating homophones and capitalization?). Hopefully we can figure out a way to do this without needing the asterisk to access some sounds (-TH in Plover's theory, for example).

Like we've talked about above, we'll have to decide how much time this gets versus how much time the actual dictionary generation gets. They're both very good projects.

--
You received this message because you are subscribed to a topic in the Google Groups "Plover" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ploversteno/-sowdKC_bjU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ploversteno...@googlegroups.com.