Word frequency lists - word forms Vs lexemes

Drew Neil

unread,

Dec 27, 2015, 6:23:40 PM12/27/15

to plove...@googlegroups.com

This list on wikipedia of most common words in English ranks the word “be” as number 2 (after “the”). But then it goes on to note that

> the lexeme "be" listed below, includes occurrences of "are", "is", "were", "was", etc.

That means that the words “are”, “is", and so on don’t appear in the word frequency list. I can see how that would be useful if you’re learning to speak/read a language, and you want to build your vocabulary by working through the most common words. In that case, you would also be studying grammar and you would learn that “be” is an irregular verb with many forms. But if you’re learning to type steno, it seems to me that it would be more useful if the word frequency list treated “be”, “are”, “is”, etc. as different things entirely. These words sound different and they are stroked different.

Does anyone know of any good word frequency lists like this? To be precise, I’m looking for a word frequency list compiled from word forms, not from lexemes.

(That terminology is new to me, but I found this definition of lexemes and word forms on wikipedia:

The distinction between these two senses of "word" is arguably the most important one in morphology. The first sense of "word", the one in which dog and dogs are "the same word", is called a lexeme. The second sense is called "word form". Dog and dogs are thus considered different forms of the same lexeme. Dog and dog catcher, on the other hand, are different lexemes, as they refer to two different kinds of entities. The form of a word that is chosen conventionally to represent the canonical form of a word is called a lemma, or citation form.)

Cheers,

Drew

Andrew Schort

unread,

Dec 27, 2015, 8:18:05 PM12/27/15

to Plover

If you like that, you should check this out! http://www.wordcount.org/main.php

Tony Wright

unread,

Dec 27, 2015, 9:22:43 PM12/27/15

to plove...@googlegroups.com

Drew, I don't have a definitive answer to your question, but one excellent resource I'd like to point out to you is the Corpus of Contemporary American English: http://corpus.byu.edu/coca/

There are lots of word lists out there, some by lemma, some by literal surface word-form, but you specified that you wanted a good word frequency list, and by good, I assume you mean one from a good, large corpus of actual English utterances. COCA is one such corpus and it's got excellent search tools associated with it.

If you look at the already-compiled word frequency lists associated with this corpus, you will note that, yes, it does list lemmas (eg., 'be'), not inflected forms of words (eg. 'is, am, are'). But I know that you could search the corpus for inflected forms. You might need to write to the administrator, Mark Davies, whose contact information should be under "Contact Us" on that website. He could tell you how to produce the kind of frequency list you need.

There's a lot you could do with access to COCA if you learn how to use it. N-gram searches are also possible, so you could see what 2 and 3 word phrases occurred most commonly in the corpus as well.

--Tony

--
You received this message because you are subscribed to the Google Groups "Plover" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ploversteno...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Di

unread,

Dec 28, 2015, 12:13:07 AM12/28/15

to Plover

What are you using the word list for? You could try this:

https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Contemporary_fiction

This is a word frequency list, based on over 9,379,000 words of contemporary fiction gathered online.
Regular plurals are combined with their singular forms (tree, trees; box, boxes). Variations of a verb ending in -ed, -ing or -(e)s are lumped together with their root verb (smile, smiled, smiling, smiles). Adjective forms ending in -er or -est are included with their positive form (sad, sadder, saddest). And words ending in -'s are grouped with the form without the apostrophe (boy, boy's; everything, everything's), except for a few common contractions (it's; that's).

... which is apparently what Mr Munroe (XKCD) based Up-Goer on, according to this: http://splasho.com/blog/2013/01/17/a-bit-more-about-the-up-goer-five-text-editor/

That link also refers to other word lists, tools, and approaches.

-Di

Mirabai Knight

unread,

Dec 28, 2015, 2:14:44 PM12/28/15

to ploversteno

Unlemmatized word frequency lists:
https://www.kilgarriff.co.uk/bnc-readme.html#raw

> --
> You received this message because you are subscribed to the Google Groups
> "Plover" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ploversteno...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Mirabai Knight, CCP, RDR
StenoKnight CART Services
917 576 4989
m...@stenoknight.com
http://stenoknight.com

Drew Neil

unread,

Dec 29, 2015, 10:27:39 AM12/29/15

to plove...@googlegroups.com

Thanks for all of your suggestions!

The COCA and BNF look like especially useful resources. It looks as though wordcount.org uses the unlemmatized word frequency list that Mirabai linked to, which looks really useful.

Cheers,

Drew

gavan....@gmail.com

unread,

Dec 31, 2015, 10:49:17 AM12/31/15

to Plover

Hey,

I've been doing online transcription for about 6 months (qwerty) and I've collected most of my transcripts into a single .txt file so that I could analyze the word frequency in order to figure out the words I should prioritize as text abbreviations. It's a better way of doing it than just randomly learning/abbreviating ten random words a day. Last count I was up to 1.8 million words and about 13.3k minutes of audio.

Here's a list of ~7500 words sorted by freq. American English, includes proper names, all lowercase. Words occurring two or less times excluded.

https://docs.google.com/spreadsheets/d/1se1aZmRl_b7FjZlFdc3UgeL4kNFPUTWVE5sIUQ3bDpc/edit?usp=sharing

I used a free program called kfNgram to analyze the text. Transcripts are typically 2 person interview, single speaker presentations, a handful of sermons. It's a decent word frequency list based on how people actually talk as opposed to, say, a list compiled from literature. I spell most numbers out.

Now I need a strategy to help me increase my steno vocab. I've up to about 200 words but it is slow, and I'm not sure at what point I can jump onto Typeracer or whatever.

Anyway, hope that list is of some use.

Gavan

Jennifer Brien

unread,

Dec 31, 2015, 1:51:27 PM12/31/15

to Plover

On Thursday, 31 December 2015 15:49:17 UTC, gavan....@gmail.com wrote:

Hey,

I've been doing online transcription for about 6 months (qwerty) and I've collected most of my transcripts into a single .txt file so that I could analyze the word frequency in order to figure out the words I should prioritize as text abbreviations. It's a better way of doing it than just randomly learning/abbreviating ten random words a day. Last count I was up to 1.8 million words and about 13.3k minutes of audio.

Here's a list of ~7500 words sorted by freq. American English, includes proper names, all lowercase. Words occurring two or less times excluded.

https://docs.google.com/spreadsheets/d/1se1aZmRl_b7FjZlFdc3UgeL4kNFPUTWVE5sIUQ3bDpc/edit?usp=sharing

I used a free program called kfNgram to analyze the text. Transcripts are typically 2 person interview, single speaker presentations, a handful of sermons. It's a decent word frequency list based on how people actually talk as opposed to, say, a list compiled from literature. I spell most numbers out.

That's interesting. Over 80% of your vocabulary is 'one-in-a-hundred-thousand' words, though I expect many of these cluster in a single transcript, where the best thing to do might be to abbreviate them on the fly and search-and-replace afterwards.

It would be interesting to have a cumulative total in column C, divided by the sum of column B, so you could see at a glance how many distinct words (as opposed to lexemes) it takes to make up a given percentage of the total text. I suspect the numbers are quite stable over a wide range of corpuses, though the actual vocabulary may differ widely once you get further down the list.

gavan....@gmail.com

unread,

Dec 31, 2015, 2:55:28 PM12/31/15

to Plover

Yeah, I tend to abbreviate on the fly for specific jobs. I put those on 0-9 (which is why I spell out most numbers). Current job includes: augmentative, speech therapist, Proloquo2Go. I might add stuff like '1blt' for 'banking license team,' on the fly too. Proper names, topic specific stuff, and speaker idiosyncrasies are perfect candidates for on the fly abbrevs.

I was lucky to have a list of about 10k abbreviations shared with me based on a phonetic system. Some of the phonetic abbreviations are long though, and I far prefer a less intuitive system that relies on memory. Dskrb for describe isn't great but dkb, dks, dkx, dkg, dkv, et cetera is much better. I haven't gotten into creating briefs with Plover yet, but I think I'd definitely prefer dropping as many letters or sounds as possible. I'm still struggling through the top 700 word list in the lessons and I would not dare to try to transcribe with it yet!

Reply all

Reply to author

Forward