English Word List By Frequency

0 views

Skip to first unread message

Ailene Goldhirsh

unread,

Aug 4, 2024, 10:15:18 PM8/4/24

to neusibmala

Unlessotherwise specified, the frequency lists linked from here count distinct orthographic words (not lemmas), including inflected and some capitalised forms. For example, the verb "to be" is represented by "is", "are", "were", and so on.

Frequency lists have many applications in the realm of second language acquisition and beyond. One use for such lists in the context of the Wiktionary project is as an aid in identifying missing terms of high-frequency and thus, it is assumed, of high priority. Since English Wiktionary aims not just to be a mere database of lemmas, but a multi-directional, multi-lingual dictionary aimed at English speaking users, there are certain advantages to lists which include inflected forms as well. These forms reflect words as they are likely to be encountered and thus as they may be used in lookup.

Feel free to add definitions for words on these lists if you know the languages involved! Even better if you can include usage citations and references. If you are involved in another non-English language edition of Wiktionary, you might also consider implementing or expanding on this idea, if there is not already something similar in place. If you see a word in this list that is clearly out of place (wrong language, punctuation, superfluous capitalisation), you are welcome to remove it. While creating entries for words, please leave valid bluelinks in place as these pages may be copied for use with other language projects in the future.

However, this system is far from perfect due to the variable quality of the source data and the automated nature of processing. Thus a word's presence in any of these lists is merely an invitation for further investigation as to whether an entry is warranted. Please be mindful that there will be many words which

Collocations may or may not warrant their own individual entries, and not necessarily in the exact form they appear here. As an aid to navigating this list, consider enabling the OrangeLinks.js gadget to reveal headword pages which exist (and so will still show a blue link) but which do not yet contain an entry for the relevant language. Please be mindful too that not all of the resources listed here are suitable for use directly in Wiktionary, mainly due to problems with licensing compatibilities.

The methods I have found so far use either Counter or dictionaries which we have not learned. I have already created the list from the file containing all the words but do not know how to find the frequency of each word in the list. I know I will need a loop to do this but cannot figure it out.

The ideal way is to use a dictionary that maps a word to it's count. But if you can't use that, you might want to use 2 lists - 1 storing the words, and the other one storing counts of words. Note that order of words and counts matters here. Implementing this would be hard and not very efficient.

The Display as option allows including more attributes into the result. For example, a frequency list of word forms can be generated with lemmas displayed alongside the word forms. Up to 3 attributes are allowed.

When the Display as option is used, first a concordance of all tokens matching the criteria is generated and the frequencies are computed from the concordance. The size of this concordance have a technical limitation of 10 million concordance lines. As a result, if the corpus is very large, the frequency data may not be computed from the whole corpus.

Sketch Engine is specifically designed to handle large corpora with speed. Any search will only take a few seconds two to complete if the corpus size is under a billion words. It might take a bit of extra time for corpora over 1 billion words. Complex regular expressions criteria used on large corpora might require a several-minute wait.

The only requirement is a tokenized corpus. The results will be more representative if the list is generated from a large corpus. There is, however, no minimum corpus size required for the wordlist to work.

Sketch Engine is specifically designed to handle large corpora with speed. Any search will only take a few seconds two to complete if the corpus size is under a billion words. It might take a bit of extra time for corpora over 1 billion words. Complex criteria set with regular expressions on large corpora can take several minutes to complete.

The Fry word list or "instant words" are widely accepted to contain the most used words in reading and writing. The sight words list is divided into ten levels and then divided into groups of twenty-five words, based on frequency of use and difficulty.

It is important for young readers to instantly recognize these high frequency words by sight in order to build up their reading fluency. It is also important for readers to practice words in meaningful context through phrase and sentence reading practice. As a follow up activity, students can practice writing short sentences including Fry words.

Thanks, Phil. Yes, the concordance should work; I had thought there might be a way to list the most frequent words in order of occurrence. My guess is to set the parameters and then, run the data. That should work, shouldn't it? Someone has done a lot of work in creating Logos.

Frequency lists for the whole BNC (version 1), for the spoken versus written components, for the conversational (i.e. demographic) versus task-oriented (i.e. context-governed) parts of the spoken component, and for the imaginative versus informative parts of the written component. Also: ranked frequency word lists according to parts of speech (e.g. all nouns, all conjunctions) based on the whole BNC corpus (version 1), as well as frequencies for individual part-of-speech tags (e.g. NN1, VDG) based on the BNC Sampler.

Although the frequency lists for this book were based on all 4,124 files of the original BNC version 1 corpus, the text classifications and POS tags used were the updated and more accurate ones implemented in the BNC World Edition.

** For those who want a user-friendly word list (i.e. without frequency figures) based on the entire BNC, I am making one available here (all word forms occurring at least 10 times per million words, alphabetically arranged)

Select any of 70+ registers/genres, and then get a frequency listing for that genre. Just enter "*" (without quotation marks) for a general frequency listing for the selected genre, "[nn1]" for singular nouns in that genre, etc. You can also easily compare word frequency in one genre (or set of genres) against another, e.g. sermons vs. spoken, tabloids vs. broadsheet, medical vs. academic, etc..

word+frequency lists based on the Brown corpus (not disambiguated by parts of speech) may be found at the Brandeis University Computational Memory Lab or at the Psycholinguistic database at Rutherford Appleton Laboratory.

570 word families assumed to reflect the shared vocabulary of written academic English as used in a wide variety of disciplines (28 in total, 125K words from each) in an Academic Corpus of 3.5m words.

Selection was based on the principles of range, frequency and dispersion, using a specially compiled academic corpus of journal articles, book chapters, course workbooks, laboratory manuals, and course notes.

Sadly, though, the corpus composition was heavily skewed, a fact that affects its representativeness immensely. However, even these days, many people still appear to not have cottoned on to this, as the list still keeps getting cited as a model ;-)

Everything to do with Charles Kay Ogdens 1930s classic Basic English vocabulary list, including the electronic version of Basic English: International Second Language. New York: Harcourt, Brace & World Inc./Orthological Institute.

based on a corpus of modern Russian fiction and political texts (more than 35 million words). The list includes about 33000 words which frequency is greater than 1 ipm (instances per million words). A shorter selection of 5000 most frequent words is also available. The list provides word rank, frequency (per million), part of speech. Some analytical information about the lexical stock is provided, such as coverage of the total language use by word bands, e.g. first 3000 lemmas cover 76.6824% of the total number of word forms. The corpus, tools for working with it, as well as an aligned parallel English-Russian corpus are discussed in: Sharoff, Serge, (2002). Meaning as use: exploitation of aligned corpora for the contrastive study of lexical semantics. Proc. of Language Resources and Evaluation Conference (LREC02). May, 2002, Las Palmas, Spain.

For a quick-and-easy frequency listing/index of words in your own texts, try the following programs. For pedagogical software and vocabulary analysis programs, see the Teaching and Miscellaneous Links page.

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

In total, the texts in the Oxford English Corpus contain more than 2 billion words.[1] The OEC includes a wide variety of writing samples, such as literary works, novels, academic journals, newspapers, magazines, Hansard's Parliamentary Debates, blogs, chat logs, and emails.[2]

Another English corpus that has been used to study word frequency is the Brown Corpus, which was compiled by researchers at Brown University in the 1960s. The researchers published their analysis of the Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis.

According to The Reading Teacher's Book of Lists, the first 25 words in the OEC make up about one-third of all printed material in English, and the first 100 words make up about half of all written English.[3] According to a study cited by Robert McCrum in The Story of English, all of the first hundred of the most common words in English are of Old English origin,[4] except for "people", ultimately from Latin "populus", and "because", in part from Latin "causa".