Word Frequency English Language

2 views

Skip to first unread message

Maya Malbon

unread,

Aug 4, 2024, 8:44:09 PM8/4/24

to congnaredte

Unlessotherwise specified, the frequency lists linked from here count distinct orthographic words (not lemmas), including inflected and some capitalised forms. For example, the verb "to be" is represented by "is", "are", "were", and so on.

Frequency lists have many applications in the realm of second language acquisition and beyond. One use for such lists in the context of the Wiktionary project is as an aid in identifying missing terms of high-frequency and thus, it is assumed, of high priority. Since English Wiktionary aims not just to be a mere database of lemmas, but a multi-directional, multi-lingual dictionary aimed at English speaking users, there are certain advantages to lists which include inflected forms as well. These forms reflect words as they are likely to be encountered and thus as they may be used in lookup.

Feel free to add definitions for words on these lists if you know the languages involved! Even better if you can include usage citations and references. If you are involved in another non-English language edition of Wiktionary, you might also consider implementing or expanding on this idea, if there is not already something similar in place. If you see a word in this list that is clearly out of place (wrong language, punctuation, superfluous capitalisation), you are welcome to remove it. While creating entries for words, please leave valid bluelinks in place as these pages may be copied for use with other language projects in the future.

However, this system is far from perfect due to the variable quality of the source data and the automated nature of processing. Thus a word's presence in any of these lists is merely an invitation for further investigation as to whether an entry is warranted. Please be mindful that there will be many words which

Collocations may or may not warrant their own individual entries, and not necessarily in the exact form they appear here. As an aid to navigating this list, consider enabling the OrangeLinks.js gadget to reveal headword pages which exist (and so will still show a blue link) but which do not yet contain an entry for the relevant language. Please be mindful too that not all of the resources listed here are suitable for use directly in Wiktionary, mainly due to problems with licensing compatibilities.

Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.

I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.

If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.

Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.

There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called part-of-speech taggers. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:

As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.

To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.

To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting sequences of words rather than just words themselves. These are called n-grams. A good place to start is UNIX for Poets. If you are willing to invest in a book on NLP, I would recommend Foundations of Statistical Natural Language Processing.

The first line just gets libraries that help with parts of the problem, as in the second line, where urllib2 downloads a copy of Ambrose Bierce's "Devil's Dictionary" The next lines make a list of all the words in the text, without punctuation. Then you create a hash table, which in this case is like a list of unique words associated with a number. The for loop goes over each word in the Bierce book, if there is already a record of that word in the table, each new occurrence adds one to the value associated with that word in the table; if the word hasn't appeared yet, it gets added to the table, with a value of 1 (meaning one occurrence.) For the cases you are talking about, you would want to pay much more attention to detail, for example using capitalization to help identify proper nouns only in the middle of sentences, etc., this is very rough but expresses the concept.

The first part of your question doesn't sound so bad. All you basically need to do is read each word from the file (or stream w/e) and place it into a prefix tree and each time you happen upon a word that already exists you increment the value associated with it. Of course you would have an ignore list of everything you'd like left out of your calculations as well.

If you use a prefix tree you ensure that to find any word is going to O(N) where N is the maximum length of a word in your data set. The advantage of a prefix tree in this situation is that if you want to look for plurals and stemming you can check in O(M+1) if that's even possible for the word, where M is the length of the word without stem or plurality (is that a word? hehe). Once you've built your prefix tree I would reanalyze it for the stems and such and condense it down so that the root word is what holds the results.

Another option for the semantic analysis could be modeling each sentence as a tree of subject, verb, etc relationships (Sentence has a subject and verb, subject has a noun and adjective, etc). Once you've broken all of your text up in this way it seems like it might be fairly easy to run through and get a quick count of the different appropriate pairings that occurred.

U can use the worldnet dictionary to the get the basic information of the question keyword like its past of speech, extract synonym, u can also can do the same for your document to create the index for it.then you can easily match the keyword with index file and rank the document. then summerize it.

If the list of topics is pre-determined and not huge, you may even go further: build a classification model that will predict the topic. Let's say you have 10 subjects. You collect sample sentences or texts. You load them into another product: prodigy. Using it's great interface you quickly assign subjects to the samples. And finally, using the categorized samples you train the spacy model to predict the subject of the texts or sentences.

This law is named after the American linguist George Kingsley Zipf,[3][4][5] and is still an important concept in quantitative linguistics. It has been found to apply to many other types of data studied in the physical and social sciences.

In mathematical statistics, the concept has been formalized as the Zipfian distribution: a family of related discrete probability distributions whose rank-frequency distribution is an inverse power law relation. They are related to Benford's law and the Pareto distribution.

Zipf's law has been discovered before Zipf,[a] by the French stenographer Jean-Baptiste Estoup' Gammes Stenographiques (4th ed) in 1916,[7] with G. Dewey in 1923,[8] and with E. Condon in 1928.[9]

The same relationship was found to occur in many other contexts, and for other variables besides frequency.[1] For example, when corporations are ranked by decreasing size, their sizes are found to be inversely proportional to the rank.[12] The same relation is found for personal incomes (where it is called Pareto principle[13]), number of people watching the same TV channel,[14] notes in music,[15] cells transcriptomes,[16][17] and more.

In 1992 bioinformatician Wentian Li published a short paper[18] showing that Zipf's law emerges even in randomly generated texts. It included proof that the power law form of Zipf's law was a byproduct of ordering words by rank.

Although Zipf's Law holds for most natural languages, and even some non-natural ones like Esperanto[21] and Toki Pona,[22] the reason is still not well understood.[23] Recent reviews of generative processes for Zipf's law include Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions",[24] and Simkin, "Re-inventing Willis".[25]

However, it may be partly explained by statistical analysis of randomly generated texts. Wentian Li has shown that in a document in which each character has been chosen randomly from a uniform distribution of all letters (plus a space character), the "words" with different lengths follow the macro-trend of Zipf's law (the more probable words are the shortest and have equal probability).[26] In 1959, Vitold Belevitch observed that if any of a large class of well-behaved statistical distributions (not only the normal distribution) is expressed in terms of rank and expanded into a Taylor series, the first-order truncation of the series results in Zipf's law. Further, a second-order truncation of the Taylor series resulted in Mandelbrot's law.[27][28]