Unlessotherwise specified, the frequency lists linked from here count distinct orthographic words (not lemmas), including inflected and some capitalised forms. For example, the verb "to be" is represented by "is", "are", "were", and so on.
Frequency lists have many applications in the realm of second language acquisition and beyond. One use for such lists in the context of the Wiktionary project is as an aid in identifying missing terms of high-frequency and thus, it is assumed, of high priority. Since English Wiktionary aims not just to be a mere database of lemmas, but a multi-directional, multi-lingual dictionary aimed at English speaking users, there are certain advantages to lists which include inflected forms as well. These forms reflect words as they are likely to be encountered and thus as they may be used in lookup.
Feel free to add definitions for words on these lists if you know the languages involved! Even better if you can include usage citations and references. If you are involved in another non-English language edition of Wiktionary, you might also consider implementing or expanding on this idea, if there is not already something similar in place. If you see a word in this list that is clearly out of place (wrong language, punctuation, superfluous capitalisation), you are welcome to remove it. While creating entries for words, please leave valid bluelinks in place as these pages may be copied for use with other language projects in the future.
However, this system is far from perfect due to the variable quality of the source data and the automated nature of processing. Thus a word's presence in any of these lists is merely an invitation for further investigation as to whether an entry is warranted. Please be mindful that there will be many words which
Collocations may or may not warrant their own individual entries, and not necessarily in the exact form they appear here. As an aid to navigating this list, consider enabling the OrangeLinks.js gadget to reveal headword pages which exist (and so will still show a blue link) but which do not yet contain an entry for the relevant language. Please be mindful too that not all of the resources listed here are suitable for use directly in Wiktionary, mainly due to problems with licensing compatibilities.
you are welcome and thank you. i stumbled upon wiktionary when i was looking for word lists and i know its not easy to find many free / extensive sources. since i only frequented English Wiktionary, i only posted links on the english version of the page.
well my overall subtitle corpus for all languages was 53 GB compressed archive. I unfortunately deleted all except the original archive. Let me open it and i can give you an idea on the number of files at least. Based on my tests, frequency lists generated using decent amount of data should be comparable. I am assure you that there were lot more entries than 50k i used and provided for download
I suggest you lemmatize your wordlists, and not only present them as wordforms (group verbforms walks, walked, under walk_V, and noun forms a walk, the walks under walk_N), and similarily for the other languages. Here is an overview of software to do so: _Grammar
well i can do that but the word lists i consume are for a keyboard app and i need raw words to match user input. infact when i started i came across a few word lists and i could not use them for my requirement purely because i depending upon user input i would want to show walked in my app and lemmatised word lists would make loading a lot slower.
well the format i used for word list is sort of generic one i found around. it goes like this
word1 wordfrequency1
word2 wordfrequency2
word3 wordfrequency3
word comes first, and is followed by word frequency which is a number, with space in between.
you were correct. I re-ran the word list generator a couple of times and I found the mistake i made in computing the total count. The other details are correct however the total word count came up to 765703147 and not 690788712769
Total word count is the total count used for frequency list.
Overall word count was the actual word count. some words has junk character or at length of 1 which are ignored. Hence Total word count
Luigi,
The word lists i have generated ignore 1 letter words like a and i. Its difficult to validate a single char word across multiple languages unless you know the language or can spend time tuning the rules per language. I know a bit about it as i have done something similar for accents across various latin based european languages. If you really want one, i can generate a one off and email it to you.
Having said that i will try to generate torrent files, one that references all the 50kzip and another one that references all full zips. Once i generate these, i will udpate this page with the torrent files.
well i used to have an excellent package which would give me tons of bandwidth and allow me to host couple of gigs of data however i was not using it.. i dont even know if it still works (actually i will check in a bit).. eventually i moved my email hosting to microsoft live a while back and moved hosting there as well.. worse is wordpress.. they allow you to upload tons of things including movies but not zipped files..
well i have a bit of c# code that churns through files. what format of files do you have ? are they utf-8 / unicode text files ? are they xml files. i have two sets routines, 1 deals with data in text files and another in specialised xml files
Did you really get the Chinese wordlist to work? I downloaded zh_50K.txt but no matter what options I choose open in Microsoft Word and Open Office (both on Mac) it just displays corrupted characters. Any way around this?
So, I need a list of English words (probably around 4000 words) to create a database that will be used for the add-ins. I intend to share the add-in for the public, this is a freeware and an open source. Can I get your permission to use the list of English words from your Word Frequency List.
Thank you.
I am using some of the English word lists as a marker against a dictionary word list in my word game to determine the difficulty level for individual words. It is an indirect use of your work. I would like to know how I can acknowledge/credit you?
Dave, thanks for your work, it can be put to so many uses. I have recently learned that the creator of a smartphone keyboard (which all use wordlists for prediction/validation) used your lists (as one input among others). I have now noted that there seem to be an unusually high number of words that are falsely spelt in lower case instead of capitals. In many languages this only affects proper nouns (which is bad enough), but for some languages which use capitals for regular nouns (like German), your list needs a lot of cleaning up before it can be relied on.
So my question is: do you do any processing that can cause this effect, or are all these errors really in the subtitle files?
Feel free to answer by e-mail if you like.
for that reason, i force all words to lower case to build the frequency word list. i myself used it in Slydr (keyboard like app in Windows Phone) i created last year. i can look further but again without language specific input, i am helpless ?
You make a fine point. its easy to rework to compute frequencies in lower case and persist in case specific word. I however need a few days. Thanks for persisting and pushing your logic in clear manner.
Thanks a lot for the lists, Hermit! Now, three brief questions:
1) What version of the opensubtitles corpora did you use? Did you use only this source for the lists?
2) In the end, did you use all the available translations for each movie or just one?
3) For some reason, the Hebrew list seem to have a high degree of dissimilarity with other equivalent lists from purely written language. Any idea?
that is correct. I however tend to rebuild them at the same time and I am sure I did add single character entries after some discussion here. let me check it tomorrow and if required rerun the code again
Hi Dave,
Thanks for the frequency lists! I am writing a little class term paper (totally nonbinding) and need to explain the source of the Bulgarian corpus (besides the info you put in the log text file). Do you know where you took the Bulgarian frequency lists from? Was it from the Bulgarian Natioanal Corpus (that is written) or did you also get info from a spoken corpus? How do I quote your frequency lists? Thanks for your help!
Hi Dave. Awesome list. For each corpus did you only use subtitles for movies that were in their native language (i.e., only French film subtitles like Amlie for the French corpus)? Or did you also include subtitles that were translated from different languages?
How did you do these? Do you use any script for it? I ask you this question as I would like to find somewhere or to do by myself such list, but for a specific purposes. I mean for some narrow subjects like most frequent words for nurses, lawyers, construction workers and other such a groups. I would appreciate very much if you could help me in any way to establish such lists or give me some tools or advices how to do it relatively easy way, fast and cheap, as
Hoping to hear from you soon, I wish you all the best.
thank you very much for your app.
I have finally succeeded to run it
It worked this time
All .txt files was scaned and I got a frequency list of all of them
It will facilitate my teaching job a lot!
Would it be possible in any way to use it for PDF documents
as I have a lot of books in PDF format
or
to make a frequency list from some www sites?
For instance I would like to prepare a frequency lists for my students
from some journals online like Le Figaro ou Le Monde?
glad it worked. The problem with PDF is many fold, it can be text + image or image only etc. Its difficult to work out. The easier solution is to extract text from PDF and operate on the extracted data.
3a8082e126