Huge dataset of common words

1,640 views
Skip to first unread message

beforever

unread,
Oct 17, 2013, 10:20:07 AM10/17/13
to sporcle-u...@googlegroups.com
I have an idea for a new series of quizzes, but I lack the raw data needed for it. I basically want 5,000, 10,000 or some other figure of very common English words - maybe the commonest ones, or just a random sample of everyday ones, it doesn't matter.

So if anybody has any sites that could direct me to with such a list, or spreadsheets on their computer they'd be willing to share with me, I'd be so, so, so grateful for the help.

Fusty

unread,
Oct 17, 2013, 10:53:25 AM10/17/13
to sporcle-u...@googlegroups.com
I've had an idea in my head for a while but never have found a good source. I might email you though just incase it's along the same lines.

TheCleverone

unread,
Oct 17, 2013, 12:47:40 PM10/17/13
to sporcle-u...@googlegroups.com
Is there anyone at dictionary.com that you might be able to e-mail? I suppose it's worth enquiring if you can find a lead.

Mateo56

unread,
Oct 17, 2013, 1:23:47 PM10/17/13
to sporcle-u...@googlegroups.com
Funnily enough I've tried to find exactly what you seem to be looking for a few months ago, again without success.

stanford0008

unread,
Oct 17, 2013, 2:24:31 PM10/17/13
to sporcle-u...@googlegroups.com
I believe this should work: http://norvig.com/google-books-common-words.txt

(It's the source rockgolf used for this quiz: http://www.sporcle.com/games/rockgolf/not-with-a-bang)

beforever

unread,
Oct 17, 2013, 7:13:37 PM10/17/13
to sporcle-u...@googlegroups.com
That is really cool. Thanks so much, stanford.

Man, things start getting really weird once you scroll about two-thirds of the way down that list.

Amazingjosh Sporcle

unread,
Oct 17, 2013, 7:20:33 PM10/17/13
to sporcle-u...@googlegroups.com
And now I have an idea- "which word is more common?" Unless that's beforever's idea, I may follow through on that.

beforever

unread,
Oct 17, 2013, 7:38:51 PM10/17/13
to sporcle-u...@googlegroups.com
Nope, not my idea. Yours sounds great!

iglew

unread,
Oct 17, 2013, 11:16:23 PM10/17/13
to sporcle-u...@googlegroups.com
There's no such thing as an absolute list of most common words.  "Most common" is a function of what sample you're using.  Even if it were possible to define what constitutes the entire language, it would be impossible to count all of it, so we have to rely on some sort of sampling. 

Two of the most respected word corpora are the Oxford English corpus and BYU's Corpus of Contemporary American English.  Both are available with paid subscription, though partial samples are available from their sites.

Google has recently made available word lists based on its Google Books collection. It has a lot of built-in biases -- in particular, it's heavily tilted toward written language as opposed to spoken, but also it tilts toward the sort of books which are more likely to be digitized by Google.  I've noticed that people are starting use Google's list more and more, and treating it as if it's a universal standard, not because it's actually a better source (which it's not) but simply because it's free.  Such, alas, is the Internet.
Message has been deleted

Ubbiebubbie

unread,
Oct 18, 2013, 9:24:25 AM10/18/13
to sporcle-u...@googlegroups.com
While going through that one, I happened to see "health-care" being repeated four times in a row.

On Friday, October 18, 2013 12:25:43 AM UTC-4, RedBengalTiger (Funnyfavorer101) wrote:
I like this one: 5,000 Most Common Words

puckett86

unread,
Oct 18, 2013, 9:38:51 AM10/18/13
to sporcle-u...@googlegroups.com
iglew is right. There is no such thing as "most common words". There has to be a qualifier to go along with that. For example, "most common words in the Bible". That makes sense because the words in the Bible are generally-agreed upon and there's a finite number of them. I've seen lists of most common words that are taken from thousands of documents. That type of list would be somewhat reliable because it comes from a finite source, but still incomplete.

Both lists posted on here are questionable. Ubbiebubbie already gave a good example of the second list's dubiousness. Just glancing over the first one, I see "s", "de", and "ii" are all in the first couple hundred words. The third word isn't even a word at all; it's a Roman numeral.

beforever

unread,
Oct 18, 2013, 10:11:49 AM10/18/13
to sporcle-u...@googlegroups.com
For the purposes of my idea, what's been linked here has fulfilled my needs already. So I'm grateful for stanford0008's one, and funnyfavorer101's will be good as a back-up. My quizzes won't focus so much on the commonality of words... I really just needed a great, big list of different words in the thousands or ten thousands order.

To those who are using them to feature common usage of words, though - I would heed carefully the points brought up by iglew and puckett86 if they're relevant to your ideas. This stuff is tricky.

Good luck to y'all, whatever projects you'll be working on!

puckett86

unread,
Oct 18, 2013, 10:18:09 AM10/18/13
to sporcle-u...@googlegroups.com
beforever: If you just need a list of words, I'd suggest the Scrabble Tournament Word List or something like that. This is what I use (warning: zip file):
http://www.isc.ro/lists/twl06.zip

beforever

unread,
Oct 18, 2013, 10:28:02 AM10/18/13
to sporcle-u...@googlegroups.com
I also wanted them to be everyday words... excluding really obscure ones. There's no way to judge what's 'everyday' objectively other than knowing the vocabulary that I know, but... yeah...

I'll keep your .zip as well, 'cause I think there could be interesting quizzes made using that. It's a really cool list - thanks!

Sammie Wiegand

unread,
Oct 18, 2013, 10:49:49 AM10/18/13
to Sporcle University
Yes, while the Scrabble words are undoubtedly very interesting/thorough, they do tend to be very obscure. I don't know how my dad remembers them all!​


--
You received this message because you are subscribed to a topic in the Google Groups "Sporcle University" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sporcle-university/pHeMcxRlG7k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sporcle-univers...@googlegroups.com.
To post to this group, send email to sporcle-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sporcle-university/307e5ae2-5e90-4c82-bf2b-c745411e0dcf%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



--
√-1 23 π . It was delicious.

Fusty

unread,
Oct 18, 2013, 11:43:25 AM10/18/13
to sporcle-u...@googlegroups.com
On the topic of words, is a demonym a proper noun (American, Chinese). I'm writing a description of what I am not including in a certain quiz and wasn't entirely sure.
 
My english language skills aren't so good just to let you know. 

TheCleverone

unread,
Oct 18, 2013, 3:36:19 PM10/18/13
to sporcle-u...@googlegroups.com
Denonyms are always written with a capital letter in English, whether it's French, Spanish, Kyrgyz or Sporcle-ese (even made-up denonyms like Sporcle-ese would count). However, as to whether they are classified as proper nouns or not is a different question. I think that they are, and a quick Googling agrees with me, however I'm sure an expert may correct me (what defines a proper noun is the only area of grammar that I am fuzzy on, really).

Sammie Wiegand

unread,
Oct 18, 2013, 7:53:22 PM10/18/13
to Sporcle University
I have some experience on the subject, but someone correct me if I'm wrong.

I think that they are proper nouns in the same way that their countries would be proper nouns; while that type of word wouldn't usually be a proper noun, given the relationship between the country and the demonym, I think they are proper nouns. You should do some more research if you're really interested in the topic.


On Fri, Oct 18, 2013 at 12:36 PM, TheCleverone <theclever...@gmail.com> wrote:
Denonyms are always written with a capital letter in English, whether it's French, Spanish, Kyrgyz or Sporcle-ese (even made-up denonyms like Sporcle-ese would count). However, as to whether they are classified as proper nouns or not is a different question. I think that they are, and a quick Googling agrees with me, however I'm sure an expert may correct me (what defines a proper noun is the only area of grammar that I am fuzzy on, really).
--
You received this message because you are subscribed to a topic in the Google Groups "Sporcle University" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sporcle-university/pHeMcxRlG7k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sporcle-univers...@googlegroups.com.
To post to this group, send email to sporcle-u...@googlegroups.com.

iglew

unread,
Oct 21, 2013, 8:44:41 PM10/21/13
to sporcle-u...@googlegroups.com
If you just want a large collection of relatively common words and you don't require it to be comprehensive, you might get good use from the free sample file derived from BYU's COCA.  It includes every 10th word up through the top 60,000, so you get 6,000 words.  It comes in Excel with various data attached, which makes it easy to sort.

I've occasionally made use of that sample to do selective searches when looking for words for a language quiz (this one, for example), though it's not a final source for the quiz.

(Some day I'll probably give in and buy a paid subscription, but I'm kind of a language geek.)
Reply all
Reply to author
Forward
0 new messages