I wrote a post earlier about brainstorming models for language processing, and I'd like to clarify the content.
All words or entities usable online constitute a very large set. Any given body of words (a speech, blog post, news article, or book) contains a subset of that first set, with each word occurring a specific number of times. One should easily be able to generate an alphabetical list of all words used in an given article, and list the number of times each word is used in each article. Merely computing this information wouldn't be particularly interesting. In order to use word-usage counts to say something interesting about a given text, you'd need some sort of baseline to compare the usage to, to filter out the noise, and see what words are used more often than should be expected.
There is no ultimate standard which sets how often words are expected to be used, so an artificial comparison is necessary in order to identify interesting things about how often a text (or blog, or politician, or whatever) uses particular words. In order to make useful comparisons about how often words are used, a baseline of normalized usage would have to be established first, perhaps by inventing a usage-likelihood score among various bodies of text. (I understand similar things exist for phonemes and letters.)
For example, one could take all fifth grade reading books in the US, or all New York Times front page articles, and make a database of the text. Every unique word could get its own entry (row?) in the database, with the number of times that it appears associated with it. With just this information, one could calculate the usage likelihood for every word, within this defined context. You could define the usage-likelihood as perhaps "unique instances divided by total words in corpus", and that would give a quantitative linguistic definition for a social context. Another way of defining word-usage-likeliness would be, for example: In The New York Times, one can expect one out of every 138 words to be "Bush", one out of 27 to be "the", and one out of 655,987 words to be "Wonderlich". (I made those up.)
After defining (and normalizing) the likelihood that words appear in text, you could start making comparisons between bodies of work, and creating interesting tag-cloudish visualizations of what distinguishes some text you'd like to analyze. You could build a widget for your blog that says "the following are the words that are more than 25% more likely to be used on this blog than they are to be used in New York Times cover stories", or, "here are recent news stories that also have similarly unlikely words used."
For any given block of text, you could output a list of those words used which are most unexpected (again, compared to an artificial standard). This should enable automated calculation of linguistic deviance, which should be something that is really really idiosyncratic, and lead to all sorts of other interesting comparisons.
I don't know how people usually do cloud visualizations, but if I were
making a word cloud, that's *precisely* what I would do --- i.e. this is
probably how people do it.
See:
http://en.wikipedia.org/wiki/TFIDF
http://en.wikipedia.org/wiki/Latent_Semantic_Indexing
Now, the thing is that word counts actually don't get you very much
information. Remember back to the days before Google- search engines
gave you back documents by matching words and returning documents where
you search terms appeared most frequently. Then Google came along and
ranked documents differently and we all saw how *awful* word frequency
was for determining relevance to a query.
So the question is what you would use word counts *for*. Clouds are
nice, but look for cases where words aren't exactly the appropriate
level of chunking to identify relevance. (And, you will see this in most
word clouds.) Articles back in 2004 about the Democratic ticket might
have used the word "John" an exceptional amount owing to the dynamic
duo's shared first name, but "John" in a word cloud isn't very
informative. You'd want to chunk whole names together, but that's a
difficult problem in itself.
Note also for comparing documents that the frequency of a word isn't
very indicative of a word's prominence in a text, and if you have a
profile (i.e. vector) of word frequencies for two documents, it's not
immediately obvious how you would compare profiles to arrive at whatever
result you want. (Not to say there aren't ways to do it, but that there
are many ways to do it.)
--
- Josh Tauberer
- GovTrack.us
"Yields falsehood when preceded by its quotation! Yields
falsehood when preceded by its quotation!" Achilles to
Tortoise (in "Godel, Escher, Bach" by Douglas Hofstadter)