more on word likeliness comparisons

John Wonderlich

unread,

Mar 9, 2008, 10:01:36 PM3/9/08

to openhous...@googlegroups.com

More on Word Likeliness Comparisons

March 9th, 2008 by John Wonderlich · No Comments

I wrote a post earlier about brainstorming models for language processing, and I'd like to clarify the content.

All words or entities usable online constitute a very large set. Any given body of words (a speech, blog post, news article, or book) contains a subset of that first set, with each word occurring a specific number of times. One should easily be able to generate an alphabetical list of all words used in an given article, and list the number of times each word is used in each article. Merely computing this information wouldn't be particularly interesting. In order to use word-usage counts to say something interesting about a given text, you'd need some sort of baseline to compare the usage to, to filter out the noise, and see what words are used more often than should be expected.

There is no ultimate standard which sets how often words are expected to be used, so an artificial comparison is necessary in order to identify interesting things about how often a text (or blog, or politician, or whatever) uses particular words. In order to make useful comparisons about how often words are used, a baseline of normalized usage would have to be established first, perhaps by inventing a usage-likelihood score among various bodies of text. (I understand similar things exist for phonemes and letters.)

For example, one could take all fifth grade reading books in the US, or all New York Times front page articles, and make a database of the text. Every unique word could get its own entry (row?) in the database, with the number of times that it appears associated with it. With just this information, one could calculate the usage likelihood for every word, within this defined context. You could define the usage-likelihood as perhaps "unique instances divided by total words in corpus", and that would give a quantitative linguistic definition for a social context. Another way of defining word-usage-likeliness would be, for example: In The New York Times, one can expect one out of every 138 words to be "Bush", one out of 27 to be "the", and one out of 655,987 words to be "Wonderlich". (I made those up.)

After defining (and normalizing) the likelihood that words appear in text, you could start making comparisons between bodies of work, and creating interesting tag-cloudish visualizations of what distinguishes some text you'd like to analyze. You could build a widget for your blog that says "the following are the words that are more than 25% more likely to be used on this blog than they are to be used in New York Times cover stories", or, "here are recent news stories that also have similarly unlikely words used."

For any given block of text, you could output a list of those words used which are most unexpected (again, compared to an artificial standard). This should enable automated calculation of linguistic deviance, which should be something that is really really idiosyncratic, and lead to all sorts of other interesting comparisons.

Tags:

--
John Wonderlich

Program Director
The Sunlight Foundation
(202) 742-1520 ext. 234

James Turk

unread,

Mar 9, 2008, 11:49:19 PM3/9/08

to Open House Project

I just spent some time thinking about interesting ways to store the
data that would allow interesting comparisons to be made. I also
looked around to see what was out there.

One idea I had would be to define a set of sources (as granular as you
wish) that also constitute subsources.

For example let's say we had the following sources:

NY Times/Politics (member of: Politics, NY Times)
NY Times/Sports (member of: Sports, NY Times)
BBC/Politics (member of: Politics, BBC)
BBC/Sports (member of: Sports, BBC)
Project Gutenberg's Public Domain Books
OpenHouseProject (member of: Politics)

Whenever a document is scanned (initially we'd take a large sample) by
pulling a large set (as close to complete as we can get) from each
source

Words would be stored in a table with each row containing a word,
source, and number of occurrences instead of just giving each word
it's own row in the database, each word would have it's own row for
each source, this could then be combined into

From this raw format it'd be simple to calculate relative usage per
source as well as usage per "domain" (ie. in the example I defined to
find out how well an article fits into the politics "domain" it'd be
compared against the combination of the BBC/politics NY Times/politics
and OHP blog)

I'd imagine that 'related stories' feature on many news sites uses a
similar concept, unless they are manually assigning related stories.
I think it'll be interesting to see what sort of things we can find
out with simple analysis on this word frequency data. (One obvious
example that occurs to me is finding out that paper X talks about Sen.
McCain Y% more than the average)

-James

On Mar 9, 10:01 pm, "John Wonderlich" <johnwonderl...@gmail.com>
wrote:
> More on Word Likeliness
> Comparisons<http://www.theopenhouseproject.com/2008/03/09/more-on-word-likeliness...>
> March
> 9th, 2008 by John Wonderlich <http://www.theopenhouseproject.com/about/who>· No
> Comments<http://www.theopenhouseproject.com/2008/03/09/more-on-word-likeliness...>
>
> I wrote a post<http://www.theopenhouseproject.com/2008/03/09/language-processing-bra...>earlier

> *Tags:*

Josh Tauberer

unread,

Mar 10, 2008, 8:04:29 AM3/10/08

to openhous...@googlegroups.com

John Wonderlich wrote:
> After defining (and normalizing) the likelihood that words appear in
> text, you could start making comparisons between bodies of work, and
> creating interesting tag-cloudish visualizations of what distinguishes
> some text you'd like to analyze. You could build a widget for your blog
> that says "the following are the words that are more than 25% more
> likely to be used on this blog than they are to be used in New York
> Times cover stories", or, "here are recent news stories that also have
> similarly unlikely words used."

I don't know how people usually do cloud visualizations, but if I were
making a word cloud, that's *precisely* what I would do --- i.e. this is
probably how people do it.

See:
http://en.wikipedia.org/wiki/TFIDF
http://en.wikipedia.org/wiki/Latent_Semantic_Indexing

Now, the thing is that word counts actually don't get you very much
information. Remember back to the days before Google- search engines
gave you back documents by matching words and returning documents where
you search terms appeared most frequently. Then Google came along and
ranked documents differently and we all saw how *awful* word frequency
was for determining relevance to a query.

So the question is what you would use word counts *for*. Clouds are
nice, but look for cases where words aren't exactly the appropriate
level of chunking to identify relevance. (And, you will see this in most
word clouds.) Articles back in 2004 about the Democratic ticket might
have used the word "John" an exceptional amount owing to the dynamic
duo's shared first name, but "John" in a word cloud isn't very
informative. You'd want to chunk whole names together, but that's a
difficult problem in itself.

Note also for comparing documents that the frequency of a word isn't
very indicative of a word's prominence in a text, and if you have a
profile (i.e. vector) of word frequencies for two documents, it's not
immediately obvious how you would compare profiles to arrive at whatever
result you want. (Not to say there aren't ways to do it, but that there
are many ways to do it.)

--
- Josh Tauberer
- GovTrack.us

http://razor.occams.info

"Yields falsehood when preceded by its quotation! Yields
falsehood when preceded by its quotation!" Achilles to
Tortoise (in "Godel, Escher, Bach" by Douglas Hofstadter)

John Wonderlich

unread,

Mar 10, 2008, 8:37:53 AM3/10/08

to openhous...@googlegroups.com

The second wikipedia link describes latent semantic indexing as very similar to what I described:

LSA can use a term-document matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents, typically stemmed words that appear in the documents. A typical example of the weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency): the element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.

And this:

There are many different formulas used to calculate tf–idf. The term frequency (TF) is the number of times the word appears in a document divided by the number of total words in the document.

...looks a whole lot like this: "You could define the usage-likelihood as perhaps "unique instances divided by total words in corpus"

As to your suggestion about word search-engine usefullness, I agree. I miss the lines and lines of white text on white background at the bottom of webpages seeking to game the search engines.

I don't think that search engines' former gamability should render "term frequency" an irrelevant tool for text evaluation; I think that we should conclude that that means there are contexts where it may be better suited for use than others. A search bot may need complex algorithms to infer meaning from term frequency, but a human user has those semantic algorithms already built in... We use heuristics to scan text and determine relevance constantly (especially as info-crammed net users), so why not deploy additional tools? The clip of text that accompanies a google search term could be considered a similar content induction device: it doesn't pretend to summarize an entire page, but tries to add scrutability and help determine an entry's relevance.

If we use tag searches to aggregate trends and results, why not normalized word frequency counts to characterize texts?

louis...@gmail.com

unread,

Mar 12, 2008, 10:32:27 AM3/12/08

to Open House Project

I was delighted to come across this thread as I've been thinking about
something quite similar (but rudimentary) over the last couple of days
for the historical Hansard prototype site (http://
hansard.millbanksystems.com/). We're experimenting with ways of making
200 years worth of Hansard, the official report of the UK Parliament,
available online.

We're using Solr (http://lucene.apache.org/solr/) for search, which
gives us some limited access to term frequency tf-idf statistics for
each speech in our database made by a member of parliament. These term
frequencies are used internally by Solr to generate search results and
to provide 'more like this' functionality. I have been trying to see
if we can get a nice tag-like 'frequently occurring term' set for each
debate using the tf-idf data.

In order to approach this, I've set up Solr with a large 'stop word'
list (indicating words that will be ignored for the purpose of
generating these term sets). I populated the stop word list with
common non-noun terms from written English word frequency lists
(imagining that generally nouns may convey 'aboutness' better). I also
added a boost for words appearing in the title of the debate.

It's pretty early days, but the term sets produced this way seem
interesting enough to keep experimenting for a bit. One way I was
thinking of using these terms is to try to characterize periods of
time in our historical data set as a way of making it more browsable
without editorializing. We have a timeline you can use to drill down
from the decade level to the day level. At the moment it's quite
sparse, only really showing how much information we have on the site
for a period of time. It seems like it might be useful to show the
frequently occurring terms for a period as well to give a flavor of
the topics being discussed during that period.

Louise

Jennifer Bell

unread,

Mar 18, 2008, 2:41:28 PM3/18/08

to Open House Project

I finished the O'Reilly book 'Programming Collective Intelligence' a
few days ago. It includes, among other things, a gentle introduction
to baysian filtering, SVN categorization, and likeness algorithms.
Just reading it triggered all sorts of ideas for potential
applications...

John Wonderlich

unread,

Mar 20, 2008, 11:01:26 AM3/20/08

to openhous...@googlegroups.com

I just spotted Micah's post on tech president that features a word counting / tag cloud tool.

This is very similar to what we were discussing, except that the baseline the words are compared against is either generic, or computed by simply ignoring some set of words.

John Wonderlich

unread,

Mar 24, 2008, 10:53:23 AM3/24/08

to openhous...@googlegroups.com

I'm just realizing how useful this Hansard search tool is.

Compare it to our equivalent Congressional Record search. I know there are a lot of plans to upgrade access to the Congressional Record, and that digitizing back into the archives is costly, but the Hansard example (even though it's a work in progress) should serve as a good motivator and example of what we should be shooting for.

Reply all

Reply to author

Forward