Article Analysis for news aggregation site

BerlinBrown

unread,

Jan 16, 2008, 12:34:52 PM1/16/08

to process.theinfo

It seems like a lot of web tools focus on search (e.g. google, dmoz),
massing a large amount of data and allowing users to tap into that
source.

I always think it is cool to have a large amount of data (links
mostly) and then push out that data in the form of news aggregation
(actually, google does a good job of this also; news.google.com,
yahoo.finance; and our user driven favorites, reddit.com and
digg.com).

My question; does anyone have any knowledge on how news aggregation
sites like news.google.com automatically push data out at the right
time. With google, it is probably easy; they can just go off of
whatever user's are interested at that particular time (accessing
popular search terms and the link)

My question; Are there approaches for actually looking at the page
content and ascertaining if this page contains interesting content.
For example, a wikipedia article may have more text structure than a
spam link that just has viagra displayed 500 times. Do you guys know
any algorithms that would aid in determing good vs bad content. There
is bayesian filters but I don't know it would work that well for
articles like you might find on nytimes or wikipedia. Anyway, that is
the project I am most interested in. I have spent about a year of
looking at ways to do this. I have a tool now that just reads RSS
feeds and simple set of metrics to say this is spam or not.

And what I have found so far:

1. http://www.cs.uic.edu/~liub/WebMiningBook.html
This guy has done good work on text content mining.

2. http://svmlight.joachims.org/
Support vector machine library, determine if A content is similar to
B.

3. http://en.wikipedia.org/wiki/Bayesian_spam_filtering

4. Rocchio Method - algorithm for calculating the distance between two
documents.

5. http://wordnet.princeton.edu/ Lexicon word database

I am just an amateur, not a researcher or anything but it is still fun
stuff.

Josh Tauberer/GovTrack

unread,

Jan 17, 2008, 8:42:22 AM1/17/08

to process.theinfo

On Jan 16, 12:34 pm, BerlinBrown <berlin.br...@gmail.com> wrote:
> My question; Are there approaches for actually looking at the page
> content and ascertaining if this page contains interesting content.
> For example, a wikipedia article may have more text structure than a
> spam link that just has viagra displayed 500 times

...
> 2.http://svmlight.joachims.org/

> Support vector machine library, determine if A content is similar to
> B.

SVMs are all the rage these days in computational linguistics (as far
as I get to know the rage via some cross-departmental osmosis here).

The usual approach is to come up with a bunch (10-100) metrics
("features") on documents, say in your case like number of paragraphs,
lengths of sentences, number of unique words used, whatever, and then
throw them at the SVM on a bunch (100-1000) of documents you hand
label as "interesting" vs "non-interesting". It learns which metrics
are useful, and then you apply the SVM on new documents.

For what metrics make a document interesting... I haven't heard of (or
recall) any.

Josh Tauberer
razor.occams.info / govtrack.us

David Buttler

unread,

Jan 25, 2008, 2:15:15 PM1/25/08

to process...@googlegroups.com

SVMs are just one type of classification algorithm. There are literally hundreds to choose from. As for features that make a article interesting, try googling iScore. If you have implementable ideas on what makes something interesting (besides what other people think), then I would love to hear them.
Dave

Reply all

Reply to author

Forward