It seems like a lot of web tools focus on search (e.g. google, dmoz),
massing a large amount of data and allowing users to tap into that
source.
I always think it is cool to have a large amount of data (links
mostly) and then push out that data in the form of news aggregation
(actually, google does a good job of this also;
news.google.com,
yahoo.finance; and our user driven favorites,
reddit.com and
digg.com).
My question; does anyone have any knowledge on how news aggregation
sites like
news.google.com automatically push data out at the right
time. With google, it is probably easy; they can just go off of
whatever user's are interested at that particular time (accessing
popular search terms and the link)
My question; Are there approaches for actually looking at the page
content and ascertaining if this page contains interesting content.
For example, a wikipedia article may have more text structure than a
spam link that just has viagra displayed 500 times. Do you guys know
any algorithms that would aid in determing good vs bad content. There
is bayesian filters but I don't know it would work that well for
articles like you might find on nytimes or wikipedia. Anyway, that is
the project I am most interested in. I have spent about a year of
looking at ways to do this. I have a tool now that just reads RSS
feeds and simple set of metrics to say this is spam or not.
And what I have found so far:
1.
http://www.cs.uic.edu/~liub/WebMiningBook.html
This guy has done good work on text content mining.
2.
http://svmlight.joachims.org/
Support vector machine library, determine if A content is similar to
B.
3.
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
4. Rocchio Method - algorithm for calculating the distance between two
documents.
5.
http://wordnet.princeton.edu/ Lexicon word database
I am just an amateur, not a researcher or anything but it is still fun
stuff.