Hi dudes,
Hope everybody had a great Christmas and New Year!
Back at the Hackfest I was chatting with a few of you about entity
extraction and also "Readability" style body text extraction.
It's still an area I am interested in and as I wanted an excuse to
try out Scala on the new Heroku stack I put together a little app ...
http://juicer.herokuapp.com/
TL;DR - You post a URL to the API, it extracts the page title, body
text,
keywords, etc then runs the lot through a named entity extractor.
Essentially
Readabilty + Stanford NER. You can also just post a load of plain text
and
it will do the NER on that.
Please feel free to use it / hack it / break it / etc with the caveat
that it's
still a "toy" project at this stage. If you'd like to include this
type of
functionality in any meta-meta API's I'd be happy to get involved
there, as it
runs on Heroku you can boot up your own instance for free in about 10
mins!
If anybody wants to install it or mess about with some Scala insanity
then the
Github URL is ...
http://github.com/matth/juicer/
Cheers, and Happy New Year!
Matt