Is there anything for Python + NLTK that is like readability.js?

210 views
Skip to first unread message

Emre Sevinç

unread,
May 27, 2010, 8:47:52 AM5/27/10
to nltk-users
Hi,

I'm looking for the Python equivalent of Arc90's readability.js

http://lab.arc90.com/experiments/readability/
http://lab.arc90.com/experiments/readability/js/readability.js

so that I can give it some input.html and the result is cleaned up
version of that html. I want this so that I can use it on the server-
side (unlike the JS version that runs only on browser side).

Any ideas?

PS: I have tried Rhino + env.js and that combination works but the
performance is unacceptable it takes minutes to clean up most of the
html content :( (still couldn't find why there is such a big
performance difference).

Drush D'Costa

unread,
May 27, 2010, 9:45:27 PM5/27/10
to nltk-users
If you are talking about cleaning up of the HTML document ,
like removing the tags and all ,
try using the Beautiful soup module of python
http://www.crummy.com/software/BeautifulSoup/

Also please refer to the pages 81-82 of the NLTK book by steven bird .


hope it helps


On May 27, 5:47 pm, Emre Sevinç <emre.sev...@gmail.com> wrote:
> Hi,
>
> I'm looking for the Python equivalent of Arc90's readability.js
>
> http://lab.arc90.com/experiments/readability/http://lab.arc90.com/experiments/readability/js/readability.js

Christopher Crowner

unread,
May 27, 2010, 10:40:05 PM5/27/10
to nltk-...@googlegroups.com

Try this:


Readability is awesome - html2text.py (Aaron Swartz, its author is the author of reddit - in Python - BTW) may serve some of your needs without the precise functionality of readability. 

I use BeautifulSoup to extract text and structured portions of web pages that I know the general structure of but using it for the functionality of readability would be a chore IMHO.
(would like to dig into the readability.js to see what is going on, would love to see its functionality ported to python!)

Document structure analysis, conversion etc. (in this case the documents are web pages) is not part of NLP proper. NLTK does a good job with its CorpusReaders
(there is not one for web pages though)  Preparing inputs for NLP processing is generally an ad hoc, tedious, error-prone, ... etc. process too ugly for NLP algorithms ;-) 
Would love to be convinced I'm wrong on this

You might find K. Summers dissertation on Logical Document Structure http://portal.acm.org/citation.cfm?id=866985 interesting as a framework for how documents 
could be analyzed (of course, there are articles on analyzing web pages that may be more directly relevant to you)

Good luck




--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.


Reply all
Reply to author
Forward
0 new messages