You should look into beautiful soup
> Can anyone tell me how to get text from a html file?I am trying to display
> the text of an html file in textview(of glade).If i directly display the
> file,it shows with html tags and attributes, etc. in textview.I don't want
> that.I just want the text.
[Parent article is unavailable on gmane, so my reply isn't quite in
the right place in the tree]
I generally just use something like this:
Popen(['w3m','-dump',filename],stdout=PIPE).stdout.read()
I'm sure there are more complex ways...
--
Grant Edwards grant.b.edwards Yow! I'm having fun
at HITCHHIKING to CINCINNATI
gmail.com or FAR ROCKAWAY!!
E.g. using lxml.html:
import lxml.html as H
html = H.parse("the_html_file.html")
print H.tostring(html, method="text")
Stefan
For more complex parsing beautiful soup is definitely the way to go.
However, if all you want to do is strip the html and keep all
remaining text I'd recommend pyparsing package with this short script:
Why would a library that even the author has lost interest in be "the way
to go"?
Stefan
Emile
Sure, if the library is still being maintained. I can't think of too
many open-source projects where somebody else hasn't taken over from
the original author.
--
Grant Edwards grant.b.edwards Yow! I'm dressing up in
at an ill-fitting IVY-LEAGUE
gmail.com SUIT!! Too late...
Interesting, even the web site has had a revamp.
Nice - I like competition. ;)
Stefan