Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HTML to Text renderer

24 views
Skip to first unread message

Ian Bicking

unread,
Nov 2, 2004, 6:07:03 PM11/2/04
to pytho...@python.org
Does anyone know of a module that can render HTML to text? Just a
subset of HTML, really; I'd like to compose emails using <p> tags and
whatnot, fill in all the values in the email template, then apply word
wrapping and other formatting. Also, it'll make using Zope Page
Templates with email easier.

Even if all it supports is <p> and <br> that would be enough, but I'm
hoping there's something even more complete out there. I don't need
something as general as, say, Lynx; these templates would be written
with a specific renderer in mind.

Thanks.

--
Ian Bicking / ia...@colorstudy.com / http://blog.ianbicking.org

Robert Brewer

unread,
Nov 2, 2004, 6:43:59 PM11/2/04
to Ian Bicking, pytho...@python.org
Ian Bicking wrote:
> Does anyone know of a module that can render HTML to text? Just a
> subset of HTML, really; I'd like to compose emails using <p> tags and
> whatnot, fill in all the values in the email template, then
> apply word
> wrapping and other formatting. Also, it'll make using Zope Page
> Templates with email easier.
>
> Even if all it supports is <p> and <br> that would be enough, but I'm
> hoping there's something even more complete out there. I don't need
> something as general as, say, Lynx; these templates would be written
> with a specific renderer in mind.

To clarify: you don't want the HTML tags merely stripped; you want to
replace e.g. br with a line break and p with, say, two line breaks?


FuManChu

Ian Bicking

unread,
Nov 3, 2004, 12:36:43 AM11/3/04
to Robert Brewer, pytho...@python.org

Right. And word wrapping too. Some other tags would also be
interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
to control alignment (e.g., <p align="">).

Roger Binns

unread,
Nov 3, 2004, 1:53:24 AM11/3/04
to
Ian Bicking wrote:
> Right. And word wrapping too. Some other tags would also be
> interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
> to control alignment (e.g., <p align="">).

Usually I resort to using one of the text based browsers (eg lynx/links/w3m)
which all have a mode to dump plain text out formatted in that way.

Roger


Marc Christiansen

unread,
Nov 8, 2004, 8:12:53 PM11/8/04
to
Ian Bicking <ia...@colorstudy.com> wrote:

> Robert Brewer wrote:
>> To clarify: you don't want the HTML tags merely stripped; you want to
>> replace e.g. br with a line break and p with, say, two line breaks?
>
> Right. And word wrapping too. Some other tags would also be
> interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
> to control alignment (e.g., <p align="">).

Have a look at htmllib.HTMLParser and formatter in the standard Python
lib (but also look at the source of htmllib). Maybe they provide what
you need.

HTH
Marc

Ivo Woltring

unread,
Nov 9, 2004, 2:07:07 PM11/9/04
to

"Marc Christiansen" <to...@jupiter.solar-empire.de> wrote in message
news:l5i562-...@halut.solar-empire.de...

look at this code:

===CUT BELOW===
from sgmllib import SGMLParser

class html2txt(SGMLParser):
"""html2txt()
"""
def reset(self):
"""reset() --> initialize the parser"""
SGMLParser.reset(self)
self.pieces = []

def handle_data(self, text):
"""handle_data(text) --> appends the pieces to self.pieces
handles all normal data not between brackets "<>"
"""
self.pieces.append(text)

def handle_entityref(self, ref):
"""called for each entity reference, e.g. for "&copy;", ref will be
"copy"
Reconstruct the original entity reference.
"""
if ref=='amp':
self.pieces.append("&")

def output(self):
"""Return processed HTML as a single string"""
return " ".join(self.pieces)

if __name__=="__main__":
html="""<h1>just a piece of html</h1>
<div class="toc">
<ul>
<li><span class="section"><a
href="index.html#install.choosing">1.1. Which Python is right for
you?</a></span></li>
<li><span class="section"><a href="windows.html">1.2. Python
on Windows</a></span></li>
<li><span class="section"><a href="macosx.html">1.3. Python
on Mac OS X</a></span></li>
<li><span class="section"><a href="macos9.html">1.4. Python
on Mac OS 9</a></span></li>
<li><span class="section"><a href="redhat.html">1.5. Python
on RedHat Linux</a></span></li>
<li><span class="section"><a href="debian.html">1.6. Python
on Debian GNU/Linux</a></span></li>
<li><span class="section"><a href="source.html">1.7. Python
Installation from Source</a></span></li>
<li><span class="section"><a href="shell.html">1.8. The
Interactive Shell</a></span></li>
<li><span class="section"><a href="summary.html">1.9.
Summary</a></span></li>
</ul>
</div>
"""
parser = html2txt()
parser.reset()
parser.feed(html)
parser.close()
print parser.output()
=== END CUT ===

The html2txt class is of course extendable and changeble. For me it was
important to convert html to text but the behavior of the class can be
adjusted to change tags to do other stuff... hope it helps

Ivo.


0 new messages