Well... it's hard to truncate exactly, as there's all that annoying
nesting stuff. An untested attempt with lxml:
def truncate(doc, chars):
"""Truncate the document in-place to the given number of
visible characters"""
length = len(doc.text_content())
if length > chars:
_truncate_tail(doc, length-chars)
def _truncate_tail(doc, strip):
doc.tail, strip = strip_chars(doc.tail, strip)
while strip:
if not len(doc):
break
strip = _truncate_tail(doc[-1], strip)
if strip:
doc.pop()
if strip:
doc.text, strip = strip_chars(doc.text, strip)
return strip
def strip_chars(string, strip):
if string is None:
return None, strip
if len(string) > strip:
return string[:len(string)-strip], 0
else:
return '', strip-len(string)
If you are inclined to finish this and make some tests (doctest-style) I
could add it to lxml.html, I guess to lxml.html.clean (which also has
functions for wordwrapping and linking, which seem related).
--
Ian Bicking : ia...@colorstudy.com : http://blog.ianbicking.org
I think he just wants to make sure the HTML is well-formed, not strip
the tags completely. However, strip_tags() is something WebHelpers
should provide. I've noticed the lack a couple times. However, I'm
not sure of the best implementation.
- sgmllib: (used in cleanhtml.py): not in Python 3. Can
cleanhtml.py be ported to HTMLParser?
- lxml: hard to install on Mac and Windows due to C dependencies.
- BeautifulSoup: has the best ability to parse real-world (i.e.,
misformed) HTML. However, it's a largish library so I'm not sure any
helper should depend on it.
- Simplicity vs speed. Would routines that depend only on the
Python standard library be fast enough?
As for Matt's case of truncating HTML without making it misformed,
would this be widely enough used to justify making a webhelper for it?
--
Mike Orr <slugg...@gmail.com>
strip_tags should be easy enough to implement with some regexes -- you
just have to remove <.*?>, then resolve any entities.
This code does some fairly simplistic rendering of HTML (but better than
what strip_tags would likely do), and might have a better home in
WebHelpers:
http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py
Put in the WebHelpers "unfinished" directory and opened ticket #458 to
integrate it.
--
Mike Orr <slugg...@gmail.com>
Well... it's hard to truncate exactly, as there's all that annoying
nesting stuff. An untested attempt with lxml:
It would be fun to write a SAX handler that permits all tags, and
counts all characters. It would stop permitting additional characters
once it reached a certain limit.
-jj
--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/
Just to confirm, I'm planning to use Ian's code for WebHelpers
HTML-to-text renderer because it uses HTMLParser and has no external
dependencies. It's currently in WebHelpers/unfinished/htmlrender.py
in the 0.6 source and at
http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py.
Noah offered an alternative using BeautifulSoup, and Matt recommended
something from Django (which would mean deleting unnecessary Django
dependencies). If somebody can tell me what these can do that Ian's
code can't, I might reconsider.
Although again, we have two issues. One is HTML-to-text (essentially
lynx-as-a-function). The other is truncating an HTML string while
keeping it well-formed (which means not stopping in the middle of a
tag and closing any open tags).
--
Mike Orr <slugg...@gmail.com>
I agree with you; I'm not convinced this is a broad enough need to
warrant a webhelper. But some significant use cases would help
convince me.
> 2) strip all HTML tags (without an interest in text formatting)
> 3) html2text (trying to keep text formatting with p, block, etc.)
Ian's code handles p and div, and treats block as p. Other tags are
stripped and ignored. We can extend it if we want more sophistocated
formatting. Actually, indented blocks would be useful. And
optionally displaying the hrefs. (Lynx does this with footnotes.)
> 4) sanitizing HTML (not directly discussed here, but a good
> implementation of this will be helpful, increase security, and should
> be able to be extended trivially to provide #2, striping all HTML
> tags).
What exactly do you mean by sanitizing? Stripping all except a few
formatting tags? This would be good for WebHelpers if somebody can
provide an implementation. One not depending on non-stdlib packages.
--
Mike Orr <slugg...@gmail.com>
I think blockquote might be handled, and anchors do show their links.
Possibly also lists?
The code was originally written for creating text alternatives to HTML
email.
>> 4) sanitizing HTML (not directly discussed here, but a good
>> implementation of this will be helpful, increase security, and should
>> be able to be extended trivially to provide #2, striping all HTML
>> tags).
>
> What exactly do you mean by sanitizing? Stripping all except a few
> formatting tags? This would be good for WebHelpers if somebody can
> provide an implementation. One not depending on non-stdlib packages.
I think Jon Rosebaugh (aka Chairos) ported lxml.html.clean to
BeautifulSoup. You couldn't do it without some kind of HTML parser, but
BS is an easy install (or even include it, it's just one file).
feedparser also includes a cleaner, but IMHO it's a bit more crude.