Truncating an html string safely

216 views
Skip to first unread message

Matt Feifarek

unread,
Jun 5, 2008, 1:59:38 PM6/5/08
to pylons-...@googlegroups.com
I'd like to use something like the "truncate" feature of webhelpers on html data that's being pulled in from an ATOM feed.

If I just use a simple truncate, it might leave some html tags opened (like a <div> without a </div>) which is Bad.

I figured that this was a common-enough task that I'd ask some experts before trying to roll my own solution. It seems like the kind of thing that might be hidden within the standard library somewhere, below my nose, but outside of my ability to discover.

I've found this:
http://code.djangoproject.com/browser/django/trunk/django/utils/text.py

Looks to be about the right thing, but I'd rather not be dependent on all of Django to do this.

Perhaps some ElementTree or LXML wizard knows a quick hack?

Thanks!

Ian Bicking

unread,
Jun 5, 2008, 2:36:46 PM6/5/08
to pylons-...@googlegroups.com

Well... it's hard to truncate exactly, as there's all that annoying
nesting stuff. An untested attempt with lxml:

def truncate(doc, chars):
"""Truncate the document in-place to the given number of
visible characters"""
length = len(doc.text_content())
if length > chars:
_truncate_tail(doc, length-chars)

def _truncate_tail(doc, strip):
doc.tail, strip = strip_chars(doc.tail, strip)
while strip:
if not len(doc):
break
strip = _truncate_tail(doc[-1], strip)
if strip:
doc.pop()
if strip:
doc.text, strip = strip_chars(doc.text, strip)
return strip

def strip_chars(string, strip):
if string is None:
return None, strip
if len(string) > strip:
return string[:len(string)-strip], 0
else:
return '', strip-len(string)


If you are inclined to finish this and make some tests (doctest-style) I
could add it to lxml.html, I guess to lxml.html.clean (which also has
functions for wordwrapping and linking, which seem related).

--
Ian Bicking : ia...@colorstudy.com : http://blog.ianbicking.org

TJ Ninneman

unread,
Jun 5, 2008, 2:56:57 PM6/5/08
to pylons-...@googlegroups.com
I've had excellent luck stripping HTML with the following:


I use it to strip out all the html leaving a nice plain string.  It does the best job of any solutions I've seen.

TJ

Mike Orr

unread,
Jun 5, 2008, 3:50:25 PM6/5/08
to pylons-...@googlegroups.com

I think he just wants to make sure the HTML is well-formed, not strip
the tags completely. However, strip_tags() is something WebHelpers
should provide. I've noticed the lack a couple times. However, I'm
not sure of the best implementation.

- sgmllib: (used in cleanhtml.py): not in Python 3. Can
cleanhtml.py be ported to HTMLParser?

- lxml: hard to install on Mac and Windows due to C dependencies.

- BeautifulSoup: has the best ability to parse real-world (i.e.,
misformed) HTML. However, it's a largish library so I'm not sure any
helper should depend on it.

- Simplicity vs speed. Would routines that depend only on the
Python standard library be fast enough?

As for Matt's case of truncating HTML without making it misformed,
would this be widely enough used to justify making a webhelper for it?

--
Mike Orr <slugg...@gmail.com>

Ian Bicking

unread,
Jun 5, 2008, 4:01:53 PM6/5/08
to pylons-...@googlegroups.com

strip_tags should be easy enough to implement with some regexes -- you
just have to remove <.*?>, then resolve any entities.

This code does some fairly simplistic rendering of HTML (but better than
what strip_tags would likely do), and might have a better home in
WebHelpers:
http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py

Mike Orr

unread,
Jun 5, 2008, 5:03:00 PM6/5/08
to pylons-...@googlegroups.com

Put in the WebHelpers "unfinished" directory and opened ticket #458 to
integrate it.

--
Mike Orr <slugg...@gmail.com>

Noah Gift

unread,
Jun 5, 2008, 9:59:01 PM6/5/08
to pylons-...@googlegroups.com
I have some boiler plate multi-threaded examples of using beautiful soup here:






 

Matt Feifarek

unread,
Jun 7, 2008, 10:24:40 AM6/7/08
to pylons-...@googlegroups.com
Oops; replied from the wrong address.

---------- Forwarded message ----------

On Thu, Jun 5, 2008 at 2:36 PM, Ian Bicking <ia...@colorstudy.com> wrote:

Well... it's hard to truncate exactly, as there's all that annoying
nesting stuff.  An untested attempt with lxml:

Exactly. Thanks for the lead.

I'm not sure I'm up to the challenge, but if I do get it working, I'll get it back to you, in case it's good enough to be added to lxml (or whatever).

Mike:
Seems like if we have the truncate function in webhelpers, a truncate that handles html would be wise... since we're, err, making html, usually, with Pylons.

Since the Django code doesn't seem to depend on anything (but some Django cruft, which seems to be frosting really) MAYBE it would be better to start with.

But I'll poke around a bit today.

Shannon -jj Behrens

unread,
Jun 12, 2008, 2:21:59 AM6/12/08
to pylons-...@googlegroups.com

It would be fun to write a SAX handler that permits all tags, and
counts all characters. It would stop permitting additional characters
once it reached a certain limit.

-jj

--
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/

Mike Orr

unread,
Jun 12, 2008, 5:13:40 AM6/12/08
to pylons-...@googlegroups.com

Just to confirm, I'm planning to use Ian's code for WebHelpers
HTML-to-text renderer because it uses HTMLParser and has no external
dependencies. It's currently in WebHelpers/unfinished/htmlrender.py
in the 0.6 source and at
http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py.

Noah offered an alternative using BeautifulSoup, and Matt recommended
something from Django (which would mean deleting unnecessary Django
dependencies). If somebody can tell me what these can do that Ian's
code can't, I might reconsider.

Although again, we have two issues. One is HTML-to-text (essentially
lynx-as-a-function). The other is truncating an HTML string while
keeping it well-formed (which means not stopping in the middle of a
tag and closing any open tags).

--
Mike Orr <slugg...@gmail.com>

rcs_comp

unread,
Jun 12, 2008, 10:19:41 AM6/12/08
to pylons-discuss


On Jun 12, 5:13 am, "Mike Orr" <sluggos...@gmail.com> wrote:
> Although again, we have two issues. One is HTML-to-text (essentially
> lynx-as-a-function). The other is truncating an HTML string while
> keeping it well-formed (which means not stopping in the middle of a
> tag and closing any open tags).

You might also want to look here:

http://www.zope.org/Members/chrisw/StripOGram
http://www.gnome.org/~jdub/bzr/planet/2.0/planet/sanitize.py

My $0.02 is that truncating HTML while ensuring it is well-formed is
not something that should be spent time on implementing in a web
helper. Take this example for instance:

<h1>My Page Subject</h1>
<div>
<p>Lorem Ipsum...[another 200 characters]</p>
<p>Lorem Ipsum...[another 200 characters]</p>
<p>Lorem Ipsum...[another 200 characters]</p>
<p>Lorem Ipsum...[another 200 characters]</p>
<p>Lorem Ipsum...[another 200 characters]</p>
</div>

Lets say that I want the first 150 characters, what is going to
happen? I am going to get 1000+ characters b/c of the <div> that is
wrapping everything OR I will get nothing but the header. Neither is
what I want.

Whenever I have come across the need to trucate HTML, I have always
been able to just do a strip-tags first. Most of the time I am just
trying to display a "summary" of a larger HTML formatted page/document
and losing formatting for summary purposes is usually not that big of
a deal.

Is there a possible need/use for truncating HTML and leaving it well
formed, maybe. Is it a trivial enough implementation to put in a web-
helper, not IMO.

rcs_comp

unread,
Jun 12, 2008, 10:37:56 AM6/12/08
to pylons-discuss


On Jun 12, 5:13 am, "Mike Orr" <sluggos...@gmail.com> wrote:

> Although again, we have two issues. One is HTML-to-text (essentially
> lynx-as-a-function). The other is truncating an HTML string while
> keeping it well-formed (which means not stopping in the middle of a
> tag and closing any open tags).


Here is another sanitizer (I think from something having to do with
Zope):

http://www.koders.com/python/fidFB51F4D2D89CC1397608213E09F11404D9B21059.aspx

rcs_comp

unread,
Jun 12, 2008, 10:55:29 AM6/12/08
to pylons-discuss


On Jun 12, 5:13 am, "Mike Orr" <sluggos...@gmail.com> wrote:
> Although again, we have two issues. One is HTML-to-text (essentially
> lynx-as-a-function). The other is truncating an HTML string while
> keeping it well-formed (which means not stopping in the middle of a
> tag and closing any open tags).

Actually, I think we may have four issues...?

1) truncate HTML and end up with well-formed HTML.
2) strip all HTML tags (without an interest in text formatting)
3) html2text (trying to keep text formatting with p, block, etc.)
4) sanitizing HTML (not directly discussed here, but a good
implementation of this will be helpful, increase security, and should
be able to be extended trivially to provide #2, striping all HTML
tags).

rcs_comp

unread,
Jun 12, 2008, 10:57:46 AM6/12/08
to pylons-discuss


On Jun 12, 10:19 am, rcs_comp <rsyr...@gmail.com> wrote:
> Lets say that I want the first 150 characters, what is going to
> happen? I am going to get 1000+ characters b/c of the <div> that is
> wrapping everything OR I will get nothing but the header. Neither is
> what I want.

I suppose you could keep track of unclosed tags and close them:

<h1>My Page Subject</h1>
<div>
<p>Lorem Ipsum...[another ~125 characters]

then insert manually:

</p></div>

Still seems like a pain though.

Mike Orr

unread,
Jun 12, 2008, 4:21:15 PM6/12/08
to pylons-...@googlegroups.com
On Thu, Jun 12, 2008 at 7:55 AM, rcs_comp <rsy...@gmail.com> wrote:
>
>
>
> On Jun 12, 5:13 am, "Mike Orr" <sluggos...@gmail.com> wrote:
>> Although again, we have two issues. One is HTML-to-text (essentially
>> lynx-as-a-function). The other is truncating an HTML string while
>> keeping it well-formed (which means not stopping in the middle of a
>> tag and closing any open tags).
>
> Actually, I think we may have four issues...?
>
> 1) truncate HTML and end up with well-formed HTML.

I agree with you; I'm not convinced this is a broad enough need to
warrant a webhelper. But some significant use cases would help
convince me.

> 2) strip all HTML tags (without an interest in text formatting)
> 3) html2text (trying to keep text formatting with p, block, etc.)

Ian's code handles p and div, and treats block as p. Other tags are
stripped and ignored. We can extend it if we want more sophistocated
formatting. Actually, indented blocks would be useful. And
optionally displaying the hrefs. (Lynx does this with footnotes.)


> 4) sanitizing HTML (not directly discussed here, but a good
> implementation of this will be helpful, increase security, and should
> be able to be extended trivially to provide #2, striping all HTML
> tags).

What exactly do you mean by sanitizing? Stripping all except a few
formatting tags? This would be good for WebHelpers if somebody can
provide an implementation. One not depending on non-stdlib packages.

--
Mike Orr <slugg...@gmail.com>

rcs_comp

unread,
Jun 12, 2008, 4:43:30 PM6/12/08
to pylons-discuss


On Jun 12, 4:21 pm, "Mike Orr" <sluggos...@gmail.com> wrote:
> On Thu, Jun 12, 2008 at 7:55 AM, rcs_comp <rsyr...@gmail.com> wrote:

> > 4) sanitizing HTML (not directly discussed here, but a good
> > implementation of this will be helpful, increase security, and should
> > be able to be extended trivially to provide #2, striping all HTML
> > tags).
>
> What exactly do you mean by sanitizing? Stripping all except a few
> formatting tags? This would be good for WebHelpers if somebody can
> provide an implementation. One not depending on non-stdlib packages.

Yes, I think the best way to implement something like this is to have
a white list of approved tags and attributes. I am new to Python so I
don't know if the things I suggested above depend on non-stdlib
packages. However, an example library if what I have in mind written
in PHP is here:

http://htmlpurifier.org/

Ian Bicking

unread,
Jun 12, 2008, 6:57:58 PM6/12/08
to pylons-...@googlegroups.com
Mike Orr wrote:
> On Thu, Jun 12, 2008 at 7:55 AM, rcs_comp <rsy...@gmail.com> wrote:
>> On Jun 12, 5:13 am, "Mike Orr" <sluggos...@gmail.com> wrote:
>>> Although again, we have two issues. One is HTML-to-text (essentially
>>> lynx-as-a-function). The other is truncating an HTML string while
>>> keeping it well-formed (which means not stopping in the middle of a
>>> tag and closing any open tags).
>> Actually, I think we may have four issues...?
>>
>> 1) truncate HTML and end up with well-formed HTML.
>
> I agree with you; I'm not convinced this is a broad enough need to
> warrant a webhelper. But some significant use cases would help
> convince me.
>
>> 2) strip all HTML tags (without an interest in text formatting)
>> 3) html2text (trying to keep text formatting with p, block, etc.)
>
> Ian's code handles p and div, and treats block as p. Other tags are
> stripped and ignored. We can extend it if we want more sophistocated
> formatting. Actually, indented blocks would be useful. And
> optionally displaying the hrefs. (Lynx does this with footnotes.)

I think blockquote might be handled, and anchors do show their links.
Possibly also lists?

The code was originally written for creating text alternatives to HTML
email.

>> 4) sanitizing HTML (not directly discussed here, but a good
>> implementation of this will be helpful, increase security, and should
>> be able to be extended trivially to provide #2, striping all HTML
>> tags).
>
> What exactly do you mean by sanitizing? Stripping all except a few
> formatting tags? This would be good for WebHelpers if somebody can
> provide an implementation. One not depending on non-stdlib packages.

I think Jon Rosebaugh (aka Chairos) ported lxml.html.clean to
BeautifulSoup. You couldn't do it without some kind of HTML parser, but
BS is an easy install (or even include it, it's just one file).

feedparser also includes a cleaner, but IMHO it's a bit more crude.

Reply all
Reply to author
Forward
0 new messages