[Django] #30686: Truncator.chars splits HTML entities

Django

unread,

Aug 6, 2019, 7:22:01 AM8/6/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities
--------------------------------------+------------------------
Reporter: tdhooper | Owner: nobody
Type: Bug | Status: new
Component: Utilities | Version: 2.2
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
--------------------------------------+------------------------
I'm using Truncator to truncate wikis, and it sometimes truncates in the
middle of " entities, resulting in 'some text &qu'

--
Ticket URL: <https://code.djangoproject.com/ticket/30686>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Aug 6, 2019, 7:22:22 AM8/6/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities

---------------------------+--------------------------------------

Reporter: tdhooper | Owner: nobody
Type: Bug | Status: new
Component: Utilities | Version: 2.2

Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

---------------------------+--------------------------------------
Description changed by tdhooper:

Old description:

> I'm using Truncator to truncate wikis, and it sometimes truncates in the
> middle of " entities, resulting in 'some text &qu'

New description:

I'm using Truncator.chars to truncate wikis, and it sometimes truncates in

the middle of " entities, resulting in 'some text &qu'

--

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:1>

Django

unread,

Aug 6, 2019, 7:48:04 AM8/6/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities

-------------------------------+--------------------------------------
Reporter: Thomas Hooper | Owner: nobody

Type: Bug | Status: new
Component: Utilities | Version: 2.2

Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+--------------------------------------

Comment (by Carlton Gibson):

Hi Thomas. Any chance of an example string (hopefully minimal) that
creates the behaviour so we can have a look?

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:2>

Django

unread,

Aug 6, 2019, 8:31:46 AM8/6/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities

-------------------------------+--------------------------------------
Reporter: Thomas Hooper | Owner: nobody

Type: Bug | Status: new
Component: Utilities | Version: 2.2

Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+--------------------------------------

Comment (by Florian Apolloner):

I think now that the security release are out let's just add bleach as
dependency on master and be done with it?

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:3>

Django

unread,

Aug 6, 2019, 9:12:11 AM8/6/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities

-------------------------------+--------------------------------------
Reporter: Thomas Hooper | Owner: nobody

Type: Bug | Status: new
Component: Utilities | Version: 2.2

Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+--------------------------------------

Comment (by Thomas Hooper):

Here's an example https://repl.it/@tdhooper/Django-truncate-entities-bug

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:4>

Django

unread,

Aug 6, 2019, 9:15:16 AM8/6/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities

-------------------------------+--------------------------------------
Reporter: Thomas Hooper | Owner: nobody

Type: Bug | Status: new
Component: Utilities | Version: 2.2

Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+--------------------------------------

Comment (by Florian Apolloner):

btw I confused `truncator` with `strip_tags`. So in this case the answer
would be to rewrite the parser using `html5lib`, while `split_tags` would
use `bleach` which in turn then uses `html5lib` as well.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:5>

Django

unread,

Aug 6, 2019, 10:54:34 AM8/6/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities

-------------------------------+--------------------------------------
Reporter: Thomas Hooper | Owner: nobody

Type: Bug | Status: new
Component: Utilities | Version: 2.2

Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+--------------------------------------

Comment (by Thomas Hooper):

Looks like it can be fixed with this regex change
https://github.com/django/django/pull/11633/files

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:6>

Django

unread,

Aug 7, 2019, 5:12:43 AM8/7/19

to django-...@googlegroups.com

#30686: Truncator.chars splits HTML entities

-------------------------------+--------------------------------------
Reporter: Thomas Hooper | Owner: nobody

Type: Bug | Status: new
Component: Utilities | Version: 2.2

Severity: Normal | Resolution:

Keywords: | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+--------------------------------------
Changes (by Carlton Gibson):

* Attachment "possible-html5lib-truncator-implementation.patch" added.

Example implemetation of _truncate_html() using html5lib, by Florian
Apolloner

Django

unread,

Aug 7, 2019, 5:17:50 AM8/7/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------
Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new
Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------
Changes (by Carlton Gibson):

* version: 2.2 => master
* stage: Unreviewed => Accepted

Old description:

> I'm using Truncator.chars to truncate wikis, and it sometimes truncates

> in the middle of " entities, resulting in 'some text &qu'

New description:

Original description:

> I'm using Truncator.chars to truncate wikis, and it sometimes truncates

in the middle of " entities, resulting in 'some text &qu'

This is a limitation of the regex based implementation (which has had
security issues, and presents an intractable problem).

Better to move to use a HTML parser, for Truncate, and strip_tags(), via
html5lib and bleach.

--

Comment:

Right, good news is this isn't a regression from
7f65974f8219729c047fbbf8cd5cc9d80faefe77.

* The new example case fails on v2.2.3 &co.
* The suggestion for the regex change is in the part not changed as part
of 7f65974f8219729c047fbbf8cd5cc9d80faefe77. (Which is why the new case
fails, I suppose :)

I don't want to accept a tweaking of the regex here. Rather, we should
move to using `html5lib` as Florian suggests.
Possibly this would entail small changes in behaviour around edge cases,
to be called out in release notes, but
would be a big win overall.

This has previously been discussed by the Security Team as the required
way forward.
I've updated the title/description and will Accept accordingly.

I've attached an initial WIP patch by Florian of an `html5lib`
implementation of the core `_truncate_html()` method.

An implementation of `strip_tags()` using `bleach` would go something
like:

{{{
bleach.clean(text, tags=[], strip=True, strip_comments=True)
}}}

Thomas, would taking on making changes like these be something you'd be
willing/keen to do? If so, I'm very happy to input to assist in any way.
:)

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:7>

Django

unread,

Aug 7, 2019, 11:12:58 AM8/7/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Thomas Hooper):

Hi Carlton, that would be fun, but this is bigger than I have time for
now. It looks like you all have it in hand.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:8>

Django

unread,

Aug 7, 2019, 11:13:26 AM8/7/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Claude Paroz):

Do we want to make both html5lib and bleach required dependencies of
Django?
html5lib latest release is now 20 months ago, and when I read issues like
https://github.com/html5lib/html5lib-python/issues/419 without any
maintainer feedback, I'm a bit worried. What about the security report
workflow for those libs? What if a security issue is discovered in html5
lib and the maintainers are unresponsive? Sorry to sound a bit negative,
but I think those questions must be asked.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:9>

Django

unread,

Aug 7, 2019, 1:35:34 PM8/7/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Carlton Gibson):

Yep Claude, absolutely.

I think there's two difficulties we could face:

* trying to successfully sanitize HTML with regexes.
* (Help) Make sure html5lib-python is maintained.

The first of these is intractable. The second not. 🙂

I've put out some feelers to try and find out more.

* This is pressing for Python and pip **now**, not for us for a while yet.
* If we look at https://github.com/html5lib/html5lib-python/issues/361 it
seems there's some money on the table from tidelift potentially.
* We COULD allocate some time in a pinch I think.
* AND it's **just** a wrapper around the underlying C library, so whilst
20 months seems a long time, I'm not sure the release cadence is really an
issue.

BUT, yes, absolutely. Let's hammer this out properly before we commit. 👍
I will open a mailing list thread when I know more.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:10>

Django

unread,

Aug 7, 2019, 1:50:02 PM8/7/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Carlton Gibson):

> AND it's just (even with the emphasis, cough) a wrapper around the

underlying C library, so whilst 20 months seems a long time, I'm not sure
the release cadence is really an issue.

OK, that last one isn't at all true. (Looking at the source it's the
entire implementation.)

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:11>

Django

unread,

Aug 7, 2019, 2:36:34 PM8/7/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Claude Paroz):

To be clear, I'm also convinced parsing is more reliable than regexes. I
just think we have to double-think before adding a dependency, because as
the name implies, we depend on it and therefore we must be able to trust
its maintainers. Some guarantees about the security process and serious
bugs fixing should be obtained. Without that, we are just outsourcing
problems.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:12>

Django

unread,

Aug 8, 2019, 2:17:14 AM8/8/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Carlton Gibson):

@Claude: 💯👍 Totally agree.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:13>

Django

unread,

Aug 12, 2019, 3:31:45 AM8/12/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Carlton Gibson):

Duplicate in #30700, with [https://github.com/django/django/pull/11660
failing test case provided].

I've tried contacting maintainers of HTML5lib with no success.

I've re-opened https://github.com/django/django/pull/11633 (original regex
based suggestion) so we can at least assess it as a possible stop-gap.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:14>

Django

unread,

Aug 20, 2019, 1:27:03 PM8/20/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------
Changes (by Carlton Gibson):

* cc: Jon Dufresne (added)

Comment:

Paging Jon, to ask his opinion on this.

Hey Jon, I see you've made a number of PRs to both html5lib, and bleach.

To me, at this point, html5lib essentially looks unmaintained. I don't
have personal capacity to give to it, as cool as it is as a project.
Arguably we (Fellows) could allocate it _some_ time, since we spend a fair
bit already messing around with regexes but that would be small, and we
couldn't take it on whole, so can I ask your thoughts?

Is html5lib in trouble? If so, as a user, what are your plans, if any? And
from that, what do you think about Django adopting it? What's the
alternative?

Thanks for the thought and insight.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:15>

Django

unread,

Aug 24, 2019, 1:42:39 PM8/24/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Jon Dufresne):

> To me, at this point, html5lib essentially looks unmaintained.

I agree with this observation. The previous main maintainer looks to have
stopped working on the project. Responses to issues and PRs have stopped.

> Is html5lib in trouble? If so, as a user, what are your plans, if any?
And from that, what do you think about Django adopting it? What's the
alternative?

For my own projects, I'll probably continue using html5lib until its
staleness creates an observable bug for me. I haven't hit that point yet.

Bleach, on the other hand, looks like maintenance has slowed, but not
stopped. I believe they have vendored html5lib to allow them to make
changes internally. FWIW, I also still use Bleach.

---

I'm not familiar with all the details of this ticket, but would the stdlib
HTML parser be sufficient?

https://docs.python.org/3/library/html.parser.html

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:16>

Django

unread,

Aug 26, 2019, 5:34:56 AM8/26/19

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------

Reporter: Thomas Hooper | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+------------------------------------

Comment (by Carlton Gibson):

Hi Jon,

Thank you for the comments. I will email Will, the maintainer of Bleach,
and ask his thoughts too. Bleach has slowed down, but that's because it's
Stable/Mature now I would have thought.

> ...would the stdlib HTML parser be sufficient?

Yes. Maybe. Ideally we just thought to bring in Bleach, and with it
html5lib since, in theory that's already working code. (Florian already
had a Truncate prototype...)

Anyhow... will follow-up.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:17>

Django

unread,

Jan 3, 2023, 11:00:17 AM1/3/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.

-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev

Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+---------------------------------------
Changes (by David Smith):

* owner: nobody => David Smith
* status: new => assigned

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:18>

Django

unread,

Jan 3, 2023, 11:28:26 AM1/3/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+---------------------------------------
Changes (by Carlton Gibson):

* cc: Matthias Kestenholz (added)

Comment:

Adding some detail after the last post, since you're looking at it David.

There was a discussion (with various folks from html5lib, and Mozilla, and
...) about whether html5lib could be put on a better footing.
I'm not sure how that panned out in the medium term. (I didn't check what
the rhythm looks like now.)

There was alternate talk about whether bleach (or an alternate) could
build off `html5ever` which is the HTML parser from the Mozilla servo
project.

* https://github.com/servo/html5ever
* https://github.com/SimonSapin/html5ever-python (Py03 bindings.)

That would be pretty cool, but it was clearly a lot of work, and then 2020
happened, so...

The other candidate in this space in Matthias' html-sanitizer:
https://github.com/matthiask/html-sanitizer — which is built on `lxml`.

That's just to lay down the notes I had gathered. I'm not sure the way
forward, but hopefully it's helpful.
Very open to ideas though! Thanks for picking it up.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:19>

Django

unread,

Jan 3, 2023, 11:32:45 AM1/3/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+---------------------------------------
Changes (by Jon Dufresne):

* cc: Jon Dufresne (removed)

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:20>

Django

unread,

Jan 3, 2023, 11:47:43 AM1/3/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+---------------------------------------

Comment (by Matthias Kestenholz):

Hi all

lxml is quite a heavy dependency. It works **very** well but you'll wait
for the compilation a long time if you do not have wheels. (see
https://pypi.org/project/lxml/#files) I think Python packaging is almost a
non-issue these days except when it comes to transitive dependencies, and
I wouldn't want to be in charge of specifying and updating the supported
range of lxml versions. That being said, I encountered almost no breaking
changes in lxml since
[https://github.com/feincms/feincms/commit/0ec8e834dd2e0927bb23d46ee9102716c7735add
~2009], I use lxml in almost all projects and I can heartily recommend it
to anyone.

I'm sure that the regex-based solution has some problems; I'm sorry to
admit I haven't read the full thread but I just cannot imagine a situation
where using `|strip_tags` without `|safe` would lead to a security issue,
and why would you want to combine these? There's no point to mark a string
as safe after stripping all tags. So it's only about the fact that the
output sometimes isn't nice, something which may be fixed by converting as
many entities to their unicode equivalents as possible and only truncating
afterwards?

Last but not least: I haven't benchmarked it ever, but I have the
suspicion that running bleach or html-sanitizer during rendering may be
wasteful in terms of CPU cycles. I only ever use the sanitizer when
saving, never when rendering. `|strip_tags` is obviously applied when
rendering and performs well enough in many situations.

So, to me `strip_tags` is a clear case of
[https://hachyderm.io/@matthiask/109545192841761218 a simple
implementation with "worse is better" characteristics].

I truly hope this is helpful and not just a cold shower (sorry for using
"just" here)

Thanks,
Matthias

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:21>

Django

unread,

Jan 3, 2023, 1:21:30 PM1/3/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+---------------------------------------

Comment (by Carlton Gibson):

Hey Matthias — that's a very useful input. Thank you for your input.

> So, to me strip_tags is a clear case of a simple implementation with
"worse is better" characteristics.

Let, me review what happened here tomorrow (it was a **long** while ago)
but assuming it makes sense, `wontfix` + ''We're not accepting any
complications to the algorithm — use ... if you need more sophistication''
may be the neatest way all round.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:22>

Django

unread,

Jan 3, 2023, 4:04:02 PM1/3/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+---------------------------------------

Comment (by David Smith):

[https://github.com/django/django/pull/16421 PR ]

I was thinking about Jon's suggestion of using the HTMLParser from the
standard library. Since the last comments on this ticket Adam Johnson has
also written a blog post on Truncating HTML with Python's HTMLParser which
helped inspire my PR, see [https://adamj.eu/tech/2021/09/23/truncating-my-
blog-posts-with-html-parser/ blog post]. (I'd cc Adam as I've mentioned
his name, but not sure how to do that?!)

While my PR still needs more work I thought it worth sharing as it may be
helpful to Carlton when reviewing tomorrow.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:23>

Django

unread,

Jan 4, 2023, 3:06:46 AM1/4/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------+---------------------------------------

Comment (by Carlton Gibson):

Hey David — great stuff. With that in play I won't rush to resolve.

Thinking about Matthias' comment:

> Last but not least: I haven't benchmarked it ever, but I have the

suspicion that...

Could I ask you to do a minimal `timeit`/`pyperf` comparison, so we can
get a rough measure on the table? (It doesn't need to be perfect.)

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:24>

Django

unread,

Feb 8, 2023, 2:53:16 AM2/8/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Changes (by David Smith):

* has_patch: 0 => 1

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:25>

Django

unread,

Mar 8, 2023, 5:54:20 AM3/8/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 1

Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Changes (by Carlton Gibson):

* needs_better_patch: 0 => 1

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:26>

Django

unread,

Jul 14, 2023, 3:57:22 AM7/14/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Changes (by David Smith):

* needs_better_patch: 1 => 0

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:27>

Django

unread,

Jul 14, 2023, 5:24:08 AM7/14/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Comment (by Mariusz Felisiak <felisiak.mariusz@…>):

In [changeset:"1d0dfc0b920916900d2770f3b5640da08af36d97" 1d0dfc0]:
{{{
#!CommitTicketReference repository=""
revision="1d0dfc0b920916900d2770f3b5640da08af36d97"
Refs #30686 -- Moved Parser.SELF_CLOSING_TAGS to
django.utils.html.VOID_ELEMENTS
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:29>

Django

unread,

Jul 14, 2023, 5:24:09 AM7/14/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Comment (by Mariusz Felisiak <felisiak.mariusz@…>):

In [changeset:"6f1b8c00d8151059cc1902a92f067bbdc1673779" 6f1b8c00]:
{{{
#!CommitTicketReference repository=""
revision="6f1b8c00d8151059cc1902a92f067bbdc1673779"
Refs #30686 -- Moved add_truncation_text() helper to a module level.
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:28>

Django

unread,

Jul 16, 2023, 1:30:29 PM7/16/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 1

Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Changes (by Mariusz Felisiak):

* needs_better_patch: 0 => 1

Comment:

Based on
[https://github.com/django/django/pull/16421#discussion_r1263859265
comment].

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:30>

Django

unread,

Aug 11, 2023, 12:02:34 PM8/11/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Changes (by Sage Abdullah):

* needs_better_patch: 1 => 0

Comment:

The PR seems to have been rebased and updated since the last update on
this ticket. Unsetting "Patch needs improvement" if that's OK.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:31>

Django

unread,

Aug 11, 2023, 12:04:54 PM8/11/23

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 1

Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------
Changes (by Sage Abdullah):

* needs_better_patch: 0 => 1

Comment:

Oops, sorry, should've looked more closely. The
[https://github.com/django/django/pull/16421#discussion_r1263859265
comment linked above] hasn't been resolved.

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:32>

Django

unread,

Feb 3, 2024, 10:51:13 AMFeb 3

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Changes (by David Smith):

* needs_better_patch: 1 => 0

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:33>

Django

unread,

Feb 6, 2024, 2:13:17 PMFeb 6

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------

Comment (by Mariusz Felisiak <felisiak.mariusz@…>):

In [changeset:"48a469395191e87d3b84ad35bae2c8b53d91ed61" 48a46939]:
{{{#!CommitTicketReference repository=""
revision="48a469395191e87d3b84ad35bae2c8b53d91ed61"
Refs #30686 -- Improved test coverage of Truncator.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:34>

Django

unread,

Feb 7, 2024, 2:12:06 AMFeb 7

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+---------------------------------------
Reporter: Thomas Hooper | Owner: David Smith
Type: Bug | Status: assigned
Component: Utilities | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------+---------------------------------------
Comment (by Mariusz Felisiak <felisiak.mariusz@…>):

In [changeset:"70f39e46f86b946c273340d52109824c776ffb4c" 70f39e4]:
{{{#!CommitTicketReference repository=""
revision="70f39e46f86b946c273340d52109824c776ffb4c"
Refs #30686 -- Fixed text truncation for negative or zero lengths.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:35>

Django

unread,

Feb 7, 2024, 3:47:26 AMFeb 7

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.

-------------------------------------+-------------------------------------

Keywords: | Triage Stage: Ready for
| checkin

Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Mariusz Felisiak):

* stage: Accepted => Ready for checkin

--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:36>

Django

unread,

Feb 7, 2024, 4:58:17 AMFeb 7

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------------+-------------------------------------
Reporter: Thomas Hooper | Owner: David
| Smith

Type: Bug | Status: closed
Component: Utilities | Version: dev
Severity: Normal | Resolution: fixed

Changes (by Mariusz Felisiak <felisiak.mariusz@…>):

* status: assigned => closed
* resolution: => fixed

Comment:

In [changeset:"6ee37ada3241ed263d8d1c2901b030d964cbd161" 6ee37ad]:
{{{#!CommitTicketReference repository=""
revision="6ee37ada3241ed263d8d1c2901b030d964cbd161"
Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:37>

Django

unread,

Feb 15, 2024, 2:39:24 AMFeb 15

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------------+-------------------------------------
Reporter: Thomas Hooper | Owner: David
| Smith
Type: Bug | Status: closed
Component: Utilities | Version: dev
Severity: Normal | Resolution: fixed
Keywords: | Triage Stage: Ready for
| checkin
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by GitHub <noreply@…>):

In [changeset:"3cadeea077a98367a4ed344d645df0aff243de91" 3cadeea]:
{{{#!CommitTicketReference repository=""
revision="3cadeea077a98367a4ed344d645df0aff243de91"
Refs #30686 -- Removed unused regexes in django.utils.text.

Unused since 6ee37ada3241ed263d8d1c2901b030d964cbd161.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:38>

Django

unread,

Mar 14, 2024, 12:56:31 AMMar 14

to django-...@googlegroups.com

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------------+-------------------------------------
Reporter: Thomas Hooper | Owner: David
| Smith
Type: Bug | Status: closed
Component: Utilities | Version: dev
Severity: Normal | Resolution: fixed
Keywords: | Triage Stage: Ready for
| checkin
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Comment (by GitHub <noreply@…>):

In [changeset:"95ae37839c907d7d030f1387a003a5776593d7d7" 95ae3783]:
{{{#!CommitTicketReference repository=""
revision="95ae37839c907d7d030f1387a003a5776593d7d7"
Refs #30686 -- Made django.utils.html.VOID_ELEMENTS a frozenset.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:39>

Reply all

Reply to author

Forward