For example:
In [2]: strip_tags('Some <p
onclick="$(\'#id\').html(\'<i>input</i>\')">text</p> here')
Out[2]: u'Some input\')">text here'
Should return "Some text here"
Already patched it with tests in my github repo - khoomeister/django but
would like to go through the proper processes!
Chris
--
Ticket URL: <https://code.djangoproject.com/ticket/19237>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:1>
* status: new => assigned
* owner: nobody => anonymous
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:2>
* owner: anonymous => Chris Khoo <chris.khoo@…>
* status: assigned => new
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:3>
* owner: Chris Khoo <chris.khoo@…> => anonymous
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:4>
* owner: anonymous => khoomeister
* status: assigned => new
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:5>
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:6>
Comment (by khoomeister):
Doh - sorry, this is my first time using this!
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:7>
* type: Uncategorized => Bug
* component: Uncategorized => Core (Other)
* stage: Unreviewed => Accepted
Comment:
Link to commit:
https://github.com/khoomeister/django/commit/605277ee32f60689e9a82a2e99767aef449c98ec
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:8>
Comment (by khoomeister):
Current code only works with Python 2.7 - updated it to work with Python
2.5 & 2.6
Link to commit:
https://github.com/khoomeister/django/commit/da1ba5fd79c28f0643a1eb0ce18feb51dfe4a499
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:9>
* cc: thiderman (added)
Comment:
I can confirm that this works and the tests pass on Python 2.6.5.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:10>
* stage: Accepted => Ready for checkin
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:11>
* status: assigned => closed
* resolution: => fixed
Comment:
In [changeset:"bf1871d874a371ad0ae6c7e098e7665a468dca16"]:
{{{
#!CommitTicketReference repository=""
revision="bf1871d874a371ad0ae6c7e098e7665a468dca16"
Fixed #19237 -- Improved strip_tags utility
The previous pattern didn't properly addressed cases where '>'
was present inside quoted tag content.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:12>
Comment (by Claude Paroz <claude@…>):
In [changeset:"9efe1a7210ee161d5688f66a759bcd8d89d33142"]:
{{{
#!CommitTicketReference repository=""
revision="9efe1a7210ee161d5688f66a759bcd8d89d33142"
[1.5.x] Fixed #19237 -- Improved strip_tags utility
The previous pattern didn't properly addressed cases where '>'
was present inside quoted tag content.
Backport of bf1871d87 from master.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:13>
* status: closed => new
* resolution: fixed =>
Comment:
I just ran out a really basic case where this fails:
{{{
>>> from django.utils.html import strip_tags
>>> strip_tags('<strong>foo</strong><a href="http://example.com">bar</a>')
u'bar'
}}}
But the result should be 'foobar'. I think this should be fixed for the
1.5 release, should I mark it as blocker?
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:14>
* needs_better_patch: 0 => 1
* severity: Normal => Release blocker
* easy: 1 => 0
* stage: Ready for checkin => Accepted
Comment:
Yes, either we quickly find the flaw in the new regex or we should revert
the commit.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:15>
* needs_better_patch: 1 => 0
Comment:
I may have found it. Regex gurus needed :-)
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:16>
Comment (by Pablo Recio <pablo.recioquijano@…>):
I'm not a regex guru, but it seems good for me. Although, I feel that this
stuff should be either reverted or adding a really big batch of tests for
releasing this into 1.5. Just my humble thought.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:17>
Comment (by claudep):
It seems the current agreement among Django devs is to revert the change
for 1.5 and switch to a non-regex technique for 1.6 (or maybe fallback to
regex when content is not valid HTML).
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:18>
Comment (by Claude Paroz <claude@…>):
In [changeset:"20ac33100cd20c17d1ceab075d96698974cc4778"]:
{{{
#!CommitTicketReference repository=""
revision="20ac33100cd20c17d1ceab075d96698974cc4778"
Partially revert 9efe1a721, strip_tags improvements
The new regex seems not stable enough for being released. Stripping
with regex might need reevaluation for the next release.
Refs #19237.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:19>
Comment (by Claude Paroz <claude@…>):
In [changeset:"d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e"]:
{{{
#!CommitTicketReference repository=""
revision="d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e"
Improved regex in strip_tags
Thanks Pablo Recio for the report. Refs #19237.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:20>
* severity: Release blocker => Normal
Comment:
So I reverted the fix in 1.5, and commit my proposed fix in 1.6.
Surely, the next step is to provide more tests. I have no religion about
the use of regex or an HTML parser to strip tags, only tests will convince
me!
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:21>
Comment (by Claude Paroz <claude@…>):
In [changeset:"20ac33100cd20c17d1ceab075d96698974cc4778"]:
{{{
#!CommitTicketReference repository=""
revision="20ac33100cd20c17d1ceab075d96698974cc4778"
Partially revert 9efe1a721, strip_tags improvements
The new regex seems not stable enough for being released. Stripping
with regex might need reevaluation for the next release.
Refs #19237.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:22>
Comment (by jacob):
FTR, the new regex committed in bf1871d874a371ad0ae6c7e098e7665a468dca16
has severe performance issues on longer strings, please consider this a
veto on re-committing it.
We should probably avoid using regexes here entirely; I think HTMLParser
might be a better choice.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:23>
Comment (by jacob):
As does the "improved" version in
d7504a3d7b8645bdb979bab7ded0e9a9b6dccd0e; both are broken on larger
strings.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:24>
* severity: Normal => Release blocker
Comment:
Reading the comment, it isn't clear to me that the regex currently used in
master doesn't suffer from performance issues. Marking as a release
blocker until this is sorted out.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:26>
Comment (by Claude Paroz <claude@…>):
In [changeset:"a01361b5ae9df5220c0ce7b42c5ff36094d5718c"]:
{{{
#!CommitTicketReference repository=""
revision="a01361b5ae9df5220c0ce7b42c5ff36094d5718c"
Added more tests for strip_tags utility
Refs #19237.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:27>
Comment (by claudep):
I just committed tests that show the current implementation has apparently
no performance issues on longer content. Anyone is still free to try a
parser-based implementation, of course!
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:28>
Comment (by simon29):
Finally, after nearly two weeks I've found the reason why my servers kept
crashing on a high traffic, multi server production site :-(
This new regex is prone to catastrophic backtracking; which, when given
the right string, spins the CPU into an infinite loop. Performance issues
is an understatement. Sorry, but this regex literally crashes sites and
absolutely cannot be included.
http://www.regular-expressions.info/catastrophic.html
http://stackoverflow.com/questions/6230489/100-cpu-usage-with-a-regexp-
depending-on-input-length
We need to revert this change until we have a parser-based implementation.
I'll attach a sample "problem string" for y'all to play with.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:29>
* cc: simon@… (added)
Comment:
I've attached a simple patch using HTMLParser from stdlib. It's the only
way, as any regex based solution won't properly support multiple attribute
tags anyway.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:30>
* easy: 0 => 1
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:31>
Comment (by claudep):
I wonder which code version you are using, because in your attached test
(problem_string.py), you are still using the regex which was clearly
broken and which has been fixed in [d7504a3d7b8645].
Thanks for your `HTMLParser`-based version, I will test it ASAP.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:32>
* needs_better_patch: 0 => 1
* easy: 1 => 0
Comment:
I just posted an improved version of Simon's patch. So far, I found one
important different behaviour with the HTMLParser-based solution:
unterminated/not-escaped `<` are removed with any following content. I'm
not sure that this is an acceptable regression.
Tests with both file-based content (84Kb and 8Kb) show that the parsing
version is somewhat slower than the regex one, but I think this remains
acceptable.
And FWIW, I'm still waiting to receive a string which really fails with
the current regex.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:33>
* needs_better_patch: 1 => 0
Comment:
I think I found a solution for unterminated/not-escaped `<` in that latest
patch.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:34>
Comment (by simon29):
My mistake, I had the previous regex. The new one works flawlessly; and
yes, a compiled "C" regex will always outperform a python parser. Unless
we can find a string to break the new one I'm now leaning in favour of the
regex solution.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:35>
* stage: Accepted => Ready for checkin
Comment:
The patch looks good to me.
Even though no one managed to break the current regex yet, it's a fragile
technique, and bugs could easily escalate into security issues. So let's
play safe!
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:36>
* status: new => closed
* resolution: => fixed
Comment:
In [changeset:"dc51ec8bc214cf60ebb99732363624c23df8005f"]:
{{{
#!CommitTicketReference repository=""
revision="dc51ec8bc214cf60ebb99732363624c23df8005f"
Fixed #19237 -- Used HTML parser to strip tags
The regex method used until now for the strip_tags utility is fast,
but subject to flaws and security issues. Consensus and good
practice lead use to use a slower but safer method.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:37>
* status: closed => new
* resolution: fixed =>
Comment:
HTMLParser behaviour changed between 2.6-2.7, 3.2-3.3. So now we have
probably no other choice than ignoring content after any unclosed tag.
That means the behaviour is changed comparing to the regex-based solution.
I'm joining the proposed fix.
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:38>
Comment (by claudep):
Another patch after discussing with Anssi on IRC. Backwards-incompatible
due to `<` and `>` encoding, but still safer and less risk of content
disappearing...
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:39>
* status: new => closed
* resolution: => fixed
Comment:
In [changeset:"b664cb818d2e5896df2763299ea2c61a9af069a8"]:
{{{
#!CommitTicketReference repository=""
revision="b664cb818d2e5896df2763299ea2c61a9af069a8"
Fixed #19237 (again) - Made strip_tags consistent between Python versions
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19237#comment:40>