A critique of cgi.escape

Lawrence D'Oliveiro

unread,

Sep 23, 2006, 8:00:16 AM9/23/06

to

The "escape" function in the "cgi" module escapes characters with special
meanings in HTML. The ones that need escaping are '<', '&' and '"'.
However, cgi.escape only escapes the quote character if you pass a second
argument of True (the default is False):

>>> cgi.escape("the \"quick\" & <brown> fox")
'the "quick" & <brown> fox'
>>> cgi.escape("the \"quick\" & <brown> fox", True)
'the "quick" & <brown> fox'

This seems to me to be dumb. The default option should be the safe one: that
is, escape _all_ the potentially troublesome characters. The only time you
can get away with NOT escaping the quote character is outside of markup,
e.g.

<TEXTAREA>
unescaped "quotes" allowed here
</TEXTAREA>

Nevertheless, even in that situation, escaped quotes are acceptable.

So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

Can changing the default break existing scripts? I don't see how. It might
even fix a few lurking bugs out there.

Fredrik Lundh

unread,

Sep 23, 2006, 2:19:19 PM9/23/06

to pytho...@python.org

Lawrence D'Oliveiro wrote:

> So I think the default for the second argument to cgi.escape should be
> changed to True. Or alternatively, the second argument should be removed
> altogether, and quotes should always be escaped.

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes. if you use the code the
way it's intended to be used, it works perfectly fine.

> Can changing the default break existing scripts? I don't see how. It might
> even fix a few lurking bugs out there.

I'm not sure this "every time I don't immediately understand something,
I'll write a change proposal instead of reading the library reference"
approach is healthy, really.

</F>

Lawrence D'Oliveiro

unread,

Sep 23, 2006, 6:41:02 PM9/23/06

to

In message <mailman.499.11590355...@python.org>, Fredrik
Lundh wrote:

> Lawrence D'Oliveiro wrote:
>
>> So I think the default for the second argument to cgi.escape should be
>> changed to True. Or alternatively, the second argument should be removed
>> altogether, and quotes should always be escaped.
>
> you're confused: cgi.escape(s) is designed to be used for ordinary text,
> cgi.escape(s, True) is designed for attributes.

What works for attributes also works for ordinary text.

Jon Ribbens

unread,

Sep 23, 2006, 10:28:17 PM9/23/06

to

In article <mailman.499.11590355...@python.org>, Fredrik Lundh wrote:
> Lawrence D'Oliveiro wrote:
>> So I think the default for the second argument to cgi.escape should be
>> changed to True. Or alternatively, the second argument should be removed
>> altogether, and quotes should always be escaped.
>
> you're confused: cgi.escape(s) is designed to be used for ordinary text,
> cgi.escape(s, True) is designed for attributes. if you use the code the
> way it's intended to be used, it works perfectly fine.

He's not confused, he's correct; the author of cgi.escape is the
confused one. The optional extra parameter is completely unnecessary
and achieves nothing except to make it easier for people to end up
with bugs in their code.

Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character. This is essentially
identical to the quote character in HTML, so any code which escaping
one should always be escaping the other.

Lawrence D'Oliveiro

unread,

Sep 24, 2006, 12:49:22 AM9/24/06

to

In message <slrnehbra1.k...@snowy.squish.net>, Jon Ribbens wrote:

> In article <mailman.499.11590355...@python.org>, Fredrik
> Lundh wrote:
>> Lawrence D'Oliveiro wrote:
>>>
>>> So I think the default for the second argument to cgi.escape should be
>>> changed to True. Or alternatively, the second argument should be removed
>>> altogether, and quotes should always be escaped.
>>
>> you're confused: cgi.escape(s) is designed to be used for ordinary text,
>> cgi.escape(s, True) is designed for attributes. if you use the code the
>> way it's intended to be used, it works perfectly fine.
>
> He's not confused, he's correct; the author of cgi.escape is the
> confused one.

Thanks for backing me up. :)

> > One thing that is flat-out wrong, by the way, is that cgi.escape()
> does not encode the apostrophe (') character. This is essentially
> identical to the quote character in HTML, so any code which escaping
> one should always be escaping the other.

I must confess I did a double-take on this. But I rechecked the HTML spec
(HTML 4.0, section 3.2.2, "Attributes"), and you're right--single quotes
ARE allowed as an alternative to double quotes. It's just I've never used
them as quotes. :)

Fredrik Lundh

unread,

Sep 24, 2006, 4:38:36 AM9/24/06

to pytho...@python.org

Lawrence D'Oliveiro wrote:

> What works for attributes also works for ordinary text.

attributes and ordinary text are two different things in HTML and XML.
you're arguing that it's a good idea for *everyone* to bloat down
ordinary text just because you're too lazy to use a piece of code in the
intended way.

</F>

Fredrik Lundh

unread,

Sep 24, 2006, 4:48:54 AM9/24/06

to pytho...@python.org

Jon Ribbens wrote:

> Making cgi.escape always escape the '"' character would not break
> anything, and would probably fix a few bugs in existing code. Yes,
> those bugs are not cgi.escape's fault, but that's no reason not to
> be helpful. It's a minor improvement with no downside.

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way, and will also
break unit tests.

> One thing that is flat-out wrong, by the way, is that cgi.escape()
> does not encode the apostrophe (') character.

it's intentional, of course: you're supposed to use " if you're using
cgi.escape(s, True) to escape attributes. again, punishing people who
actually read the docs and understand them is not a very good way to
maintain software.

btw, you're both missing that cgi.escape isn't good enough for general
use anyway, since it doesn't deal with encodings at all. if you want a
general purpose function that can be used for everything that can be put
in an HTML file, you need more than just a modified cgi.escape. feel
free to propose a general-purpose replacement (which should have a new
name), but make sure you think through *all* the issues before you do that.

</F>

Lawrence D'Oliveiro

unread,

Sep 24, 2006, 6:07:26 AM9/24/06

to

In message <mailman.518.11590877...@python.org>, Fredrik
Lundh wrote:

> Jon Ribbens wrote:
>
>> Making cgi.escape always escape the '"' character would not break
>> anything, and would probably fix a few bugs in existing code. Yes,
>> those bugs are not cgi.escape's fault, but that's no reason not to
>> be helpful. It's a minor improvement with no downside.
>
> the "improvement with no downside" would bloat down the output for
> everyone who's using the function in the intended way, and will also
> break unit tests.

I don't understand this "bloat down" nonsense. Any tests that would break
are obviously testing the wrong thing.

> > One thing that is flat-out wrong, by the way, is that cgi.escape()
> > does not encode the apostrophe (') character.
>
> it's intentional, of course: you're supposed to use " if you're using
> cgi.escape(s, True) to escape attributes.

Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.

> btw, you're both missing that cgi.escape isn't good enough for general
> use anyway, since it doesn't deal with encodings at all.

Why does it need to?

Fredrik Lundh

unread,

Sep 24, 2006, 6:35:32 AM9/24/06

to pytho...@python.org

Lawrence D'Oliveiro wrote:

> Attributes can be quoted with either single or double quotes. That's what
> the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
> cgi.escape is broken. QED.

do you ever think before you post?

</F>

Georg Brandl

unread,

Sep 24, 2006, 6:41:14 AM9/24/06

to

Lawrence D'Oliveiro wrote:
> In message <mailman.518.11590877...@python.org>, Fredrik
> Lundh wrote:
>
>> Jon Ribbens wrote:
>>
>>> Making cgi.escape always escape the '"' character would not break
>>> anything, and would probably fix a few bugs in existing code. Yes,
>>> those bugs are not cgi.escape's fault, but that's no reason not to
>>> be helpful. It's a minor improvement with no downside.
>>
>> the "improvement with no downside" would bloat down the output for
>> everyone who's using the function in the intended way, and will also
>> break unit tests.
>
> I don't understand this "bloat down" nonsense. Any tests that would break
> are obviously testing the wrong thing.

" is 4 characters more than ".

>> > One thing that is flat-out wrong, by the way, is that cgi.escape()
>> > does not encode the apostrophe (') character.
>>
>> it's intentional, of course: you're supposed to use " if you're using
>> cgi.escape(s, True) to escape attributes.
>
> Attributes can be quoted with either single or double quotes. That's what
> the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
> cgi.escape is broken. QED.

A function is broken if its implementation doesn't match the documentation.

As a courtesy, I've pasted it below.

escape(s[, quote])
Convert the characters "&", "<" and ">" in string s to HTML-safe sequences.
Use this if you need to display text that might contain such characters in HTML.
If the optional flag quote is true, the quotation mark character (""") is also
translated; this helps for inclusion in an HTML attribute value, as in <A
HREF="...">. If the value to be quoted might include single- or double-quote
characters, or both, consider using the quoteattr() function in the
xml.sax.saxutils module instead.

Now, do you still think cgi.escape is broken?

Georg

Fredrik Lundh

unread,

Sep 24, 2006, 6:53:32 AM9/24/06

to pytho...@python.org

Georg Brandl wrote:

> A function is broken if its implementation doesn't match the documentation.

or if it doesn't match the designer's intent. cgi.escape is old enough
that we would have noticed that, by now...

</F>

Jon Ribbens

unread,

Sep 24, 2006, 8:17:20 PM9/24/06

to

In article <ef5ncc$uus$1...@news.albasani.net>, Georg Brandl wrote:
>> Attributes can be quoted with either single or double quotes. That's what
>> the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
>> cgi.escape is broken. QED.
>
> A function is broken if its implementation doesn't match the documentation.

Or if the design, as described in the documentation, is flawed in some
way.

> As a courtesy, I've pasted it below.
>

[...]

>
> Now, do you still think cgi.escape is broken?

Yes.

Jon Ribbens

unread,

Sep 24, 2006, 8:50:23 PM9/24/06

to

In article <mailman.518.11590877...@python.org>, Fredrik Lundh wrote:
>> Making cgi.escape always escape the '"' character would not break
>> anything, and would probably fix a few bugs in existing code. Yes,
>> those bugs are not cgi.escape's fault, but that's no reason not to
>> be helpful. It's a minor improvement with no downside.
>
> the "improvement with no downside" would bloat down the output for
> everyone who's using the function in the intended way,

By a miniscule degree. That is a very weak argument by any standard.

> and will also break unit tests.

Er, so change the unit tests at the same time?

> > One thing that is flat-out wrong, by the way, is that cgi.escape()
> > does not encode the apostrophe (') character.
>
> it's intentional, of course:

I noticed. That doesn't mean it isn't wrong.

> you're supposed to use " if you're using cgi.escape(s, True) to
> escape attributes. again, punishing people who actually read the
> docs and understand them is not a very good way to maintain
> software.

In what way is anyone being "punished"? Deliberately retaining flaws
and misfeatures that can easily be fixed without damaging
backwards-compatibility is not a very good way to maintain software
either.

> btw, you're both missing that cgi.escape isn't good enough for general
> use anyway,

I'm sorry, I didn't realise this was a general thread about any and
all inadequacies of Python's cgi module.

> since it doesn't deal with encodings at all.

Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.

By the way, if you could try and put across your proposed arguments as
to why you don't favour this suggested change without the insults and
general rudeness, it would be appreciated.

Lawrence D'Oliveiro

unread,

Sep 25, 2006, 3:04:41 AM9/25/06

to

In message <mailman.524.11590953...@python.org>, Fredrik
Lundh wrote:

_We_ certainly have noticed it.

Fredrik Lundh

unread,

Sep 25, 2006, 8:41:37 AM9/25/06

to pytho...@python.org

Lawrence D'Oliveiro wrote:

>> Georg Brandl wrote:
>>
>>> A function is broken if its implementation doesn't match the
>>> documentation.
>>
>> or if it doesn't match the designer's intent. cgi.escape is old enough
>> that we would have noticed that, by now...
>
> _We_ certainly have noticed it.

you're not the designer, you're just some random guy who thinks that if you
don't understand something at first, it has to be changed, even if it that change
would break things for others. maybe you haven't done software long enough
to understand that software works better if you use it the way it was intended
to be used, but that's no excuse for being stupid.

</F>

Fredrik Lundh

unread,

Sep 25, 2006, 8:43:52 AM9/25/06

to pytho...@python.org

Jon Ribbens wrote:

> Or if the design, as described in the documentation, is flawed in some
> way.

it does exactly what it says, and is perfectly usable as is, if you bother to
use it the way it was intended to be used.

(still waiting for the "jon's enhanced escape" proposal, btw, but I guess it's
easier to piss on others than to actually contribute something useful).

</F>

Fredrik Lundh

unread,

Sep 25, 2006, 9:16:06 AM9/25/06

to pytho...@python.org

Jon Ribbens wrote:

>> since it doesn't deal with encodings at all.
>
> Why does it need to? cgi.escape is (or should be) dealing with
> character strings, not byte sequences. I must admit,
> internationalisation is not my forte, so if there's something
> I'm missing here I'd love to hear about it.

If you're really serious about making things easier to use, shouldn't
you look at the whole picture? HTML documents are byte streams, so
any transformation from internal character data to HTML must take both
escaping and encoding into account. If you and Lawrence have a hard
time remembering how to use the existing cgi.escape function, despite
it's utter simplicity, surely it would make your life even easier if
there was an alternative API that would handle both the easy part
(escaping) and the hard part (encoding) ?

> By the way, if you could try and put across your proposed arguments as
> to why you don't favour this suggested change without the insults and
> general rudeness, it would be appreciated.

I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself. Breaking things just because you think
you can simply isn't the Python way of doing things.

</F>

Jon Ribbens

unread,

Sep 25, 2006, 9:30:52 AM9/25/06

to

In article <mailman.559.11591881...@python.org>, Fredrik Lundh wrote:
> maybe you haven't done software long enough to understand that
> software works better if you use it the way it was intended to be
> used, but that's no excuse for being stupid.

So what's your excuse?

Jon Ribbens

unread,

Sep 25, 2006, 9:46:02 AM9/25/06

to

In article <mailman.563.11591903...@python.org>, Fredrik Lundh wrote:
> If you're really serious about making things easier to use, shouldn't
> you look at the whole picture? HTML documents are byte streams, so
> any transformation from internal character data to HTML must take both
> escaping and encoding into account.

Ever heard of modular programming? I would suggest that you do indeed
take a step back and look at the whole picture - it's the whole
picture that needs to take escaping and encoding into account. There's
nothing to say that cgi.escape should take them both into account in
the one function, and in fact as you yourself have already commented,
good reasons for it not to, in that it would make it excessively
complicated.

> If you and Lawrence have a hard time remembering how to use the
> existing cgi.escape function, despite it's utter simplicity, surely
> it would make your life even easier if there was an alternative API
> that would handle both the easy part (escaping) and the hard part
> (encoding) ?

You seem to be arguing that because, in an ideal world, it would be
better to throw away the 'cgi' module completely and start again, it
is not worth making minor improvements in what we already have.
I would suggest that this is, to put it mildly, not a good argument.

> I've already explained that, but since you're convinced that your use
> case is more important than other use cases, and you don't care about
> things like stability and respect for existing users of an API, nor
> the cost for others to update their code and unit tests, I don't see
> much need to repeat myself.

You are merely compounding your bad manners. All of your above
allegations are outright lies. I am not sure if you are simply not
understanding the simple points I am making, or are deliberately
trying to mislead people for some bizarre reason of your own.

> Breaking things just because you think you can simply isn't the
> Python way of doing things.

Your hyperbole is growing more extravagant. To begin with, you were
claiming that the suggested change would make things (minisculely)
less efficient, now you're claiming it will "break" unspecified
things. What precisely do you think it would "break"?

Duncan Booth

unread,

Sep 25, 2006, 9:50:07 AM9/25/06

to

Jon Ribbens <jon+u...@unequivocal.co.uk> wrote:
>> and will also break unit tests.
>
> Er, so change the unit tests at the same time?

It is generally a principle of Python that new releases maintain backward
compatability. An incompatible change such proposed here would probably
break many tests for a large number of people.

If the change were seen as a good thing, then a backwards compatible change
(e.g. introducing a function with a different name) might be considered,
but if so it should address the whole issue: the current lack of support
for encodings is IMHO a far bigger problem than whether or a quote mark is
escaped.

> Why does it need to? cgi.escape is (or should be) dealing with
> character strings, not byte sequences. I must admit,
> internationalisation is not my forte, so if there's something
> I'm missing here I'd love to hear about it.

If I have a unicode string such as: u'\u201d' (right double quote), then I
want that encoded in my html as '”' (or ” but the numeric form
is better). For many purposes I could just encode it in the encoding to be
used for the page, typically latin1 or utf8, but sometimes that isn't
possible e.g. if you don't know the encoding at the point when you produce
the string, or if there is no translation for the character in the desired
encoding. The character reference will work whatever encoding is used for
the page.

There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html page;
at present I need to call both cgi.escape and s.encode to get the desired
effect.

Jon Ribbens

unread,

Sep 25, 2006, 9:54:16 AM9/25/06

to

In article <mailman.561.11591883...@python.org>, Fredrik Lundh wrote:
> (still waiting for the "jon's enhanced escape" proposal, btw, but I guess it's
> easier to piss on others than to actually contribute something useful).

Well, yes, you certainly seem to be good at the "pissing on others"
part, even if you have to lie to do it. You have had the "enhanced
escape" proposal all along - it was the post which started this
thread! If you are referring to your strawman argument about
encodings, you have yet to show that it's relevant.

If it'll make you any happier, here's the code for the 'cgi.escape'
equivalent that I usually use:

_html_encre = re.compile("[&<>\"'+]")
_html_encodes = { "&": "&", "<": "<", ">": ">", "\"": """,
"'": "'", "+": "+" }

def html_encode(raw):
return re.sub(_html_encre, lambda m: _html_encodes[m.group(0)], raw)

Max M

unread,

Sep 25, 2006, 10:00:45 AM9/25/06

to

Fredrik Lundh skrev:
> Jon Ribbens wrote:

>> By the way, if you could try and put across your proposed arguments as
>> to why you don't favour this suggested change without the insults and
>> general rudeness, it would be appreciated.
>
> I've already explained that, but since you're convinced that your use
> case is more important than other use cases, and you don't care about
> things like stability and respect for existing users of an API, nor
> the cost for others to update their code and unit tests, I don't see
> much need to repeat myself. Breaking things just because you think
> you can simply isn't the Python way of doing things.

This thread is highly entertaining but perhaps not that productive.

Lawrence is right that the escape method doesn't work the way he expects
it to.

Rewriting a library module simply because a developer is surprised is a
*very* bad idea. It would break just about every web app out there that
uses the escape module and uses testing. Which is probably most of them.
That could mean several man years of wasted time. It also makes the
escaped html harder to read for standard cases.

Frederik is right that doing so is utterly ... well let us call it
"unproductive". Stupid is such a harsh word ;-)

Whether someone finds the bloat miniscule and thus a small enough change
to warrant the rewrite does not really matter.

Lawrence is free to write a wrapper and use that instead.

my_escape = lambda st: cgi.escape(st, 1)

So. Lawrence is happy, and the escape works as expected. Several man
years has been saved.

Max M

Fredrik Lundh

unread,

Sep 25, 2006, 10:00:35 AM9/25/06

to pytho...@python.org

Jon Ribbens wrote:

> There's nothing to say that cgi.escape should take them both into account
> in the one function

so what exactly are you using cgi.escape for in your code ?

> What precisely do you think it would "break"?

existing code, and existing tests.

</F>

Jon Ribbens

unread,

Sep 25, 2006, 10:05:26 AM9/25/06

to

In article <Xns984996E6BA...@127.0.0.1>, Duncan Booth wrote:
> It is generally a principle of Python that new releases maintain backward
> compatability. An incompatible change such proposed here would probably
> break many tests for a large number of people.

Why is the suggested change incompatible? What code would it break?
I agree that it would be a bad idea if it did indeed break backwards
compatibility - but it doesn't.

> There should be a one-stop shop where I can take my unicode text and
> convert it into something I can safely insert into a generated html page;

I disagree. I think that doing it in one is muddled thinking and
liable to lead to bugs. Why not keep your output as unicode until it
is ready to be output to the browser, and encode it as appropriate
then? Character encoding and character escaping are separate jobs with
separate requirements that are better off handled by separate code.

Jon Ribbens

unread,

Sep 25, 2006, 10:08:23 AM9/25/06

to

In article <mailman.569.11591928...@python.org>, Fredrik Lundh wrote:
>> There's nothing to say that cgi.escape should take them both into account
>> in the one function
>
> so what exactly are you using cgi.escape for in your code ?

To escape characters so that they will be treated as character data
and not control characters in HTML.

>> What precisely do you think it would "break"?
>
> existing code, and existing tests.

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Fredrik Lundh

unread,

Sep 25, 2006, 10:20:52 AM9/25/06

to pytho...@python.org

Max M wrote:

> It also makes the escaped html harder to read for standard cases.

and slows things down a bit.

(cgi.escape(s, True) is slower than cgi.escape(s), for reasons that are
obvious for anyone who's looked at the code).

</F>

Georg Brandl

unread,

Sep 25, 2006, 10:24:26 AM9/25/06

to

Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
code that expects it not to do so would break.

Georg

Duncan Booth

unread,

Sep 25, 2006, 10:25:51 AM9/25/06

to

Jon Ribbens <jon+u...@unequivocal.co.uk> wrote:

> In article <Xns984996E6BA...@127.0.0.1>, Duncan Booth
> wrote:
>> It is generally a principle of Python that new releases maintain
>> backward compatability. An incompatible change such proposed here
>> would probably break many tests for a large number of people.
>
> Why is the suggested change incompatible? What code would it break?
> I agree that it would be a bad idea if it did indeed break backwards
> compatibility - but it doesn't.

I guess you've never seen anyone write tests which retrieve some generated
html and compare it against the expected value. If the page contains any
unescaped quotes then this change would break it.

>
>> There should be a one-stop shop where I can take my unicode text and
>> convert it into something I can safely insert into a generated html
>> page;
>
> I disagree. I think that doing it in one is muddled thinking and
> liable to lead to bugs. Why not keep your output as unicode until it
> is ready to be output to the browser, and encode it as appropriate
> then? Character encoding and character escaping are separate jobs with
> separate requirements that are better off handled by separate code.

Sorry, convert into something I can safely insert wasn't meant to imply
encoding: just entity escaping.

To be clear:

I'm talking about encoding certain characters as entity references. It
doesn't matter whether its the character ampersand or right double quote,
they both want to be converted to entities. Same operation.

The resulting string might be a byte string or it might still be unicode:
the point being that the conversion I want is from unescaped to entity
escaped, not from unicode to byte encoded. Right now the only way the
Python library gives me to do the entity escaping properly has a side
effect of encoding the string. I should be able to do the escaping without
having to encode the string at the same time.

Jon Ribbens

unread,

Sep 25, 2006, 10:37:01 AM9/25/06

to

In article <ef8oqr$9pt$1...@news.albasani.net>, Georg Brandl wrote:
>> I'm sorry, that's not good enough. How, precisely, would it break
>> "existing code"? Can you come up with an example, or even an
>> explanation of how it *could* break existing code?
>
> Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
> code that expects it not to do so would break.

Sorry, that's still not good enough. Why would any code expect such a
thing?

Max M

unread,

Sep 25, 2006, 10:48:03 AM9/25/06

to

Jon Ribbens skrev:

Some examples are:

- Possibly any code that tests for string equality in a rendered
html/xml page. Testing is a prefered development tool these days.

- Code that generates cgi.escaped() markup and (rightfully) for some
reason expects the old behaviour to be used.

- 3. party code that parses/scrapes content from cgi.escaped() markup.
(you could even break Java code this way :-s )

Any change in Python that has these consequences will rightfully be
considered a bug. So what you are suggesting is to knowingly introduce a
bug in the standard library!

You are right that the html generated by cgi.escape() would (probably)
have the same visual appearence in the browsers. But that is a *very*
narrow definition of being bug free and not breaking stuff.

If you cannot think of other examples for yourself where your change
would introduce breakage, you are certainly not an experienced enough
programmer to suggest changes in the standard lib!

Max M

Jon Ribbens

unread,

Sep 25, 2006, 10:50:32 AM9/25/06

to

In article <Xns98499CF9DC...@127.0.0.1>, Duncan Booth wrote:
> I guess you've never seen anyone write tests which retrieve some generated
> html and compare it against the expected value. If the page contains any
> unescaped quotes then this change would break it.

You're right - I've never seen anyone do such a thing. It sounds like
a highly dubious and very fragile sort of test to me, of very limited
use.

> I'm talking about encoding certain characters as entity references. It
> doesn't matter whether its the character ampersand or right double quote,
> they both want to be converted to entities. Same operation.

This is that muddled thinking I was talking about. They are *not* the
same operation. You want to encode "<", for example, because it must
always be encoded to prevent it being treated as an HTML control
character. This has nothing to do with character encodings.

You might sometimes want to escape "right double quote" because it may
or may not be available in the character encoding you using to output
to the browser. Yes, this might sometimes seem a bit similar to the
"<" escaping described above, because one of the ways you could avoid
the character encoding issue would be to use numeric entities, but it
is actually a completely separate issue and is none of the business of
cgi.escape.

By your argument, cgi.escape should in fact escape *every single*
character as a numeric entity, and even that wouldn't work properly
since "&", "#", ";" and the digits might not be in their usual
positions in the output encoding.

> Right now the only way the Python library gives me to do the entity
> escaping properly has a side effect of encoding the string. I should
> be able to do the escaping without having to encode the string at
> the same time.

I'm getting lost here - the opposite of what you say above is true.
cgi.escape does the escaping properly (modulo failing to escape
quotes) without encoding.

Max M

unread,

Sep 25, 2006, 10:59:26 AM9/25/06

to

Jon Ribbens skrev:

Oh ... because you cannot see a use case for that *documented*
behaviour, it must certainly be wrong?

This funktion which is correct by current documentation will be broken
by you change.

def hasSomeWord(someword):
import urllib
f = urllib.open('http://www.example.com/cgi_escaped_content')
content = f.read()
f.close()
return '"%s"' % someword in content:

You might think that it is stupid code that should be changed to take
escaped quotes into account. But that is really not your bussines to
decide if the other behaviour is documented and correct.

I find it amazing that you cannot understand this. I will stop replying
in this thread now.

Max M

Jon Ribbens

unread,

Sep 25, 2006, 11:02:16 AM9/25/06

to

In article <4517ec24$0$13947$edfa...@dread15.news.tele.dk>, Max M wrote:
>> I'm sorry, that's not good enough. How, precisely, would it break
>> "existing code"? Can you come up with an example, or even an
>> explanation of how it *could* break existing code?
>
> Some examples are:
>
> - Possibly any code that tests for string equality in a rendered
> html/xml page. Testing is a prefered development tool these days.

Testing is good, but only if done correctly.

> - Code that generates cgi.escaped() markup and (rightfully) for some
> reason expects the old behaviour to be used.

That's begging the question again ("an example of code that would
break is code that would break").

> - 3. party code that parses/scrapes content from cgi.escaped() markup.
> (you could even break Java code this way :-s )

I'm sorry, I don't understand that one. What is "party code"? Code
that is scraping content from web sites already has to cope with
entities etc.

Your comment about Java is a little ironic given that I persuaded the
Java Struts people to make the exact same change we're talking about
here, back in 2002 (even if it did take 11 months) ;-)

> If you cannot think of other examples for yourself where your change
> would introduce breakage, you are certainly not an experienced enough
> programmer to suggest changes in the standard lib!

I'll take my own opinion on that over yours, thanks.

and-g...@doxdesk.com

unread,

Sep 25, 2006, 11:08:43 AM9/25/06

to

Jon Ribbens wrote:

> I'm sorry, that's not good enough. How, precisely, would it break
> "existing code"?

('owdo Mr. Ribbens!)

It's possible there could be software that relies on ' not being
escaped, for example:

# Auto-markup links to O'Reilly, everyone's favourite
# example name with an apostrophe in it
#
URI= 'http://www.oreilly.com/'
html= cgi.escape(text)
html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)

Sure this may be rare, but it's what the documentation says, and
changing it may not only fix things but also subtly break things in
ways that are hard to detect.

A similar change to str.encode('unicode-escape') in Python 2.5 caused a
number of similar subtle problems. (In this case the old documentation
was a bit woolly so didn't prescribe the exact older behaviour.)

I'm not saying that the cgi.escape interface is *good*, just that it's
too late to change it.

I personally think the entire function should be deprecated, firstly
because it's insufficient in some corner cases (apostrophes as you
pointed out, and XHTML CDATA), and secondly because it's in the wrong
place: HTML-escaping is nothing to do with the CGI interface. A good
template library should deal with escaping more smoothly and correctly
than cgi.escape. (It may be able to deal with escape-or-not-bother and
character encoding issues automatically, for example.)

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

Jon Ribbens

unread,

Sep 25, 2006, 11:13:30 AM9/25/06

to

In article <4517eecf$0$14036$edfa...@dread15.news.tele.dk>, Max M wrote:
> Oh ... because you cannot see a use case for that *documented*
> behaviour, it must certainly be wrong?

No, but if nobody else can find one either, that's a clue that maybe
it's safe to change.

Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to. Even by your own argument, therefore, code is not
entitled to rely on the output of cgi.escape being any particular
exact string.

> This funktion which is correct by current documentation will be broken
> by you change.
>
> def hasSomeWord(someword):
> import urllib
> f = urllib.open('http://www.example.com/cgi_escaped_content')
> content = f.read()
> f.close()
> return '"%s"' % someword in content:

That function is broken already, no change required.

Duncan Booth

unread,

Sep 25, 2006, 11:35:51 AM9/25/06

to

Jon Ribbens <jon+u...@unequivocal.co.uk> wrote:

It's easy enough to come up with examples which might. For example, I
have doctests which evaluate tal expressions. I don't think I currently
have any which depend on quotes, but I can easily create one (I just
did, and it passes):

>>> print T('''<tal:x tal:content="python:'It\\'s a \\x22tal\\x22 string'" />''')
It's a "tal" string
>>> print T('''<x tal:attributes="title python:'It\\'s a \\x22tal\\x22 string'" />''')
<x title="It's a "tal" string" />

More likely I might output a field value and just happen to have used a quote
in it.

FWIW, in zope tal, the value of tal:content is escaped using the equivalent of
cgi.escape(s, False), and attribute values are escaped using
cgi.escape(s, True).

The function T I use is defined as:

def T(template, **kw):
"""Create and render a page template."""
pt = PageTemplate()
pt.pt_edit(template, 'text/html')
return pt.pt_render(extra_context=kw).strip('\n')

Fredrik Lundh

unread,

Sep 25, 2006, 11:45:34 AM9/25/06

to pytho...@python.org

Jon Ribbens wrote:

> Sorry, that's still not good enough.

that's not up to you to decide, though.

</F>

Jon Ribbens

unread,

Sep 25, 2006, 11:49:52 AM9/25/06

to

In article <1159196923.5...@i42g2000cwa.googlegroups.com>, and-g...@doxdesk.com wrote:
>> I'm sorry, that's not good enough. How, precisely, would it break
>> "existing code"?
>
> ('owdo Mr. Ribbens!)

Good afternoon Mr Glover ;-)

> URI= 'http://www.oreilly.com/'
> html= cgi.escape(text)
> html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)
>
> Sure this may be rare, but it's what the documentation says, and
> changing it may not only fix things but also subtly break things in
> ways that are hard to detect.

I'm not sure about "subtly break things", but you're right that the
above code would break. I could argue that it's broken already,
(since it's doing a plain-text search on HTML data) but given
real-world considerations it's reasonable enough that I won't be that
pedantic ;-)

> I personally think the entire function should be deprecated, firstly
> because it's insufficient in some corner cases (apostrophes as you
> pointed out, and XHTML CDATA), and secondly because it's in the wrong
> place: HTML-escaping is nothing to do with the CGI interface. A good
> template library should deal with escaping more smoothly and correctly
> than cgi.escape. (It may be able to deal with escape-or-not-bother and
> character encoding issues automatically, for example.)

I agree that in most situations you should probably be using a
template library, but sometimes a simple CGI-and-manual-HTML system
suffices, and I think (a fixed version of) cgi.escape should exist at
a low level of the web application stack.

Filip Salomonsson

unread,

Sep 25, 2006, 11:51:03 AM9/25/06

to pytho...@python.org

On 25 Sep 2006 15:13:30 GMT, Jon Ribbens <jon+u...@unequivocal.co.uk> wrote:
>
> Here's a point for you - the documentation for cgi.escape says that
> the characters "&", "<" and ">" are converted, but not what they are
> converted to.

If the documentation isn't clear enough, that means the documentation
should be fixed.

It does _not_ mean "you are free to introduce new behavior because
nobody should trust what this function does anyway".
--
filip salomonsson

Jon Ribbens

unread,

Sep 25, 2006, 11:52:22 AM9/25/06

to

In article <mailman.579.11591992...@python.org>, Fredrik Lundh wrote:
>> Sorry, that's still not good enough.
>
> that's not up to you to decide, though.

It's up to me to decide whether or not an argument is good enough to
convince me, thank you very much.

Jon Ribbens

unread,

Sep 25, 2006, 11:54:26 AM9/25/06

to

In article <mailman.580.11591994...@python.org>, Filip Salomonsson wrote:
>> Here's a point for you - the documentation for cgi.escape says that
>> the characters "&", "<" and ">" are converted, but not what they are
>> converted to.
>
> If the documentation isn't clear enough, that means the documentation
> should be fixed.

Incorrect - documentation can and frequently does leave certain
behaviours undefined. This is deliberate and (among other things)
is to allow for the behaviour to change in future versions without
breaking backwards-compatibility.

Fredrik Lundh

unread,

Sep 25, 2006, 12:03:18 PM9/25/06

to pytho...@python.org

Jon Ribbens wrote:

> It's up to me to decide whether or not an argument is good enough to
> convince me, thank you very much.

not if you expect anyone to take anything you say seriously.

</F>

Jon Ribbens

unread,

Sep 25, 2006, 12:17:18 PM9/25/06

to

Now you're just being ridiculous. In this thread you have been rude,
evasive, insulting, vague, hypocritical, and have failed to answer
substantive points in favour of sarcastic and erroneous sniping - I'd
suggest it's you that needs to worry about being taken seriously.

Brian Quinlan

unread,

Sep 25, 2006, 1:11:18 PM9/25/06

to Jon Ribbens, pytho...@python.org

Actually, at least in the context of this mailing list, Fredrik doesn't
have to worry about that at all. Why? Because he is one of the most
prolific contributers to the Python language and libraries and his
contributions have been of consistent high quality.

You, on the other hand, are "just some guy" and people don't have a lot
of incentive to convince you of anything.

I have no opinion on the actual debate though. Just trying to help with
the social analysis :-)

Cheers,
Brian

Jon Ribbens

unread,

Sep 25, 2006, 1:20:41 PM9/25/06

to

In article <mailman.585.11592042...@python.org>, Brian Quinlan wrote:
>> Now you're just being ridiculous. In this thread you have been rude,
>> evasive, insulting, vague, hypocritical, and have failed to answer
>> substantive points in favour of sarcastic and erroneous sniping - I'd
>> suggest it's you that needs to worry about being taken seriously.
>
> Actually, at least in the context of this mailing list, Fredrik doesn't
> have to worry about that at all. Why? Because he is one of the most
> prolific contributers to the Python language and libraries

I would have hoped that people don't treat that as a licence to be
obnoxious, though. I am aware of Fredrik's history, which is why I
was somewhat surprised and disappointed that he was being so rude
and unpleasant in this thread. He is not living up to his reputation
at all. Maybe he's having a bad day ;-)

Georg Brandl

unread,

Sep 25, 2006, 2:02:58 PM9/25/06

to

Jon Ribbens wrote:
> In article <4517eecf$0$14036$edfa...@dread15.news.tele.dk>, Max M wrote:
>> Oh ... because you cannot see a use case for that *documented*
>> behaviour, it must certainly be wrong?
>
> No, but if nobody else can find one either, that's a clue that maybe
> it's safe to change.
>
> Here's a point for you - the documentation for cgi.escape says that
> the characters "&", "<" and ">" are converted, but not what they are
> converted to.

It says "to HTML-safe sequences". That's reasonably clear without the need
to reproduce the exact replacements for each character.

If anyone doesn't know what is meant by this, he shouldn't really write apps
using the cgi module before doing a basic HTML course.

Or use the source.

Georg

Dan Bishop

unread,

Sep 25, 2006, 2:20:05 PM9/25/06

to

Fredrik Lundh wrote:
> Jon Ribbens wrote:
>
> > Making cgi.escape always escape the '"' character would not break
> > anything, and would probably fix a few bugs in existing code. Yes,
> > those bugs are not cgi.escape's fault, but that's no reason not to
> > be helpful. It's a minor improvement with no downside.
>
> the "improvement with no downside" would bloat down the output for
> everyone who's using the function in the intended way,

"Unless" "your" "CGI" "scripts" "output" "text" "like" "this," "I"
"think" "it's" "absurd" "to" "consider" "the" "bloat" "significant."

Jon Ribbens

unread,

Sep 25, 2006, 7:41:48 PM9/25/06

to

In article <ef95kk$oan$1...@news.albasani.net>, Georg Brandl wrote:
>> Here's a point for you - the documentation for cgi.escape says that
>> the characters "&", "<" and ">" are converted, but not what they are
>> converted to.
>
> It says "to HTML-safe sequences". That's reasonably clear without the need
> to reproduce the exact replacements for each character.
>
> If anyone doesn't know what is meant by this, he shouldn't really write apps
> using the cgi module before doing a basic HTML course.

So would you like to expliain the difference between " and " ,
or do you need to go on a "basic HTML course" first?

Lawrence D'Oliveiro

unread,

Sep 25, 2006, 11:02:18 PM9/25/06

to

In message <Xns984996E6BA...@127.0.0.1>, Duncan Booth wrote:

> If I have a unicode string such as: u'\u201d' (right double quote), then I
> want that encoded in my html as '”' (or ” but the numeric form
> is better).

Right-double-quote is not an HTML special, so there's no need to quote it.
I'm only concerned here with characters that have special meanings in HTML
markup.

> There should be a one-stop shop where I can take my unicode text and
> convert it into something I can safely insert into a generated html page;

> at present I need to call both cgi.escape and s.encode to get the desired
> effect.

What you're really asking for is a version of cgi.escape that a) fixes the
bugs discussed in this thread, and b) copes with different encodings while
doing so.

To handle b), you would need to pass it some indication of what the encoding
of the string is. In any case, converting a literal right-double-quote to
” is not relevant to the purpose of cgi.escape.

Lawrence D'Oliveiro

unread,

Sep 25, 2006, 11:39:54 PM9/25/06

to

In message <4517e10e$0$13929$edfa...@dread15.news.tele.dk>, Max M wrote:

> Lawrence is right that the escape method doesn't work the way he expects
> it to.
>
> Rewriting a library module simply because a developer is surprised is a
> *very* bad idea.

I'm not surprised. Disappointed, yes. Verging on disgust at some comments in
this thread, yes. But "surprised" is what a lot of users of the existing
cgi.escape function are going to be when they discover their code isn't
doing what they thought it was.

> It would break just about every web app out there that
> uses the escape module...

How will it break them? Give an example.

Lawrence D'Oliveiro

unread,

Sep 25, 2006, 11:41:30 PM9/25/06

to

In message <mailman.570.11591941...@python.org>, Fredrik
Lundh wrote:

What you're doing is adding to the reasons why the existing cgi.escape
function is stupidly designed and implemented. The True case is by far the
most common, so to make that the slow case, as well as being the
non-default case, is doubly brain-dead.

Lawrence D'Oliveiro

unread,

Sep 25, 2006, 11:45:23 PM9/25/06

to

In message <4517ec24$0$13947$edfa...@dread15.news.tele.dk>, Max M wrote:

> Jon Ribbens skrev:
>> In article <mailman.569.11591928...@python.org>, Fredrik
>> Lundh wrote:
>>>> There's nothing to say that cgi.escape should take them both into
>>>> account in the one function
>>> so what exactly are you using cgi.escape for in your code ?
>>
>> To escape characters so that they will be treated as character data
>> and not control characters in HTML.
>>
>>>> What precisely do you think it would "break"?
>>> existing code, and existing tests.
>>
>> I'm sorry, that's not good enough. How, precisely, would it break
>> "existing code"? Can you come up with an example, or even an
>> explanation of how it *could* break existing code?
>
>
> Some examples are:
>
> - Possibly any code that tests for string equality in a rendered
> html/xml page.

You've got to be kidding. Any programmer knows that, to test two strings for
equality, you should do that on a canonical (non-encoded) representation.

> - Code that generates cgi.escaped() markup and (rightfully) for some
> reason expects the old behaviour to be used.

Whenever I use a channel-coding function, I expect the resulting output to
be only fit for feeding into the channel. I do NOT expect to do anything
else with it. Any kind of data manipulation I do, I do BEFORE feeding it
into the output channel, which means BEFORE putting it through the channel
coding.

> - 3. party code that parses/scrapes content from cgi.escaped() markup.
> (you could even break Java code this way :-s )

If that code follows the HTML rules, it will work.

Lawrence D'Oliveiro

unread,

Sep 25, 2006, 11:48:16 PM9/25/06

to

In message <mailman.579.11591992...@python.org>, Fredrik
Lundh wrote:

> In article <ef8oqr$9pt$1...@news.albasani.net>, Georg Brandl wrote:
>>> I'm sorry, that's not good enough. How, precisely, would it break
>>> "existing code"? Can you come up with an example, or even an

>>> explanation of how it could break existing code?

>>
>> Is that so hard to see? If cgi.escape replaced "'" with an entity
>> reference, code that expects it not to do so would break.
>
> Sorry, that's still not good enough. Why would any code expect such a
> thing?
>>

> that's not up to you to decide, though.

Yes it is. An HTML-quoting function converts a string to its HTML-compatible
representation. Since it is now HTML-compatible, any code that tries to
work with it afterwards has got to expect it to be HTML-compatible. Which
means it has to allow for what HTML allows.

Lawrence D'Oliveiro

unread,

Sep 25, 2006, 11:53:34 PM9/25/06

to

In message <mailman.559.11591881...@python.org>, Fredrik
Lundh wrote:

> Lawrence D'Oliveiro wrote:
>
>>> Georg Brandl wrote:
>>>
>>>> A function is broken if its implementation doesn't match the
>>>> documentation.
>>>
>>> or if it doesn't match the designer's intent. cgi.escape is old enough
>>> that we would have noticed that, by now...
>>
>> _We_ certainly have noticed it.
>
> you're not the designer...

I don't have to be. Whoever the designer was, they had not properly thought
through the uses of this function. That's quite obvious already, to anybody
who works with HTML a lot. So the function is broken and needs to be fixed.

If you're worried about changing the semantics of a function that keeps the
same "cgi.escape" name, then fine. We delete the existing function and add
a new, properly-designed one. _That_ will be a wake-up call to all the
users of the existing function to fix their code.

Steven D'Aprano

unread,

Sep 26, 2006, 12:43:24 AM9/26/06

to

On Mon, 25 Sep 2006 16:48:03 +0200, Max M wrote:

> Any change in Python that has these consequences will rightfully be
> considered a bug. So what you are suggesting is to knowingly introduce a
> bug in the standard library!

It isn't like there have never been backwards _in_compatible changes to
the standard library before.

Ten seconds of googling finds
http://www.python.org/download/releases/2.3/highlights/:

int() - this can now return a long when converting a string with many
digits, rather than raising OverflowError. (New in 2.3a2: issues a
FutureWarning when sign-folding an unsigned hex or octal literal.)

Bastion and rexec - these modules are disabled, because they aren't
safe in Python 2.3 (nor in Python 2.2). (New in 2.3a2.)

Hex/oct literals prefixed with a minus sign were handled
inconsistently. This has been fixed in accordance with PEP 237. (New
in 2.3a2.)

Passing a float to C functions expecting an integer now issues a
DeprecationWarning; in the future this will become a TypeError. (New
in 2.3a2.)

None - assignment to variables or attributes named None will now
trigger a warning. In the future, None may become a keyword.

And more, all from one release.

If the behaviour of cgi.escape is "broken", or incomplete, or misleading,
then Python has a great mechanism for introducing incompatible changes
slowly: warnings.

It isn't good enough to say that the function does what it says it does,
if what it does is dangerous and misleading. Artificial example:

def sqr(x):
"""Returns the square of almost all numbers."""
if x != 1: return x**2
else: return -1

The function does exactly what it says, and yet still has badly dangerous
behaviour that risks introducing serious bugs. If people are relying on
unit tests which include specific tests for that behaviour, then the
function and the code needs to be fixed in parallel. That's what the
warnings module is for.

So any arguments about "breaking code" are a red herring: if cgi.escape
does the wrong thing (and that's arguable), and code relies on that
behaviour, then the code is already broken and needs to be fixed in
parallel with the function. So can we accept that:

(1) *if* there is a problem with cgi.escape it needs to be fixed;

(and, dear gods, I would hope that nobody here wants to argue that Python
should make backwards compatibility a higher virtue than correctness!)

(2) it doesn't need to be fixed *immediately* without warning;

(3) but it can be fixed through a gradual process with warning; and

(4) unit tests and code that expect the (presumed) bad behaviour can be
fixed gradually?

Now that we've got that out of the way, can we CALMLY and RATIONALLY
discuss whether cgi.escape is or isn't broken?

Or, more specifically, UNDER WHAT CIRCUMSTANCES it does the wrong thing?

--
Steven D'Aprano

Gabriel G

unread,

Sep 26, 2006, 12:18:57 AM9/26/06

to pytho...@python.org

At Monday 25/9/2006 11:08, Jon Ribbens wrote:

> >> What precisely do you think it would "break"?
> >
> > existing code, and existing tests.
>

>I'm sorry, that's not good enough. How, precisely, would it break
>"existing code"? Can you come up with an example, or even an

>explanation of how it *could* break existing code?

FWIW, a *lot* of unit tests on *my* generated html code would break,
and I imagine a *lot* of other people's code would break too. So
changing the defaults is not a good idea.
But if you want, import this on sitecustomize.py and pretend it said
quote=True:

import cgi
cgi.escape.func_defaults = (True,)
del cgi

Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Steve Holden

unread,

Sep 26, 2006, 2:46:33 AM9/26/06

to pytho...@python.org

I generally find that Fredrik's rudeness quotient is satisfactorily
biased towards discouraging ill-informed comment. As far as rudeness
goes, I've found your approach to this discussion to be pretty
obnoxious, and I'm generally know as someone with a high tolerance for
idiotic behaviour.

If your intention was to troll you could not have crafted your
contributions in a better way.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Dan Bishop

unread,

Sep 26, 2006, 2:57:14 AM9/26/06

to

How exactly would you make s = s.replace('"',""") faster than
*not* doing the replacement?

Duncan Booth

unread,

Sep 26, 2006, 3:00:09 AM9/26/06

to

Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:

>> (cgi.escape(s, True) is slower than cgi.escape(s), for reasons that
>> are obvious for anyone who's looked at the code).
>
> What you're doing is adding to the reasons why the existing cgi.escape
> function is stupidly designed and implemented. The True case is by far
> the most common, so to make that the slow case, as well as being the
> non-default case, is doubly brain-dead.

It is slightly slower because it does more. Both cases are about 15 times
faster than the regular expression implementation someone posted to this
thread yesterday.

Duncan Booth

unread,

Sep 26, 2006, 3:00:10 AM9/26/06

to

Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:

> In message <Xns984996E6BA...@127.0.0.1>, Duncan Booth
> wrote:
>
>> If I have a unicode string such as: u'\u201d' (right double quote),
>> then I want that encoded in my html as '”' (or ” but the
>> numeric form is better).
>
> Right-double-quote is not an HTML special, so there's no need to quote
> it. I'm only concerned here with characters that have special meanings
> in HTML markup.

There is no need to quote " or ' either except in particular situations.

Would you care to suggest how you get a right double quote into any iso-
8859-1 encoded web page without quoting it? Even if the page is utf-8
encoded quoting it can be a good idea.

>
>> There should be a one-stop shop where I can take my unicode text and
>> convert it into something I can safely insert into a generated html
>> page; at present I need to call both cgi.escape and s.encode to get
>> the desired effect.
>
> What you're really asking for is a version of cgi.escape that a) fixes
> the bugs discussed in this thread, and b) copes with different
> encodings while doing so.
>
> To handle b), you would need to pass it some indication of what the
> encoding of the string is. In any case, converting a literal
> right-double-quote to ” is not relevant to the purpose of
> cgi.escape.
>

You don't seem to understand about html entity escapes. ” is a valid
way to express right double quote whatever the page encoding. There is no
need to know the encoding of the page in order to escape entities, just
escape anything which can be problematic.

Lawrence D'Oliveiro

unread,

Sep 26, 2006, 3:15:22 AM9/26/06

to

In message <1159253834.5...@m7g2000cwm.googlegroups.com>, Dan
Bishop wrote:

Wrong answer. Correctness comes first, then we worry about efficiency.

Lawrence D'Oliveiro

unread,

Sep 26, 2006, 3:16:24 AM9/26/06

to

In message <mailman.633.11592516...@python.org>, Gabriel G
wrote:

> At Monday 25/9/2006 11:08, Jon Ribbens wrote:
>
>> >> What precisely do you think it would "break"?
>> >
>> > existing code, and existing tests.
>>
>>I'm sorry, that's not good enough. How, precisely, would it break
>>"existing code"? Can you come up with an example, or even an
>>explanation of how it *could* break existing code?
>

> FWIW, a *lot* of unit tests on *my* generated html code would break...

Why did you write your code that way?

Georg Brandl

unread,

Sep 26, 2006, 3:20:12 AM9/26/06

to

What about the users who don't need to "fix" their code since it's working fine
and flawlessly with the current cgi.escape?

Georg

Georg Brandl

unread,

Sep 26, 2006, 3:26:44 AM9/26/06

to

Lawrence D'Oliveiro wrote:
> In message <4517e10e$0$13929$edfa...@dread15.news.tele.dk>, Max M wrote:
>
>> Lawrence is right that the escape method doesn't work the way he expects
>> it to.
>>
>> Rewriting a library module simply because a developer is surprised is a
>> *very* bad idea.
>
> I'm not surprised. Disappointed, yes. Verging on disgust at some comments in
> this thread, yes. But "surprised" is what a lot of users of the existing
> cgi.escape function are going to be when they discover their code isn't
> doing what they thought it was.

Why should they be surprised? The documentation states clearly what cgi.escape()
does (as does the docstring).

Georg

Lawrence D'Oliveiro

unread,

Sep 26, 2006, 3:31:24 AM9/26/06

to

Documentation frequently states stupid things. Doesn't mean it should be
treated as sacrosanct.

Lawrence D'Oliveiro

unread,

Sep 26, 2006, 3:32:51 AM9/26/06

to

They're just lucky. I guess, that the bugs haven't bitten them--yet.

Max M

unread,

Sep 26, 2006, 3:56:43 AM9/26/06

to

Lawrence D'Oliveiro skrev:

Stop feeding the troll.

Jon Ribbens

unread,

Sep 26, 2006, 5:41:33 AM9/26/06

to

In article <mailman.636.11592531...@python.org>, Steve Holden wrote:
>> I would have hoped that people don't treat that as a licence to be
>> obnoxious, though. I am aware of Fredrik's history, which is why I
>> was somewhat surprised and disappointed that he was being so rude
>> and unpleasant in this thread. He is not living up to his reputation
>> at all. Maybe he's having a bad day ;-)
>
> I generally find that Fredrik's rudeness quotient is satisfactorily
> biased towards discouraging ill-informed comment.

It's a pity he's being rude when presented with well-informed comment
then.

> As far as rudeness goes, I've found your approach to this discussion
> to be pretty obnoxious, and I'm generally know as someone with a
> high tolerance for idiotic behaviour.

Why do you say that? I have confined myself to simple logical
arguments, and been frankly very restrained when presented with
rudeness and misunderstanding from other thread participants.
In what way should I have modified my postings?

Georg Brandl

unread,

Sep 26, 2006, 5:55:18 AM9/26/06

to

That's not the point. The point is that someone using cgi.escape() will hardly
be surprised of what it does and doesn't do.

Georg

Jim

unread,

Sep 26, 2006, 6:15:32 AM9/26/06

to

Jon Ribbens wrote:
> You're right - I've never seen anyone do such a thing. It sounds like
> a highly dubious and very fragile sort of test to me, of very limited
> use.
I have code that checks to see if my CGI scripts generate the pages
that I expect. That code would break. (Whether I should not have
written them that way is a different point, but it would break.)

Jim

Lawrence D'Oliveiro

unread,

Sep 26, 2006, 7:10:53 AM9/26/06

to

And this surprise, or lack of it, is relevant to the argument how, exactly?

Steve Holden

unread,

Sep 26, 2006, 7:12:04 AM9/26/06

to pytho...@python.org, jon+u...@unequivocal.co.uk

Please allow me to apologise. I have clearly been confusing you with
someone else. A review of your contributions to the thread confirms your
asertion.

Sion Arrowsmith

unread,

Sep 26, 2006, 7:16:03 AM9/26/06

to

Jon Ribbens <jon+u...@unequivocal.co.uk> wrote:
>In article <Xns98499CF9DC...@127.0.0.1>, Duncan Booth wrote:
>> I guess you've never seen anyone write tests which retrieve some generated
>> html and compare it against the expected value. If the page contains any
>> unescaped quotes then this change would break it.

>You're right - I've never seen anyone do such a thing. It sounds like
>a highly dubious and very fragile sort of test to me, of very limited
>use.

So what sort of test would you use, that doesn't involve comparing
actual output against expected output?

--
\S -- si...@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
___ | "Frankly I have no feelings towards penguins one way or the other"
\X/ | -- Arthur C. Clarke
her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump

Christophe

unread,

Sep 26, 2006, 7:39:51 AM9/26/06

to

Sion Arrowsmith a écrit :

> Jon Ribbens <jon+u...@unequivocal.co.uk> wrote:
>> In article <Xns98499CF9DC...@127.0.0.1>, Duncan Booth wrote:
>>> I guess you've never seen anyone write tests which retrieve some generated
>>> html and compare it against the expected value. If the page contains any
>>> unescaped quotes then this change would break it.
>> You're right - I've never seen anyone do such a thing. It sounds like
>> a highly dubious and very fragile sort of test to me, of very limited
>> use.
>
> So what sort of test would you use, that doesn't involve comparing
> actual output against expected output?

Well, one could say that the expected output is the one as it'll be
interpreted by the HTLM navigator. And thus, the test should un HTLM
escape the string and compare it to the original string instead of
mandating a specific encoding.

Jon Ribbens

unread,

Sep 26, 2006, 7:53:16 AM9/26/06

to

In article <mailman.667.11592691...@python.org>, Steve Holden wrote:
>> Why do you say that? I have confined myself to simple logical
>> arguments, and been frankly very restrained when presented with
>> rudeness and misunderstanding from other thread participants.
>> In what way should I have modified my postings?
>
> Please allow me to apologise. I have clearly been confusing you with
> someone else. A review of your contributions to the thread confirms your
> asertion.

Oh, ok! You had me worried for a minute there ;-)

Fredrik Lundh

unread,

Sep 26, 2006, 9:00:49 AM9/26/06

to pytho...@python.org

Jon Ribbens wrote:

> This has nothing to do with character encodings.

it has *everything* to do with encoding of existing data into HTML so it can be
safely transported to, and recreated by, an HTML-aware client.

does the word "information set" mean anything to you?

</F>

Steve Holden

unread,

Sep 26, 2006, 9:07:02 AM9/26/06

to pytho...@python.org

Is there *any* branch of this thread that won't end with some snippy
remark from you?

Fredrik Lundh

unread,

Sep 26, 2006, 9:08:47 AM9/26/06

to pytho...@python.org

Lawrence D'Oliveiro wrote:

>> (cgi.escape(s, True) is slower than cgi.escape(s), for reasons that are
>> obvious for anyone who's looked at the code).
>
> What you're doing is adding to the reasons why the existing cgi.escape
> function is stupidly designed and implemented. The True case is by far the
> most common

really? most HTML attributes cannot even contain things that would need to
be escaped, while *all* element content needs escaping. and the web contains
a lot of element content, as should be obvious to anyone who's been there...

</F>

Georg Brandl

unread,

Sep 26, 2006, 9:13:45 AM9/26/06

to

Lawrence D'Oliveiro wrote:
> In message <efate6$ilf$1...@news.albasani.net>, Georg Brandl wrote:
>
>> Lawrence D'Oliveiro wrote:
>>> In message <efaknl$867$2...@news.albasani.net>, Georg Brandl wrote:
>>>
>>>> Lawrence D'Oliveiro wrote:
>>>>> In message <4517e10e$0$13929$edfa...@dread15.news.tele.dk>, Max M
>>>>> wrote:
>>>>>
>>>>>> Lawrence is right that the escape method doesn't work the way he
>>>>>> expects it to.
>>>>>>
>>>>>> Rewriting a library module simply because a developer is surprised is
>>>>>> a *very* bad idea.
>>>>>
>>>>> I'm not surprised. Disappointed, yes. Verging on disgust at some
>>>>> comments in this thread, yes. But "surprised" is what a lot of users of

^^^^^^^^^^^

>>>>> the existing cgi.escape function are going to be when they discover
>>>>> their code isn't doing what they thought it was.
>>>>
>>>> Why should they be surprised? The documentation states clearly what
>>>> cgi.escape() does (as does the docstring).
>>>
>>> Documentation frequently states stupid things. Doesn't mean it should be
>>> treated as sacrosanct.
>>
>> That's not the point. The point is that someone using cgi.escape() will
>> hardly be surprised of what it does and doesn't do.
>
> And this surprise, or lack of it, is relevant to the argument how, exactly?

Which argument? You said users were going to be surprised, I told you why they
aren't.

Georg

(Okay, this is my last posting to this thread)

Fredrik Lundh

unread,

Sep 26, 2006, 9:15:11 AM9/26/06

to pytho...@python.org

Georg Brandl wrote:

> It says "to HTML-safe sequences". That's reasonably clear without the need
> to reproduce the exact replacements for each character.

the same documentation tells people what function to use if they want to quote *every-
thing* that might need to be quoted, so if people did actually understand everything that
was written in a reasonably clear way, this thread wouldn't even exist.

</F>

Jon Ribbens

unread,

Sep 26, 2006, 9:19:33 AM9/26/06

to

In article <mailman.672.11592757...@python.org>, Fredrik Lundh wrote:
>> This has nothing to do with character encodings.
>
> it has *everything* to do with encoding of existing data into HTML
> so it can be safely transported to, and recreated by, an HTML-aware
> client.

I can't tell if you're disagreeing or not. You escape the character
"<" as the sequence of characters "<", for example, because
otherwise the HTML user agent will treat it as the start of a tag and
not as character data. You will notice that the character encoding is
utterly irrelevant to this.

> does the word "information set" mean anything to you?

You would appear to be talking about either game theory, or XML,
neither of which have anything to do with HTML.

Fredrik Lundh

unread,

Sep 26, 2006, 9:20:59 AM9/26/06

to pytho...@python.org

Jon Ribbens wrote:

> It's a pity he's being rude when presented with well-informed comment
> then.

since when is the output of

import random, sys
messages = [
"that's irrelevant",
"then their code is broken already",
"that's not good enough",
"then their tests are broken already",
"you're rude",
]
for x in xrange(sys.maxint):
print random.choice(messages)

well-informed? heck, it doesn't even pass the turing test ;-)

</F>

Jon Ribbens

unread,

Sep 26, 2006, 9:25:07 AM9/26/06

to

In article <mailman.676.11592765...@python.org>, Fredrik Lundh wrote:
> the same documentation tells people what function to use if they

> want to quote *every-thing* that might need to be quoted, so if

> people did actually understand everything that was written in a
> reasonably clear way, this thread wouldn't even exist.

The fact that you don't understand that that's not true is the reason
you've been getting into such a muddle in this thread.

Fredrik Lundh

unread,

Sep 26, 2006, 9:32:07 AM9/26/06

to pytho...@python.org

Jon Ribbens wrote:

>> does the word "information set" mean anything to you?
>
> You would appear to be talking about either game theory, or XML,
> neither of which have anything to do with HTML.

you see no connection between XML's concept of information set and
HTML? (hint: what's XHTML?)

</F>

Jon Ribbens

unread,

Sep 26, 2006, 9:41:57 AM9/26/06

to

In article <mailman.678.11592770...@python.org>, Fredrik Lundh wrote:
>> It's a pity he's being rude when presented with well-informed comment
>> then.
>
> since when is the output of
>

[snip code]

>
> well-informed? heck, it doesn't even pass the turing test ;-)

Since when did that bear any resemblance to what I have said?

Are you going to grow up and start addressing the substantial points
raised, rather than making puerile sarcastic remarks?

An apology from you would not go amiss.

Jon Ribbens

unread,

Sep 26, 2006, 9:48:20 AM9/26/06

to

In article <mailman.680.11592779...@python.org>, Fredrik Lundh wrote:
> Jon Ribbens wrote:
>
>>> does the word "information set" mean anything to you?
>>
>> You would appear to be talking about either game theory, or XML,
>> neither of which have anything to do with HTML.

I notice that yet again you've snipped the substantial point and
failed to answer it, presumably because you don't know how.

> you see no connection between XML's concept of information set and
> HTML? (hint: what's XHTML?)

I am perfectly well aware of what XHTML is. If you're trying to make
a point, please get to it, rather than going off on irrelevant
tangents. What do XML Information Sets have to do with escaping
control characters in HTML?

Fredrik Lundh

unread,

Sep 26, 2006, 9:47:37 AM9/26/06

to pytho...@python.org

Jon Ribbens wrote:

>> the same documentation tells people what function to use if they
>> want to quote *every-thing* that might need to be quoted, so if
>> people did actually understand everything that was written in a
>> reasonably clear way, this thread wouldn't even exist.
>
> The fact that you don't understand that that's not true is the reason
> you've been getting into such a muddle in this thread.

it's a fact that it's not true that the documentation points to the function
that it points to ? exactly what definitions of the words "fact" and "true"
are you using here ?

</F>

Jon Ribbens

unread,

Sep 26, 2006, 10:02:39 AM9/26/06

to

You misunderstand again. The second half of the sentence is the untrue
bit ("if people did ... understand ... this thread wouldn't even exist"),
not the first.

Fredrik Lundh

unread,

Sep 26, 2006, 10:13:08 AM9/26/06

to pytho...@python.org

Jon Ribbens wrote:

> I notice that yet again you've snipped the substantial point and
> failed to answer it, presumably because you don't know how.

cute.

> What do XML Information Sets have to do with escaping control
> characters in HTML?

figure out the connection, and you'll have the answer to your "substantial
point".

</F>

Jon Ribbens

unread,

Sep 26, 2006, 10:21:40 AM9/26/06

to

In article <mailman.687.11592803...@python.org>, Fredrik Lundh wrote:
>> What do XML Information Sets have to do with escaping control
>> characters in HTML?
>
> figure out the connection, and you'll have the answer to your "substantial
> point".

If you don't know the answer, you can say so y'know. There's no shame
in it.

Fredrik Lundh

unread,

Sep 26, 2006, 10:44:14 AM9/26/06

to pytho...@python.org

Jon Ribbens wrote:

> If you don't know the answer, you can say so y'know.

I know the answer. I'm pretty sure everyone else who's actually read my posts
to this thread might have figured it out by now, too. But since you're still trying
to "win" the debate, long after it's over, I think it's safest to end this thread right
now. *plonk*

Jon Ribbens

unread,

Sep 26, 2006, 10:58:35 AM9/26/06

to

In article <mailman.690.11592821...@python.org>, Fredrik Lundh wrote:
> I know the answer. I'm pretty sure everyone else who's actually
> read my posts to this thread might have figured it out by now, too.
> But since you're still trying to "win" the debate, long after it's
> over, I think it's safest to end this thread right now. *plonk*

It's sad to see a grown man throw his toys out of his pram, just
because he's losing an argument...

Brian Quinlan

unread,

Sep 26, 2006, 11:17:52 AM9/26/06

to pytho...@python.org

A summary of this pointless argument:

Why cgi.escape should be changed to escape double quote (and maybe
single quote) characters by default:
o escaping should be very aggressive by default to avoid subtle bugs
o over-escaping is not likely to harm most program significantly
o people who do not read the documentation may be surprised by it's
behavior

Why cgi.escape should NOT be changed:
o it is current used in lots of code and changing it will almost
certainly break some of it, test suites at minimum e.g.
assert my_template_system("<p>{foo}</p>", foo='"') == '<p>"</p>'
o escaping attribute values is less common than escaping element
text so people should not be punished with:
- harder to read output
- (slightly) increased file size
- (slightly) decreased performance
o cgi.escape is not meant for serious web application development, so
either roll your own (trivial) function to do escaping how you want
it or use the one provided by your framework (if it is not automatic)
o the documentation describes the current behavior precisely and
suggests solutions that provide more aggressive escaping, so arguing
about surprising behavior is not reasonable
o it doesn't even make sense for an escape function to exist in the cgi
module, so it should only be used by old applications for
compatibility reasons

Cheers,
Brian

Paul Rubin

unread,

Sep 26, 2006, 11:22:45 AM9/26/06

to

Brian Quinlan <br...@sweetapp.com> writes:
> o cgi.escape is not meant for serious web application development,

What is it meant for then? Why should the library ever implement
anything in a half-assed way unsuitable for serious application
development, if it can supply a robust implementation instead?

Your other points are reasonable. I like the idea of adding an option
to escape single quotes, but I don't care much what the defaults are.

I notice that the options for pickle.dump/dumps changed incompatibly
between Python 2.2 and 2.3, and nobody really cared.

Jon Ribbens

unread,

Sep 26, 2006, 11:26:12 AM9/26/06

to

In article <mailman.698.11592838...@python.org>, Brian Quinlan wrote:
> A summary of this pointless argument:

Your summary seems pretty reasonable, but please note that later on,
the thread was not about cgi.escape escaping (or not) quote
characters (as described in your summary), but about Fredrik arguing,
somewhat incoherently, that it should have to take character encodings
into consideration.

Brian Quinlan

unread,

Sep 26, 2006, 11:36:28 AM9/26/06

to Paul Rubin, pytho...@python.org

Paul Rubin wrote:
> Brian Quinlan <br...@sweetapp.com> writes:
>> o cgi.escape is not meant for serious web application development,
>
> What is it meant for then? Why should the library ever implement
> anything in a half-assed way unsuitable for serious application
> development, if it can supply a robust implementation instead?

I'd have to dig through the revision history to be sure, but I imagine
that cgi.escape was originally only used in the cgi module (and there
only in it's various print_* functions). Then it started being used by
other core Python modules e.g. cgitb, DocXMLRPCServer.

The "mistake", if there was one, was probably that escape wasn't spelled
_escape and got documented in the LaTeX documentation system.

All of this is just speculation though.

Cheers,
Brian

George Sakkis

unread,

Sep 26, 2006, 11:37:20 AM9/26/06

to

Lawrence D'Oliveiro wrote:

> Fredrik Lundh wrote:
> > you're not the designer...
>
> I don't have to be. Whoever the designer was, they had not properly thought
> through the uses of this function. That's quite obvious already, to anybody
> who works with HTML a lot. So the function is broken and needs to be fixed.
>
> If you're worried about changing the semantics of a function that keeps the
> same "cgi.escape" name, then fine. We delete the existing function and add
> a new, properly-designed one. _That_ will be a wake-up call to all the
> users of the existing function to fix their code.

Wow. Are you always that arrogant for things you know very little
about, or just plain stupid ?

Brian Quinlan

unread,

Sep 26, 2006, 11:43:47 AM9/26/06

to Jon Ribbens, pytho...@python.org

And, of course, about you telling people that their explanations are not
good enough :-)

BTW, I am curious about how you do unit testing. The example that I used
in my summary is a very common pattern but would break in cgi.escape
changed it's semantics. What do you do instead?

Cheers,
Brian

Jon Ribbens

unread,

Sep 26, 2006, 11:53:46 AM9/26/06

to

In article <mailman.704.11592854...@python.org>, Brian Quinlan wrote:
>> Your summary seems pretty reasonable, but please note that later on,
>> the thread was not about cgi.escape escaping (or not) quote
>> characters (as described in your summary), but about Fredrik arguing,
>> somewhat incoherently, that it should have to take character encodings
>> into consideration.
>
> And, of course, about you telling people that their explanations are not
> good enough :-)

I guess, if you mean the part of the thread which went "it'll break
existing code", "what existing code"? "existing code" "but what
existing code?" "i dunno, just, er, code" "ok *how* will it break it?"
"i dunno, it just will"?

> BTW, I am curious about how you do unit testing. The example that I used
> in my summary is a very common pattern but would break in cgi.escape
> changed it's semantics. What do you do instead?

To be honest I'm not sure what *sort* of code people test this way. It
just doesn't seem appropriate at all for web page generating code. Web
pages need to be manually viewed in web browsers, and validated, and
checked for accessibility. Checking they're equal to a particular
string just seems bizarre (and where does that string come from
anyway?)

Brian Quinlan

unread,

Sep 26, 2006, 12:11:13 PM9/26/06

to Jon Ribbens, pytho...@python.org

Jon Ribbens wrote:
> I guess, if you mean the part of the thread which went "it'll break
> existing code", "what existing code"? "existing code" "but what
> existing code?" "i dunno, just, er, code" "ok *how* will it break it?"
> "i dunno, it just will"?

See below for a possible example.

>> BTW, I am curious about how you do unit testing. The example that I used
>> in my summary is a very common pattern but would break in cgi.escape
>> changed it's semantics. What do you do instead?
>
> To be honest I'm not sure what *sort* of code people test this way. It
> just doesn't seem appropriate at all for web page generating code.

Well, there are dozens (hundreds?) of templating systems for Python.
Here is a (simplified/modified) unit test for my company's system (yeah,
we lifted some ideas from Django):

test.html
---------
<p>{foo | escape}</p>

test.py
-------
t = Template("test.html")
t['foo'] = 'Brian -> "Hi!"'
assert str(t) == '<p>Brian -> "Hi"</p>'

So how would you test our template system?

> Web
> pages need to be manually viewed in web browsers, and validated, and
> checked for accessibility.

True.

> Checking they're equal to a particular
> string just seems bizarre (and where does that string come from
> anyway?)

Maybe, which is why I'm asking you how you do it. Some of our web
applications contain 100s of script generated pages. Testing each one by
hand after making a change would be completely impossible. So we use
HTTP scripting for testing purposes i.e. send this request, grab the
results, verify that the test in the element with id="username" equals
"Brian Quinlan", etc. The test also validates that each page is well
formed. We also view each page at some point but not every time a
developer makes a change that might (i.e. everything) affect the entire
system.

Cheers,
Brian