>>> cgi.escape("the \"quick\" & <brown> fox")
'the "quick" & <brown> fox'
>>> cgi.escape("the \"quick\" & <brown> fox", True)
'the "quick" & <brown> fox'
This seems to me to be dumb. The default option should be the safe one: that
is, escape _all_ the potentially troublesome characters. The only time you
can get away with NOT escaping the quote character is outside of markup,
e.g.
<TEXTAREA>
unescaped "quotes" allowed here
</TEXTAREA>
Nevertheless, even in that situation, escaped quotes are acceptable.
So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.
Can changing the default break existing scripts? I don't see how. It might
even fix a few lurking bugs out there.
> So I think the default for the second argument to cgi.escape should be
> changed to True. Or alternatively, the second argument should be removed
> altogether, and quotes should always be escaped.
you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes. if you use the code the
way it's intended to be used, it works perfectly fine.
> Can changing the default break existing scripts? I don't see how. It might
> even fix a few lurking bugs out there.
I'm not sure this "every time I don't immediately understand something,
I'll write a change proposal instead of reading the library reference"
approach is healthy, really.
</F>
> Lawrence D'Oliveiro wrote:
>
>> So I think the default for the second argument to cgi.escape should be
>> changed to True. Or alternatively, the second argument should be removed
>> altogether, and quotes should always be escaped.
>
> you're confused: cgi.escape(s) is designed to be used for ordinary text,
> cgi.escape(s, True) is designed for attributes.
What works for attributes also works for ordinary text.
He's not confused, he's correct; the author of cgi.escape is the
confused one. The optional extra parameter is completely unnecessary
and achieves nothing except to make it easier for people to end up
with bugs in their code.
Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.
One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character. This is essentially
identical to the quote character in HTML, so any code which escaping
one should always be escaping the other.
> In article <mailman.499.11590355...@python.org>, Fredrik
> Lundh wrote:
>> Lawrence D'Oliveiro wrote:
>>>
>>> So I think the default for the second argument to cgi.escape should be
>>> changed to True. Or alternatively, the second argument should be removed
>>> altogether, and quotes should always be escaped.
>>
>> you're confused: cgi.escape(s) is designed to be used for ordinary text,
>> cgi.escape(s, True) is designed for attributes. if you use the code the
>> way it's intended to be used, it works perfectly fine.
>
> He's not confused, he's correct; the author of cgi.escape is the
> confused one.
Thanks for backing me up. :)
> > One thing that is flat-out wrong, by the way, is that cgi.escape()
> does not encode the apostrophe (') character. This is essentially
> identical to the quote character in HTML, so any code which escaping
> one should always be escaping the other.
I must confess I did a double-take on this. But I rechecked the HTML spec
(HTML 4.0, section 3.2.2, "Attributes"), and you're right--single quotes
ARE allowed as an alternative to double quotes. It's just I've never used
them as quotes. :)
> What works for attributes also works for ordinary text.
attributes and ordinary text are two different things in HTML and XML.
you're arguing that it's a good idea for *everyone* to bloat down
ordinary text just because you're too lazy to use a piece of code in the
intended way.
</F>
> Making cgi.escape always escape the '"' character would not break
> anything, and would probably fix a few bugs in existing code. Yes,
> those bugs are not cgi.escape's fault, but that's no reason not to
> be helpful. It's a minor improvement with no downside.
the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way, and will also
break unit tests.
> One thing that is flat-out wrong, by the way, is that cgi.escape()
> does not encode the apostrophe (') character.
it's intentional, of course: you're supposed to use " if you're using
cgi.escape(s, True) to escape attributes. again, punishing people who
actually read the docs and understand them is not a very good way to
maintain software.
btw, you're both missing that cgi.escape isn't good enough for general
use anyway, since it doesn't deal with encodings at all. if you want a
general purpose function that can be used for everything that can be put
in an HTML file, you need more than just a modified cgi.escape. feel
free to propose a general-purpose replacement (which should have a new
name), but make sure you think through *all* the issues before you do that.
</F>
> Jon Ribbens wrote:
>
>> Making cgi.escape always escape the '"' character would not break
>> anything, and would probably fix a few bugs in existing code. Yes,
>> those bugs are not cgi.escape's fault, but that's no reason not to
>> be helpful. It's a minor improvement with no downside.
>
> the "improvement with no downside" would bloat down the output for
> everyone who's using the function in the intended way, and will also
> break unit tests.
I don't understand this "bloat down" nonsense. Any tests that would break
are obviously testing the wrong thing.
> > One thing that is flat-out wrong, by the way, is that cgi.escape()
> > does not encode the apostrophe (') character.
>
> it's intentional, of course: you're supposed to use " if you're using
> cgi.escape(s, True) to escape attributes.
Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.
> btw, you're both missing that cgi.escape isn't good enough for general
> use anyway, since it doesn't deal with encodings at all.
Why does it need to?
> Attributes can be quoted with either single or double quotes. That's what
> the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
> cgi.escape is broken. QED.
do you ever think before you post?
</F>
" is 4 characters more than ".
>> > One thing that is flat-out wrong, by the way, is that cgi.escape()
>> > does not encode the apostrophe (') character.
>>
>> it's intentional, of course: you're supposed to use " if you're using
>> cgi.escape(s, True) to escape attributes.
>
> Attributes can be quoted with either single or double quotes. That's what
> the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
> cgi.escape is broken. QED.
A function is broken if its implementation doesn't match the documentation.
As a courtesy, I've pasted it below.
escape(s[, quote])
Convert the characters "&", "<" and ">" in string s to HTML-safe sequences.
Use this if you need to display text that might contain such characters in HTML.
If the optional flag quote is true, the quotation mark character (""") is also
translated; this helps for inclusion in an HTML attribute value, as in <A
HREF="...">. If the value to be quoted might include single- or double-quote
characters, or both, consider using the quoteattr() function in the
xml.sax.saxutils module instead.
Now, do you still think cgi.escape is broken?
Georg
> A function is broken if its implementation doesn't match the documentation.
or if it doesn't match the designer's intent. cgi.escape is old enough
that we would have noticed that, by now...
</F>
Or if the design, as described in the documentation, is flawed in some
way.
> As a courtesy, I've pasted it below.
>
[...]
>
> Now, do you still think cgi.escape is broken?
Yes.
By a miniscule degree. That is a very weak argument by any standard.
> and will also break unit tests.
Er, so change the unit tests at the same time?
> > One thing that is flat-out wrong, by the way, is that cgi.escape()
> > does not encode the apostrophe (') character.
>
> it's intentional, of course:
I noticed. That doesn't mean it isn't wrong.
> you're supposed to use " if you're using cgi.escape(s, True) to
> escape attributes. again, punishing people who actually read the
> docs and understand them is not a very good way to maintain
> software.
In what way is anyone being "punished"? Deliberately retaining flaws
and misfeatures that can easily be fixed without damaging
backwards-compatibility is not a very good way to maintain software
either.
> btw, you're both missing that cgi.escape isn't good enough for general
> use anyway,
I'm sorry, I didn't realise this was a general thread about any and
all inadequacies of Python's cgi module.
> since it doesn't deal with encodings at all.
Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.
By the way, if you could try and put across your proposed arguments as
to why you don't favour this suggested change without the insults and
general rudeness, it would be appreciated.
>> Georg Brandl wrote:
>>
>>> A function is broken if its implementation doesn't match the
>>> documentation.
>>
>> or if it doesn't match the designer's intent. cgi.escape is old enough
>> that we would have noticed that, by now...
>
> _We_ certainly have noticed it.
you're not the designer, you're just some random guy who thinks that if you
don't understand something at first, it has to be changed, even if it that change
would break things for others. maybe you haven't done software long enough
to understand that software works better if you use it the way it was intended
to be used, but that's no excuse for being stupid.
</F>
> Or if the design, as described in the documentation, is flawed in some
> way.
it does exactly what it says, and is perfectly usable as is, if you bother to
use it the way it was intended to be used.
(still waiting for the "jon's enhanced escape" proposal, btw, but I guess it's
easier to piss on others than to actually contribute something useful).
</F>
>> since it doesn't deal with encodings at all.
>
> Why does it need to? cgi.escape is (or should be) dealing with
> character strings, not byte sequences. I must admit,
> internationalisation is not my forte, so if there's something
> I'm missing here I'd love to hear about it.
If you're really serious about making things easier to use, shouldn't
you look at the whole picture? HTML documents are byte streams, so
any transformation from internal character data to HTML must take both
escaping and encoding into account. If you and Lawrence have a hard
time remembering how to use the existing cgi.escape function, despite
it's utter simplicity, surely it would make your life even easier if
there was an alternative API that would handle both the easy part
(escaping) and the hard part (encoding) ?
> By the way, if you could try and put across your proposed arguments as
> to why you don't favour this suggested change without the insults and
> general rudeness, it would be appreciated.
I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself. Breaking things just because you think
you can simply isn't the Python way of doing things.
</F>
So what's your excuse?
Ever heard of modular programming? I would suggest that you do indeed
take a step back and look at the whole picture - it's the whole
picture that needs to take escaping and encoding into account. There's
nothing to say that cgi.escape should take them both into account in
the one function, and in fact as you yourself have already commented,
good reasons for it not to, in that it would make it excessively
complicated.
> If you and Lawrence have a hard time remembering how to use the
> existing cgi.escape function, despite it's utter simplicity, surely
> it would make your life even easier if there was an alternative API
> that would handle both the easy part (escaping) and the hard part
> (encoding) ?
You seem to be arguing that because, in an ideal world, it would be
better to throw away the 'cgi' module completely and start again, it
is not worth making minor improvements in what we already have.
I would suggest that this is, to put it mildly, not a good argument.
> I've already explained that, but since you're convinced that your use
> case is more important than other use cases, and you don't care about
> things like stability and respect for existing users of an API, nor
> the cost for others to update their code and unit tests, I don't see
> much need to repeat myself.
You are merely compounding your bad manners. All of your above
allegations are outright lies. I am not sure if you are simply not
understanding the simple points I am making, or are deliberately
trying to mislead people for some bizarre reason of your own.
> Breaking things just because you think you can simply isn't the
> Python way of doing things.
Your hyperbole is growing more extravagant. To begin with, you were
claiming that the suggested change would make things (minisculely)
less efficient, now you're claiming it will "break" unspecified
things. What precisely do you think it would "break"?
It is generally a principle of Python that new releases maintain backward
compatability. An incompatible change such proposed here would probably
break many tests for a large number of people.
If the change were seen as a good thing, then a backwards compatible change
(e.g. introducing a function with a different name) might be considered,
but if so it should address the whole issue: the current lack of support
for encodings is IMHO a far bigger problem than whether or a quote mark is
escaped.
> Why does it need to? cgi.escape is (or should be) dealing with
> character strings, not byte sequences. I must admit,
> internationalisation is not my forte, so if there's something
> I'm missing here I'd love to hear about it.
If I have a unicode string such as: u'\u201d' (right double quote), then I
want that encoded in my html as '”' (or ” but the numeric form
is better). For many purposes I could just encode it in the encoding to be
used for the page, typically latin1 or utf8, but sometimes that isn't
possible e.g. if you don't know the encoding at the point when you produce
the string, or if there is no translation for the character in the desired
encoding. The character reference will work whatever encoding is used for
the page.
There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html page;
at present I need to call both cgi.escape and s.encode to get the desired
effect.
Well, yes, you certainly seem to be good at the "pissing on others"
part, even if you have to lie to do it. You have had the "enhanced
escape" proposal all along - it was the post which started this
thread! If you are referring to your strawman argument about
encodings, you have yet to show that it's relevant.
If it'll make you any happier, here's the code for the 'cgi.escape'
equivalent that I usually use:
_html_encre = re.compile("[&<>\"'+]")
_html_encodes = { "&": "&", "<": "<", ">": ">", "\"": """,
"'": "'", "+": "+" }
def html_encode(raw):
return re.sub(_html_encre, lambda m: _html_encodes[m.group(0)], raw)
>> By the way, if you could try and put across your proposed arguments as
>> to why you don't favour this suggested change without the insults and
>> general rudeness, it would be appreciated.
>
> I've already explained that, but since you're convinced that your use
> case is more important than other use cases, and you don't care about
> things like stability and respect for existing users of an API, nor
> the cost for others to update their code and unit tests, I don't see
> much need to repeat myself. Breaking things just because you think
> you can simply isn't the Python way of doing things.
This thread is highly entertaining but perhaps not that productive.
Lawrence is right that the escape method doesn't work the way he expects
it to.
Rewriting a library module simply because a developer is surprised is a
*very* bad idea. It would break just about every web app out there that
uses the escape module and uses testing. Which is probably most of them.
That could mean several man years of wasted time. It also makes the
escaped html harder to read for standard cases.
Frederik is right that doing so is utterly ... well let us call it
"unproductive". Stupid is such a harsh word ;-)
Whether someone finds the bloat miniscule and thus a small enough change
to warrant the rewrite does not really matter.
Lawrence is free to write a wrapper and use that instead.
my_escape = lambda st: cgi.escape(st, 1)
So. Lawrence is happy, and the escape works as expected. Several man
years has been saved.
Max M
> There's nothing to say that cgi.escape should take them both into account
> in the one function
so what exactly are you using cgi.escape for in your code ?
> What precisely do you think it would "break"?
existing code, and existing tests.
</F>
Why is the suggested change incompatible? What code would it break?
I agree that it would be a bad idea if it did indeed break backwards
compatibility - but it doesn't.
> There should be a one-stop shop where I can take my unicode text and
> convert it into something I can safely insert into a generated html page;
I disagree. I think that doing it in one is muddled thinking and
liable to lead to bugs. Why not keep your output as unicode until it
is ready to be output to the browser, and encode it as appropriate
then? Character encoding and character escaping are separate jobs with
separate requirements that are better off handled by separate code.
To escape characters so that they will be treated as character data
and not control characters in HTML.
>> What precisely do you think it would "break"?
>
> existing code, and existing tests.
I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?
> It also makes the escaped html harder to read for standard cases.
and slows things down a bit.
(cgi.escape(s, True) is slower than cgi.escape(s), for reasons that are
obvious for anyone who's looked at the code).
</F>
Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
code that expects it not to do so would break.
Georg
> In article <Xns984996E6BA...@127.0.0.1>, Duncan Booth
> wrote:
>> It is generally a principle of Python that new releases maintain
>> backward compatability. An incompatible change such proposed here
>> would probably break many tests for a large number of people.
>
> Why is the suggested change incompatible? What code would it break?
> I agree that it would be a bad idea if it did indeed break backwards
> compatibility - but it doesn't.
I guess you've never seen anyone write tests which retrieve some generated
html and compare it against the expected value. If the page contains any
unescaped quotes then this change would break it.
>
>> There should be a one-stop shop where I can take my unicode text and
>> convert it into something I can safely insert into a generated html
>> page;
>
> I disagree. I think that doing it in one is muddled thinking and
> liable to lead to bugs. Why not keep your output as unicode until it
> is ready to be output to the browser, and encode it as appropriate
> then? Character encoding and character escaping are separate jobs with
> separate requirements that are better off handled by separate code.
Sorry, convert into something I can safely insert wasn't meant to imply
encoding: just entity escaping.
To be clear:
I'm talking about encoding certain characters as entity references. It
doesn't matter whether its the character ampersand or right double quote,
they both want to be converted to entities. Same operation.
The resulting string might be a byte string or it might still be unicode:
the point being that the conversion I want is from unescaped to entity
escaped, not from unicode to byte encoded. Right now the only way the
Python library gives me to do the entity escaping properly has a side
effect of encoding the string. I should be able to do the escaping without
having to encode the string at the same time.
Sorry, that's still not good enough. Why would any code expect such a
thing?
Some examples are:
- Possibly any code that tests for string equality in a rendered
html/xml page. Testing is a prefered development tool these days.
- Code that generates cgi.escaped() markup and (rightfully) for some
reason expects the old behaviour to be used.
- 3. party code that parses/scrapes content from cgi.escaped() markup.
(you could even break Java code this way :-s )
Any change in Python that has these consequences will rightfully be
considered a bug. So what you are suggesting is to knowingly introduce a
bug in the standard library!
You are right that the html generated by cgi.escape() would (probably)
have the same visual appearence in the browsers. But that is a *very*
narrow definition of being bug free and not breaking stuff.
If you cannot think of other examples for yourself where your change
would introduce breakage, you are certainly not an experienced enough
programmer to suggest changes in the standard lib!
Max M
You're right - I've never seen anyone do such a thing. It sounds like
a highly dubious and very fragile sort of test to me, of very limited
use.
> I'm talking about encoding certain characters as entity references. It
> doesn't matter whether its the character ampersand or right double quote,
> they both want to be converted to entities. Same operation.
This is that muddled thinking I was talking about. They are *not* the
same operation. You want to encode "<", for example, because it must
always be encoded to prevent it being treated as an HTML control
character. This has nothing to do with character encodings.
You might sometimes want to escape "right double quote" because it may
or may not be available in the character encoding you using to output
to the browser. Yes, this might sometimes seem a bit similar to the
"<" escaping described above, because one of the ways you could avoid
the character encoding issue would be to use numeric entities, but it
is actually a completely separate issue and is none of the business of
cgi.escape.
By your argument, cgi.escape should in fact escape *every single*
character as a numeric entity, and even that wouldn't work properly
since "&", "#", ";" and the digits might not be in their usual
positions in the output encoding.
> Right now the only way the Python library gives me to do the entity
> escaping properly has a side effect of encoding the string. I should
> be able to do the escaping without having to encode the string at
> the same time.
I'm getting lost here - the opposite of what you say above is true.
cgi.escape does the escaping properly (modulo failing to escape
quotes) without encoding.
Oh ... because you cannot see a use case for that *documented*
behaviour, it must certainly be wrong?
This funktion which is correct by current documentation will be broken
by you change.
def hasSomeWord(someword):
import urllib
f = urllib.open('http://www.example.com/cgi_escaped_content')
content = f.read()
f.close()
return '"%s"' % someword in content:
You might think that it is stupid code that should be changed to take
escaped quotes into account. But that is really not your bussines to
decide if the other behaviour is documented and correct.
I find it amazing that you cannot understand this. I will stop replying
in this thread now.
Max M
Testing is good, but only if done correctly.
> - Code that generates cgi.escaped() markup and (rightfully) for some
> reason expects the old behaviour to be used.
That's begging the question again ("an example of code that would
break is code that would break").
> - 3. party code that parses/scrapes content from cgi.escaped() markup.
> (you could even break Java code this way :-s )
I'm sorry, I don't understand that one. What is "party code"? Code
that is scraping content from web sites already has to cope with
entities etc.
Your comment about Java is a little ironic given that I persuaded the
Java Struts people to make the exact same change we're talking about
here, back in 2002 (even if it did take 11 months) ;-)
> If you cannot think of other examples for yourself where your change
> would introduce breakage, you are certainly not an experienced enough
> programmer to suggest changes in the standard lib!
I'll take my own opinion on that over yours, thanks.
> I'm sorry, that's not good enough. How, precisely, would it break
> "existing code"?
('owdo Mr. Ribbens!)
It's possible there could be software that relies on ' not being
escaped, for example:
# Auto-markup links to O'Reilly, everyone's favourite
# example name with an apostrophe in it
#
URI= 'http://www.oreilly.com/'
html= cgi.escape(text)
html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)
Sure this may be rare, but it's what the documentation says, and
changing it may not only fix things but also subtly break things in
ways that are hard to detect.
A similar change to str.encode('unicode-escape') in Python 2.5 caused a
number of similar subtle problems. (In this case the old documentation
was a bit woolly so didn't prescribe the exact older behaviour.)
I'm not saying that the cgi.escape interface is *good*, just that it's
too late to change it.
I personally think the entire function should be deprecated, firstly
because it's insufficient in some corner cases (apostrophes as you
pointed out, and XHTML CDATA), and secondly because it's in the wrong
place: HTML-escaping is nothing to do with the CGI interface. A good
template library should deal with escaping more smoothly and correctly
than cgi.escape. (It may be able to deal with escape-or-not-bother and
character encoding issues automatically, for example.)
--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
No, but if nobody else can find one either, that's a clue that maybe
it's safe to change.
Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to. Even by your own argument, therefore, code is not
entitled to rely on the output of cgi.escape being any particular
exact string.
> This funktion which is correct by current documentation will be broken
> by you change.
>
> def hasSomeWord(someword):
> import urllib
> f = urllib.open('http://www.example.com/cgi_escaped_content')
> content = f.read()
> f.close()
> return '"%s"' % someword in content:
That function is broken already, no change required.
It's easy enough to come up with examples which might. For example, I
have doctests which evaluate tal expressions. I don't think I currently
have any which depend on quotes, but I can easily create one (I just
did, and it passes):
>>> print T('''<tal:x tal:content="python:'It\\'s a \\x22tal\\x22 string'" />''')
It's a "tal" string
>>> print T('''<x tal:attributes="title python:'It\\'s a \\x22tal\\x22 string'" />''')
<x title="It's a "tal" string" />
More likely I might output a field value and just happen to have used a quote
in it.
FWIW, in zope tal, the value of tal:content is escaped using the equivalent of
cgi.escape(s, False), and attribute values are escaped using
cgi.escape(s, True).
The function T I use is defined as:
def T(template, **kw):
"""Create and render a page template."""
pt = PageTemplate()
pt.pt_edit(template, 'text/html')
return pt.pt_render(extra_context=kw).strip('\n')
> Sorry, that's still not good enough.
that's not up to you to decide, though.
</F>
Good afternoon Mr Glover ;-)
> URI= 'http://www.oreilly.com/'
> html= cgi.escape(text)
> html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)
>
> Sure this may be rare, but it's what the documentation says, and
> changing it may not only fix things but also subtly break things in
> ways that are hard to detect.
I'm not sure about "subtly break things", but you're right that the
above code would break. I could argue that it's broken already,
(since it's doing a plain-text search on HTML data) but given
real-world considerations it's reasonable enough that I won't be that
pedantic ;-)
> I personally think the entire function should be deprecated, firstly
> because it's insufficient in some corner cases (apostrophes as you
> pointed out, and XHTML CDATA), and secondly because it's in the wrong
> place: HTML-escaping is nothing to do with the CGI interface. A good
> template library should deal with escaping more smoothly and correctly
> than cgi.escape. (It may be able to deal with escape-or-not-bother and
> character encoding issues automatically, for example.)
I agree that in most situations you should probably be using a
template library, but sometimes a simple CGI-and-manual-HTML system
suffices, and I think (a fixed version of) cgi.escape should exist at
a low level of the web application stack.
If the documentation isn't clear enough, that means the documentation
should be fixed.
It does _not_ mean "you are free to introduce new behavior because
nobody should trust what this function does anyway".
--
filip salomonsson
It's up to me to decide whether or not an argument is good enough to
convince me, thank you very much.
Incorrect - documentation can and frequently does leave certain
behaviours undefined. This is deliberate and (among other things)
is to allow for the behaviour to change in future versions without
breaking backwards-compatibility.
> It's up to me to decide whether or not an argument is good enough to
> convince me, thank you very much.
not if you expect anyone to take anything you say seriously.
</F>
Now you're just being ridiculous. In this thread you have been rude,
evasive, insulting, vague, hypocritical, and have failed to answer
substantive points in favour of sarcastic and erroneous sniping - I'd
suggest it's you that needs to worry about being taken seriously.
Actually, at least in the context of this mailing list, Fredrik doesn't
have to worry about that at all. Why? Because he is one of the most
prolific contributers to the Python language and libraries and his
contributions have been of consistent high quality.
You, on the other hand, are "just some guy" and people don't have a lot
of incentive to convince you of anything.
I have no opinion on the actual debate though. Just trying to help with
the social analysis :-)
Cheers,
Brian
I would have hoped that people don't treat that as a licence to be
obnoxious, though. I am aware of Fredrik's history, which is why I
was somewhat surprised and disappointed that he was being so rude
and unpleasant in this thread. He is not living up to his reputation
at all. Maybe he's having a bad day ;-)
It says "to HTML-safe sequences". That's reasonably clear without the need
to reproduce the exact replacements for each character.
If anyone doesn't know what is meant by this, he shouldn't really write apps
using the cgi module before doing a basic HTML course.
Or use the source.
Georg
"Unless" "your" "CGI" "scripts" "output" "text" "like" "this," "I"
"think" "it's" "absurd" "to" "consider" "the" "bloat" "significant."
So would you like to expliain the difference between " and " ,
or do you need to go on a "basic HTML course" first?
> If I have a unicode string such as: u'\u201d' (right double quote), then I
> want that encoded in my html as '”' (or ” but the numeric form
> is better).
Right-double-quote is not an HTML special, so there's no need to quote it.
I'm only concerned here with characters that have special meanings in HTML
markup.
> There should be a one-stop shop where I can take my unicode text and
> convert it into something I can safely insert into a generated html page;
> at present I need to call both cgi.escape and s.encode to get the desired
> effect.
What you're really asking for is a version of cgi.escape that a) fixes the
bugs discussed in this thread, and b) copes with different encodings while
doing so.
To handle b), you would need to pass it some indication of what the encoding
of the string is. In any case, converting a literal right-double-quote to
” is not relevant to the purpose of cgi.escape.
> Lawrence is right that the escape method doesn't work the way he expects
> it to.
>
> Rewriting a library module simply because a developer is surprised is a
> *very* bad idea.
I'm not surprised. Disappointed, yes. Verging on disgust at some comments in
this thread, yes. But "surprised" is what a lot of users of the existing
cgi.escape function are going to be when they discover their code isn't
doing what they thought it was.
> It would break just about every web app out there that
> uses the escape module...
How will it break them? Give an example.
What you're doing is adding to the reasons why the existing cgi.escape
function is stupidly designed and implemented. The True case is by far the
most common, so to make that the slow case, as well as being the
non-default case, is doubly brain-dead.
> Jon Ribbens skrev:
>> In article <mailman.569.11591928...@python.org>, Fredrik
>> Lundh wrote:
>>>> There's nothing to say that cgi.escape should take them both into
>>>> account in the one function
>>> so what exactly are you using cgi.escape for in your code ?
>>
>> To escape characters so that they will be treated as character data
>> and not control characters in HTML.
>>
>>>> What precisely do you think it would "break"?
>>> existing code, and existing tests.
>>
>> I'm sorry, that's not good enough. How, precisely, would it break
>> "existing code"? Can you come up with an example, or even an
>> explanation of how it *could* break existing code?
>
>
> Some examples are:
>
> - Possibly any code that tests for string equality in a rendered
> html/xml page.
You've got to be kidding. Any programmer knows that, to test two strings for
equality, you should do that on a canonical (non-encoded) representation.
> - Code that generates cgi.escaped() markup and (rightfully) for some
> reason expects the old behaviour to be used.
Whenever I use a channel-coding function, I expect the resulting output to
be only fit for feeding into the channel. I do NOT expect to do anything
else with it. Any kind of data manipulation I do, I do BEFORE feeding it
into the output channel, which means BEFORE putting it through the channel
coding.
> - 3. party code that parses/scrapes content from cgi.escaped() markup.
> (you could even break Java code this way :-s )
If that code follows the HTML rules, it will work.
> In article <ef8oqr$9pt$1...@news.albasani.net>, Georg Brandl wrote:
>>> I'm sorry, that's not good enough. How, precisely, would it break
>>> "existing code"? Can you come up with an example, or even an
>>> explanation of how it could break existing code?
>>
>> Is that so hard to see? If cgi.escape replaced "'" with an entity
>> reference, code that expects it not to do so would break.
>
> Sorry, that's still not good enough. Why would any code expect such a
> thing?
>>
> that's not up to you to decide, though.
Yes it is. An HTML-quoting function converts a string to its HTML-compatible
representation. Since it is now HTML-compatible, any code that tries to
work with it afterwards has got to expect it to be HTML-compatible. Which
means it has to allow for what HTML allows.
> Lawrence D'Oliveiro wrote:
>
>>> Georg Brandl wrote:
>>>
>>>> A function is broken if its implementation doesn't match the
>>>> documentation.
>>>
>>> or if it doesn't match the designer's intent. cgi.escape is old enough
>>> that we would have noticed that, by now...
>>
>> _We_ certainly have noticed it.
>
> you're not the designer...
I don't have to be. Whoever the designer was, they had not properly thought
through the uses of this function. That's quite obvious already, to anybody
who works with HTML a lot. So the function is broken and needs to be fixed.
If you're worried about changing the semantics of a function that keeps the
same "cgi.escape" name, then fine. We delete the existing function and add
a new, properly-designed one. _That_ will be a wake-up call to all the
users of the existing function to fix their code.
> Any change in Python that has these consequences will rightfully be
> considered a bug. So what you are suggesting is to knowingly introduce a
> bug in the standard library!
It isn't like there have never been backwards _in_compatible changes to
the standard library before.
Ten seconds of googling finds
http://www.python.org/download/releases/2.3/highlights/:
int() - this can now return a long when converting a string with many
digits, rather than raising OverflowError. (New in 2.3a2: issues a
FutureWarning when sign-folding an unsigned hex or octal literal.)
Bastion and rexec - these modules are disabled, because they aren't
safe in Python 2.3 (nor in Python 2.2). (New in 2.3a2.)
Hex/oct literals prefixed with a minus sign were handled
inconsistently. This has been fixed in accordance with PEP 237. (New
in 2.3a2.)
Passing a float to C functions expecting an integer now issues a
DeprecationWarning; in the future this will become a TypeError. (New
in 2.3a2.)
None - assignment to variables or attributes named None will now
trigger a warning. In the future, None may become a keyword.
And more, all from one release.
If the behaviour of cgi.escape is "broken", or incomplete, or misleading,
then Python has a great mechanism for introducing incompatible changes
slowly: warnings.
It isn't good enough to say that the function does what it says it does,
if what it does is dangerous and misleading. Artificial example:
def sqr(x):
"""Returns the square of almost all numbers."""
if x != 1: return x**2
else: return -1
The function does exactly what it says, and yet still has badly dangerous
behaviour that risks introducing serious bugs. If people are relying on
unit tests which include specific tests for that behaviour, then the
function and the code needs to be fixed in parallel. That's what the
warnings module is for.
So any arguments about "breaking code" are a red herring: if cgi.escape
does the wrong thing (and that's arguable), and code relies on that
behaviour, then the code is already broken and needs to be fixed in
parallel with the function. So can we accept that:
(1) *if* there is a problem with cgi.escape it needs to be fixed;
(and, dear gods, I would hope that nobody here wants to argue that Python
should make backwards compatibility a higher virtue than correctness!)
(2) it doesn't need to be fixed *immediately* without warning;
(3) but it can be fixed through a gradual process with warning; and
(4) unit tests and code that expect the (presumed) bad behaviour can be
fixed gradually?
Now that we've got that out of the way, can we CALMLY and RATIONALLY
discuss whether cgi.escape is or isn't broken?
Or, more specifically, UNDER WHAT CIRCUMSTANCES it does the wrong thing?
--
Steven D'Aprano
> >> What precisely do you think it would "break"?
> >
> > existing code, and existing tests.
>
>I'm sorry, that's not good enough. How, precisely, would it break
>"existing code"? Can you come up with an example, or even an
>explanation of how it *could* break existing code?
FWIW, a *lot* of unit tests on *my* generated html code would break,
and I imagine a *lot* of other people's code would break too. So
changing the defaults is not a good idea.
But if you want, import this on sitecustomize.py and pretend it said
quote=True:
import cgi
cgi.escape.func_defaults = (True,)
del cgi
Gabriel Genellina
Softlab SRL
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
I generally find that Fredrik's rudeness quotient is satisfactorily
biased towards discouraging ill-informed comment. As far as rudeness
goes, I've found your approach to this discussion to be pretty
obnoxious, and I'm generally know as someone with a high tolerance for
idiotic behaviour.
If your intention was to troll you could not have crafted your
contributions in a better way.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden
How exactly would you make s = s.replace('"',""") faster than
*not* doing the replacement?
>> (cgi.escape(s, True) is slower than cgi.escape(s), for reasons that
>> are obvious for anyone who's looked at the code).
>
> What you're doing is adding to the reasons why the existing cgi.escape
> function is stupidly designed and implemented. The True case is by far
> the most common, so to make that the slow case, as well as being the
> non-default case, is doubly brain-dead.
It is slightly slower because it does more. Both cases are about 15 times
faster than the regular expression implementation someone posted to this
thread yesterday.
> In message <Xns984996E6BA...@127.0.0.1>, Duncan Booth
> wrote:
>
>> If I have a unicode string such as: u'\u201d' (right double quote),
>> then I want that encoded in my html as '”' (or ” but the
>> numeric form is better).
>
> Right-double-quote is not an HTML special, so there's no need to quote
> it. I'm only concerned here with characters that have special meanings
> in HTML markup.
There is no need to quote " or ' either except in particular situations.
Would you care to suggest how you get a right double quote into any iso-
8859-1 encoded web page without quoting it? Even if the page is utf-8
encoded quoting it can be a good idea.
>
>> There should be a one-stop shop where I can take my unicode text and
>> convert it into something I can safely insert into a generated html
>> page; at present I need to call both cgi.escape and s.encode to get
>> the desired effect.
>
> What you're really asking for is a version of cgi.escape that a) fixes
> the bugs discussed in this thread, and b) copes with different
> encodings while doing so.
>
> To handle b), you would need to pass it some indication of what the
> encoding of the string is. In any case, converting a literal
> right-double-quote to ” is not relevant to the purpose of
> cgi.escape.
>
You don't seem to understand about html entity escapes. ” is a valid
way to express right double quote whatever the page encoding. There is no
need to know the encoding of the page in order to escape entities, just
escape anything which can be problematic.
Wrong answer. Correctness comes first, then we worry about efficiency.
> At Monday 25/9/2006 11:08, Jon Ribbens wrote:
>
>> >> What precisely do you think it would "break"?
>> >
>> > existing code, and existing tests.
>>
>>I'm sorry, that's not good enough. How, precisely, would it break
>>"existing code"? Can you come up with an example, or even an
>>explanation of how it *could* break existing code?
>
> FWIW, a *lot* of unit tests on *my* generated html code would break...
Why did you write your code that way?
What about the users who don't need to "fix" their code since it's working fine
and flawlessly with the current cgi.escape?
Georg
Why should they be surprised? The documentation states clearly what cgi.escape()
does (as does the docstring).
Georg
Documentation frequently states stupid things. Doesn't mean it should be
treated as sacrosanct.
They're just lucky. I guess, that the bugs haven't bitten them--yet.
Stop feeding the troll.
It's a pity he's being rude when presented with well-informed comment
then.
> As far as rudeness goes, I've found your approach to this discussion
> to be pretty obnoxious, and I'm generally know as someone with a
> high tolerance for idiotic behaviour.
Why do you say that? I have confined myself to simple logical
arguments, and been frankly very restrained when presented with
rudeness and misunderstanding from other thread participants.
In what way should I have modified my postings?
That's not the point. The point is that someone using cgi.escape() will hardly
be surprised of what it does and doesn't do.
Georg
Jim
And this surprise, or lack of it, is relevant to the argument how, exactly?
Please allow me to apologise. I have clearly been confusing you with
someone else. A review of your contributions to the thread confirms your
asertion.
So what sort of test would you use, that doesn't involve comparing
actual output against expected output?
--
\S -- si...@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
___ | "Frankly I have no feelings towards penguins one way or the other"
\X/ | -- Arthur C. Clarke
her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump
Well, one could say that the expected output is the one as it'll be
interpreted by the HTLM navigator. And thus, the test should un HTLM
escape the string and compare it to the original string instead of
mandating a specific encoding.
Oh, ok! You had me worried for a minute there ;-)
> This has nothing to do with character encodings.
it has *everything* to do with encoding of existing data into HTML so it can be
safely transported to, and recreated by, an HTML-aware client.
does the word "information set" mean anything to you?
</F>
Is there *any* branch of this thread that won't end with some snippy
remark from you?
>> (cgi.escape(s, True) is slower than cgi.escape(s), for reasons that are
>> obvious for anyone who's looked at the code).
>
> What you're doing is adding to the reasons why the existing cgi.escape
> function is stupidly designed and implemented. The True case is by far the
> most common
really? most HTML attributes cannot even contain things that would need to
be escaped, while *all* element content needs escaping. and the web contains
a lot of element content, as should be obvious to anyone who's been there...
</F>
Which argument? You said users were going to be surprised, I told you why they
aren't.
Georg
(Okay, this is my last posting to this thread)
> It says "to HTML-safe sequences". That's reasonably clear without the need
> to reproduce the exact replacements for each character.
the same documentation tells people what function to use if they want to quote *every-
thing* that might need to be quoted, so if people did actually understand everything that
was written in a reasonably clear way, this thread wouldn't even exist.
</F>
I can't tell if you're disagreeing or not. You escape the character
"<" as the sequence of characters "<", for example, because
otherwise the HTML user agent will treat it as the start of a tag and
not as character data. You will notice that the character encoding is
utterly irrelevant to this.
> does the word "information set" mean anything to you?
You would appear to be talking about either game theory, or XML,
neither of which have anything to do with HTML.
> It's a pity he's being rude when presented with well-informed comment
> then.
since when is the output of
import random, sys
messages = [
"that's irrelevant",
"then their code is broken already",
"that's not good enough",
"then their tests are broken already",
"you're rude",
]
for x in xrange(sys.maxint):
print random.choice(messages)
well-informed? heck, it doesn't even pass the turing test ;-)
</F>
The fact that you don't understand that that's not true is the reason
you've been getting into such a muddle in this thread.
>> does the word "information set" mean anything to you?
>
> You would appear to be talking about either game theory, or XML,
> neither of which have anything to do with HTML.
you see no connection between XML's concept of information set and
HTML? (hint: what's XHTML?)
</F>
Since when did that bear any resemblance to what I have said?
Are you going to grow up and start addressing the substantial points
raised, rather than making puerile sarcastic remarks?
An apology from you would not go amiss.
I notice that yet again you've snipped the substantial point and
failed to answer it, presumably because you don't know how.
> you see no connection between XML's concept of information set and
> HTML? (hint: what's XHTML?)
I am perfectly well aware of what XHTML is. If you're trying to make
a point, please get to it, rather than going off on irrelevant
tangents. What do XML Information Sets have to do with escaping
control characters in HTML?
>> the same documentation tells people what function to use if they
>> want to quote *every-thing* that might need to be quoted, so if
>> people did actually understand everything that was written in a
>> reasonably clear way, this thread wouldn't even exist.
>
> The fact that you don't understand that that's not true is the reason
> you've been getting into such a muddle in this thread.
it's a fact that it's not true that the documentation points to the function
that it points to ? exactly what definitions of the words "fact" and "true"
are you using here ?
</F>
You misunderstand again. The second half of the sentence is the untrue
bit ("if people did ... understand ... this thread wouldn't even exist"),
not the first.
> I notice that yet again you've snipped the substantial point and
> failed to answer it, presumably because you don't know how.
cute.
> What do XML Information Sets have to do with escaping control
> characters in HTML?
figure out the connection, and you'll have the answer to your "substantial
point".
</F>
If you don't know the answer, you can say so y'know. There's no shame
in it.
> If you don't know the answer, you can say so y'know.
I know the answer. I'm pretty sure everyone else who's actually read my posts
to this thread might have figured it out by now, too. But since you're still trying
to "win" the debate, long after it's over, I think it's safest to end this thread right
now. *plonk*
It's sad to see a grown man throw his toys out of his pram, just
because he's losing an argument...
Why cgi.escape should be changed to escape double quote (and maybe
single quote) characters by default:
o escaping should be very aggressive by default to avoid subtle bugs
o over-escaping is not likely to harm most program significantly
o people who do not read the documentation may be surprised by it's
behavior
Why cgi.escape should NOT be changed:
o it is current used in lots of code and changing it will almost
certainly break some of it, test suites at minimum e.g.
assert my_template_system("<p>{foo}</p>", foo='"') == '<p>"</p>'
o escaping attribute values is less common than escaping element
text so people should not be punished with:
- harder to read output
- (slightly) increased file size
- (slightly) decreased performance
o cgi.escape is not meant for serious web application development, so
either roll your own (trivial) function to do escaping how you want
it or use the one provided by your framework (if it is not automatic)
o the documentation describes the current behavior precisely and
suggests solutions that provide more aggressive escaping, so arguing
about surprising behavior is not reasonable
o it doesn't even make sense for an escape function to exist in the cgi
module, so it should only be used by old applications for
compatibility reasons
Cheers,
Brian
What is it meant for then? Why should the library ever implement
anything in a half-assed way unsuitable for serious application
development, if it can supply a robust implementation instead?
Your other points are reasonable. I like the idea of adding an option
to escape single quotes, but I don't care much what the defaults are.
I notice that the options for pickle.dump/dumps changed incompatibly
between Python 2.2 and 2.3, and nobody really cared.
Your summary seems pretty reasonable, but please note that later on,
the thread was not about cgi.escape escaping (or not) quote
characters (as described in your summary), but about Fredrik arguing,
somewhat incoherently, that it should have to take character encodings
into consideration.
I'd have to dig through the revision history to be sure, but I imagine
that cgi.escape was originally only used in the cgi module (and there
only in it's various print_* functions). Then it started being used by
other core Python modules e.g. cgitb, DocXMLRPCServer.
The "mistake", if there was one, was probably that escape wasn't spelled
_escape and got documented in the LaTeX documentation system.
All of this is just speculation though.
Cheers,
Brian
> Fredrik Lundh wrote:
> > you're not the designer...
>
> I don't have to be. Whoever the designer was, they had not properly thought
> through the uses of this function. That's quite obvious already, to anybody
> who works with HTML a lot. So the function is broken and needs to be fixed.
>
> If you're worried about changing the semantics of a function that keeps the
> same "cgi.escape" name, then fine. We delete the existing function and add
> a new, properly-designed one. _That_ will be a wake-up call to all the
> users of the existing function to fix their code.
Wow. Are you always that arrogant for things you know very little
about, or just plain stupid ?
And, of course, about you telling people that their explanations are not
good enough :-)
BTW, I am curious about how you do unit testing. The example that I used
in my summary is a very common pattern but would break in cgi.escape
changed it's semantics. What do you do instead?
Cheers,
Brian
I guess, if you mean the part of the thread which went "it'll break
existing code", "what existing code"? "existing code" "but what
existing code?" "i dunno, just, er, code" "ok *how* will it break it?"
"i dunno, it just will"?
> BTW, I am curious about how you do unit testing. The example that I used
> in my summary is a very common pattern but would break in cgi.escape
> changed it's semantics. What do you do instead?
To be honest I'm not sure what *sort* of code people test this way. It
just doesn't seem appropriate at all for web page generating code. Web
pages need to be manually viewed in web browsers, and validated, and
checked for accessibility. Checking they're equal to a particular
string just seems bizarre (and where does that string come from
anyway?)
See below for a possible example.
>> BTW, I am curious about how you do unit testing. The example that I used
>> in my summary is a very common pattern but would break in cgi.escape
>> changed it's semantics. What do you do instead?
>
> To be honest I'm not sure what *sort* of code people test this way. It
> just doesn't seem appropriate at all for web page generating code.
Well, there are dozens (hundreds?) of templating systems for Python.
Here is a (simplified/modified) unit test for my company's system (yeah,
we lifted some ideas from Django):
test.html
---------
<p>{foo | escape}</p>
test.py
-------
t = Template("test.html")
t['foo'] = 'Brian -> "Hi!"'
assert str(t) == '<p>Brian -> "Hi"</p>'
So how would you test our template system?
> Web
> pages need to be manually viewed in web browsers, and validated, and
> checked for accessibility.
True.
> Checking they're equal to a particular
> string just seems bizarre (and where does that string come from
> anyway?)
Maybe, which is why I'm asking you how you do it. Some of our web
applications contain 100s of script generated pages. Testing each one by
hand after making a change would be completely impossible. So we use
HTTP scripting for testing purposes i.e. send this request, grab the
results, verify that the test in the element with id="username" equals
"Brian Quinlan", etc. The test also validates that each page is well
formed. We also view each page at some point but not every time a
developer makes a change that might (i.e. everything) affect the entire
system.
Cheers,
Brian