Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

escaping/encoding/formatting in python

60 views
Skip to first unread message

Steve Howell

unread,
Apr 5, 2012, 9:56:08 PM4/5/12
to
One of the biggest nuisances for programmers, just beneath date/time
APIs in the pantheon of annoyances, is that we are constantly dealing
with escaping/encoding/formatting issues.

I wrote this little program as a cheat sheet for myself and others.
Hope it helps.

# escaping quotes
legal_string = ['"', "'", "'\"", '"\'', """ '" """]
for s in legal_string:
print("[" + s + "]")

# formatting
print 'Hello %s' % 'world'
print "Hello %s" % 'world'
planet = 'world'
print "Hello {planet}".format(**locals())
print "Hello {planet}".format(planet=planet)
print "Hello {0}".format(planet)

# Unicode
s = u"\u0394"
print s # prints a triangle
print repr(s) == "u'\u0394'" # True
print s.encode("utf-8") == "\xce\x94" # True
# other examples/resources???

# Web encodings
import urllib
s = "~foo ~bar"
print urllib.quote_plus(s) == '%7Efoo+%7Ebar' # True
print urllib.unquote_plus(urllib.quote_plus(s)) == s # True
import cgi
s = "x < 4 & x > 5"
print cgi.escape(s) == 'x &lt; 4 &amp; x &gt; 5' # True

# JSON
import json
h = {'foo': 'bar'}
print json.dumps(h) == '{"foo": "bar"}' # True
try:
bad_json = "{'foo': 'bar'}"
json.loads(bad_json)
except:
print 'Must use double quotes in your JSON'

It's tested under Python3.2. I didn't dare to cover regexes. It
would be great if somebody could flesh out the Unicode examples or
remind me (and others) of other common APIs that are useful to have in
your bag of tricks.



rusi

unread,
Apr 6, 2012, 12:59:17 AM4/6/12
to
On Apr 6, 6:56 am, Steve Howell <showel...@yahoo.com> wrote:
> One of the biggest nuisances for programmers, just beneath date/time
> APIs in the pantheon of annoyances, is that we are constantly dealing
> with escaping/encoding/formatting issues.

[OT for this list]
If you run
$ find /usr/share/emacs/23.3/lisp/ -name '*.gz'|xargs zgrep '\\\\\\\\\\
\\\\\\'
you can get quite a few results.

[Suitable assumptions: linux box with emacs installed]
Message has been deleted

Steve Howell

unread,
Apr 6, 2012, 1:13:24 AM4/6/12
to
You've one-upped me with 2-to-the-N backslash escaping. I've written
useful scripts before with "\\\\\\\\" (scripts that went through
three
levels of interpretation), but four is setting a new bar. My use of
three exponentially increasing levels of backslashes back in the day
was like Beamon's jump in the Mexico City Olympics. An amazing feat
for its time, but every record
eventually gets broken. Well done.

rusi

unread,
Apr 6, 2012, 1:28:19 AM4/6/12
to
There was a competition here?!
If so I can break my own record -- double the number of backslashes
and you still get hits.
Its just that I was unsure of my ability at typing 32 backslashes (and
making a reasonable post).

On a more serious note this indicates that it is (may be?) a bad idea
for old-fashioned languages (like elisp and C) to have only 1 string-
quoter.

May-be-question-mark because programming language experience tells us
that avoiding recursion (in its infinite guises) by special-casing is
usually a bad idea.

All this mess would vanish if the string-literal-starter and ender
were different.
[You dont need to escape a open-paren in a lisp sexp]

Nobody

unread,
Apr 6, 2012, 4:52:57 AM4/6/12
to
On Thu, 05 Apr 2012 22:28:19 -0700, rusi wrote:

> All this mess would vanish if the string-literal-starter and ender
> were different.

You still need an escape character in order to be able to embed an
unbalanced end character.

Tcl and PostScript use mirrored string delimiters (braces for Tcl,
parentheses for PostScript), which results in the worst of both worlds:
they still need an escape character (backslash, in both cases) but now you
can't match tokens with a regexp/DFA.

rusi

unread,
Apr 6, 2012, 9:22:13 AM4/6/12
to
Yes. I hand it to you that I missed the case of explicitly unbalanced
strings.
But are not such cases rare?
For example code such as:
print '"'
print str(something)
print '"'

could better be written as
print '"%s"' % str(something)

Nobody

unread,
Apr 7, 2012, 1:36:14 AM4/7/12
to
On Fri, 06 Apr 2012 06:22:13 -0700, rusi wrote:

> But are not such cases rare?

They exist, therefore they have to be supported somehow.

> For example code such as:
> print '"'
> print str(something)
> print '"'
>
> could better be written as
> print '"%s"' % str(something)

Not if the text between the delimiters is large.

Consider:

print 'static const char * const data[] = {'
for line in infile:
print '\t"%s",' % line.rstrip()
print '};'

Versus:

text = '\n'.join('\t"%s",' % line.rstrip() for line in infile)
print 'static const char * const data[] = {\n%s\n};' % text

C++11 solves the problem to an extent by providing raw strings with
user-defined delimiters (up to 16 printable characters excluding
parentheses and backslash), e.g.:

R"delim(quote: " backslash: \ rparen: ))delim"

evaluates to the string:

quote: " backslash: \ rparen: )

The only sequence which cannot appear in such a string is )delim" (i.e. a
right parenthesis followed by the chosen delimiter string followed by a
double quote). The delimiter can be chosen either by analysing the string
or by choosing something a string at random and relying upon a collision
being statistically improbable.

rusi

unread,
May 22, 2012, 1:59:31 AM5/22/12
to
On Apr 6, 10:13 am, Steve Howell <showel...@yahoo.com> wrote:
> On Apr 5, 9:59 pm,rusi<rustompm...@gmail.com> wrote:
On a (somewhat distantly) related note, found this old fortune:

Wouldn't the sentence 'I want to put a hyphen between the words Fish
and And and And and Chips in my Fish-And-Chips sign' have been clearer
if quotation marks had been placed before Fish, and between Fish and
and, and and and And, and And and and, and and and And, and And and
and, and and and Chips, as well as after Chips?

John Nagle

unread,
May 23, 2012, 2:19:55 PM5/23/12
to
On 4/5/2012 10:10 PM, Steve Howell wrote:
> On Apr 5, 9:59 pm, rusi<rustompm...@gmail.com> wrote:
>> On Apr 6, 6:56 am, Steve Howell<showel...@yahoo.com> wrote:

> You've one-upped me with 2-to-the-N backspace escaping.

Early attempts at UNIX word processing, "nroff" and "troff",
suffered from that problem, due to a badly designed macro system.

A question in language design is whether to escape or quote.
Do you write

"X = %d" % (n,))

or

"X = " + str(n)

In general, for anything but output formatting, the second scales
better. Regular expressions have a bad case of the first.
For a quoted alternative to regular expression syntax, see
SNOBOL or Icon. SNOBOL allows naming patterns, and those patterns
can then be used as components of other patterns. SNOBOL
is obsolete, but that approach produced much more readable
code.

John Nagle
0 new messages