Embedding a literal "\u" in a unicode raw string.

Romano Giannetti

unread,

Feb 25, 2008, 6:39:14 AM2/25/08

to pytho...@python.org

Hi,

while writing some LaTeX preprocessing code, I stumbled into this problem: (I
have a -*- coding: utf-8 -*- line, obviously)

s = ur"añado $\uparrow$"

Which gave an error because the \u escape is interpreted in raw unicode strings,
too. So I found that the only way to solve this is to write:

s = unicode(r"añado $\uparrow$", "utf-8")

or

s = ur"añado $\u005cuparrow$"

The second one is too ugly to live, while the first is at least acceptable; but
looking around the Python 3.0 doc, I saw that the first one will fail, too.

Am I doing something wrong here or there is another solution for this?

Romano

Diez B. Roggisch

unread,

Feb 25, 2008, 7:46:38 AM2/25/08

to

Romano Giannetti wrote:

Why don't you rid yourself of the raw-string? Then you need to do

s = u"anando $\\uparrow$"

which is considerably easier to read than both other variants above.

Diez

Thinker

unread,

Feb 25, 2008, 7:58:04 AM2/25/08

to Romano Giannetti, pytho...@python.org

Romano Giannetti wrote:
> Hi,
>
> while writing some LaTeX preprocessing code, I stumbled into this
> problem: (I have a -*- coding: utf-8 -*- line, obviously)
>
> s = ur"añado $\uparrow$"
>
> Which gave an error because the \u escape is interpreted in raw
> unicode strings, too. So I found that the only way to solve this is
> to write:
>
> s = unicode(r"añado $\uparrow$", "utf-8")
>
> or
>
> s = ur"añado $\u005cuparrow$"
>
> The second one is too ugly to live, while the first is at least
> acceptable; but looking around the Python 3.0 doc, I saw that the
> first one will fail, too.
>
> Am I doing something wrong here or there is another solution for
> this?
>

> Romano
>
The backslash '\' is a meta-char that escapes the string. You can
escape the char as following string
u"....\\u....'
insert another '\' before it.

Romano Giannetti

unread,

Feb 25, 2008, 8:13:38 AM2/25/08

to pytho...@python.org

Thinker <thinker <at> branda.to> writes:

>
>
> > s = ur"añado $\uparrow$"
> >
> > Which gave an error because the \u escape is interpreted in raw
> > unicode strings, too. So I found that the only way to solve this is
> > to write:
> >
> > s = unicode(r"añado $\uparrow$", "utf-8")
> >
> > or
> >
> > s = ur"añado $\u005cuparrow$"
> >
> >

> The backslash '\' is a meta-char that escapes the string. You can
> escape the char as following string
> u"....\\u....'
> insert another '\' before it.
>

(Answering this and the other off thread answer by Diez)

Well, I have simplified too much. The problem is, when writing LaTeX snippets, a
lot of backslashed are involved. So the un-raw string is difficult to read
because all those doubled \\, and the raw string is just handy. Moreover, that
way I can copy-and-paste LaTeX code between ur""" """ marks,

Searching more, I even found a thread in python-dev where Guido himself seemed
convinced that this "\u" interpratation in raw strings is at least a bit
disappointing:

http://mail.python.org/pipermail/python-dev/2007-May/073042.html

but I have seen later that it will still here in 3.0. That means that all my
unicode(r"\uparrow", "utf-8") will break... sigh.

Thanks anyway,

Romano

romano.g...@gmail.com

unread,

Feb 25, 2008, 11:21:58 AM2/25/08

to

> unicode(r"\uparrow", "utf-8") will break... sigh.
>

Moreover, I checked with 2to3.py, and it say (similar case):

-ok_preamble = unicode(r"""
+ok_preamble = str(r"""
\usepackage[utf8]{inputenc}
\begin{document}
Añadidos:
""", "utf-8")

which AFAIK will give an error for the \u in \usepackage. Hmmm...
should I dare ask on the developer list? :-)

OKB (not okblacke)

unread,

Feb 25, 2008, 12:03:53 PM2/25/08

to

Romano Giannetti wrote:

I too encountered this problem, in the same situation (making
strings that contain LaTeX commands). One possibility is to separate
out just the bit that has the \u, and use string juxtaposition to attach
it to the others:

s = ur"añado " u"$\\uparrow$"

It's not ideal, but I think it's easier to read than your solution
#2.

--
--OKB (not okblacke)
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is
no path, and leave a trail."
--author unknown

romano.g...@gmail.com

unread,

Feb 25, 2008, 12:24:20 PM2/25/08

to

On Feb 25, 6:03 pm, "OKB (not okblacke)"

<brenNOSPAMb...@NObrenSPAMbarn.net> wrote:
>
> I too encountered this problem, in the same situation (making
> strings that contain LaTeX commands). One possibility is to separate
> out just the bit that has the \u, and use string juxtaposition to attach
> it to the others:
>
> s = ur"añado " u"$\\uparrow$"
>
> It's not ideal, but I think it's easier to read than your solution
> #2.
>

Yes, I think I will do something like that, although... I really do
not understand why \x5c is not interpreted in a raw string but \u005c
is interpreted in a unicode raw string... is, well, not elegant. Raw
should be raw...

Thanks anyway

"Martin v. Löwis"

unread,

Feb 25, 2008, 5:27:32 PM2/25/08

to romano.g...@gmail.com

> Yes, I think I will do something like that, although... I really do
> not understand why \x5c is not interpreted in a raw string but \u005c
> is interpreted in a unicode raw string... is, well, not elegant. Raw
> should be raw...

Right. IMO, this is just a plain design mistake in the Python Unicode
handling. Unfortunately, there was discussion about this specific issue
in the past, and the proponent of the status quo always defended it,
with the rationale (IIUC) that a) without that, you can't put arbitrary
Unicode characters into a string, and b) the semantics of \u in Java and
C is so that \u gets processed even before tokenization even starts, and
it should be the same in Python.

Regards,
Martin

rmano

unread,

Feb 25, 2008, 5:45:54 PM2/25/08

to

On Feb 25, 11:27 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > Raw
> > should be raw...
>
> Right. IMO, this is just a plain design mistake in the Python Unicode
> handling. Unfortunately, there was discussion about this specific issue
> in the past, and the proponent of the status quo always defended it,
> with the rationale (IIUC) that a) without that, you can't put arbitrary
> Unicode characters into a string, and b) the semantics of \u in Java and
> C is so that \u gets processed even before tokenization even starts, and
> it should be the same in Python.

Well, I do not know Java, but C AFAIK has no raw strings, so you have
nevertheless
to use double backslashes. Raw strings are a handy shorthand when you
can generate
the characters with your keyboard, and this asymmetry quite defeat it.

Is it decided or it is possible to lobby for it? :-)

Thanks,
Romano

BTW, 2to3.py should warn when a raw string (not unicode) with \u in
it, I think.
I tried it and it seems to ignore the problem...

NickC

unread,

Mar 4, 2008, 7:00:17 AM3/4/08

to

On Feb 26, 8:45 am, rmano <romano.gianne...@gmail.com> wrote:
> BTW, 2to3.py should warn when a raw string (not unicode) with \u in
> it, I think.
> I tried it and it seems to ignore the problem...

Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> r"\u"
'\\u'
>>> r"\uparrow"
'\\uparrow'
>>> r"\u005c"
'\\u005c'
>>> r"\N{REVERSE SOLIDUS}"
'\\N{REVERSE SOLIDUS}'
>>> "\u005c"
'\\'
>>> "\N{REVERSE SOLIDUS}"
'\\'

2to3.py may be ignoring a problem, but existing raw 8-bit string
literals containing a '\u' aren't going to be it. If anything is going
to have a problem with conversion to Py3k at this point, it is raw
Unicode literals that contain a Unicode escape.

rmano

unread,

Mar 7, 2008, 10:18:11 AM3/7/08

to

On Mar 4, 1:00 pm, NickC <ncogh...@gmail.com> wrote:
>
> Python 3.0a3+ (py3k:61229, Mar 4 2008, 21:38:15)
> [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> r"\u"
> '\\u'
> >>> r"\uparrow"
> '\\uparrow'

Nice to know... so it seems that the 3.0 doc was not updated. I think
this is the correct
behaviour. Thanks