How to convert a raw string r'\xdd' to '\xdd' more gracefully?

Jach Feng

unread,

Dec 6, 2022, 9:23:20 PM12/6/22

to

s0 = r'\x0a'
At this moment it was done by

def to1byte(matchobj):
....return chr(int('0x' + matchobj.group(1), 16))
s1 = re.sub(r'\\x([0-9a-fA-F]{2})', to1byte, s0)

But, is it that difficult on doing this simple thing?

--Jach

MRAB

unread,

Dec 6, 2022, 10:04:43 PM12/6/22

to

You could try this:

>>> s0 = r'\x0a'
>>> ast.literal_eval('"%s"' % s0)
'\n'

Jach Feng

unread,

Dec 6, 2022, 10:38:10 PM12/6/22

to

MRAB 在 2022年12月7日星期三上午11:04:43 [UTC+8] 的信中寫道：

Not work in my system:-(

Python 3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:08:11) [MSC v.1928 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s0 = r'\x0a'
>>> import ast
>>> ast.literal_eval("%s" % s0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Jach\AppData\Local\Programs\Python\Python38-32\lib\ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "C:\Users\Jach\AppData\Local\Programs\Python\Python38-32\lib\ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
\x0a
^
SyntaxError: unexpected character after line continuation character

Thomas Passin

unread,

Dec 6, 2022, 11:51:32 PM12/6/22

to

I'm not totally clear on what you are trying to do here. But:

s1 = r'\xdd' # s1[2:] = 'dd'
n1 = int(s1[2:], 16) # = 221 decimal or 0xdd in hex
# So
chr(n1) == 'Ý' # True
# and
'\xdd' == 'Ý' # True

So the conversion you want seems to be chr(int(s1[2:], 16)).

Of course, this will only work if the input string is exactly four
characters long, and the first two characters are r'\x', and the
remaining two characters are going to be a hex string representation of
a number small enough to fit into a byte.

If you know for sure that will be the case, then the conversion above
seems to be about as simple as it could be. If those conditions may not
always be met, then you need to work out exactly what strings you may
need to convert, and what they should be converted to.

Jach Feng

unread,

Dec 7, 2022, 2:40:26 AM12/7/22

to

Thomas Passin 在 2022年12月7日星期三中午12:51:32 [UTC+8] 的信中寫道：

Thank you for reminding that the '0x'+ in the to1byte() definition is redundant:-)

Just not sure if there is a better way than using chr(int(...)) to do it.
Yes, for this specific case, slice is much simpler than re.sub().

Roel Schroeven

unread,

Dec 7, 2022, 3:42:48 AM12/7/22

to

Op 7/12/2022 om 4:37 schreef Jach Feng:

You missed a pair of quotes. They are easily overlooked but very
important. The point is to wrap your string in another pair of quotes so
it becomes a valid Python string literal in a Python string which can
then be passed to ast.literal_eval(). Works for me:

In [7]: s0 = r'\x0a'

In [8]: import ast

In [9]: ast.literal_eval('"%s"' % s0)
Out[9]: '\n'

--
"Experience is that marvelous thing that enables you to recognize a
mistake when you make it again."
-- Franklin P. Jones

Peter Otten

unread,

Dec 7, 2022, 4:17:59 PM12/7/22

to

>>> import codecs
>>> codecs.decode(r"\x68\x65\x6c\x6c\x6f\x0a", "unicode-escape")
'hello\n'

Jach Feng

unread,

Dec 7, 2022, 8:16:22 PM12/7/22

to

Roel Schroeven 在 2022年12月7日星期三下午4:42:48 [UTC+8] 的信中寫道：

Thank you for notifying me. I did notice those ''' in MRAB's post, but didn't figure out what it is at that time:-(

Jach Feng

unread,

Dec 7, 2022, 8:17:35 PM12/7/22

to

Peter Otten 在 2022年12月8日星期四清晨5:17:59 [UTC+8] 的信中寫道：

Thank you. What I really want to handle is to any r'\xdd'. The r'\x0a' is for example. Sorry, didn't describe it clearly:-)

Jach Feng

unread,

Dec 8, 2022, 3:56:55 AM12/8/22

to

Jach Feng 在 2022年12月7日星期三上午10:23:20 [UTC+8] 的信中寫道：

I find another answer on the web.

>>> s0 = r'\x0a'
>>> s0.encode('Latin-1').decode('unicode-escape')
'\n'

Weatherby,Gerard

unread,

Dec 8, 2022, 8:40:07 AM12/8/22

to

I’m not understanding the task. The sample code given is converting the input r’\x0a’ to a newline, it appears.

import re

def exam(z):
print(f"examine {type(z)} {z}")
for c in z:
print(f"{ord(c)} {c}")

s0 = r'\x0a'

def to1byte(matchobj):

return chr(int('0x' + matchobj.group(1), 16))
s1 = re.sub(r'\\x([0-9a-fA-F]{2})', to1byte, s0)

exam(s0)
exam(s1)

---
examine <class 'str'> \x0a
92 \
120 x
48 0
97 a
examine <class 'str'>

10

--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!kUnextA7_cF7EoP_4hGzC5Jq2wRvn8nwLwT8wmeNkgVjK_n6VG19fxb-4SwmDMwepWe8_bGaH9Y2LlkSvFRz$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!kUnextA7_cF7EoP_4hGzC5Jq2wRvn8nwLwT8wmeNkgVjK_n6VG19fxb-4SwmDMwepWe8_bGaH9Y2LlkSvFRz$>

Thomas Passin

unread,

Dec 8, 2022, 9:12:57 AM12/8/22

to

The original post started out with r'\x0a' but then talked about '\xdd'.
I assumed that there was a pattern here, a raw string containing "\x"
and two more characters, and made a suggestion for converting any string
with that pattern. But the OP was very unclear what the task really
was, so here we all are, making a variety of guesses.

On 12/8/2022 8:23 AM, Weatherby,Gerard wrote:
> I’m not understanding the task. The sample code given is converting the input r’\x0a’ to a newline, it appears.
>
>
> import re
>
>
> def exam(z):
> print(f"examine {type(z)} {z}")
> for c in z:
> print(f"{ord(c)} {c}")
>
> s0 = r'\x0a'
>
> def to1byte(matchobj):

> return chr(int('0x' + matchobj.group(1), 16))
> s1 = re.sub(r'\\x([0-9a-fA-F]{2})', to1byte, s0)

> exam(s0)
> exam(s1)
>
> ---
> examine <class 'str'> \x0a
> 92 \
> 120 x
> 48 0
> 97 a
> examine <class 'str'>
>
> 10
>
> From: Python-list <python-list-bounces+gweatherby=uchc...@python.org> on behalf of Jach Feng <jf...@ms4.hinet.net>
> Date: Wednesday, December 7, 2022 at 9:27 PM
> To: pytho...@python.org <pytho...@python.org>
> Subject: Re: How to convert a raw string r'xdd' to 'xdd' more gracefully?
> *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***
>

Peter Otten

unread,

Dec 8, 2022, 11:20:48 AM12/8/22

to

Hm, codecs.decode() does work for arbitrary escapes. It will produce the
same result for r"\xdd"-type raw strings where d is in the range 0...F.
It will also convert other escapes like

>>> codecs.decode(r"\t", "unicode-escape")
'\t'
>>> codecs.decode(r"\u5728", "unicode-escape")
'在'

moi

unread,

Dec 8, 2022, 2:12:43 PM12/8/22

to

PS C:\humour> py38 sysargwithliteral.py abc\x80œ cp1252
abc€œ
PS C:\humour> py38 sysargwithliteral.py abc\xe1œ€ cp1253
abcαœ€
PS C:\humour> py38 sysargwithliteral.py abc\xe1\xe2\xe3z cp1253
abcαβγz
PS C:\humour> py38 sysargwithliteral.py abc\xe1\xe2\xe3z cp437
abcßΓπz
PS C:\humour> py38 sysargwithliteral.py abc\xe1\xe2\xe3z cp850
abcßÔÒz
PS C:\humour> py38 sysargwithliteral.py abc\u03b1\u03b2\u03b3z unicode
abcαβγz
PS C:\humour> py38 sysargwithliteral.py abc\u03b1\u03b2\u03b3z unicode
abcαβγz
PS C:\humour> py38 sysargwithliteral.py abc\\ cp1252
abc\

Anyway. Interpreting a command line may lead to a non sense.
Ditto for piping.

PS C:\humour> py38 sysargwithliteral.py x:\xffb.html cp1252
x:ÿb.html
PS C:\humour> py38 sysargwithliteral.py x:\\xffb.html cp1252
x:\xffb.html
PS C:\humour>

moi

unread,

Dec 8, 2022, 4:08:51 PM12/8/22

to

Le mercredi 7 décembre 2022 à 22:17:59 UTC+1, Peter Otten a écrit :

> >>> import codecs
> >>> codecs.decode(r"\x68\x65\x6c\x6c\x6f\x0a", "unicode-escape")
> 'hello\n'

Rejected.

It works by chance correctly only because you are using ascii.

Jach Feng

unread,

Dec 8, 2022, 9:05:15 PM12/8/22

to

Jach Feng 在 2022年12月7日星期三上午10:23:20 [UTC+8] 的信中寫道：

The whold story is,

I had a script which accepts an argparse's positional argument. I like this argument may have control character embedded in when required. So I make a post "How to enter escape character in a positional string argument from the command line? on DEC05. But there is no response. I assume that there is no way of doing it and I have to convert it later after I get the whole string from the command line.

I made this convertion using the chr(int(...)) method but not satisfied with. That why this post came out.

At this moment the conversion is done almost the same as Peter's codecs.decode() method but without the need of importing codecs module:-)

def to1byte(matchobj):
....return matchobj.group(0).encode().decode("unicode-escape")

Weatherby,Gerard

unread,

Dec 9, 2022, 8:36:18 AM12/9/22

to

That’s actually more of a shell question than a Python question. How you pass certain control characters is going to depend on the shell, operating system, and possibly the keyboard you’re using. (e.g. https://www.alt-codes.net).

Here’s a sample program. The dashes are to help show the boundaries of the string

#!/usr/bin/env python3
import argparse
import logging

parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('data')
args = parser.parse_args()
print(f'Input\n: -{args.data}- length {len(args.data)}')
for c in args.data:
print(f'{ord(c)} ',end='')
print()

Using bash on Linux:

./cl.py '^M
'
Input
-
- length 3
13 32 10

From: Python-list <python-list-bounces+gweatherby=uchc...@python.org> on behalf of Jach Feng <jf...@ms4.hinet.net>
Date: Thursday, December 8, 2022 at 9:31 PM
To: pytho...@python.org <pytho...@python.org>
Subject: Re: How to convert a raw string r'xdd' to 'xdd' more gracefully?
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hcg9ULzmtVUzMJ87Emlfsf6PGAfC-MEzUs3QQNVzWwK4aWDEtePG34hRX0ZFVvWcqZXRcM67JkkIg-l-K9vB$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hcg9ULzmtVUzMJ87Emlfsf6PGAfC-MEzUs3QQNVzWwK4aWDEtePG34hRX0ZFVvWcqZXRcM67JkkIg-l-K9vB$>

moi

unread,

Dec 9, 2022, 10:41:20 AM12/9/22

to

PS C:\humour> py38 sysargwithliteral.py a\x0ab\x09c\x0a\x80uro\x0ax\x08z cp1252
a
b c
€uro
z

PS C:\humour> $a = py38 sysargwithliteral.py a\x0ab\x09c\x0a\x80uro\x0ax\x08z cp1252

PS C:\humour> licp($a)
a U+0061
b U+0062
U+0009
c U+0063
€ U+20AC
u U+0075
r U+0072
o U+006F
x U+0078
U+0008
z U+007A

PS C:\humour>

PS C:\humour> py38 sysargwithliteral.py a\u000ab\u0009c\u000a\u20acuro\u000ax\u0008z\u000aend\U0001f60a unicode
a
b c
€uro
z
end😊

PS C:\humour>

PS C:\humour> py38 sysargwithliteral.py a\x0ab\x09c\x0a\x80uro\x0ax\x08z cp1252 | py38 -c "import sys; s = sys.stdin.read(); print(s.rstrip())"
a
b c
€uro
z

PS C:\humour>
Note: In a terminal "\t" is correct.

Jach Feng

unread,

Dec 9, 2022, 9:07:10 PM12/9/22

to

Weatherby,Gerard 在 2022年12月9日星期五晚上9:36:18 [UTC+8] 的信中寫道：

You are right, that's why I found later that it's easier to enter it using a preferred pattern. But there is a case, as moi mentioned in his previous post, will cause failure when a Windows path in the form of \xdd just happen in the string:-(

Jach Feng

unread,

Dec 9, 2022, 9:12:54 PM12/9/22

to

moi 在 2022年12月9日星期五晚上11:41:20 [UTC+8] 的信中寫道：

Where is the sysargwithliteral.py?

moi

unread,

Dec 11, 2022, 9:05:27 AM12/11/22

to

Limited powershell experience. I did something wrong, licp().

PS C:\humour> $a =py38x sysargwithliteral.py '\xc5\x81uckasz\x20pays\x0ain\x20\xe2\x82\xacuro' utf8
PS C:\humour> $a
Łuckasz pays
in €uro
PS C:\humour> licp2 $a
Ł U+0141
u U+0075
c U+0063
k U+006B
a U+0061
s U+0073
z U+007A
U+0020
p U+0070
a U+0061
y U+0079
s U+0073
U+000D

U+000A
i U+0069
n U+006E
U+0020

€ U+20AC
u U+0075
r U+0072
o U+006F

PS C:\humour> $b =py38x sysargwithliteral.py Łuckasz\x20pays\x0ain\x20€uro iso-8859-2
PS C:\humour> $b
Łuckasz pays
in €uro
PS C:\humour> licp2 $b
Ł U+0141
u U+0075
c U+0063
k U+006B
a U+0061
s U+0073
z U+007A
U+0020
p U+0070
a U+0061
y U+0079
s U+0073
U+000D

U+000A
i U+0069
n U+006E
U+0020

€ U+20AC
u U+0075
r U+0072
o U+006F

PS C:\humour> $aa = $a | out-string
PS C:\humour> $bb = $b | out-string
PS C:\humour> $aa -eq $bb
True
PS C:\humour>

-----

In

PS C:\humour> $a = py38 -c "print('a\nb')"

$a is not a string !

PS C:\humour> $a.gettype()

IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Object[] System.Array

moi

unread,

Dec 12, 2022, 4:38:50 AM12/12/22

to

>>> ast.literal_eval("r'\x7a'") == ast.literal_eval("r'z'")
True
>>> ast.literal_eval("r'\xe0'") == ast.literal_eval("r'à'")
True
>>> ast.literal_eval("r'\x9c'") == ast.literal_eval("r'œ'")
False

---------

>>> print(codecs.decode(r'z', 'unicode-escape'))
z
>>> print(codecs.decode(r'g\hz', 'unicode-escape'))
g\hz
>>> print(codecs.decode(r'g\az', 'unicode-escape'))
g\u0007z
>>> print(codecs.decode(r'g\nz', 'unicode-escape'))
g
z
>>>
print(codecs.decode(r'abcü', 'unicode-escape'))
abcÃ¼
>>>

Jach Feng

unread,

Dec 12, 2022, 6:04:01 AM12/12/22

to

moi 在 2022年12月12日星期一下午5:38:50 [UTC+8] 的信中寫道：

I have a different result:-)

>>> print(codecs.decode(r'g\hz', 'unicode-escape'))

<stdin>:1: DeprecationWarning: invalid escape sequence '\h'

g\hz
>>> print(codecs.decode(r'g\az', 'unicode-escape'))

gz # with a companioning bell

moi

unread,

Dec 12, 2022, 7:26:56 AM12/12/22

to

>>> Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:34:34) [MSC v.1928 32 bit (Intel)] on win32
coq runs coqzero.py...
...coqzero has been executed
>>> import unicodedata
>>> import codecs
>>> print('a\u0000b\bcd\x1fend')
a\u0000b\u0008cd\u001fend
>>>
>>> unicodedata.normalize('NFKD', 'aéböc')
'ae\u0301bo\u0308c'
>>>
>>> print(codecs.decode(r'ö', 'unicode-escape'))
Ã¶
>>> codecs.decode(r'ö', 'unicode-escape')
'Ã¶'
>>>

"official py38" :
>>> print(codecs.decode(r'ö', 'unicode-escape'))
Ã¶
>>>

moi

unread,

Dec 12, 2022, 7:29:44 AM12/12/22

to

Missing part in e-mail

Sorry. I used *my* interactive interpreter. I took the freedom to display
"chars" a little bit differently.

moi

unread,

Dec 23, 2022, 3:27:46 AM12/23/22

to

-------

Deleted.
It works. It is however a non sense.