Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

"convert" string to bytes without changing data (encoding)

2,605 views
Skip to first unread message

Peter Daum

unread,
Mar 28, 2012, 4:56:20 AM3/28/12
to
Hi,

is there any way to convert a string to bytes without
interpreting the data in any way? Something like:

s='abcde'
b=bytes(s, "unchanged")

Regards,
Peter

Chris Angelico

unread,
Mar 28, 2012, 5:02:42 AM3/28/12
to pytho...@python.org
On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <ga...@cs.tu-berlin.de> wrote:
> Hi,
>
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
>
> s='abcde'
> b=bytes(s, "unchanged")

What is a string? It's not a series of bytes. You can't convert it
without encoding those characters into bytes in some way.

ChrisA

Stefan Behnel

unread,
Mar 28, 2012, 5:08:24 AM3/28/12
to pytho...@python.org
Peter Daum, 28.03.2012 10:56:
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
>
> s='abcde'
> b=bytes(s, "unchanged")

If you can tell us what you actually want to achieve, i.e. why you want to
do this, we may be able to tell you how to do what you want.

Stefan

Peter Daum

unread,
Mar 28, 2012, 5:43:52 AM3/28/12
to
... in my example, the variable s points to a "string", i.e. a series of
bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

b=bytes(s,'ascii') # or ('utf-8', 'latin1', ...)

would of course work in this case, but in general, if s holds any
data with bytes > 127, the actual data will be changed according
to the provided encoding.

What I am looking for is a general way to just copy the raw data
from a "string" object to a "byte" object without any attempt to
"decode" or "encode" anything ...

Regards,
Peter

Heiko Wundram

unread,
Mar 28, 2012, 6:42:43 AM3/28/12
to pytho...@python.org
Am 28.03.2012 11:43, schrieb Peter Daum:
> ... in my example, the variable s points to a "string", i.e. a series
> of
> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

No; a string contains a series of codepoints from the unicode plane,
representing natural language characters (at least in the simplistic
view, I'm not talking about surrogates). These can be encoded to
different binary storage representations, of which ascii is (a common)
one.

> What I am looking for is a general way to just copy the raw data
> from a "string" object to a "byte" object without any attempt to
> "decode" or "encode" anything ...

There is "logically" no raw data in the string, just a series of
codepoints, as stated above. You'll have to specify the encoding to use
to get at "raw" data, and from what I gather you're interested in the
latin-1 (or iso-8859-15) encoding, as you're specifically referencing
chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
speak).

--
--- Heiko.

Stefan Behnel

unread,
Mar 28, 2012, 7:25:33 AM3/28/12
to pytho...@python.org
Peter Daum, 28.03.2012 11:43:
> What I am looking for is a general way to just copy the raw data
> from a "string" object to a "byte" object without any attempt to
> "decode" or "encode" anything ...

That's why I asked about your use case - where does the data come from and
why is it contained in a character string in the first place? If you could
provide that information, we can help you further.

Stefan

Ross Ridge

unread,
Mar 28, 2012, 11:36:10 AM3/28/12
to
Chris Angelico <ros...@gmail.com> wrote:
>What is a string? It's not a series of bytes.

Of course it is. Conceptually you're not supposed to think of it that
way, but a string is stored in memory as a series of bytes.

What he's asking for many not be very useful or practical, but if that's
your problem here than then that's what you should be addressing, not
pretending that it's fundamentally impossible.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rri...@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //

Chris Angelico

unread,
Mar 28, 2012, 12:18:43 PM3/28/12
to pytho...@python.org
On Thu, Mar 29, 2012 at 2:36 AM, Ross Ridge <rri...@csclub.uwaterloo.ca> wrote:
> Chris Angelico  <ros...@gmail.com> wrote:
>>What is a string? It's not a series of bytes.
>
> Of course it is.  Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

Note that distinction. I said that a string "is not" a series of
bytes; you say that it "is stored" as bytes.

> What he's asking for many not be very useful or practical, but if that's
> your problem here than then that's what you should be addressing, not
> pretending that it's fundamentally impossible.

That's equivalent to taking a 64-bit integer and trying to treat it as
a 64-bit floating point number. They're all just bits in memory, and
in C it's quite easy to cast a pointer to a different type and
dereference it. But a Python Unicode string might be stored in several
ways; for all you know, it might actually be stored as a sequence of
apples in a refrigerator, just as long as they can be referenced
correctly. There's no logical Python way to turn that into a series of
bytes.

ChrisA

Grant Edwards

unread,
Mar 28, 2012, 12:33:13 PM3/28/12
to
On 2012-03-28, Chris Angelico <ros...@gmail.com> wrote:

> for all you know, it might actually be stored as a sequence of
> apples in a refrigerator

[...]

> There's no logical Python way to turn that into a series of bytes.

There's got to be a joke there somewhere about how to eat an apple...

--
Grant Edwards grant.b.edwards Yow! Somewhere in DOWNTOWN
at BURBANK a prostitute is
gmail.com OVERCOOKING a LAMB CHOP!!

Dave Angel

unread,
Mar 28, 2012, 1:16:57 PM3/28/12
to Peter Daum, pytho...@python.org
You needed to specify that you are using Python 3.x . In python 2.x, a
string is indeed a series of bytes. But in Python 3.x, you have to be
much more specific.

For example, if that string is coming from a literal, then you usually
can convert it back to bytes simply by encoding using the same method as
the one specified for the source file. So look at the encoding line at
the top of the file.



--

DaveA

Peter Daum

unread,
Mar 28, 2012, 1:43:36 PM3/28/12
to
... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)

Regards,
Peter

Steven D'Aprano

unread,
Mar 28, 2012, 1:54:20 PM3/28/12
to
On Wed, 28 Mar 2012 11:36:10 -0400, Ross Ridge wrote:

> Chris Angelico <ros...@gmail.com> wrote:
>>What is a string? It's not a series of bytes.
>
> Of course it is. Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

You don't know that. They might be stored as a tree, or a rope, or some
even more complex data structure. In fact, in Python, they are stored as
an object.

But even if they were stored as a simple series of bytes, you don't know
what bytes they are. That is an implementation detail of the particular
Python build being used, and since Python doesn't give direct access to
memory (at least not in pure Python) there's no way to retrieve those
bytes using Python code.

Saying that strings are stored in memory as bytes is no more sensible
than saying that dicts are stored in memory as bytes. Yes, they are. So
what? Taken out of context in a running Python interpreter, those bytes
are pretty much meaningless.


> What he's asking for many not be very useful or practical, but if that's
> your problem here than then that's what you should be addressing, not
> pretending that it's fundamentally impossible.

The right way to convert bytes to strings, and vice versa, is via
encoding and decoding operations. What the OP is asking for is as silly
as somebody asking to turn a float 1.3792 into a string without calling
str() or any equivalent float->string conversion. They're both made up of
bytes, right? Yeah, they are. So what?

Even if you do a hex dump of float 1.3792, the result will NOT be the
string "1.3792". And likewise, even if you somehow did a hex dump of the
memory representation of a string, the result will NOT be the equivalent
sequence of bytes except *maybe* for some small subset of possible
strings.



--
Steven

Ross Ridge

unread,
Mar 28, 2012, 2:05:11 PM3/28/12
to
Ross Ridge <rri...@csclub.uwaterloo.ca> wr=
> Of course it is. =A0Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

Chris Angelico <ros...@gmail.com> wrote:
>Note that distinction. I said that a string "is not" a series of
>bytes; you say that it "is stored" as bytes.

The distinction is meaningless. I'm not going argue with you about what
you or I ment by the word "is".

>But a Python Unicode string might be stored in several
>ways; for all you know, it might actually be stored as a sequence of
>apples in a refrigerator, just as long as they can be referenced
>correctly.

But it is in fact only stored in one particular way, as a series of bytes.

>There's no logical Python way to turn that into a series of bytes.

Nonsense. Play all the semantic games you want, it already is a series
of bytes.

Steven D'Aprano

unread,
Mar 28, 2012, 2:12:57 PM3/28/12
to
On Wed, 28 Mar 2012 11:43:52 +0200, Peter Daum wrote:

> ... in my example, the variable s points to a "string", i.e. a series of
> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

No. Strings are not sequences of bytes (except in the trivial sense that
everything in computer memory is made of bytes). They are sequences of
CODE POINTS. (Roughly speaking, code points are *almost* but not quite
the same as characters.)

I suggest that you need to reset your understanding of strings and bytes.
I suggest you start by reading this:

http://www.joelonsoftware.com/articles/Unicode.html

Then come back and try to explain what actual problem you are trying to
solve.


--
Steven

Heiko Wundram

unread,
Mar 28, 2012, 2:13:11 PM3/28/12
to pytho...@python.org
Am 28.03.2012 19:43, schrieb Peter Daum:
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)

It seems that you're mixing things up wrt. the string/bytes
distinction; it's not as "complicated" as it might seem.

1) Strings

s = "This is a test string"
s = 'This is another test string with single quotes'
s = """
And this is a multiline test string.
"""
s = 'c' # This is also a string...

all create/refer to string objects. How Python internally stores them
is none of your concern (actually, that's rather complicated anyway, at
least with the upcoming Python 3.3), and processing a string basically
means that you'll work on the natural language characters present in the
string. Python strings can store (pretty much) all characters and
surrogates that unicode allows, and when the python interpreter/compiler
reads strings from input (I'm talking about source files), a default
encoding defines how the bytes in your input file get interpreted as
unicode codepoint encodings (generally, it depends on your system locale
or file header indications) to construct the internal string object
you're using to access the data in the string.

There is no such thing as a type for a single character; single
characters are simply strings of length 1 (and so indexing also returns
a [new] string object).

Single/double quotes work no different.

The internal encoding used by the Python interpreter is of no concern
to you.

2) Bytes

s = b'this is a byte-string'
s = b'\x22\x33\x44'

The above define bytes. Think of the bytes type as arrays of 8-bit
integers, only representing a buffer which you can process as an array
of fixed-width integers. Reading from stdin/a file gets you bytes, and
not a string, because Python cannot automagically guess what format the
input is in.

Indexing the bytes type returns an integer (which is the clearest
distinction between string and bytes).

Being able to input "string-looking" data in source files as bytes is a
debatable "feature" (IMHO; see the first example), simply because it
breaks the semantic difference between the two types in the eye of the
programmer looking at source.

3) Conversions

To get from bytes to string, you have to decode the bytes buffer,
telling Python what kind of character data is contained in the array of
integers. After decoding, you'll get a string object which you can
process using the standard string methods. For decoding to succeed, you
have to tell Python how the natural language characters are encoded in
your array of bytes:

b'hello'.decode('iso-8859-15')

To get from string back to bytes (you want to write the natural
language character data you've processed to a file), you have to encode
the data in your string buffer, which gets you an array of 8-bit
integers to write to the output:

'hello'.encode('iso-8859-15')

Most output methods will happily do the encoding for you, using a
standard encoding, and if that happens to be ASCII, you're getting
UnicodeEncodeErrors which tell you that a character in your string
source is unsuited to be transmitted using the encoding you've
specified.

If the above doesn't make the string/bytes-distinction and usage
clearer, and you have a C#-background, check out the distinction between
byte[] (which the System.IO-streams get you), and how you have to use a
System.Encoding-derived class to get at actual System.String objects to
manipulate character data. Pythons type system wrt. character data is
pretty much similar, except for missing the "single character" type
(char).

Anyway, back to what you wrote: how are you getting the input data? Why
are "high bytes" in there which you do not know the encoding for?
Generally, from what I gather, you'll decode data from some source,
process it, and write it back using the same encoding which you used for
decoding, which should do exactly what you want and not get you into any
trouble with encodings.

--
--- Heiko.

Jussi Piitulainen

unread,
Mar 28, 2012, 2:13:53 PM3/28/12
to
Peter Daum writes:

> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.
>
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You can read as bytes and decode as ASCII but ignoring the troublesome
non-text characters:

>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
(Parittsbit) auf den Kommunikationsleitungen oder fr andere
Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
so dass alle im ASCII definierten Zeichen auch in den verschiedenen
Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.

The paragraph is from the German Wikipedia on ASCII, in UTF-8.

Prasad, Ramit

unread,
Mar 28, 2012, 2:20:23 PM3/28/12
to pytho...@python.org
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)

Just a small note as you are new to Python, string concatenation can
be expensive (quadratic time). The Python (2.x and 3.x) idiom for
frequent string concatenation is to append to a list and then join
them like the following (linear time).

>>>lst = [ 'Hi,' ]
>>>lst.append( 'how' )
>>>lst.append( 'are' )
>>>lst.append( 'you?' )
>>>sentence = ' '.join( lst ) # use a space separating each element
>>>print sentence
Hi, how are you?

You can use join on an empty string, but then they will not be
separated by spaces.

>>>sentence = ''.join( lst ) # empty string so no separation
>>>print sentence
Hi,howareyou?

You can use any string as a separator, length does not matter.

>>>sentence = '@-Q'.join( lst )
>>>print sentence
Hi,@-Qhow@-Qare@-Qyou?


Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

Ian Kelly

unread,
Mar 28, 2012, 2:20:30 PM3/28/12
to Peter Daum, pytho...@python.org
On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum <ga...@cs.tu-berlin.de> wrote:
> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.

No, unicode strings can be stored internally as any of UCS-1, UCS-2,
UCS-4, C wchar strings, or even plain ASCII. And those are all
implementation details that could easily change in future versions of
Python.

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You can't generally just "deal with the ascii portions" without
knowing something about the encoding. Say you encounter a byte
greater than 127. Is it a single non-ASCII character, or is it the
leading byte of a multi-byte character? If the next character is less
than 127, is it an ASCII character, or a continuation of the previous
character? For UTF-8 you could safely assume ASCII, but without
knowing the encoding, there is no way to be sure. If you just assume
it's ASCII and manipulate it as such, you could be messing up
non-ASCII characters.

Cheers,
Ian

Steven D'Aprano

unread,
Mar 28, 2012, 2:26:29 PM3/28/12
to
On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I am
> often dealing with data, that is basically text, but it can contain
> 8-bit bytes.

All bytes are 8-bit, at least on modern hardware. I think you have to go
back to the 1950s to find 10-bit or 12-bit machines.

> In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Well you can't do that, because *by definition* you are changing a
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is,
*how* do you want to change them?

You can use an error handler to convert any untranslatable characters
into question marks, or to ignore them altogether:

bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')

When going the other way, from bytes to strings, it can sometimes be
useful to use the Latin-1 encoding, which essentially cannot fail:

string = bytes.decode('latin1')

although the non-ASCII chars that you get may not be sensible or
meaningful in any way. But if there are only a few of them, and you don't
care too much, this may be a simple approach.

But in a nutshell, it is physically impossible to map the millions of
Unicode characters to just 256 possible bytes without either throwing
some characters away, or performing an encoding.



> As it seems, this would be far easier with python 2.x.

It only seems that way until you try.


--
Steven

Terry Reedy

unread,
Mar 28, 2012, 2:11:28 PM3/28/12
to pytho...@python.org


On 3/28/2012 11:36 AM, Ross Ridge wrote:
> Chris Angelico<ros...@gmail.com> wrote:
>> What is a string? It's not a series of bytes.
>
> Of course it is. Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

*If* it is stored in byte memory. If you execute a 3.x program mentally
or on paper, then there are no bytes.

If you execute a 3.3 program on a byte-oriented computer, then the 'a'
in the string might be represented by 1, 2, or 4 bytes, depending on the
other characters in the string. The actual logical bit pattern will
depend on the big versus little endianness of the system.

My impression is that if you go down to the physical bit level, then
again there are, possibly, no 'bytes' as a physical construct as the
bits, possibly, are stored in parallel on multiple ram chips.

> What he's asking for many not be very useful or practical, but if that's
> your problem here than then that's what you should be addressing, not
> pretending that it's fundamentally impossible.

The python-level way to get the bytes of an object that supports the
buffer interface is memoryview(). 3.x strings intentionally do not
support the buffer interface as there is not any particular
correspondence between characters (codepoints) and bytes.

The OP could get the ordinal for each character and decide how *he*
wants to convert them to bytes.

ba = bytearray()
for c in s:
i = ord(c)
<append bytes to ba corresponding to i>

To get the particular bytes used for a particular string on a particular
system, OP should use the C API, possibly through ctypes.

--
Terry Jan Reedy

Ross Ridge

unread,
Mar 28, 2012, 2:22:50 PM3/28/12
to
Steven D'Aprano <steve+comp....@pearwood.info> wrote:
>The right way to convert bytes to strings, and vice versa, is via
>encoding and decoding operations.

If you want to dictate to the original poster the correct way to do
things then you don't need to do anything more that. You don't need to
pretend like Chris Angelico that there's isn't a direct mapping from
the his Python 3 implementation's internal respresentation of strings
to bytes in order to label what he's asking for as being "silly".

Prasad, Ramit

unread,
Mar 28, 2012, 2:31:00 PM3/28/12
to pytho...@python.org
> You can read as bytes and decode as ASCII but ignoring the troublesome
> non-text characters:
>
> >>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
> Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
>
> The paragraph is from the German Wikipedia on ASCII, in UTF-8.

I see no non-ASCII characters, not sure if that is because the source
has none or something else. From this example I would not say that
the rest of the text is "unchanged". Decode converts to Unicode,
did you mean encode?

I think "ignore" will remove non-translatable characters and not
leave them in the returned string.

Ethan Furman

unread,
Mar 28, 2012, 2:17:56 PM3/28/12
to Peter Daum, pytho...@python.org
Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:
>>> ... in my example, the variable s points to a "string", i.e. a series of
>>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
>> No; a string contains a series of codepoints from the unicode plane,
>> representing natural language characters (at least in the simplistic
>> view, I'm not talking about surrogates). These can be encoded to
>> different binary storage representations, of which ascii is (a common) one.
>>
>>> What I am looking for is a general way to just copy the raw data
>>> from a "string" object to a "byte" object without any attempt to
>>> "decode" or "encode" anything ...
>> There is "logically" no raw data in the string, just a series of
>> codepoints, as stated above. You'll have to specify the encoding to use
>> to get at "raw" data, and from what I gather you're interested in the
>> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
>> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
>> speak).
>
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Where is the data coming from? Files? In that case, it sounds like you
will want to decode/encode using 'latin-1', as the bulk of your text is
plain ascii and you don't really care about the upper-ascii chars.

~Ethan~

Tim Chase

unread,
Mar 28, 2012, 2:49:19 PM3/28/12
to Ross Ridge, pytho...@python.org
On 03/28/12 13:05, Ross Ridge wrote:
> Ross Ridge<rri...@csclub.uwaterloo.ca> wr=
>> But a Python Unicode string might be stored in several
>> ways; for all you know, it might actually be stored as a sequence of
>> apples in a refrigerator, just as long as they can be referenced
>> correctly.
>
> But it is in fact only stored in one particular way, as a series of bytes.
>
>> There's no logical Python way to turn that into a series of bytes.
>
> Nonsense. Play all the semantic games you want, it already is a series
> of bytes.

Internally, they're a series of bytes, but they are MEANINGLESS
bytes unless you know how they are encoded internally. Those
bytes could be UTF-8, UTF-16, UTF-32, or any of a number of other
possible encodings[1]. If you get the internal byte stream,
there's no way to meaningfully operate on it unless you also know
how it's encoded (or you're willing to sacrifice the ability to
reliably get the string back).

-tkc

[1]
http://docs.python.org/library/codecs.html#standard-encodings




Ross Ridge

unread,
Mar 28, 2012, 3:10:23 PM3/28/12
to
Tim Chase <pytho...@tim.thechases.com> wrote:
>Internally, they're a series of bytes, but they are MEANINGLESS
>bytes unless you know how they are encoded internally. Those
>bytes could be UTF-8, UTF-16, UTF-32, or any of a number of other
>possible encodings[1]. If you get the internal byte stream,
>there's no way to meaningfully operate on it unless you also know
>how it's encoded (or you're willing to sacrifice the ability to
>reliably get the string back).

In practice the number of ways that CPython (the only Python 3
implementation) represents strings is much more limited. Pretending
otherwise really isn't helpful.

Still, if Chris Angelico had used your much less misleading explaination,
then this could've been resolved much quicker. The original poster
didn't buy Chris's bullshit for a minute, instead he had to find out on
his own that that the internal representation of strings wasn't what he
expected to be.

Evan Driscoll

unread,
Mar 28, 2012, 3:20:50 PM3/28/12
to Ross Ridge, pytho...@python.org
On 01/-10/-28163 01:59 PM, Ross Ridge wrote:
> Steven D'Aprano<steve+comp....@pearwood.info> wrote:
>> The right way to convert bytes to strings, and vice versa, is via
>> encoding and decoding operations.
>
> If you want to dictate to the original poster the correct way to do
> things then you don't need to do anything more that. You don't need to
> pretend like Chris Angelico that there's isn't a direct mapping from
> the his Python 3 implementation's internal respresentation of strings
> to bytes in order to label what he's asking for as being "silly".

That mapping may as well be:

def get_bytes(some_string):
import random
length = random.randint(len(some_string), 5*len(some_string))
bytes = [0] * length
for i in xrange(length):
bytes[i] = random.randint(0, 255)
return bytes

Of course this is hyperbole, but it's essentially about as much
guarantee as to what the result is.

As many others have said, the encoding isn't defined, and I would guess
varies between implementations. (E.g. if Jython and IronPython use their
host platforms' native strings, both have 16-bit chars and thus probably
use UTF-16 encoding. I am not sure what CPython uses, but I bet it's
*not* that.)

It's even guaranteed that the byte representation won't change! If
something is lazily evaluated or you have a COW string or something, the
bytes backing it will differ.


So yes, you can say that pretending there's not a mapping of strings to
internal representation is silly, because there is. However, there's
nothing you can say about that mapping.

Evan

Albert W. Hopkins

unread,
Mar 28, 2012, 3:22:39 PM3/28/12
to pytho...@python.org
On Wed, 2012-03-28 at 14:05 -0400, Ross Ridge wrote:
> Ross Ridge <rri...@csclub.uwaterloo.ca> wr=
> > Of course it is. =A0Conceptually you're not supposed to think of it that
> > way, but a string is stored in memory as a series of bytes.
>
> Chris Angelico <ros...@gmail.com> wrote:
> >Note that distinction. I said that a string "is not" a series of
> >bytes; you say that it "is stored" as bytes.
>
> The distinction is meaningless. I'm not going argue with you about what
> you or I ment by the word "is".
>

Off topic, but obligatory:

https://www.youtube.com/watch?v=j4XT-l-_3y0


Prasad, Ramit

unread,
Mar 28, 2012, 3:02:41 PM3/28/12
to pytho...@python.org
> >The right way to convert bytes to strings, and vice versa, is via
> >encoding and decoding operations.
>
> If you want to dictate to the original poster the correct way to do
> things then you don't need to do anything more that. You don't need to
> pretend like Chris Angelico that there's isn't a direct mapping from
> the his Python 3 implementation's internal respresentation of strings
> to bytes in order to label what he's asking for as being "silly".

It might be technically possible to recreate internal implementation,
or get the byte data. That does not mean it will make any sense or
be understood in a meaningful manner. I think Ian summarized it
very well:

>You can't generally just "deal with the ascii portions" without
>knowing something about the encoding. Say you encounter a byte
>greater than 127. Is it a single non-ASCII character, or is it the
>leading byte of a multi-byte character? If the next character is less
>than 127, is it an ASCII character, or a continuation of the previous
>character? For UTF-8 you could safely assume ASCII, but without
>knowing the encoding, there is no way to be sure. If you just assume
>it's ASCII and manipulate it as such, you could be messing up
>non-ASCII characters.

Technically, ASCII goes up to 256 but they are not A-z letters.

John Nagle

unread,
Mar 28, 2012, 3:30:49 PM3/28/12
to
On 3/28/2012 10:43 AM, Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

So why let the data get into a "str" type at all? Do everything
end to end with "bytes" or "bytearray" types.

John Nagle

Ethan Furman

unread,
Mar 28, 2012, 2:49:26 PM3/28/12
to Prasad, Ramit, pytho...@python.org
Prasad, Ramit wrote:
>> You can read as bytes and decode as ASCII but ignoring the troublesome
>> non-text characters:
>>
>>>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
>> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
>> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
>> Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
>> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
>> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
>> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
>> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
>> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
>> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
>>
>> The paragraph is from the German Wikipedia on ASCII, in UTF-8.
>
> I see no non-ASCII characters, not sure if that is because the source
> has none or something else.

The 'ignore' argument to .decode() caused all non-ascii characters to be
removed.

~Ethan~

Grant Edwards

unread,
Mar 28, 2012, 3:40:57 PM3/28/12
to
On 2012-03-28, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:
>
>> The longer story of my question is: I am new to python (obviously), and
>> since I am not familiar with either one, I thought it would be advisory
>> to go for python 3.x. The biggest problem that I am facing is, that I am
>> often dealing with data, that is basically text, but it can contain
>> 8-bit bytes.
>
> All bytes are 8-bit, at least on modern hardware. I think you have to
> go back to the 1950s to find 10-bit or 12-bit machines.

Well, on anything likely to run Python that's true. There are modern
DSP-oriented CPUs where a byte is 16 or 32 bits (and so is an int and
a long, and a float and a double).

>> As it seems, this would be far easier with python 2.x.
>
> It only seems that way until you try.

It's easy as long as you deal with nothing but ASCII and Latin-1. ;)

--
Grant Edwards grant.b.edwards Yow! Somewhere in Tenafly,
at New Jersey, a chiropractor
gmail.com is viewing "Leave it to
Beaver"!

Grant Edwards

unread,
Mar 28, 2012, 3:44:02 PM3/28/12
to
On 2012-03-28, Prasad, Ramit <ramit....@jpmorgan.com> wrote:
>
>>You can't generally just "deal with the ascii portions" without
>>knowing something about the encoding. Say you encounter a byte
>>greater than 127. Is it a single non-ASCII character, or is it the
>>leading byte of a multi-byte character? If the next character is less
>>than 127, is it an ASCII character, or a continuation of the previous
>>character? For UTF-8 you could safely assume ASCII, but without
>>knowing the encoding, there is no way to be sure. If you just assume
>>it's ASCII and manipulate it as such, you could be messing up
>>non-ASCII characters.
>
> Technically, ASCII goes up to 256

No, ASCII only defines 0-127. Values >=128 are not ASCII.

From https://en.wikipedia.org/wiki/ASCII:

ASCII includes definitions for 128 characters: 33 are non-printing
control characters (now mostly obsolete) that affect how text and
space is processed and 95 printable characters, including the space
(which is considered an invisible graphic).

--
Grant Edwards grant.b.edwards Yow! Used staples are good
at with SOY SAUCE!
gmail.com

MRAB

unread,
Mar 28, 2012, 3:50:01 PM3/28/12
to pytho...@python.org
On 28/03/2012 20:02, Prasad, Ramit wrote:
>> >The right way to convert bytes to strings, and vice versa, is via
>> >encoding and decoding operations.
>>
>> If you want to dictate to the original poster the correct way to do
>> things then you don't need to do anything more that. You don't need to
>> pretend like Chris Angelico that there's isn't a direct mapping from
>> the his Python 3 implementation's internal respresentation of strings
>> to bytes in order to label what he's asking for as being "silly".
>
> It might be technically possible to recreate internal implementation,
> or get the byte data. That does not mean it will make any sense or
> be understood in a meaningful manner. I think Ian summarized it
> very well:
>
>>You can't generally just "deal with the ascii portions" without
>>knowing something about the encoding. Say you encounter a byte
>>greater than 127. Is it a single non-ASCII character, or is it the
>>leading byte of a multi-byte character? If the next character is less
>>than 127, is it an ASCII character, or a continuation of the previous
>>character? For UTF-8 you could safely assume ASCII, but without
>>knowing the encoding, there is no way to be sure. If you just assume
>>it's ASCII and manipulate it as such, you could be messing up
>>non-ASCII characters.
>
> Technically, ASCII goes up to 256 but they are not A-z letters.
>
Technically, ASCII is 7-bit, so it goes up to 127.

Ross Ridge

unread,
Mar 28, 2012, 3:43:31 PM3/28/12
to
Evan Driscoll <dris...@cs.wisc.edu> wrote:
>So yes, you can say that pretending there's not a mapping of strings to
>internal representation is silly, because there is. However, there's
>nothing you can say about that mapping.

I'm not the one labeling anything as being silly. I'm the one labeling
the things as bullshit, and that's what you're doing here. I can in
fact say what the internal byte string representation of strings is any
given build of Python 3. Just because I can't say what it would be in
an imaginary hypothetical implementation doesn't mean I can never say
anything about it.

Mark Lawrence

unread,
Mar 28, 2012, 4:44:14 PM3/28/12
to pytho...@python.org
On 28/03/2012 20:43, Ross Ridge wrote:
> Evan Driscoll<dris...@cs.wisc.edu> wrote:
>> So yes, you can say that pretending there's not a mapping of strings to
>> internal representation is silly, because there is. However, there's
>> nothing you can say about that mapping.
>
> I'm not the one labeling anything as being silly. I'm the one labeling
> the things as bullshit, and that's what you're doing here. I can in
> fact say what the internal byte string representation of strings is any
> given build of Python 3. Just because I can't say what it would be in
> an imaginary hypothetical implementation doesn't mean I can never say
> anything about it.
>
> Ross Ridge
>

Bytes is bytes and strings is strings
And the wrong one I have chose
Let's go where they keep on wearin'
Those frills and flowers and buttons and bows
Rings and things and buttons and bows.

No guessing the tune.

--
Cheers.

Mark Lawrence.

Neil Cerutti

unread,
Mar 28, 2012, 4:56:49 PM3/28/12
to
On 2012-03-28, Ross Ridge <rri...@csclub.uwaterloo.ca> wrote:
> Evan Driscoll <dris...@cs.wisc.edu> wrote:
>> So yes, you can say that pretending there's not a mapping of
>> strings to internal representation is silly, because there is.
>> However, there's nothing you can say about that mapping.
>
> I'm not the one labeling anything as being silly. I'm the one
> labeling the things as bullshit, and that's what you're doing
> here. I can in fact say what the internal byte string
> representation of strings is any given build of Python 3. Just
> because I can't say what it would be in an imaginary
> hypothetical implementation doesn't mean I can never say
> anything about it.

I am in a similar situation viz a viz my wife's undergarments.

--
Neil Cerutti

Terry Reedy

unread,
Mar 28, 2012, 5:37:53 PM3/28/12
to pytho...@python.org
On 3/28/2012 1:43 PM, Peter Daum wrote:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x.

I strongly agree with that unless you have reason to use 2.7. Python 3.3
(.0a1 in nearly out) has an improved unicode implementation, among other
things.

< The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You are assuming, or must assume, that the text is in an
ascii-compatible encoding, meaning that bytes 0-127 really represent
ascii chars. Otherwise, you cannot reliably interpret anything, let
alone change it.

This problem of knowing that much but not the specific encoding is
unfortunately common. It has been discussed among core developers and
others the last few months. Different people prefer one of the following
approaches.

1. Keep the bytes as bytes and use bytes literals and bytes functions as
needed. The danger, as you noticed, is forgetting the 'b' prefix.

2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
chars. When done, encode back to 'latin-1' and the non-ascii chars will
be as they originally were. The danger is forgetting the pretense, and
perhaps passing on the the string (as a string, not bytes) to other
modules that will not know the pretense.

3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
(using the surrogate-pair second-half code units). This is probably the
safest in that invalid operations on the non-chars should raise an
exception. Re-encoding with the same setting will reproduce the original
hi-bit chars. The main danger is passing the illegal strings out of your
local sandbox.

--
Terry Jan Reedy

Steven D'Aprano

unread,
Mar 28, 2012, 8:02:37 PM3/28/12
to
On Wed, 28 Mar 2012 15:43:31 -0400, Ross Ridge wrote:

> I can in
> fact say what the internal byte string representation of strings is any
> given build of Python 3.

Don't keep us in suspense! Given:

Python 3.2.2 (default, Mar 4 2012, 10:50:33)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2

what *is* the internal byte representation of the string "a∫©πz"?

(lowercase a, integral sign, copyright symbol, lowercase Greek pi,
lowercase z)


And more importantly, given that internal byte representation, what could
you do with it?


--
Steven

Evan Driscoll

unread,
Mar 28, 2012, 8:11:56 PM3/28/12
to Ross Ridge, pytho...@python.org
On 3/28/2012 14:43, Ross Ridge wrote:
> Evan Driscoll <dris...@cs.wisc.edu> wrote:
>> So yes, you can say that pretending there's not a mapping of strings to
>> internal representation is silly, because there is. However, there's
>> nothing you can say about that mapping.
>
> I'm not the one labeling anything as being silly. I'm the one labeling
> the things as bullshit, and that's what you're doing here. I can in
> fact say what the internal byte string representation of strings is any
> given build of Python 3. Just because I can't say what it would be in
> an imaginary hypothetical implementation doesn't mean I can never say
> anything about it.

People like you -- who write to assumptions which are not even remotely
guaranteed by the spec -- are part of the reason software sucks.

People like you hold back progress, because system implementers aren't
free to make changes without breaking backwards compatibility. Enormous
amounts of effort are expended to test programs and diagnose problems
which are caused by unwarranted assumptions like "the encoding of a
string is UTF-8". In the worst case, assumptions like that lead to
security fixes that don't go as far as they could, like the recent
discussion about hashing.

Python is definitely closer to the "willing to break backwards
compatibility to improve" end of the spectrum than some other projects
(*cough* Windows *cough*), but that still doesn't mean that you can make
assumptions like that.


This email is a bit harsher than it deserves -- but I feel not by much.

Evan

Ross Ridge

unread,
Mar 28, 2012, 11:04:08 PM3/28/12
to
Evan Driscoll <dris...@cs.wisc.edu> wrote:
>People like you -- who write to assumptions which are not even remotely
>guaranteed by the spec -- are part of the reason software sucks.
...
>This email is a bit harsher than it deserves -- but I feel not by much.

I don't see how you could feel the least bit justified. Well meaning,
if unhelpful, lies about the nature Python strings in order to try to
convince someone to follow what you think are good programming practices
is one thing. Maliciously lying about someone else's code that you've
never seen is another thing entirely.

Chris Angelico

unread,
Mar 28, 2012, 11:31:59 PM3/28/12
to pytho...@python.org
On Thu, Mar 29, 2012 at 2:04 PM, Ross Ridge <rri...@csclub.uwaterloo.ca> wrote:
> Evan Driscoll  <dris...@cs.wisc.edu> wrote:
>>People like you -- who write to assumptions which are not even remotely
>>guaranteed by the spec -- are part of the reason software sucks.
> ...
>>This email is a bit harsher than it deserves -- but I feel not by much.
>
> I don't see how you could feel the least bit justified.  Well meaning,
> if unhelpful, lies about the nature Python strings in order to try to
> convince someone to follow what you think are good programming practices
> is one thing.  Maliciously lying about someone else's code that you've
> never seen is another thing entirely.

Actually, he is justified. It's one thing to work in C or assembly and
write code that depends on certain bit-pattern representations of data
(although even that causes trouble - assuming that
sizeof(int)==sizeof(int*) isn't good for portability), but in a high
level language, you cannot assume any correlation between objects and
bytes. Any code that depends on implementation details is risky.

ChrisA

Ross Ridge

unread,
Mar 28, 2012, 11:58:53 PM3/28/12
to
Chris Angelico <ros...@gmail.com> wrote:
>Actually, he is justified. It's one thing to work in C or assembly and
>write code that depends on certain bit-pattern representations of data
>(although even that causes trouble - assuming that
>sizeof(int)=3D=3Dsizeof(int*) isn't good for portability), but in a high
>level language, you cannot assume any correlation between objects and
>bytes. Any code that depends on implementation details is risky.

How does that in anyway justify Evan Driscoll maliciously lying about
code he's never seen?

Mark Lawrence

unread,
Mar 29, 2012, 2:01:14 AM3/29/12
to pytho...@python.org
On 29/03/2012 04:58, Ross Ridge wrote:
> Chris Angelico<ros...@gmail.com> wrote:
>> Actually, he is justified. It's one thing to work in C or assembly and
>> write code that depends on certain bit-pattern representations of data
>> (although even that causes trouble - assuming that
>> sizeof(int)=3D=3Dsizeof(int*) isn't good for portability), but in a high
>> level language, you cannot assume any correlation between objects and
>> bytes. Any code that depends on implementation details is risky.
>
> How does that in anyway justify Evan Driscoll maliciously lying about
> code he's never seen?
>
> Ross Ridge
>

We appear to have a case of "would you stand up please, your voice is
rather muffled". I can hear all the *plonks* from miles away.

--
Cheers.

Mark Lawrence.

Steven D'Aprano

unread,
Mar 29, 2012, 2:51:51 AM3/29/12
to
On Wed, 28 Mar 2012 23:58:53 -0400, Ross Ridge wrote:

> How does that in anyway justify Evan Driscoll maliciously lying about
> code he's never seen?

You are perfectly justified to complain about Evan making sweeping
generalisations about your code when he has not seen it; you are NOT
justified in making your own sweeping generalisations that he is not just
lying but *maliciously* lying. He might be just confused by the strength
of his emotions and so making an honest mistake. Or he might have guessed
perfectly accurately about your code, and you are the one being
dishonest. Who knows?

Evan's impassioned rant is based on his estimate of your mindset, namely
that you are the sort of developer who writes code making assumptions
about implementation details even when explicitly told not to by the
library authors. I have no idea whether Evan's estimate is right or not,
but I don't think it is justified based on the little amount we've seen
of you.

Your reaction is to make an equally unjustified estimate of Evan's
mindset, namely that he is not just wrong about you, but *deliberately
and maliciously* lying about you in the full knowledge that he is wrong.
If anything, I would say that you have less justification for calling
Evan a malicious liar than he has for calling you the sort of person who
would write to an implementation instead of an interface.


--
Steven

Peter Daum

unread,
Mar 29, 2012, 10:57:19 AM3/29/12
to
On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.

... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

l=[i for i in range(256)]; b=bytes(l)
s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
for c in s:
print(hex(ord(c)), end=' ')
if (ord(c)+1) % 16 ==0: print("")
print()

... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)

> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.

Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)

Thank you very much for your constructive advice!

Regards,
Peter

Peter Daum

unread,
Mar 29, 2012, 10:57:19 AM3/29/12
to pytho...@python.org
On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.

... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

l=[i for i in range(256)]; b=bytes(l)
s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
for c in s:
print(hex(ord(c)), end=' ')
if (ord(c)+1) % 16 ==0: print("")
print()

... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)

> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.

Ross Ridge

unread,
Mar 29, 2012, 11:30:19 AM3/29/12
to
Steven D'Aprano <steve+comp....@pearwood.info> wrote:
>Your reaction is to make an equally unjustified estimate of Evan's
>mindset, namely that he is not just wrong about you, but *deliberately
>and maliciously* lying about you in the full knowledge that he is wrong.

No, Evan in his own words admitted that his post was ment to be harsh,
"a bit harsher than it deserves", showing his malicious intent. He made
accusations that where neither supported by anything I've said in this
thread nor by the code I actually write. His accusation about me were
completely made up, he was not telling the truth and had no reasonable
basis to beleive he was telling the truth. He was malicously lying and
I'm completely justified in saying so.

Just to make it clear to all you zealots. I've not once advocated writing
any sort "risky code" in this thread. I have not once advocated writing
any style of code in thread. Just because I refuse to drink the "it's
impossible to represent strings as a series of bytes" kool-aid does't mean
that I'm a heretic that must oppose against everything you believe in.

Evan Driscoll

unread,
Mar 29, 2012, 12:31:23 PM3/29/12
to Ross Ridge, pytho...@python.org
On 01/-10/-28163 01:59 PM, Ross Ridge wrote:
> Evan Driscoll<dris...@cs.wisc.edu> wrote:
>> People like you -- who write to assumptions which are not even remotely
>> guaranteed by the spec -- are part of the reason software sucks.
> ...
>> This email is a bit harsher than it deserves -- but I feel not by much.
>
> I don't see how you could feel the least bit justified. Well meaning,
> if unhelpful, lies about the nature Python strings in order to try to
> convince someone to follow what you think are good programming practices
> is one thing. Maliciously lying about someone else's code that you've
> never seen is another thing entirely.

I'm not even talking about code that you or the OP has written. I'm
talking about your suggestion that

I can in fact say what the internal byte string representation
of strings is any given build of Python 3.

Aside from the questionable truth of this assertion (there's no
guarantee that an implementation uses one consistent encoding or data
structure representation consistently), that's of no consequence because
you can't depend on what the representation is. So why even bring it up?

Also irrelevant is:

In practice the number of ways that CPython (the only Python 3
implementation) represents strings is much more limited.
Pretending otherwise really isn't helpful.

If you can't depend on CPython's implementation (and, I would argue,
your code is broken if you do), then it *is* helpful. Saying that "you
can just look at what CPython does" is what is unhelpful.


That said, looking again I did misread your post that I sent that harsh
reply to; I was looking at it perhaps a bit too much through the lens of
the CPython comment I said above, and interpreting it as "I can say what
the internal representation is of CPython, so just give me that" and
launched into my spiel. If that's not what was intended, I retract my
statement. As long as everyone is clear on the fact that Python 3
implementations can use whatever encoding and data structures they want,
perhaps even different encodings or data structures for equal strings,
and that as a consequence saying "what's the internal representation of
this string" is a meaningless question as far as Python itself is
concerned, I'm happy.

Evan

Terry Reedy

unread,
Mar 29, 2012, 12:49:19 PM3/29/12
to pytho...@python.org
On 3/29/2012 11:30 AM, Ross Ridge wrote:

> No, Evan in his own words admitted that his post was ment to be harsh,

I agree that he should have restrained and censored his writing.

> Just because I refuse to drink the
> "it's impossible to represent strings as a series of bytes" kool-aid

I do not believe *anyone* has made that claim. Is this meant to be a
wild exaggeration? As wild as Evan's?

In my first post on this thread, I made three truthful claims.

1. A 3.x text string is logically a sequence of unicode 'characters'
(codepoints).

2. The Python language definition does not require that a string be
bytes or become bytes unless and until it is explicitly encoded.

3. The intentionally hidden byte implementation of strings on byte
machines is version and system dependent. The bytes used for a
particular character is (in 3.3) context dependent.

As it turns out, the OP had mistakenly assumed that the hidden byte
implementation of 3.3 strings was both well-defined and something
(utf-8) that it is not and (almost certainly) never will be. Guido and
most other devs strongly want string indexing (and hence slice endpoint
finding) to be O(1).

So all of the above is moot as far as the OP's problem is concerned. I
already gave him the three standard solutions.

--
Terry Jan Reedy

Prasad, Ramit

unread,
Mar 29, 2012, 1:36:34 PM3/29/12
to pytho...@python.org
> > Technically, ASCII goes up to 256 but they are not A-z letters.
> >
> Technically, ASCII is 7-bit, so it goes up to 127.

> No, ASCII only defines 0-127. Values >=128 are not ASCII.
>
> >From https://en.wikipedia.org/wiki/ASCII:
>
> ASCII includes definitions for 128 characters: 33 are non-printing
> control characters (now mostly obsolete) that affect how text and
> space is processed and 95 printable characters, including the space
> (which is considered an invisible graphic).


Doh! I was mistaking extended ASCII for ASCII. Thanks for the
correction.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--


> --
> http://mail.python.org/mailman/listinfo/python-list

Ross Ridge

unread,
Mar 29, 2012, 2:00:42 PM3/29/12
to
Ross Ridge wrote:
> Just because I refuse to drink the
> "it's impossible to represent strings as a series of bytes" kool-aid

Terry Reedy <tjr...@udel.edu> wrote:
>I do not believe *anyone* has made that claim. Is this meant to be a
>wild exaggeration? As wild as Evan's?

Sorry, it would've been more accurate to label the flavour of kool-aid
Chris Angelico was trying to push as "it's impossible ... without
encoding":

What is a string? It's not a series of bytes. You can't convert
it without encoding those characters into bytes in some way.

>In my first post on this thread, I made three truthful claims.

I'm not objecting to every post made in this thread. If your post had
been made before the original poster had figured it out on his own,
I would've hoped he would have found it much more convincing than what
I quoted above.

Chris Angelico

unread,
Mar 29, 2012, 4:41:31 PM3/29/12
to pytho...@python.org
On Fri, Mar 30, 2012 at 5:00 AM, Ross Ridge <rri...@csclub.uwaterloo.ca> wrote:
> Sorry, it would've been more accurate to label the flavour of kool-aid
> Chris Angelico was trying to push as "it's impossible ... without
> encoding":
>
>        What is a string? It's not a series of bytes. You can't convert
>        it without encoding those characters into bytes in some way.

I still stand by that statement. Do you try to convert a "dictionary
of filename to open file object" into a "series of bytes" inside
Python? It doesn't matter that, on some level, it's *stored as* a
series of bytes; the actual object *is not* a series of bytes. There
is no logical equivalency, ergo it is illogical and nonsensical to
expect to turn one into the other without some form of encoding.
Python does include an encoding that can handle lists and
dictionaries. It's called Pickle, and it returns (in Python 3) a bytes
object - which IS a series of bytes. It doesn't simply return some
internal representation.

ChrisA

Steven D'Aprano

unread,
Mar 29, 2012, 9:10:35 PM3/29/12
to
On Thu, 29 Mar 2012 17:36:34 +0000, Prasad, Ramit wrote:

>> > Technically, ASCII goes up to 256 but they are not A-z letters.
>> >
>> Technically, ASCII is 7-bit, so it goes up to 127.
>
>> No, ASCII only defines 0-127. Values >=128 are not ASCII.
>>
>> >From https://en.wikipedia.org/wiki/ASCII:
>>
>> ASCII includes definitions for 128 characters: 33 are non-printing
>> control characters (now mostly obsolete) that affect how text and
>> space is processed and 95 printable characters, including the space
>> (which is considered an invisible graphic).
>
>
> Doh! I was mistaking extended ASCII for ASCII. Thanks for the
> correction.

There actually is no such thing as "extended ASCII" -- there is a whole
series of many different "extended ASCIIs". If you look at the encodings
available in (for example) Thunderbird, many of the ISO-8859-* and
Windows-* encodings are "extended ASCII" in the sense that they extend
ASCII to include bytes 128-255. Unfortunately they all extend ASCII in a
different way (hence they are different encodings).


--
Steven

Steven D'Aprano

unread,
Mar 29, 2012, 9:16:22 PM3/29/12
to
On Thu, 29 Mar 2012 11:30:19 -0400, Ross Ridge wrote:

> Steven D'Aprano <steve+comp....@pearwood.info> wrote:
>>Your reaction is to make an equally unjustified estimate of Evan's
>>mindset, namely that he is not just wrong about you, but *deliberately
>>and maliciously* lying about you in the full knowledge that he is wrong.
>
> No, Evan in his own words admitted that his post was ment to be harsh,
> "a bit harsher than it deserves", showing his malicious intent.

Being harsher than it deserves is not synonymous with malicious. You are
making assumptions about Evan's mental state that are not supported by
the evidence. Evan may believe that by "punishing" (for some feeble sense
of punishment) you harshly, he is teaching you better behaviour that will
be to your own benefit; or that it will act as a warning to others.
Either way he may believe that he is actually doing good.

And then he entirely undermined his own actions by admitting that he was
over-reacting. This suggests that, in fact, he wasn't really motivated by
either malice or beneficence but mere frustration.

It is quite clear that Evan let his passions about writing maintainable
code get the best of him. His rant was more about "people like you" than
you personally.

Evan, if you're reading this, I think you owe Ross an apology for flying
off the handle. Ross, I think you owe Evan an apology for unjustified
accusations of malice.


> He made
> accusations that where neither supported by anything I've said

Now that is not actually true. Your posts have defended the idea that
copying the raw internal byte representation of strings is a reasonable
thing to do. You even claimed to know how to do so, for any version of
Python (but so far have ignored my request for you to demonstrate).


> in this
> thread nor by the code I actually write. His accusation about me were
> completely made up, he was not telling the truth and had no reasonable
> basis to beleive he was telling the truth. He was malicously lying and
> I'm completely justified in saying so.

No, they were not completely made up. Your posts give many signs of being
somebody who might very well write code to the implementation rather than
the interface. Whether you are or not is a separate question, but your
posts in this thread indicate that you very likely could be.

If this is not the impression you want to give, then you should
reconsider your posting style.

Ross, to be frank, your posting style in this thread has been cowardly
and pedantic, an obnoxious combination. Please take this as constructive
criticism and not an attack -- you have alienated people in this thread,
leading at least one person to publicly kill-file your future posts. I
choose to assume you aren't aware of why that is than that you are doing
so deliberately.

Without actually coming out and making a clear, explicit statement that
you approve or disapprove of the OP's attempt to use implementation
details, you *imply* support without explicitly giving it; you criticise
others for saying it can't be done without demonstrating that it can be
done. If this is a deliberate rhetorical trick, then shame on you for
being a coward without the conviction to stand behind concrete
expressions of your opinion. If not, then you should be aware that you
are using a rhetorical style that will make many people predisposed to
think you are a twat.

You *might* have said

Guys, you're technically wrong about this. This is how you can
retrieve the internal representation of a string as a sequence
of bytes: ...code... but you shouldn't use this in production
code because it is fragile and depends on implementation details
that may break in PyPy and Jython and IronPython.

But you didn't.

You *might* have said

Wrong, you can convert a string into a sequence of bytes without
encoding or decoding: ...code... but don't do this.

But you didn't.

Instead you puffed yourself up as a big shot who was more technically
correct than everyone else, but without *actually* demonstrating that you
can do what you said you can do. You labelled as "bullshit" our attempts
to discourage the OP from his misguided approached.

If your intention was to put people off-side, you succeeded very well. If
not, you should be aware that you have, and consider how you might avoid
this in the future.



--
Steven

Michael Ströder

unread,
Mar 30, 2012, 3:04:49 AM3/30/12
to
Yupp.

Looking at RFC 1345 some years ago (while having to deal with EBCDIC) made
this all pretty clear to me. I appreciate that someone did this heavy work of
collecting historical encodings.

Ciao, Michael.

Serhiy Storchaka

unread,
Mar 30, 2012, 3:06:45 PM3/30/12
to pytho...@python.org
28.03.12 21:13, Heiko Wundram написав(ла):
> Reading from stdin/a file gets you bytes, and
> not a string, because Python cannot automagically guess what format the
> input is in.

In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw
for access to byte stream. And reading from file opened in text mode
gets you string too.

Chris Angelico

unread,
Mar 30, 2012, 3:10:03 PM3/30/12
to pytho...@python.org
True. But that's only if it's been told the encoding of stdin (which I
believe is the normal case on Linux). It's still not "automagically
guess(ing)", it's explicitly told.

ChrisA

Piet van Oostrum

unread,
Aug 29, 2012, 7:27:10 PM8/29/12
to
Ross Ridge <rri...@csclub.uwaterloo.ca> writes:

>
> But it is in fact only stored in one particular way, as a series of bytes.
>
No, it can be stored in different ways. Certainly in Python 3.3 and
beyond. And in 3.2 also, depending on wide/narrow build.
--
Piet van Oostrum <pi...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]

Piet van Oostrum

unread,
Aug 29, 2012, 7:39:15 PM8/29/12
to
Heiko Wundram <mode...@modelnine.org> writes:

> Reading from stdin/a file gets you bytes, and
> not a string, because Python cannot automagically guess what format the
> input is in.
>
Huh?

Python 3.3.0rc1 (v3.3.0rc1:8bb5c7bc46ba, Aug 25 2012, 10:09:29)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x = input()
abcd123
>>> x
'abcd123'
>>> type(x)
<class 'str'>

>>> y = sys.stdin.readline()
abcd123
>>> y
'abcd123\n'
>>> type(y)
<class 'str'>

Nobody

unread,
Aug 30, 2012, 1:51:11 AM8/30/12
to
On Wed, 29 Aug 2012 19:39:15 -0400, Piet van Oostrum wrote:

>> Reading from stdin/a file gets you bytes, and not a string, because
>> Python cannot automagically guess what format the input is in.
>>
> Huh?

Oh, it can certainly guess (in the absence of any other information, it
uses the current locale). Whether or not that guess is correct is a
different matter.

Realistically, if you want sensible behaviour from Python 3.x, you need
to use an ISO-8859-1 locale. That ensures that conversion between str and
bytes will never fail, and an str-bytes-str or bytes-str-bytes round-trip
will pass data through unmangled.

0 new messages