Totally confused by the str/bytes/unicode differences introduced in Pythyon 3.x

Giampaolo Rodola'

unread,

Jan 16, 2009, 7:47:52 PM1/16/09

to

Hi,
I'm sure the message I'm going to write will seem quite dumb to most
people but I really don't understand the str/bytes/unicode
differences introduced in Python 3.0 so be patient.
What I'm trying to do is porting pyftpdlib to Python 3.x.
I don't want to support Unicode. I don't want pyftpdlib for py 3k to
do anything new or different.
I just want it to behave exactly the same as in the 2.x version and
I'd like to know if that's possible with Python 3.x.

Now. The basic difference is that socket.recv() returns a bytes object
instead of a string object and that's the thing which confuses me
mainly.
My question is: is there a way to convert that bytes object into
exactly *the same thing* returned by socket.recv() in Python 2.x (a
string)?

I know I can do:

data = socket.recv(1024)
data = data.decode(encoding)

...to convert bytes into a string but that's not exactly the same
thing.
In Python 2.x I didn't have to care about the encoding. What
socket.recv() returned was just a string. That was all.
Now doing something like b''.decode(encoding) puts me in serious
troubles since that can raise an exception in case client and server
use a different encoding.

As far as I've understood the basic difference I see now is that a
Python 2.x based FTP server could handle a 3.x based FTP client using
"latin1" encoding or "utf-8" or anything else while with Python 3.x
I'm forced to tell my server which encoding to use and I don't know
how to deal with that.

--- Giampaolo
http://code.google.com/p/pyftpdlib

Christian Heimes

unread,

Jan 16, 2009, 8:10:42 PM1/16/09

to pytho...@python.org

Giampaolo Rodola' schrieb:

> Now. The basic difference is that socket.recv() returns a bytes object
> instead of a string object and that's the thing which confuses me
> mainly.
> My question is: is there a way to convert that bytes object into
> exactly *the same thing* returned by socket.recv() in Python 2.x (a
> string)?

Python 3.0's bytes type is almost the same type as Python 2.x's str
type. During the development of Python 3.0 the old str type was modified
and renamed to bytes. The old unicode type is now known as str.

2.x -> 3.0
-----------------
str -> bytes
unicode -> str
"" -> b""
u"" -> ""

HTH

Christian

MRAB

unread,

Jan 16, 2009, 8:24:04 PM1/16/09

to pytho...@python.org

Originally Python had a single string type 'str' with 8 bits per
character. That was a bit limiting for international use. Then a new
string type 'unicode' was introduced.

Now, in Python 3.x, it's time to tidy things up.

The 'str' type has been renamed 'bytes' and the 'unicode' type has been
renamed 'str'. If you're truly working with strings of _characters_ then
'str' is what you need, but if you're working with strings of _bytes_
then 'bytes' is what you need.

socket.send() and socket.recv() are still the same, it's just that it's
now clearer that they work with bytes and not strings.

Giampaolo Rodola'

unread,

Jan 16, 2009, 8:32:17 PM1/16/09

to

On 17 Gen, 02:24, MRAB <goo...@mrabarnett.plus.com> wrote:

> If you're truly working with strings of _characters_ then
> 'str' is what you need, but if you're working with strings of _bytes_
> then 'bytes' is what you need.

I work with string of characters but to convert bytes into string I
need to specify an encoding and that's what confuses me.
Before there was no need to deal with that.

--- Giampaolo
http://code.google.com/p/pyftpdlib

Christian Heimes

unread,

Jan 16, 2009, 8:42:11 PM1/16/09

to pytho...@python.org

Giampaolo Rodola' schrieb:

> I work with string of characters but to convert bytes into string I
> need to specify an encoding and that's what confuses me.
> Before there was no need to deal with that.

Why do you have to deal with unicode data? IIRC ftp uses ASCII only text
so you can stick to bytes everywhere. If you didn't have to worry about
encoding and unicode in Python 2.x then you should use bytes all over
the place, too.

Christian

Steven D'Aprano

unread,

Jan 16, 2009, 9:09:13 PM1/16/09

to

On Fri, 16 Jan 2009 17:32:17 -0800, Giampaolo Rodola' wrote:

> On 17 Gen, 02:24, MRAB <goo...@mrabarnett.plus.com> wrote:
>
>> If you're truly working with strings of _characters_ then 'str' is what
>> you need, but if you're working with strings of _bytes_ then 'bytes' is
>> what you need.
>
> I work with string of characters but to convert bytes into string I need
> to specify an encoding and that's what confuses me. Before there was no
> need to deal with that.

In Python 2.x, str means "string of bytes". This has been renamed "bytes"
in Python 3.

In Python 2.x, unicode means "string of characters". This has been
renamed "str" in Python 3.

If you do this in Python 2.x:

my_string = str(bytes_from_socket)

then you don't need to convert anything, because you are going from a
string of bytes to a string of bytes.

If you do this in Python 3:

my_string = str(bytes_from_socket)

then you *do* have to convert, because you are going from a string of
bytes to a string of characters (unicode). The Python 2.x equivalent code
would be:

my_string = unicode(bytes_from_socket)

and when you convert to unicode, you can get encoding errors. A better
way to do this would be some variation on:

my_str = bytes_from_socket.decode('utf-8')

You should read this:

http://www.joelonsoftware.com/articles/Unicode.html

--
Steven

Giampaolo Rodola'

unread,

Jan 16, 2009, 9:34:25 PM1/16/09

to

On 17 Gen, 03:09, Steven D'Aprano <st...@REMOVE-THIS-

Thanks, that clarifies a bit even if I still have a lot of doubts.
I wish I could do:

my_str = bytes_from_socket.decode('utf-8')

That would mean avoiding to replace "" with b"" almost everywhere in
my code but I doubt it would actually be a good idea.
RFC-2640 states that UTF-8 is the preferable encoding to use for both
clients and servers but I see that Python 3.x's ftplib uses latin1,
for example (bug?). How my server is supposed to deal with that?
I think that using bytes everywhere, as Christian recommended, would
be the only way to behave exactly like the 2.x version, but that's not
easy at all.

--- Giampaolo
http://code.google.com/p/pyftpdlib

Steve Holden

unread,

Jan 16, 2009, 9:40:11 PM1/16/09

to pytho...@python.org

Giampaolo Rodola' wrote:
> On 17 Gen, 02:24, MRAB <goo...@mrabarnett.plus.com> wrote:
>
>> If you're truly working with strings of _characters_ then
>> 'str' is what you need, but if you're working with strings of _bytes_
>> then 'bytes' is what you need.
>
> I work with string of characters but to convert bytes into string I
> need to specify an encoding and that's what confuses me.
> Before there was no need to deal with that.
>

I don't yet understand why you feel you have to convert what you receive
to a string. In Python 3.0 bytes is the same as a string in 2.6, for
most practical purposes.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Giampaolo Rodola'

unread,

Jan 16, 2009, 9:54:51 PM1/16/09

to

On 17 Gen, 03:40, Steve Holden <st...@holdenweb.com> wrote:
> Giampaolo Rodola' wrote:
> > On 17 Gen, 02:24, MRAB <goo...@mrabarnett.plus.com> wrote:
>
> >> If you're truly working with strings of _characters_ then
> >> 'str' is what you need, but if you're working with strings of _bytes_
> >> then 'bytes' is what you need.
>
> > I work with string of characters but to convert bytes into string I
> > need to specify an encoding and that's what confuses me.
> > Before there was no need to deal with that.
>
> I don't yet understand why you feel you have to convert what you receive
> to a string. In Python 3.0 bytes is the same as a string in 2.6, for
> most practical purposes.
>
> regards
> Steve

That would help to avoid replacing "" with b"" almost everywhere in my
code.

--- Giampaolo
http://code.google.com/p/pyftpdlib

Terry Reedy

unread,

Jan 16, 2009, 10:43:31 PM1/16/09

to pytho...@python.org

Giampaolo Rodola' wrote:

> That would help to avoid replacing "" with b"" almost everywhere in my
> code.

Won't 2to3 do that for you?

Giampaolo Rodola'

unread,

Jan 16, 2009, 10:51:33 PM1/16/09

to

I used 2to3 against my code but it didn't cover the "" -> b""
conversion (and I doubt it is able to do so, anyway).

--- Giampaolo
http://code.google.com/p/pyftpdlib

Steve Holden

unread,

Jan 16, 2009, 11:08:26 PM1/16/09

to pytho...@python.org

Giampaolo Rodola' wrote:
> On 17 Gen, 04:43, Terry Reedy <tjre...@udel.edu> wrote:
>> Giampaolo Rodola' wrote:
>>> That would help to avoid replacing "" with b"" almost everywhere in my
>>> code.
>> Won't 2to3 do that for you?
>
> I used 2to3 against my code but it didn't cover the "" -> b""
> conversion (and I doubt it is able to do so, anyway).
>

Note that if you are using 2.6 you should first convert your "" quotes
to b"" - this won't make any practical difference, but then you will be
able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

John Machin

unread,

Jan 16, 2009, 11:26:31 PM1/16/09

to

On Jan 17, 3:08 pm, Steve Holden <st...@holdenweb.com> wrote:
> Giampaolo Rodola' wrote:
> > On 17 Gen, 04:43, Terry Reedy <tjre...@udel.edu> wrote:
> >> Giampaolo Rodola' wrote:
> >>> That would help to avoid replacing "" with b"" almost everywhere in my
> >>> code.
> >> Won't 2to3 do that for you?
>
> > I used 2to3 against my code but it didn't cover the "" -> b""
> > conversion (and I doubt it is able to do so, anyway).
>
> Note that if you are using 2.6 you should first convert your "" quotes
> to b"" - this won't make any practical difference, but then you will be
> able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.

Perhaps before we get too far down the track of telling the OP what he
should do, we should ask him a little about his intentions:

Is he porting to 3.0 and abandoning 2.x support completely?
[presumably unlikely]
So then what is the earliest 2.x that he wants to support at the same
time as 3.x? [presumably at least 2.5]
Does he intend to maintain two separate codebases, one 2.x and the
other 3.x?
Else does he intend to maintain just one codebase written in some 2.x
dialect and using 2to3 plus sys.version-dependent code for the things
that 2to3 can't/doesn't handle?

Cheers,
John

Giampaolo Rodola'

unread,

Jan 17, 2009, 8:58:35 AM1/17/09

to

On 17 Gen, 05:26, John Machin <sjmac...@lexicon.net> wrote:
> On Jan 17, 3:08 pm, Steve Holden <st...@holdenweb.com> wrote:
>
> > Giampaolo Rodola' wrote:
> > > On 17 Gen, 04:43, Terry Reedy <tjre...@udel.edu> wrote:
> > >> Giampaolo Rodola' wrote:
> > >>> That would help to avoid replacing "" with b"" almost everywhere in my
> > >>> code.
> > >> Won't 2to3 do that for you?
>
> > > I used 2to3 against my code but it didn't cover the "" -> b""
> > > conversion (and I doubt it is able to do so, anyway).
>
> > Note that if you are using 2.6 you should first convert your "" quotes
> > to b"" - this won't make any practical difference, but then you will be
> > able to run 2to3 on your code and (one hopes) covert for 3.0 automatically.
>
> Perhaps before we get too far down the track of telling the OP what he
> should do, we should ask him a little about his intentions:
>
> Is he porting to 3.0 and abandoning 2.x support completely?
> [presumably unlikely]

No.

> So then what is the earliest 2.x that he wants to support at the same
> time as 3.x? [presumably at least 2.5]

I currently support Python versions from 2.3 to 2.6 by using un unique
codebase.
My idea is to support 3.x starting from the last upcoming release.

> Does he intend to maintain two separate codebases, one 2.x and the
> other 3.x?

I think I have no other choice.
Why? Is theoretically possible to maintain an unique code base for
both 2.x and 3.x?

> Else does he intend to maintain just one codebase written in some 2.x
> dialect and using 2to3 plus sys.version-dependent code for the things
> that 2to3 can't/doesn't handle?

I don't think it would worth the effort.

> Cheers,
> John

Thanks a lot

--- Giampaolo
http://code.google.com/p/pyftpdlib

"Martin v. Löwis"

unread,

Jan 17, 2009, 9:24:11 AM1/17/09

to

>> Does he intend to maintain two separate codebases, one 2.x and the
>> other 3.x?
>
> I think I have no other choice.
> Why? Is theoretically possible to maintain an unique code base for
> both 2.x and 3.x?

That is certainly possible! One might have to make tradeoffs wrt.
readability sometimes, but I found that this approach works quite
well for Django. I think Mark Hammond is also working on maintaining
a single code base for both 2.x and 3.x, for PythonWin.

Regards,
Martin

Terry Reedy

unread,

Jan 17, 2009, 5:10:53 PM1/17/09

to pytho...@python.org

Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.

One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended. (Many programs do not have such a mix.)

One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately. Them marking the bytes with a 'b',
either by hand or program would be easy.

tjr

John Machin

unread,

Jan 17, 2009, 8:37:40 PM1/17/09

to

(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?

(2) Assuming supporting only 2.6 and 3.x:

Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file

Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *

Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file

Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?

Terry Reedy

unread,

Jan 17, 2009, 10:00:38 PM1/17/09

to pytho...@python.org

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Terry Reedy

unread,

Jan 17, 2009, 10:02:53 PM1/17/09

to pytho...@python.org

John Machin wrote:
> On Jan 18, 9:10 am, Terry Reedy <tjre...@udel.edu> wrote:
>> Martin v. Löwis wrote:
>>>>> Does he intend to maintain two separate codebases, one 2.x and the
>>>>> other 3.x?
>>>> I think I have no other choice.
>>>> Why? Is theoretically possible to maintain an unique code base for
>>>> both 2.x and 3.x?
>>> That is certainly possible! One might have to make tradeoffs wrt.
>>> readability sometimes, but I found that this approach works quite
>>> well for Django. I think Mark Hammond is also working on maintaining
>>> a single code base for both 2.x and 3.x, for PythonWin.
>> Where 'single codebase' means that the code runs as is in 2.x and as
>> autoconverted by 2to3 (or possibly a custom comverter) in 3.x.
>>
>> One barrier to doing this is when the 2.x code has a mix of string
>> literals with some being character strings that should not have 'b'
>> prepended and some being true byte strings that should have 'b'
>> prepended. (Many programs do not have such a mix.)
>>
>> One approach to dealing with string constants I have not yet seen
>> discussed here is to put them all in separate file(s) to be imported.
>> Group the text and bytes separately. Them marking the bytes with a 'b',
>> either by hand or program would be easy.
>
> (1) How would this work for somebody who wanted/needed to support 2.5
> and earlier?

See reposts in python wiki, one by Martin.

For 2.6 only (which is much easier than 2.x), do C. Plan A is for 2.x
where C does not work.

tjr

John Machin

unread,

Jan 18, 2009, 6:56:39 AM1/18/09

to

Most relevant of these is Martin's article on porting Django, using a
single codebase. The """goal is to support all versions that Django
supports, plus 3.0""" -- indicating that it supports at least 2.5,
which won't eat b"blah" syntax. He is using 2to3, and handles bytes
constants by """django.utils.py3.b, which is a function that converts
its argument to an ASCII-encoded byte string. In 2.x, it is another
alias for str; in 3.x, it leaves byte strings alone, and encodes
regular (unicode) strings as ASCII. This function is used in all
places where string literals are meant as bytes, plus all cases where
str() was used to invoke the default conversion of 2.x."""

Very similar to what I expected. However it doesn't answer my question
about how your "move byte strings to a separate file, prepend 'b', and
import the separate file" strategy would help ... and given that 2.5
and earlier will barf on b"arf", I don't expect it to.

Excuse me? I'm with the OP now, I'm totally confused. Plan C is *not*
what you were proposing; you were proposing something like Plan A
which definitely involved a separate file.

Why won't Plan C work on 2.x (x <= 5)? Because the 2.X will b"arf".
But you say Plan A is for 2.x -- but Plan A involves importing the
separate file which contains and causes b"arf" also!

To my way of thinking, one obvious DISadvantage of a strategy that
actually moves the strings to another file (requiring invention of a
name for each string (that doesn't have one already) so that it can be
imported is the amount of effort and exposure to error required to get
the same functional result as a strategy that keeps the string in the
same file ... and this disadvantage applies irrespective of what one
does to the string: b"arf", Martin's b("arf"), somebody else's _b
("arf") [IIRC] or my you-aint-gonna-miss-noticing-this-in-the-code
BYTES_LITERAL("arf").

Cheers,
John