Now. The basic difference is that socket.recv() returns a bytes object
instead of a string object and that's the thing which confuses me
mainly.
My question is: is there a way to convert that bytes object into
exactly *the same thing* returned by socket.recv() in Python 2.x (a
string)?
I know I can do:
data = socket.recv(1024)
data = data.decode(encoding)
...to convert bytes into a string but that's not exactly the same
thing.
In Python 2.x I didn't have to care about the encoding. What
socket.recv() returned was just a string. That was all.
Now doing something like b''.decode(encoding) puts me in serious
troubles since that can raise an exception in case client and server
use a different encoding.
As far as I've understood the basic difference I see now is that a
Python 2.x based FTP server could handle a 3.x based FTP client using
"latin1" encoding or "utf-8" or anything else while with Python 3.x
I'm forced to tell my server which encoding to use and I don't know
how to deal with that.
--- Giampaolo
http://code.google.com/p/pyftpdlib
Python 3.0's bytes type is almost the same type as Python 2.x's str
type. During the development of Python 3.0 the old str type was modified
and renamed to bytes. The old unicode type is now known as str.
2.x -> 3.0
-----------------
str -> bytes
unicode -> str
"" -> b""
u"" -> ""
HTH
Christian
Now, in Python 3.x, it's time to tidy things up.
The 'str' type has been renamed 'bytes' and the 'unicode' type has been
renamed 'str'. If you're truly working with strings of _characters_ then
'str' is what you need, but if you're working with strings of _bytes_
then 'bytes' is what you need.
socket.send() and socket.recv() are still the same, it's just that it's
now clearer that they work with bytes and not strings.
> If you're truly working with strings of _characters_ then
> 'str' is what you need, but if you're working with strings of _bytes_
> then 'bytes' is what you need.
I work with string of characters but to convert bytes into string I
need to specify an encoding and that's what confuses me.
Before there was no need to deal with that.
--- Giampaolo
http://code.google.com/p/pyftpdlib
Why do you have to deal with unicode data? IIRC ftp uses ASCII only text
so you can stick to bytes everywhere. If you didn't have to worry about
encoding and unicode in Python 2.x then you should use bytes all over
the place, too.
Christian
> On 17 Gen, 02:24, MRAB <goo...@mrabarnett.plus.com> wrote:
>
>> If you're truly working with strings of _characters_ then 'str' is what
>> you need, but if you're working with strings of _bytes_ then 'bytes' is
>> what you need.
>
> I work with string of characters but to convert bytes into string I need
> to specify an encoding and that's what confuses me. Before there was no
> need to deal with that.
In Python 2.x, str means "string of bytes". This has been renamed "bytes"
in Python 3.
In Python 2.x, unicode means "string of characters". This has been
renamed "str" in Python 3.
If you do this in Python 2.x:
my_string = str(bytes_from_socket)
then you don't need to convert anything, because you are going from a
string of bytes to a string of bytes.
If you do this in Python 3:
my_string = str(bytes_from_socket)
then you *do* have to convert, because you are going from a string of
bytes to a string of characters (unicode). The Python 2.x equivalent code
would be:
my_string = unicode(bytes_from_socket)
and when you convert to unicode, you can get encoding errors. A better
way to do this would be some variation on:
my_str = bytes_from_socket.decode('utf-8')
You should read this:
http://www.joelonsoftware.com/articles/Unicode.html
--
Steven
Thanks, that clarifies a bit even if I still have a lot of doubts.
I wish I could do:
my_str = bytes_from_socket.decode('utf-8')
That would mean avoiding to replace "" with b"" almost everywhere in
my code but I doubt it would actually be a good idea.
RFC-2640 states that UTF-8 is the preferable encoding to use for both
clients and servers but I see that Python 3.x's ftplib uses latin1,
for example (bug?). How my server is supposed to deal with that?
I think that using bytes everywhere, as Christian recommended, would
be the only way to behave exactly like the 2.x version, but that's not
easy at all.
--- Giampaolo
http://code.google.com/p/pyftpdlib
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
That would help to avoid replacing "" with b"" almost everywhere in my
code.
--- Giampaolo
http://code.google.com/p/pyftpdlib
> That would help to avoid replacing "" with b"" almost everywhere in my
> code.
Won't 2to3 do that for you?
I used 2to3 against my code but it didn't cover the "" -> b""
conversion (and I doubt it is able to do so, anyway).
--- Giampaolo
http://code.google.com/p/pyftpdlib
Perhaps before we get too far down the track of telling the OP what he
should do, we should ask him a little about his intentions:
Is he porting to 3.0 and abandoning 2.x support completely?
[presumably unlikely]
So then what is the earliest 2.x that he wants to support at the same
time as 3.x? [presumably at least 2.5]
Does he intend to maintain two separate codebases, one 2.x and the
other 3.x?
Else does he intend to maintain just one codebase written in some 2.x
dialect and using 2to3 plus sys.version-dependent code for the things
that 2to3 can't/doesn't handle?
Cheers,
John
No.
> So then what is the earliest 2.x that he wants to support at the same
> time as 3.x? [presumably at least 2.5]
I currently support Python versions from 2.3 to 2.6 by using un unique
codebase.
My idea is to support 3.x starting from the last upcoming release.
> Does he intend to maintain two separate codebases, one 2.x and the
> other 3.x?
I think I have no other choice.
Why? Is theoretically possible to maintain an unique code base for
both 2.x and 3.x?
> Else does he intend to maintain just one codebase written in some 2.x
> dialect and using 2to3 plus sys.version-dependent code for the things
> that 2to3 can't/doesn't handle?
I don't think it would worth the effort.
> Cheers,
> John
Thanks a lot
--- Giampaolo
http://code.google.com/p/pyftpdlib
That is certainly possible! One might have to make tradeoffs wrt.
readability sometimes, but I found that this approach works quite
well for Django. I think Mark Hammond is also working on maintaining
a single code base for both 2.x and 3.x, for PythonWin.
Regards,
Martin
Where 'single codebase' means that the code runs as is in 2.x and as
autoconverted by 2to3 (or possibly a custom comverter) in 3.x.
One barrier to doing this is when the 2.x code has a mix of string
literals with some being character strings that should not have 'b'
prepended and some being true byte strings that should have 'b'
prepended. (Many programs do not have such a mix.)
One approach to dealing with string constants I have not yet seen
discussed here is to put them all in separate file(s) to be imported.
Group the text and bytes separately. Them marking the bytes with a 'b',
either by hand or program would be easy.
tjr
(1) How would this work for somebody who wanted/needed to support 2.5
and earlier?
(2) Assuming supporting only 2.6 and 3.x:
Suppose you have this line:
if binary_data[:4] == "PK\x03\x04": # signature of ZIP file
Plan A:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the bytes section of the separate file:
ZIPFILE_SIG = "PK\x03\x04"
[somewhat later]
Change the above to:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *
Plan B:
Change original to:
if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
Add this to the separate file:
ZIPFILE_SIG = b"PK\x03\x04"
[once per original file]
Add near the top:
from separatefile import *
Plan C:
Change original to:
if binary_data[:4] == b"PK\3\4": # signature of ZIP file
Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
variety thereof -- what advantages do you see in this over Plan C?
See reposts in python wiki, one by Martin.
For 2.6 only (which is much easier than 2.x), do C. Plan A is for 2.x
where C does not work.
tjr
Most relevant of these is Martin's article on porting Django, using a
single codebase. The """goal is to support all versions that Django
supports, plus 3.0""" -- indicating that it supports at least 2.5,
which won't eat b"blah" syntax. He is using 2to3, and handles bytes
constants by """django.utils.py3.b, which is a function that converts
its argument to an ASCII-encoded byte string. In 2.x, it is another
alias for str; in 3.x, it leaves byte strings alone, and encodes
regular (unicode) strings as ASCII. This function is used in all
places where string literals are meant as bytes, plus all cases where
str() was used to invoke the default conversion of 2.x."""
Very similar to what I expected. However it doesn't answer my question
about how your "move byte strings to a separate file, prepend 'b', and
import the separate file" strategy would help ... and given that 2.5
and earlier will barf on b"arf", I don't expect it to.
Excuse me? I'm with the OP now, I'm totally confused. Plan C is *not*
what you were proposing; you were proposing something like Plan A
which definitely involved a separate file.
Why won't Plan C work on 2.x (x <= 5)? Because the 2.X will b"arf".
But you say Plan A is for 2.x -- but Plan A involves importing the
separate file which contains and causes b"arf" also!
To my way of thinking, one obvious DISadvantage of a strategy that
actually moves the strings to another file (requiring invention of a
name for each string (that doesn't have one already) so that it can be
imported is the amount of effort and exposure to error required to get
the same functional result as a strategy that keeps the string in the
same file ... and this disadvantage applies irrespective of what one
does to the string: b"arf", Martin's b("arf"), somebody else's _b
("arf") [IIRC] or my you-aint-gonna-miss-noticing-this-in-the-code
BYTES_LITERAL("arf").
Cheers,
John