Universal string conversion to bytes in python2/3

85 views
Skip to first unread message

Ian Bell

unread,
Dec 17, 2017, 3:04:38 AM12/17/17
to cython...@googlegroups.com
Is there a more elegant solution to this problem? I need to be able to support generic inputs of str on python 2, bytes, or unicode on python 3, and end up with a bytes object.  I know all string-types are UTF8 (or ascii) encoded.

cpdef bytes get_bytes(object _key):
    cdef bytes key
    try:
        key = _key.encode('utf8')
    except AttributeError:
        key = _key
    return key

Jeroen Demeyer

unread,
Dec 17, 2017, 3:49:33 AM12/17/17
to cython...@googlegroups.com
On 2017-12-17 09:03, Ian Bell wrote:
> Is there a more elegant solution to this problem? I need to be able to
> support generic inputs of str on python 2, bytes, or unicode on python
> 3, and end up with a bytes object. I know all string-types are UTF8 (or
> ascii) encoded.

I would do:

cpdef bytes get_bytes(key):
if isinstance(key, bytes):
return <bytes>key
else:
return key.encode('utf8')

Ian Bell

unread,
Dec 17, 2017, 3:37:11 PM12/17/17
to cython...@googlegroups.com
Great, yeah that works perfectly in python 2/3 and with all the different types of string-ish inputs.  And it's faster than my solution as well, especially if the input is already bytes.

All these python 2/3 string types drive me a little bit insane...



--

--- You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stefan Behnel

unread,
Dec 18, 2017, 1:09:16 AM12/18/17
to cython...@googlegroups.com
Am 17. Dezember 2017 21:36:27 MEZ schrieb Ian Bell:
>All these python 2/3 string types drive me a little bit insane...

Did you read this?

http://docs.cython.org/en/latest/src/tutorial/strings.html

Stefan

Ian Bell

unread,
Dec 18, 2017, 1:39:34 AM12/18/17
to cython...@googlegroups.com
Yes, in detail, many times.  I find the panoply of string types and encodings and encoding and decoding to be completely confusing, especially because the string types are similar but different between python 2, python 3 and cython.  It is maybe the one thing that I still struggle with the most in python, especially because I am always dealing with c++ code that works with std::string, and python has all these string-like things.  As an American, I know I have been able to avoid Unicode mostly because almost everything we do does not use Unicode.  But the rest of the world is not so lucky.

What I settled on was to internally do everything with bytes and convert inputs to bytes in every function.  Then the string type is clear between python 2 and python 3.  What if didn't do that though and went for the default string type in each of the python version? Is there a notion of the default string type in the given python version in cython?  Something like cython.default_string?  That would be very useful.  Or maybe I'm thinking about the problem all wrong.


Stefan

Jeroen Demeyer

unread,
Dec 18, 2017, 4:02:33 AM12/18/17
to cython...@googlegroups.com
On 2017-12-18 07:39, Ian Bell wrote:
> Is there a
> notion of the default string type in the given python version in
> cython?

Do you mean "str"?

Chris Barker

unread,
Dec 18, 2017, 1:36:09 PM12/18/17
to cython-users
On Sun, Dec 17, 2017 at 10:39 PM, Ian Bell <ian.h...@gmail.com> wrote:
Yes, in detail, many times.  I find the panoply of string types and encodings and encoding and decoding to be completely confusing, especially because the string types are similar but different between python 2, python 3 and cython.  It is maybe the one thing that I still struggle with the most in python, especially because I am always dealing with c++ code that works with std::string, and python has all these string-like things. 

Well, std::string is pretty much a bytes object :-) (or maybe a python2 str object....). But it really doesn't help with unicode at all.
 
As an American, I know I have been able to avoid Unicode mostly because almost everything we do does not use Unicode.  But the rest of the world is not so lucky.

What I settled on was to internally do everything with bytes and convert inputs to bytes in every function.  Then the string type is clear between python 2 and python 3.  What if didn't do that though and went for the default string type in each of the python version?

that is a probably the worst way to do it :-)

In Python2, there there are two types: "unicode" and "str", and then there is bytes, which is just a synonym for str.

In Python3, there are two types: "str" and "bytes" -- so here is where it gets a bit confusing:

a py3 str is essentially the same as a py2 unicode
a py3 bytes is essentially the same as a py2 str

But maybe you are seeing some symmetry here -- in BOTH, there are essentially two types: Unicode and Bytes. So the trick is to not use the name "str", as that means two different things in the two python versions.

My solution:

In your Cython code, use two and only two types:

unicode
bytes

bytes contain encoded text, and unicode contains, well, "proper" unicode. So all text is in a unicode object.

Is there a notion of the default string type in the given python version in cython?  Something like cython.default_string?  That would be very useful.

that would be the "str" type in Cython - it's a py2 string (bytes) in python2 and a py3 string (unicode) in py3 -- but then you don't know in the rest of your code what you are dealing with.

In terms of exchanging data with C / C++ you always want to use the bytes type -- so you encode it from a unicode type to bytes when going to C+, and decode it from bytes to unicode when coming from C++.

This is all pretty easy IF you can be sure of the encoding of the C++ strings....

-CHB

Or maybe I'm thinking about the problem all wrong.

only a little wrong. :-)

-CHB




--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Justin Israel

unread,
Jan 31, 2019, 1:28:34 AM1/31/19
to cython-users
I know this is an older thread, but after searching around, this one seems the most relevant to my question.  I've also reviewed:


I want to gut-check what I am currently doing in my approach to adapting a large-ish C++ binding to add support python3. A long while back I had already added the following directives to all my pyx files to ensure I was dealing with unicode strings:

    # cython: c_string_type=unicode, c_string_encoding=utf8

Now, one of my biggest and time-consuming challenges in this update is dealing with the *ton* of locations where a python string is assigned to a std::string or pass to an argument, or even part of implicit map or list conversions. I'm finding that I have to discover all this locations and litter the code-base with a call to this helper:

    # cython: c_string_type=unicode, c_string_encoding=utf8

    from libcpp.string cimport string
    from cpython.version cimport PY_MAJOR_VERSION

    cdef unicode _text(s):
        if type(s) is unicode:
            return <unicode>s

        elif PY_MAJOR_VERSION < 3 and isinstance(s, bytes):
            return (<bytes>s).decode('ascii')
        
        elif isinstance(s, unicode):
            return unicode(s)
        
        else:
            raise TypeError("Could not convert to unicode.")

    cdef string _string(basestring s) except *:
        cdef string c_str = _text(s).encode("utf-8")
        return c_str

    # ...
    self.field = _string(s)

Is this approach expected? 

Justin

Robert Bradshaw

unread,
Jan 31, 2019, 6:32:51 AM1/31/19
to cython...@googlegroups.com
These directives should be enough to automatically convert your Python
strings (Py2 and Py3) into C++ strings, no extra helpers required.
(The only think I'm not sure of is handling subclasses of unicode, is
that important?) What kinds of errors are you seeing when you just do

self.field = py_object # where field is of type std::string
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "cython-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.

Justin Israel

unread,
Jan 31, 2019, 1:35:13 PM1/31/19
to cython...@googlegroups.com


On Fri, Feb 1, 2019, 12:32 AM Robert Bradshaw <robe...@gmail.com> wrote:
These directives should be enough to automatically convert your Python
strings (Py2 and Py3) into C++ strings, no extra helpers required.
(The only think I'm not sure of is handling subclasses of unicode, is
that important?) What kinds of errors are you seeing when you just do

    self.field = py_object   # where field is of type std::string

It was enough to only have the directives when I was only running under python2. When I switched to py3 I would get:

    Expected bytes, got string

It seems Cython does not want to encode Unicode strings to bytes automatically when assigning to a std::string anymore. This is under Cython 0.28.5.
I have no need to support string subclasses. I can pass a string literal into a function and it will raise the exception when it tried to directly assign it. 

You received this message because you are subscribed to a topic in the Google Groups "cython-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cython-users/oqk3GQ2pJ8M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cython-users...@googlegroups.com.

Robert Bradshaw

unread,
Jan 31, 2019, 2:18:55 PM1/31/19
to cython...@googlegroups.com
On Thu, Jan 31, 2019 at 7:35 PM Justin Israel <justin...@gmail.com> wrote:

On Fri, Feb 1, 2019, 12:32 AM Robert Bradshaw <robe...@gmail.com> wrote:
These directives should be enough to automatically convert your Python
strings (Py2 and Py3) into C++ strings, no extra helpers required.
(The only think I'm not sure of is handling subclasses of unicode, is
that important?) What kinds of errors are you seeing when you just do

    self.field = py_object   # where field is of type std::string

It was enough to only have the directives when I was only running under python2. When I switched to py3 I would get:

    Expected bytes, got string

It seems Cython does not want to encode Unicode strings to bytes automatically when assigning to a std::string anymore. This is under Cython 0.28.5.
I have no need to support string subclasses. I can pass a string literal into a function and it will raise the exception when it tried to directly assign it. 

Justin Israel

unread,
Jan 31, 2019, 3:55:26 PM1/31/19
to cython-users
On Fri, Feb 1, 2019 at 8:18 AM Robert Bradshaw <robe...@gmail.com> wrote:
On Thu, Jan 31, 2019 at 7:35 PM Justin Israel <justin...@gmail.com> wrote:

On Fri, Feb 1, 2019, 12:32 AM Robert Bradshaw <robe...@gmail.com> wrote:
These directives should be enough to automatically convert your Python
strings (Py2 and Py3) into C++ strings, no extra helpers required.
(The only think I'm not sure of is handling subclasses of unicode, is
that important?) What kinds of errors are you seeing when you just do

    self.field = py_object   # where field is of type std::string

It was enough to only have the directives when I was only running under python2. When I switched to py3 I would get:

    Expected bytes, got string

It seems Cython does not want to encode Unicode strings to bytes automatically when assigning to a std::string anymore. This is under Cython 0.28.5.
I have no need to support string subclasses. I can pass a string literal into a function and it will raise the exception when it tried to directly assign it. 

Hmm... that does look like a bug then. 


I wish I had seen your reply before I put together this exact repro of my own usage:

I am using utf8 on both sides instead of ascii, if that makes a difference. But I have tried this repro even on the master 3.0a build, with each of the language levels, and the behaviour is the same. So do you have any suggestions to address this, or am I going to have to continue finding all of my std::string conversion sites and wrapping them in a bytes conversion?

Robert Bradshaw

unread,
Jan 31, 2019, 4:56:48 PM1/31/19
to cython...@googlegroups.com
I'm going to try and fix Cython.

Justin Israel

unread,
Jan 31, 2019, 5:28:05 PM1/31/19
to cython-users
On Fri, Feb 1, 2019 at 10:56 AM Robert Bradshaw <robe...@gmail.com> wrote:
I'm going to try and fix Cython.

Sweet. Is there a specific issue that I can watch to track the progress? Thanks! 

Robert Bradshaw

unread,
Jan 31, 2019, 5:34:13 PM1/31/19
to cython...@googlegroups.com
File One and I'll use it.

Justin Israel

unread,
Jan 31, 2019, 5:49:58 PM1/31/19
to cython-users
On Fri, Feb 1, 2019 at 11:34 AM Robert Bradshaw <robe...@gmail.com> wrote:
File One and I'll use it.

Reply all
Reply to author
Forward
0 new messages