Python 3 error: TypeError: expected bytes, str found

7,656 views
Skip to first unread message

Geoffrey Fairchild

unread,
Sep 19, 2013, 7:07:00 PM9/19/13
to cython...@googlegroups.com
I have some code at https://github.com/gfairchild/pyxDamerauLevenshtein/blob/master/pyxdameraulevenshtein/pyxdameraulevenshtein.pyx that is throwing an error I'm not sure how to fix. Essentially, I have this code:

cdef unicode to_unicode(char *s):
return s.decode('UTF-8', 'strict')

cpdef unsigned int damerau_levenshtein_distance(char *seq1, char *seq2):
s1 = to_unicode(seq1)
s2 = to_unicode(seq2)

This works just fine under Python 2 (which is what I've been testing with). But I was alerted that it is broken under Python 3. Essentially, you need to do this under Python 3:

normalized_damerau_levenshtein_distance('smtih'.encode(), 'smith'.encode())

where in Python 2, you just do this:

normalized_damerau_levenshtein_distance('smtih', 'smith')

How can I fix this in the Cython code so that the user doesn't have to call .encode()? I read http://docs.cython.org/src/tutorial/strings.html and tried several things, but I'm still not clear on how to do this.

Thanks!

Stefan Behnel

unread,
Sep 20, 2013, 1:15:42 AM9/20/13
to cython...@googlegroups.com
Geoffrey Fairchild, 20.09.2013 01:07:
> I have some code at
> https://github.com/gfairchild/pyxDamerauLevenshtein/blob/master/pyxdameraulevenshtein/pyxdameraulevenshtein.pyx
> that is throwing an error I'm not sure how to fix. Essentially, I have this
> code:
>
> cdef unicode to_unicode(char *s):
> return s.decode('UTF-8', 'strict')
>
> cpdef unsigned int damerau_levenshtein_distance(char *seq1, char *seq2):
> s1 = to_unicode(seq1)
> s2 = to_unicode(seq2)
>
>
> This works just fine under Python 2 (which is what I've been testing with).
> But I was alerted that it is broken under Python 3. Essentially, you need
> to do this under Python 3:
>
> normalized_damerau_levenshtein_distance('smtih'.encode(), 'smith'.encode())

I'd rather write code that is explicit about the encoding it uses.


> where in Python 2, you just do this:
>
> normalized_damerau_levenshtein_distance('smtih', 'smith')
>
>
> How can I fix this in the Cython code so that the user doesn't have to call
> .encode()? I read http://docs.cython.org/src/tutorial/strings.html and
> tried several things, but I'm still not clear on how to do this.

Given that your code above starts by unpacking a byte string and then calls
decode() on it, I'd rather drop the "char*" from the current signature and
rewrite the helper like this:

cdef unicode to_unicode(s):
if isinstance(s, bytes):
return (<bytes>s).decode('utf8')
return s

cpdef unsigned int damerau_levenshtein_distance(seq1, seq2):
# as before

That avoids an unnecessarily costly round-trip through C. String handling
is much easier and often (as in this case) also more efficient in Python
than in C.

Here's some more information on that:

http://docs.cython.org/src/tutorial/strings.html

Stefan

Stefan Behnel

unread,
Sep 20, 2013, 1:19:17 AM9/20/13
to cython...@googlegroups.com
Stefan Behnel, 20.09.2013 07:15:
Ah, sorry, I only just noticed that you've already seen that. So here's a
more concrete link to the section I meant:

http://docs.cython.org/src/tutorial/strings.html#general-notes-about-c-strings

Stefan

Geoffrey Fairchild

unread,
Sep 20, 2013, 1:42:27 AM9/20/13
to cython...@googlegroups.com, stef...@behnel.de
Ahhhh, thanks very much! I think I see what I need to fix. I'll tackle this tomorrow.
Reply all
Reply to author
Forward
0 new messages