Geoffrey Fairchild, 20.09.2013 01:07:
> I have some code at
>
https://github.com/gfairchild/pyxDamerauLevenshtein/blob/master/pyxdameraulevenshtein/pyxdameraulevenshtein.pyx
> that is throwing an error I'm not sure how to fix. Essentially, I have this
> code:
>
> cdef unicode to_unicode(char *s):
> return s.decode('UTF-8', 'strict')
>
> cpdef unsigned int damerau_levenshtein_distance(char *seq1, char *seq2):
> s1 = to_unicode(seq1)
> s2 = to_unicode(seq2)
>
>
> This works just fine under Python 2 (which is what I've been testing with).
> But I was alerted that it is broken under Python 3. Essentially, you need
> to do this under Python 3:
>
> normalized_damerau_levenshtein_distance('smtih'.encode(), 'smith'.encode())
I'd rather write code that is explicit about the encoding it uses.
> where in Python 2, you just do this:
>
> normalized_damerau_levenshtein_distance('smtih', 'smith')
>
>
> How can I fix this in the Cython code so that the user doesn't have to call
> .encode()? I read
http://docs.cython.org/src/tutorial/strings.html and
> tried several things, but I'm still not clear on how to do this.
Given that your code above starts by unpacking a byte string and then calls
decode() on it, I'd rather drop the "char*" from the current signature and
rewrite the helper like this:
cdef unicode to_unicode(s):
if isinstance(s, bytes):
return (<bytes>s).decode('utf8')
return s
cpdef unsigned int damerau_levenshtein_distance(seq1, seq2):
# as before
That avoids an unnecessarily costly round-trip through C. String handling
is much easier and often (as in this case) also more efficient in Python
than in C.
Here's some more information on that:
http://docs.cython.org/src/tutorial/strings.html
Stefan