Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#442392: python-unac -- code is of poor quality and can be done as easily in native Python

0 views
Skip to first unread message

Joe Wreschnig

unread,
Sep 15, 2007, 2:30:16 PM9/15/07
to
Package: python-unac
Version: 1.7.0-1
Priority: wishlist

All libunac does is run a decomposition filter on the unicode strings
passed in. This can be done natively in Python without resort to a
third-party C library, in a trivial amount of code.

Additionally, libunac hardcodes the Unicode consortium data into itself
and furthermore changes that data based on non-standard proposals from
its users.

The libunac Python wrapper cannot properly handle subclasses of string
or unicode, since it compares the name of the class to 'str' or
'unicode' rather than checking the types per se (Which is also unsafe in
other ways, since someone might have named some other class 'str'.)

I've attached a Python module that should be basically compatible with
python-unac, except for the fact that Python's Unicode data does not
include the non-standard decomposition forms present in libunac, and it
works properly with subclasses of str or unicode.

There are two small differences; I made the default encoding utf-8
instead of nothing (which always returned nothing), and I let the user
pass in alternate error handling behavior if they want. I would argue
both of these make it better, but the former is technically an API
change.
--
Joe Wreschnig <pi...@sacredchao.net>

unac.py
signature.asc

Lukáš Lalinský

unread,
Sep 15, 2007, 3:40:07 PM9/15/07
to

There is one functional difference, NFKD normalization and the filtering
used in libunac will not provide the same result (the sample is some
random text from http://www.bbc.co.uk/vietnamese/):

>>> print unac.unac_string(u'Khoảng một triệu người châu Phi đang chịu
ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng chục người
chết')
Khoang mot trieu nguoi chau Phi dang chiu anh huong cua lu lut do mua
lon gay mat mua, vo de va hang chuc nguoi chet
>>> print unac2.unac_string(u'Khoảng một triệu người châu Phi đang chịu
ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng chục người
chết')
Khoang mot trieu nguoi chau Phi đang chiu anh huong cua lu lut do mua
lon gay mat mua, vo đe va hang chuc nguoi chet

(notice the 'd' in libunac and 'đ' in NKFD)

But there is one, more important, difference -- performance. Real-time
unicode normalization is slow, Python list filtering is slow. I use the
code for a Lucene index builder with custom unaccenting analyzer, and
the Python code would increase the running time significantly:

>>> timeit.Timer("unac.unac_string(u'Khoảng một triệu người châu Phi
đang chịu ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng
chục người chết')", "from unac import unac").timeit(100000)
1.7533831596374512
>>> timeit.Timer("unac.unac_string(u'Khoảng một triệu người châu Phi
đang chịu ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng
chục người chết')", "from unac2 import unac").timeit(100000)
19.089791059494019

I know that the wrapper code is not nice, and could be done much better,
but the functionality and the speed is not comparable with the code
based on Python's unicodedata module.

Lukas

signature.asc
0 new messages