Unicode and Fields.

476 views
Skip to first unread message

Sudharshan S

unread,
Mar 17, 2012, 1:11:55 PM3/17/12
to Whoosh
Hi all,
I have been giving whoosh a spin and found that the field contents
(except BOOLEAN and NUMERIC since they have their own index()
implementation) should to be strictly unicode.

Is there some design decision behind this?

When I try to store normal basestrings for fields I get the following
Exception
ValueError: 'foobar' is not unicode or sequence

Of course, the right solution to change my code and ensure all fields
are unicode before indexing, but I'd like to know why whoosh doesn't
do something like this.

diff -r d8001e7edb28 src/whoosh/fields.py
--- a/src/whoosh/fields.py Tue Dec 20 05:48:36 2011 -0500
+++ b/src/whoosh/fields.py Sat Mar 17 14:05:15 2012 -0300
@@ -215,6 +215,8 @@
if not self.format:
raise Exception("%s field %r cannot index without a
format"
% (self.__class__.__name__, self))
+ if isinstance(value, str):
+ value = unicode(value)
if not isinstance(value, (text_type, list, tuple)):
raise ValueError("%r is not unicode or sequence" % value)
assert isinstance(self.format, formats.Format)

Roger Binns

unread,
Mar 20, 2012, 2:20:40 PM3/20/12
to who...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 17/03/12 10:11, Sudharshan S wrote:
> but I'd like to know why whoosh doesn't do something like this.

You can't change from a sequence of bytes (str) to unicode without knowing
the encoding. Generally that is let slide if all the bytes in the str are
less than 127 and assume they are ascii but that is not necessarily correct.

Pragmatic Unicode (python specific)

http://nedbatchelder.com/text/unipain.html

The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)

http://www.joelonsoftware.com/articles/Unicode.html

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk9oyngACgkQmOOfHg372QRa1gCgtEqPuIPXxvW7v9wA0rQPaV+M
an8An0jUv2S6S7tfVABHFo/GXQQjHts8
=EV4k
-----END PGP SIGNATURE-----

Reply all
Reply to author
Forward
0 new messages