pyparsing.ParseException with german umlaute

262 views
Skip to first unread message

Marcus Puchalla

unread,
Aug 31, 2009, 9:42:21 AM8/31/09
to Whoosh
Hi,

I've have a problem with German Umlaute that I want to search for.
Indexing data from mysql I convert the strings using:
str(row[0]).encode("utf-8"))
which is fine even for ß, ä or ü

But when I use the same method at
parser.parse(..) and search for 'Straße'
i alway get the following exception:
File "build/bdist.linux-i686/egg/whoosh/qparser.py", line 209, in
parse
File "build/bdist.linux-i686/egg/whoosh/support/pyparsing.py", line
1076, in parseString
whoosh.support.pyparsing.ParseException: Expected end of text (at char
4), (line:1, col:5)

I've tried every possible combination:
query = parser.parse(unicode(str(self.keyword),'ascii'))
query = parser.parse(unicode(str(self.keyword),'utf-8'))
query = parser.parse(unicode(self.keyword.decode("utf-8")))
query = parser.parse(unicode(str(self.keyword).decode("utf-8")))

Does someone have a clue why this happens and how to avoid it.
Thanks
Marcus

Matt Chaput

unread,
Aug 31, 2009, 11:51:40 AM8/31/09
to who...@googlegroups.com
Marcus Puchalla wrote:
> Indexing data from mysql I convert the strings using:
> str(row[0]).encode("utf-8"))

Not sure what you're going for with that... you should be indexing
unicode strings. Also, I thought str().encode("utf-8") was a no-op,
since str() will only handle ASCII.

> But when I use the same method at
> parser.parse(..) and search for 'Straße'

(Shakes fist at sky) CURSE YOU PYPARSING!!!

Sigh.

I thought pyparsing would handle non-ASCII characters. I'll try to
figure out what I need to do to make it work properly.

Matt

Adam Blinkinsop

unread,
Aug 31, 2009, 12:04:45 PM8/31/09
to who...@googlegroups.com
On Mon, Aug 31, 2009 at 8:51 AM, Matt Chaput <ma...@whoosh.ca> wrote:

Marcus Puchalla wrote:
> Indexing data from mysql I convert the strings using:
> str(row[0]).encode("utf-8"))

Not sure what you're going for with that... you should be indexing
unicode strings. Also, I thought str().encode("utf-8") was a no-op,
since str() will only handle ASCII.

Ah, Unicode in Python is a crazy thing.  Basically, think of the `str` type as uninterpreted byte arrays, with a special str() or repr() implementation that shows the ASCII characters those bytes represent.  Any bytes will fit, iirc.  Unicode is an array/string of unicode code points that have little to no real numeric interpretation.  You can translate between the two by encoding (unicode -> str) or decoding (str -> unicode).  In Python 3.0 this makes much more sense, as unicode becomes str, and str becomes "bytes".  For more, check out http://evanjones.ca/python-utf8.html
 
> But when I use the same method at
> parser.parse(..) and search for 'Straße'

(Shakes fist at sky) CURSE YOU PYPARSING!!!

Sigh.

I thought pyparsing would handle non-ASCII characters. I'll try to
figure out what I need to do to make it work properly.

If you'd just like a new default query parser, I can implement that.  :)  I'd just like a few tests to ensure compatibility.  (I've been using my custom parser for so long, I'm not sure exactly what qparser does!)
 
Matt






--
Adam Blinkinsop <bli...@acm.org>

Matt Chaput

unread,
Aug 31, 2009, 1:44:15 PM8/31/09
to who...@googlegroups.com
Marcus Puchalla wrote:
> But when I use the same method at
> parser.parse(..) and search for 'Straße'

I think I've got this figured out in pyparsing. I'll upload a new
release with the fix.

Cheers,

Matt

Paul McGuire

unread,
Sep 1, 2009, 7:17:46 PM9/1/09
to Whoosh
Have you looked at using alphas8bit?:

# vim:fileencoding=utf-8

from pyparsing import *

test = "which is fine even for ß, ä or ü"
punc = oneOf(", .")
print OneOrMore(Word(alphas+alphas8bit) | punc).parseString(test)


prints:

['which', 'is', 'fine', 'even', 'for', '\xdf', ',', '\xe4', 'or',
'\xfc']


Or here is a more Unicode-complete way to go:

allunicodealphas = u''.join(unichr(c) for c in xrange(65536) if unichr
(c).isalpha())
print len(allunicodealphas)
uniword = Word(allunicodealphas)
print OneOrMore(uniword | punc).parseString(test)

prints:

46618
['which', 'is', 'fine', 'even', 'for', '\xdf', ',', '\xe4', 'or',
'\xfc']


-- Paul
Reply all
Reply to author
Forward
0 new messages