unicode regepxs

1 view
Skip to first unread message

Werner LEMBERG

unread,
Jun 20, 2005, 3:25:23 AM6/20/05
to py...@googlegroups.com

Tim,


it seems to me that pyggy is not ready yet to handle UTF-8 encoded
regexps. Any plan to implement this? I would need it urgently :-)


Werner

Tim Newsham

unread,
Jun 20, 2005, 4:02:32 PM6/20/05
to py...@googlegroups.com
> it seems to me that pyggy is not ready yet to handle UTF-8 encoded
> regexps. Any plan to implement this? I would need it urgently :-)

I really dont know too much about UTF-8 or unicode. I supposed
the front end could take in characters from a file and use
pythons utf8-to-string conversion functions to convert it to
a unicode string, and then run that through the lexer (or have
the lexer do this automatically). However as it stands the
lexer tables are defined as arrays of 256 possibilities, for
8-bit characters. Updating the lexer generator to handle
unicode wouldnt be a trivial undertaking.

I will put it on my ToDo list to look at but it definitely wont
be done "soon".

> Werner

Tim Newsham
http://www.lava.net/~newsham/

Tim Newsham

unread,
Jun 20, 2005, 4:04:53 PM6/20/05
to py...@googlegroups.com
> it seems to me that pyggy is not ready yet to handle UTF-8 encoded
> regexps. Any plan to implement this? I would need it urgently :-)

It also occurs to me that you could still write a lexer for
UTF-8 characters without having support in the lexer, since
UTF-8 is just a sequence of 8-bit characters. This would probably
be more work than writing it in a lexer generator that supported
UTF-8, but might not be that hard.

Werner LEMBERG

unread,
Jun 20, 2005, 5:50:27 PM6/20/05
to py...@googlegroups.com, new...@lava.net

> It also occurs to me that you could still write a lexer for UTF-8
> characters without having support in the lexer, since UTF-8 is just
> a sequence of 8-bit characters. This would probably be more work
> than writing it in a lexer generator that supported UTF-8, but might
> not be that hard.

I think so too, but it becomes *very* ugly. For example, instead of
using

[äöü]

I had to write

\xc3(\xa4|\xb6|\xbc)

or something similar. Maybe you can provide a function which does
this conversion automatically.


Werner
Reply all
Reply to author
Forward
0 new messages