thanks,
Yaron
[Yacc and its clones parse tokens, not characters, so they shouldn't
be a problem, give or take nits like passing through non-ASCII strings
in C action routines correctly. Lex or flex is harder since all of
the implementations I know of use the character codes as indexes into
tables to implement the lex state machine. But if you do that for
Unicode, you'll have 64K entry tables rather than 256 entry tables and
severe program bloat. I believe that plan 9 has a Unicode lex,
presumably with some hackery to keep the table sizes down. -John]
> [... Lex or flex is harder since all of
> the implementations I know of use the character codes as indexes into
> tables to implement the lex state machine. But if you do that for
> Unicode, you'll have 64K entry tables rather than 256 entry tables and
> severe program bloat. I believe that plan 9 has a Unicode lex,
> presumably with some hackery to keep the table sizes down -John]
In LPM's lexical scanner, which can be compiled in Unicode mode, I got
around the huge table problem by using a table for everything less
than UCHAR_MAX, and allowing codes greater than that only in clause
parameters. This is only possible because the clause is the atom in
LPM, rather than the character. This means that the RE:
lexicographic*[a-z]+
is expressed in LPM as:
%'lexicographi'$ %*'c'$ %'a-z'#
This works in LPM because, during the match, each clause is matched
against the stream by the engine. The character class [a-z]+ above,
when compiled in standard mode builds a table that is always UCHAR_MAX
big, but in Unicode mode, behaves differently, since table
construction for character classes could result in huge tables.
--
Quinn Tyler Jackson
http://www.qtj.net/~quinn/
Plan 9 Unicode lex: no, I'm afraid, just for the reason related
to the one John mentioned: no one developed the energy to be clever about
this program. The BUGS section (even now) says
Cannot handle UTF.
The asteroid to kill this dinosaur is still in orbit.
Dennis
Then feed that to Lex. You'll need to convert your input codepage
into UTF-8, before handing it to your lexer, but that's not so tough.
Now, if you have a Lex that can handle *huuuuge* regular-expressions,
it'll just *work*.
If not, and if your Lex is relatively stupid (in this case, stupidity
is a *good* thing ;-) about not trying to build optimal lexing
automata, you can pre-process your regular-expression to factor out
lots of common cases.
E.g., in the 'Name' lexical class in the XML spec, there are 35K
different glyphs. If you apply a few factorization transformations to
the naive 35K-way disjunction, you can get the disjunction down to 107
branches, each of which, admittedly, has a hundred or so branches.
But in the process, you end up removing a lot of opportunity for
state-explosion.
Now, I don't claim that this is either pretty, nor that this will work
every time. But it certainly got me past all the lexical issues in
XML, and I was able to use (CAML) Lex/Yacc to do it all.
--chet--
[Wow, that's gross. But it's a good idea. -John]
One way to do this lexing work could be the following :
- use a static algorithm that analyze all your regular expressions to
characterize sets of characters that are obviously equivalent (always saw
together and with no ohter chars, don't forget to join OR between RE that
are exactly of length 1)
- compute a correspondance table where table[UNICODECHAR] = [representative
char in the set] (or optimize it with a double entry, ie treat each 256
paquets chars as a line and if others chars modulo 256 behave equivalently
don't duplicate the line, index it)
- then use the representative char in the 8 bits charset in lex and do your
lexing work as every day (but dont use the text sent by lex, use one that
has the same length in the real flow of characters)
Example :
if you have :
[a-fA-F][a-z]
Sets are :
[a-f] [g-z] [A-F]
a g A
and the above expression becomes : [aA]g
if chars are unicode in the top expression, it works too.
i know it's not very easy, because you have to find a regular expression
parser before (but it's quite easy tu do it), and then recompute expression,
but all this stuff can be done easily.
(I'm personnaly working on a equivalent of Lex/Yacc, and i think of using
this technic, if you find it fool please tell me :)
Armel