Re: Lexing Unicode strings?

Johann 'Myrkraverk' Oskarsson

unread,

May 3, 2021, 7:58:22 PM5/3/21

to

On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

> Dear c.compilers,

> For context, I have been reading the old book Compiler design in C
> by Allen Holub; available here

> https://holub.com/compiler/

> and it goes into the details of the author's own LeX implementation.

Snip.

> [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> a sequence of bytes. UTF-8 was designed so that no character representation is a
> prefix or suffix of any other character, so it should work without having to be clever
> -John]

That's not always feasible, nor the right approach. Let's consider the
range of all lowercase greek letters. In the source file, that range
will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
not to match the bytes \xCE, \xB1..\xCF, and \x89.

There is also the question of validating the input. It seems more
natural to put the overlong sequence validator, and legal code point
validator into the lexer, rather than preprocess the source file.

But maybe that's "premature optimization?"

Utf-8 has more problems too. There are bytes that can only appear after
other bytes, and detecting that early would be nice[0]; there is also the fact
that some glyphs can be constructed with a single code point, or two.

Also, Unicode input is not always utf-8.

When Holub is constructing his state machines for the regular
expressions, he uses bitmasks for the ranges[1]. ASCII, being 128
bits in a mask, is reasonable; even 256 bits for 8bit characters.

0x11000 bits in a mask feels excessive to me, though.

I was maybe naive, when I thought Unicode was considered a
"solved problem." I guess everyone is using either libraries like
ICU or run time environments. These can have the following
"problem:" Invalid input may not be signalled to the higher level
code, but quietly replaced with the replacement symbol, U+FFFD.
Which is, in my opinion, not good for a compiler.

[0] This can be part of output lexer tables, and is harder to
do manually when bootstrapping the lexer by hand.

[1] His implementation of sets.
--
Johann
[I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
is a prefix or suffix of any other UTF-8 encoding, you can lex them
the same way you'd lex strings of ASCII. In that example above, \xCE,
\xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
a multi-byte sequence, so if they do, you can put a wildcard . at the
end to match bogus bytes and complain about an invalid character. Dunno
what you mean about not always UTF-8; I realize there are mislabeled
files of UTF-16 that you have to sort out by sniffing the BOM at the
front, but you do that and turn whatever you're getting into UTF-8
and then feed it to the lexer.

I agree that lexing Unicode is not a solved problem, and I'm not
aware of any really good ways to limit the table sizes. -John]

gah4

unread,

May 4, 2021, 11:33:50 AM5/4/21

to

On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson
wrote:

> On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

Snip.

> > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> > a sequence of bytes. UTF-8 was designed so that no character representation is a
> > prefix or suffix of any other character, so it should work without having to be clever
> > -John]

> That's not always feasible, nor the right approach. Let's consider the
> range of all lowercase greek letters. In the source file, that range
> will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
> not to match the bytes \xCE, \xB1..\xCF, and \x89.

Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode.
You needs some Boolean logic in your matching, though that shouldn't be so
hard.

> There is also the question of validating the input. It seems more
> natural to put the overlong sequence validator, and legal code point
> validator into the lexer, rather than preprocess the source file.

This reminds me of the question, which I don't remember where I saw,
about applying Boyer-Moore to Unicode. One answer is, as John notes,
to apply it to the UTF-8 bytes. But the whole idea behind Boyer-Moore
is to use the unequal probability distribution of characters, to quickly
skip over impossible matches. But the bytes of UTF-8 don't have the
same (im)probabiity of the individual characters.

It seems to me that you lose some of the speed advantage of Boyer-Moore,
but maybe it is still plenty fast enough.

On the other hand, Java uses a 16 bit char, and one should be able to apply
Boyer-Moore just as well in that case. The tables will be bigger, but then
so are computer memories today.

So, as with many problems, there are trade-offs between speed and size,
and one has to choose the best case for the specific problem.

Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode
letter, followed by Unicode letters and digits. Presumably, then, the
designers
of Java compilers have figured this out, I suspect using the 16 bit char.

One can, for example, have variables named A and Α in the same program.
(In case you can't see it, the second one is an Alpha.)

Yes, Unicode can be fun!
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]

Christopher F Clark

unread,

May 4, 2021, 11:34:39 AM5/4/21

to

I don't have much to personally add on this topic.

However, if you are considering how to compress lexer tables indexed
by Unicode code points, I would recommend you look at this paper:
https://dl.acm.org/doi/10.1145/1780.1802
by Dencker, Duerre, and Heuft
on Optimization of Parsing Tables for Portable Compilers.

They investigated the main techniques for compressing said tables,
with particular interest in ways of using "coloring" (assigning
multiple entries to the same location by indexing into a color table).
From my experience with lexing Unicode (which is admittedly quite
limited), most grammars have long sequential sets of unicode code
points that are all in the same set, e.g. they are alphabetic
characters, digitis, operators, punctuation,etc,or disallowed and that
once grouped into those sets, the actual lexing tables are mostly
compact.

Now, that doesn't necessarily help you map your UTF-8 (et al) into
those sets, although my guess is that it is simpler than it seems as
there is regularity there

And, if you look at the techniques the authors presented, you can
combine one or two of them and get a method that is relatively both
space and time efficient.

If you do some experiments and want review, critique, suggestions, you
may either post here or email me directly. I would be interested in a
space and time efficient solution myself as I intend to make my next
lexer generator for Yacc++ unicode aware and perhaps even unicode
centric.
--
******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris

gah4

unread,

May 4, 2021, 6:56:02 PM5/4/21

to

On Tuesday, May 4, 2021 at 8:33:50 AM UTC-7, gah4 wrote:

(snip, I wrote)

> Note that in addition to have a 16 bit Unicode char, the Java language
> itself is defined in terms of Unicode. Variable names can be any Unicode
> letter, followed by Unicode letters and digits. Presumably, then, the
> designers of Java compilers have figured this out, I suspect using the 16 bit char.

(snip)

> Yes, Unicode can be fun!
> [Remember that Unicode is a 20 bit code and for characters outside the first 64K,
> Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8
seem clean and beautiful. -John]

I did know that Java used 16 bits, but never tried to figure out what they did
with the rest of the characters. There should be enough in the first 64K for writing
programs.

I did once use π for a variable name, with the obvious value. It seems it is \u03c0.
I even found an editor that allowed entering such characters, and then would write
out the file with \u escapes. As far as I know, that is more usual than UTF-8.

I believe that the Java parser converts from \u escapes fairly
early, such that you can quote strings with \u0022, and then you
should be able to put \uu0022 inside the strings.

[If you're only going to allow the lower 64K, your users will be sad
when they try to use quoted strings with uncommon Chinese characters
or with emoji, or more likely your compiler will barf since they will
be encoded as two surrogate characters and your lexer won't know what
to do with them. If you're going to deal with Unicode, better bite the
bullet and deal with the whole mess. -John]

Hans Aberg

unread,

Jul 14, 2021, 3:39:27 PM7/14/21

to

On 2021-05-04 01:58, John Levine wrote:
> [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
> is a prefix or suffix of any other UTF-8 encoding, you can lex them
> the same way you'd lex strings of ASCII. In that example above, \xCE,
> \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
> a multi-byte sequence, so if they do, you can put a wildcard . at the
> end to match bogus bytes and complain about an invalid character. Dunno
> what you mean about not always UTF-8; I realize there are mislabeled
> files of UTF-16 that you have to sort out by sniffing the BOM at the
> front, but you do that and turn whatever you're getting into UTF-8
> and then feed it to the lexer.
>
> I agree that lexing Unicode is not a solved problem, and I'm not
> aware of any really good ways to limit the table sizes. -John]

I wrote code, in Haskell and C++, that translates Unicode character
classes into byte classes. From a theoretical standpoint, a Unicode
regular language mapped under UTF-8 is a byte regular language, so it is
possible. So the 2^8 = 256 size tables that Flex uses is enough. The
Flex manual has an example how to make a regular expression replacing
its dot '.' to pick up all legal UTF-8 bytes.