Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Lexing Unicode strings?

11 views

Skip to first unread message

Johann 'Myrkraverk' Oskarsson

unread,

Apr 21, 2021, 12:38:25 PM4/21/21

Dear c.compilers,

For context, I have been reading the old book Compiler design in C
by Allen Holub; available here

https://holub.com/compiler/

and it goes into the details of the author's own LeX implementation.

Just like the dragon book [which I admit I haven't read for some number
of years] this uses lookup tables for the individual characters, which
is fine for ASCII, but does kind of seem excessive for all 0x10ffff code
points in Unicode.

I am interested in this, using plain old C, without using external tools
like ICU, for my own reasons[1]. What data structures are appropriate
for this exercise? Are there resources out there I can study, other
than the ICU source code? [Which for other reasons of my own, I'm not
too keen on studying.]

[1] Let's leave out the question if I'll be successful or not.

Thanks,
--
Johann
[The obvious approach if you're scaning UTF-8 text is to keep treating the input as
a sequence of bytes. UTF-8 was designed so that no character representation is a prefix or suffix
of any other character, so it should work without having to be clever. -John]

0 new messages