1. Allowing sets of Unicode characters to be referenced by predicate/
subtype. Assume, for example, that IDs in your language look like
idname, that they have to start with (any) Unicode letter character,
and that they can continue with any number of Unicode letter-
characters, Unicode digit-characters, or underscore.
TOKEN: {
// $name
< ID: <UNICODE_ID_LETTER> ( <UNICODE_ID_LETTER>
| <UNICODE_DIGIT>
| "_"
)*
>
......
| < #UNICODE_ID_LETTER: [ "A"-"Z", "a"-"z" , etc, etc, etc ] >
| < #UNICODE_DIGIT: [ "0"-"9", etc, etc, etc ] >
}
One problem is that, even with ranges, listing out all the letters,
and digits, is tedious and error-prone.
The other problem is that when new versions of Unicode come out, with
new letters and digits, I don't really want to go back into my
tokenizer to add them.
Ideally, I'd like JavaCC/FreeCC to support reference to
UNICODE.letter and UNICODE.digit (or something like that), which
would cover all the characters officially classified as "letters" and
"digits', respectively, in the version of Unicode being used.
2. Allowing subtraction in character sets, e.g. to match all ASCII
letters, except M and m, one might write something like
TOKEN: {
< #ASCII_LETTER_EXCEPT_M: [ "A-Z", "a"-"z" ] - ("M" | "m") >
}
Any chance of allowing these tricks in FreeCC?
Thanks,
Ken
******************************
Kenneth R. Beesley, D.Phil.
P.O. Box 540475
North Salt Lake, UT
84054 USA
It's another good distinguishing feature, right? "Handles Unicode
characters correctly out-of-the-box". :-)
Attila.
--
home: http://www.szegedi.org
twitter: http://twitter.com/szegedi
weblog: http://constc.blogspot.com
On 18 Nov 2008, at 23:33, Jonathan Revusky wrote:
>
> <snip>
>>
>>
>>>
>>> BTW, I ran across your paper on using unicode in JavaCC, and I think
>>> you may be wrong on a certain point. IIRC, you state that JavaCC
>>> will
>>> handle unicode correctly if you instantiate your parser with a
>>> Reader,
>>> and that the UNICODE_INPUT option is unnecessary. I don't think
>>> that's
>>> true, though maybe it is supposed to be. I am pretty sure that you
>>> need UNICODE_INPUT=true for a parser to handle unicode correctly.
>>
Ken Beesley responded
Many thanks for the clarification. This is finally making sense.
In any non-trivial testing, I had probably set the external option
variable
JAVA_UNICODE_ESCAPE = true
which means that the internal "convenience variable" UNICODE_INPUT was
being forced to true automatically.
So I wouldn't have seen any difference setting the external
UNICODE_INPUT option to true vs. false.
>
>
>
>>
>> Tom Copeland wrote (29 Dec 2006) "Anyhow, I can't think of a reason
>> when you would use UNICODE_INPUT
>> rather than using a Reader with the proper character encoding. If
>> any
>> of the real gurus want to weigh in to correct me here that'd be
>> great..."
>
> Copeland is quite confused. He talks of using UNICODE_INPUT=true and
> using a Reader with the proper encoding as if they are two alternative
> solutions. But that's not the case at all. Even if you have
> UNICODE_INPUT=true, you have to be using a Reader with the appropriate
> encoding (Of course!!!). And, as I point out above, if you use a
> Reader
> with the proper encoding so that your input is a sequence of 16-bit
> unicode characters, your JavaCC-built parser will not do the right
> thing
> with it unless you have set UNICODE_INPUT=true.
or
JAVA_UNICODE_ESCAPE = true
which is probably what I had specified
Yikes. Not good. I share your opinion that the attempted
optimization-for-ASCII is probably unwise and unnecessary.
And making such an optimization-for-ASCII the default, in a modern
Java environment, is evil. It takes
Java's commendable dedication to handing Unicode and emasculates it.
Full handling of Unicode input should be the default in JavaCC/
FreeCC. If supplied at all, such an optimization-for-ASCII should be
an explicit user option, e.g.
HANDLE_ONLY_ASCII_INPUT = true
(default false)
but even this is potentially confusing and probably unwise.
Best,
Ken
Yes, I see (I think) that the full generalized solution is something
like this, where you instantiate various RCharacterList objects for
lower case letter, etcetera, based on a configurable UnicodeData.txt
file. However, I don't think there is a need to solve the full
generalized problem initially. Getting set difference implemented and
getting the aliases like \Ll or whatever working are two things that can
be attacked separately. And then fully generalizing the solution could
then be attacked in turn.
>
> It's better to look at how native Java regular expressions and Unicode
> regular expressions refer to Unicode character types.
> Inside Java regular expressions, you can match whole Unicode
> categories of characters, e.g.
> \p{L} matches a single Unicode letter character. While \P{L}, with a
> big P, matches any single character that is
> _not_ a letter character. Similarly \p{Ll} matches any lowercase
> letter, and \p{Lu} matches any
> uppercase letter. There are many such codes to refer to official
> Unicode character types and subtypes, documented
> in
>
> http://www.unicode.org/reports/tr18/#Categories
> http://www.regular-expressions.info/unicode.html
>
> It would be useful to have these character-type-matching-codes exposed
> inside JavaCC itself, but
> that would probably require a hard-core Java guru.
Well, Ken, you are interested enough in the whole topic to investigate
these matters. As for the notion that you need to be a hard-core Java
guru to muck with this stuff,... well... I just looked into this (in the
last few days or so which is why my response to this message is so late)
and it doesn't look too hard to beat this stuff into shape. The
necessary changes would be fairly localized AFAICS.
Anyway, just bear with me and I'll explain what I know (or think I know)
about this:
Now, if you look here, this is the actual object that is used to span a
range of characters (or multiple ranges of characters actually.)
Now, if you look over it, you see that it mostly contains a few
statically hard-coded tables that deal with upper/lower case conversions
in Unicode. But the actual object instance is quite simple really. I
mean, the instance data for an RCharacterList instance really is stored
basically in the descriptors List, defined on line 55. Now, in Eclipse,
you have this command to see all the points where a variable or method
is used (particularly handy in something like the JavaCC codebase which
tends to define a lot of public variables). Anyway, a bit of
investigation (more than should be necessary if the code were really
well structured) and you figure out that the descriptors List can only
contain two types of Object, SingleCharacter and CharacterRange. You can
see these here:
http://code.google.com/p/freecc/source/browse/trunk/freecc/src/java/org/visigoths/freecc/lexgen/SingleCharacter.java
http://code.google.com/p/freecc/source/browse/trunk/freecc/src/java/org/visigoths/freecc/lexgen/CharacterRange.java
So, basically, set difference is not really that hard. You simply run
over the the descriptors list and make adjustments based on the rhs of
the set subtraction. Now, if the descriptor object is just a
SingleCharacter, then you remove it or leave it depending on whether it
is in the range to be subtracted, right? A CharacterRange object is not
actually that much more difficult. For example, if we have the single
character range ("A"-"Z") and we want to remove "M"-"O" from it, we
would have to break the CharacterRange object into the two subranges
("A"-"L") and ("P"-"Z") and those two CharacterRange objects would
replace the single ("A"-"Z") element in the descriptors List.
Of course, case sensitivity and negation are the extra wrinkles. (And
you could, just to get something working, resolve not to solve those for
now, but to get the absolutely most simple case working...) But anyway,
as long as you can munge a CharacterRange object into a set of
CharacterRange objects that represent the difference, it really seems
that all the existing machinery should just continue to work.
Of course, the other matter is what notation is to be used for set
difference, since I guess the straight minus sign is being used to
specify a range of characters. Well, maybe backslash. Or maybe it can be
reused without ambiguity. I haven't looked into it sufficiently.
>
>
>>>
>>> 2. Allowing subtraction in character sets, e.g. to match all ASCII
>>> letters, except M and m, one might write something like
>>>
>>> TOKEN: {
>>>
>>> < #ASCII_LETTER_EXCEPT_M: [ "A-Z", "a"-"z" ] - ("M" | "m") >
>>>
>>> }
>>>
>>> Any chance of allowing these tricks in FreeCC?
Well, as I say, the above does not look particularly hard to implement.
Basically, the lhs in the set difference, ["A-Z", "a"-"z"] is reified as
a RCharacterList object whose descriptors variable contains two
CharacterRange objects. Basically, changing this into basically the same
internal representation as ["A"-"L", "N"-"Z", "a"-"l", "n"-"z"] is quite
doable. It is not proverbial rocket science.
>> Well, sure, I think this kind of thing would be quite useful. The only
>> thing is that, if I'm the only person working on this, then I don't
>> know when I'll turn my attention to this. Also, the whole lexer
>> generation piece is where I have the biggest holes in my understanding
>> of how the tool works. I have been filling this in a bit. I have read
>> up somewhat on the NFA->DFA algorithm type stuff.
>
> I fear that I don't have the skills or the time to dig down into the
> lexer-syntax-and-generation code.
> Limited subtraction of character sets inside regular expressions is
> allowed in "Unicode
> Regular Expressions"
>
> http://www.unicode.org/reports/tr18/#Subtraction_and_Intersection
Yes, I see that they use a double-minus operator -- for set subtraction.
That looks quite doable.
>
> Most computer-language implementations of "regular expressions" don't
> support subtraction.
Well, I'll take your word on that. I'm not exactly a regexp virtuoso. I
do use them here and there, but probably I master a pretty pathetic subset.
>
> *****
>
> There's another issue, and that is how to enhance JavaCC to allow the
> lexer to refer to
> Unicode supplementary characters (beyond the Basic Multilingual
> Plane). The basic problem
> may still lie in Java itself. I'm not completely up to date with
> Java's latest handling of Unicode, but
> earlier versions of Java kind of made a mess of supplementary
> characters (in my opinion).
Well, Ken, at first blush, it does seem that you investigated the whole
topic in enough depth to form an opinion. (Or however deeply is
necessary to conclude that whichever gurus or whoever it was bungled
whatever the thing is... :-))
>
>
>>
>> But the best chance for this stuff happening reasonably soon is if
>> somebody shows up who wants to take ownership of that piece. You could
>> be interested. If you ever tried to get into the JavaCC codebase and
>> basically ran away in fright, I think that you'd find that the FreeCC
>> code is much more amenable to getting involved. It's been heavily
>> refactored and cleaned up. With all the actual generation of java code
>> now in external templates, the java codebase is much smaller and
>> cleaner.
>
> I wish I could dig into the lexer and make things better, but I know
> when something
> is beyond my skills.
Yeah, okay.... but...
....the inescapable fact does remain that any specific thing is going to
be beyond your skills .... I mean, as a practical matter, whether it
really is beyond your skills or not .... IF ... you talk yourself out of
it. (I really hope that doesn't come off as condescending or anything.
Maybe it could, it's not the intention. When I say "this is an
inescapable fact etcetera", I don't mean it in any personal way really.
IMO, this really is just a cold, hard fact....)
Now, neither you nor anybody here owes me anything, but here's the pitch:
Basically, I could paraphrase the above as making the case that it
really may not be more work to get some of this stuff actually
implemented in code than it was, for example, to write the essay you
wrote as a result of picking the brains of some self-styled JavaCC
gurus. But wouldn't you get a much better sense of satisfaction and
closure out of actually implementing the feature?
And, BTW, I don't think that implementing the aliases like \l for a
unicode letter and so on is very hard either. AFAICS, all you need is to
have some predefined RCharacterList instances that you can fish out --
(probably clone the canonical instance on demand and then use them and
throw them away...)
Regards,
JR
>
> Kenneth Reid Beesley wrote:
>> *****
>>
>> There's another issue, and that is how to enhance JavaCC to allow the
>> lexer to refer to
>> Unicode supplementary characters (beyond the Basic Multilingual
>> Plane). The basic problem
>> may still lie in Java itself. I'm not completely up to date with
>> Java's latest handling of Unicode, but
>> earlier versions of Java kind of made a mess of supplementary
>> characters (in my opinion).
It is easy to get confused here. Saying that the internal
representation of Java strings is "Unicode" is an imprecise
oversimplification. An individual "char" in Java is simply an unsigned
16 bit value. As for java.lang.String, the sequences of those 16-bit
values are specifically to be interpreted as UTF-16. This means that
"only" the characters from the Basic Multilingual Plane (BMP) of
Unicode can be encoded in a single "char" value. In most practical
applications, this ain't much of a limitation, though, as up to
Unicode 3.0, all characters are in the BMP, that's why I put "only" in
quotes :-)
Unicode 3.1 does define some ~45000 new characters in other planes,
though, so if you'd wish to express those, you could, but they would
be need to be represented on two "char" values (and would throw off
the indexOf(), length() etc. methods as they'd no longer operate on
Unicode characters, but rather on UTF-16 encoding units). I.e. U+10000
is encoded as U+D800,U+DC00 in UTF-16.
So, it can be said that java.lang.String uses UTF-16 encoding, and
that nicely maps exactly one "char" value to exactly one Unicode
character for all Unicode 3.0 characters.
Now, for a completely independent another issue:
To make matters even more interesting, both composed and decomposed
variants for characters are allowed, i.e. "ffi" can be equally
expressed as U+0066,U+0066,+U0069 or as the U+FB03 ligature. This is
not a Java problem, it's a trait of the Unicode standard. Sun finally
added java.text.Normalizer for Java 6 that allows you to obtain a
various normalized decompositions/compositions of your Unicode
sequences. It is probably a good idea to pass your char sequence
through it before letting it be analyzed by a lexer; it'll definitely
make the lexer rules easier.
If you need this functionality prior to Java 6, your best bet is IBM's
"International Components for Unicode" library: <http://www-01.ibm.com/software/globalization/icu/index.jsp
> (I've been using it for years).
So, to summarize:
- you can run your input char stream through a Unicode normalizer in
order to not have to worry about composed/decomposed variants in
downstream processing
- as long as you're comfortable not having to handle Unicode 3.1
input, you also needn't worry about the fact that java.lang.String is
actually a UTF-16 encoded text representation. (I know, that's like
saying "as long as you're comfortable not having to handle non-ASCII
input, you needn't know what's UTF-8" :-) )
Attila.
>
>
>>
>> Kenneth Reid Beesley wrote:
>>> *****
>>>
>>> There's another issue, and that is how to enhance JavaCC to allow
>>> the
>>> lexer to refer to
>>> Unicode supplementary characters (beyond the Basic Multilingual
>>> Plane). The basic problem
>>> may still lie in Java itself. I'm not completely up to date with
>>> Java's latest handling of Unicode, but
>>> earlier versions of Java kind of made a mess of supplementary
>>> characters (in my opinion).
>
> It is easy to get confused here. Saying that the internal
> representation of Java strings is "Unicode" is an imprecise
> oversimplification. An individual "char" in Java is simply an unsigned
> 16 bit value. As for java.lang.String, the sequences of those 16-bit
> values are specifically to be interpreted as UTF-16. This means that
> "only" the characters from the Basic Multilingual Plane (BMP) of
> Unicode can be encoded in a single "char" value. In most practical
> applications, this ain't much of a limitation, though, as up to
> Unicode 3.0, all characters are in the BMP, that's why I put "only" in
> quotes :-)
Attila,
Thanks for the overview, but I understand this stuff pretty well.
From the beginning,
Java was strongly devoted to the original 16-bit vision of Unicode,
but when supplementary characters were
introduced (theoretically with 2.0 in 1996 and de facto with 3.1 in
2001) it stumbled badly (in my opinion).
>
>
> Unicode 3.1 does define some ~45000 new characters in other planes,
> though, so if you'd wish to express those, you could, but they would
> be need to be represented on two "char" values (and would throw off
> the indexOf(), length() etc. methods as they'd no longer operate on
> Unicode characters, but rather on UTF-16 encoding units). I.e. U+10000
> is encoded as U+D800,U+DC00 in UTF-16.
Yes, this is what I find disappointing about the Java implementation.
Other languages have,
in my opinion, evolved to handle these supplementary characters much
better. For example:
The internal encoding of Perl unicode strings happens to be UTF-8, but
that's well
hidden from you. Perl Unicode strings can be visualized by the
programmer as sequences of Unicode
Characters, and indexing and length (and looping through the
characters in a string) work intuitively.
In Python, you get similar intuitive indexing/length/looping behavior
with a "ucs4" build, where strings are technically stored in UTF-32.
Java's internal representation of Strings is UTF-16, but it's not
hidden from you;
the programmer has to worry constantly about the Unicode character vs.
char gap. (And you get similar
problems with a 'ucs2' build of Python.)
In Perl code, you can represent any Unicode character (including
supplementary characters) with the \x{H....} notation.
In Python code, you can represent BMP characters with \uHHHH and
supplementary characters with \UHHHHHHHH.
The last time I looked (and I haven't looked closely at Java 6) Java
still had no convenient way to refer to a supplementary character in
code.
You have to calculate the two surrogate values and enter them
literally as \uHHHH\uHHHH. That's pretty lame (in my opinion).
Please update me out as necessary.
********
And this brings me to the problem cited in my message: how to update
the JavaCC/FreeCC lexer syntax to allow reference to ranges of
supplementary characters. Suppose that I'm implementing a new
programming language wherein the IDs can contain any Unicode "letter"
character. I might
define a helper
< #UNICODE_ID_LETTER: [
"\u0041" - "\u005A" , // Latin
"\u0061" - "\u007A" ,
"\u00AA" ,
"\u00B5" ,
"\u00BA" ,
"\u00C0" - "\u00D6" ,
"\u00D8" - "\u00F6" ,
"\u00F8" - "\u02C1" ,
"\u02C6" - "\u02D1" ,
"\u02E0" - "\u02E4" ,
"\u02EC" ,
"\u02EE" ,
"\u0370" - "\u0374" , // Greek
"\u0376" - "\u0377" ,
"\u037A" - "\u037D" ,
"\u0386" ,
"\u0388" - "\u038A" ,
"\u038C" ,
"\u038E" - "\u03A1" ,
"\u03A3" - "\u03F5" ,
"\u03F7" - "\u03FF" ,
"\u0400" - "\u0481" , // Cyrillic
"\u048A" - "\u0523"
etc, etc, etc
] >
which works for the BMP, but what do I do when I want to include
Shavian, Deseret, Gothic, etc. from the supplementary area?
Using a Python-like notation, the Shavian range is
"\U00010450" - "\U0001047F"
but (the last time I looked) Java itself doesn't support such a
notation. If you're better informed than I am, please set me straight.
That's what I meant when I wrote that "the basic problem may still lie
in Java".
>
>
> Now, for a completely independent another issue:
>
> To make matters even more interesting, both composed and decomposed
> variants for characters are allowed, i.e. "ffi" can be equally
> expressed as U+0066,U+0066,+U0069 or as the U+FB03 ligature. This is
> not a Java problem, it's a trait of the Unicode standard. Sun finally
> added java.text.Normalizer for Java 6 that allows you to obtain a
> various normalized decompositions/compositions of your Unicode
> sequences. It is probably a good idea to pass your char sequence
> through it before letting it be analyzed by a lexer; it'll definitely
> make the lexer rules easier.
>
> If you need this functionality prior to Java 6, your best bet is IBM's
> "International Components for Unicode" library: <http://www-01.ibm.com/software/globalization/icu/index.jsp
>> (I've been using it for years).
Yes, everyone should be aware of the Normalization challenge.
Before java.text.Normalizer I successfully used sun.text.Normalizer
for a while. As you say, ICU is often the best
solution for getting around the awkwardness of the Java implementation
of supplementary characters.
E.g. for iterating through a String character by character (rather
than char by char) you can use
com.ibm.icu.text.UCharacterIterator, but this is much less convenient
than the built-in support for supplementary
characters in Perl and Java.
>>
>
> So, to summarize:
> - you can run your input char stream through a Unicode normalizer in
> order to not have to worry about composed/decomposed variants in
> downstream processing
Of course.
>
> - as long as you're comfortable not having to handle Unicode 3.1
> input, you also needn't worry about the fact that java.lang.String is
> actually a UTF-16 encoded text representation. (I know, that's like
> saying "as long as you're comfortable not having to handle non-ASCII
> input, you needn't know what's UTF-8" :-) )
But this was the whole point :) Supplementary characters are
important to me,
and they've been an official and de facto part of the Unicode standard
since 2001. In November 2008 I shouldn't have to ignore
or suppress them. I'd like to define lexers that refer to
supplementary characters and ranges of supplementary characters,
just like they refer to BMP characters. Unicode is now at version
5.1.0, and if we settle for 3.0 we're aiming pretty low.
So the big question is: How might JavaCC/FreeCC be enhanced/expanded
to recognize tokens that include supplementary characters?
And the next obvious question is: If we did implement an expanded
FreeCC lexer syntax, e.g. something like
...
"\U00010330" - "\U0001034F", // Gothic
"\U00010400" - "\U0001044F", // Deseret
"\U00010450" - "\U0001047F", // Shavian
...
how would that translate into Java? Would we have to change the front-
end to rely on UCharacterIterator?
Best,
A-ha!
Okay, I didn't understand that from your previous posting. I thought
that you specifically *don't* want to deal with supplementals
explicitly in the lexer.
> Unicode is now at version
> 5.1.0, and if we settle for 3.0 we're aiming pretty low.
>
> So the big question is: How might JavaCC/FreeCC be enhanced/expanded
> to recognize tokens that include supplementary characters?
> And the next obvious question is: If we did implement an expanded
> FreeCC lexer syntax, e.g. something like
>
> ...
> "\U00010330" - "\U0001034F", // Gothic
> "\U00010400" - "\U0001044F", // Deseret
> "\U00010450" - "\U0001047F", // Shavian
> ...
>
> how would that translate into Java? Would we have to change the
> front-
> end to rely on UCharacterIterator?
Well, yes, ideally this would work with an equivalent of
java.io.Reader that returns UCS-4 values directly. Funnily enough,
java.io.Reader#read() returns an int, so it is actually already
suitable for the task, unfortunately the read(char[]) and read(char[],
int, int) aren't -- they'd have to be changed to receive int[]
instead. And of course, FreeCC would internally need to deal with int
values instead of char values to represent characters. Not a too big
deal, actually; I don't think the FreeCC internals keep too much
arrays around, so having the processed chunk of text stored on 32 bit/
char is probably acceptable. The author of the parser would need to
make a transcoding decision though if he'd need to transform a
sequence of UCS-4 characters into either a java.lang.String object or
a char[] object...