Enhancements to tokenization?

5 views
Skip to first unread message

Kenneth Reid Beesley

unread,
Nov 14, 2008, 7:33:22 PM11/14/08
to freecc

My memory grows dim, but years ago there was a discussion about
enhancing/facilitating Unicode-character-specification in the JavaCC
tokenizer.

1. Allowing sets of Unicode characters to be referenced by predicate/
subtype. Assume, for example, that IDs in your language look like
idname, that they have to start with (any) Unicode letter character,
and that they can continue with any number of Unicode letter-
characters, Unicode digit-characters, or underscore.

TOKEN: {
// $name
< ID: <UNICODE_ID_LETTER> ( <UNICODE_ID_LETTER>
| <UNICODE_DIGIT>
| "_"
)*
>
......
| < #UNICODE_ID_LETTER: [ "A"-"Z", "a"-"z" , etc, etc, etc ] >
| < #UNICODE_DIGIT: [ "0"-"9", etc, etc, etc ] >
}

One problem is that, even with ranges, listing out all the letters,
and digits, is tedious and error-prone.
The other problem is that when new versions of Unicode come out, with
new letters and digits, I don't really want to go back into my
tokenizer to add them.

Ideally, I'd like JavaCC/FreeCC to support reference to
UNICODE.letter and UNICODE.digit (or something like that), which
would cover all the characters officially classified as "letters" and
"digits', respectively, in the version of Unicode being used.

2. Allowing subtraction in character sets, e.g. to match all ASCII
letters, except M and m, one might write something like

TOKEN: {

< #ASCII_LETTER_EXCEPT_M: [ "A-Z", "a"-"z" ] - ("M" | "m") >

}

Any chance of allowing these tricks in FreeCC?

Thanks,

Ken


******************************
Kenneth R. Beesley, D.Phil.
P.O. Box 540475
North Salt Lake, UT
84054 USA

Jonathan Revusky

unread,
Nov 16, 2008, 7:06:45 AM11/16/08
to freecc...@googlegroups.com
On Sat, Nov 15, 2008 at 1:33 AM, Kenneth Reid Beesley
<krbe...@gmail.com> wrote:
>
>
> My memory grows dim, but years ago there was a discussion about
> enhancing/facilitating Unicode-character-specification in the JavaCC
> tokenizer.

Well, I wouldn't have been involved in that, since I only
participated in the JavaCC user list for the first time in April of
this year. I was just googling around for it and the first search
string I wrote was 'unicode javacc list users'. Interestingly enough
the first hit on the list returned was:

http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/javacc_unicode.pdf

which is by you. The second hit is Theo Norvell's JavaCC FAQ and the
third hit is possibly the conversation you are referring to, which
would have been mid-2006. I can't see in that thread where they ever
resolved to do anything. Though, regardless, I am quite sure that they
never did anything.

BTW, I ran across your paper on using unicode in JavaCC, and I think
you may be wrong on a certain point. IIRC, you state that JavaCC will
handle unicode correctly if you instantiate your parser with a Reader,
and that the UNICODE_INPUT option is unnecessary. I don't think that's
true, though maybe it is supposed to be. I am pretty sure that you
need UNICODE_INPUT=true for a parser to handle unicode correctly.

This brings me to a fairly big question I have had in my mind about
this whole issue. What I have been wondering about and had been
meaning to bring up here was the topic of whether there is any strong
reason to support anything besides unicode input. AFAICS, supporting
non-unicode input really only exists as an optimization option.
Surely, any lexer for ASCII input can be handled using a charstream
that allows full unicode. I think that specifically handling the
non-unicode case allows certain code to be table-based, and thus,
somewhat faster. What I don't know is how much faster. Intuitively, it
is hard to believe there is much to gain from specially handling
non-unicode input. After all, Java itself uses 16-bit characters in
strings and so on whether you need them or not, so you've already got
a lot of the unicode conversion overhead assumed regardless. Well, I
guess it's really just a question of constructing a benchmark and
seeing whether there is much speed difference. Earlier, I did check
into the whole STATIC option and I can certainly tell you that
STATIC=false just buys nothing basically. There might be something
like a 3% speed difference, depending on the exact case. But nothing
really. So I pretty much immediately removed all support for static
parsers. I wouldn't be surprised if the whole UNICODE_INPUT=false was
for nought as well. If we just said that all input is unicode, then I
think there is a lot of lexer-related code that could just be whacked,
and getting code size down always makes things more manageable.

> 1. Allowing sets of Unicode characters to be referenced by predicate/
> subtype. Assume, for example, that IDs in your language look like
> idname, that they have to start with (any) Unicode letter character,
> and that they can continue with any number of Unicode letter-
> characters, Unicode digit-characters, or underscore.
>
> TOKEN: {
> // $name
> < ID: <UNICODE_ID_LETTER> ( <UNICODE_ID_LETTER>
> | <UNICODE_DIGIT>
> | "_"
> )*
> >
> ......
> | < #UNICODE_ID_LETTER: [ "A"-"Z", "a"-"z" , etc, etc, etc ] >
> | < #UNICODE_DIGIT: [ "0"-"9", etc, etc, etc ] >
> }
>
> One problem is that, even with ranges, listing out all the letters,
> and digits, is tedious and error-prone.

The INCLUDE statement introduced in FreeCC could go some way to
alleviating this. Certain standard definitions could be contained
externally, and you could just reuse them with an include. In fact, if
some generally useful ones were worked up, maybe the included file
containing them could just be fished out of the freecc.jar file. I
don't have a disposition for including something from a jar file, but
that, of course, is trivial.

> The other problem is that when new versions of Unicode come out, with
> new letters and digits, I don't really want to go back into my
> tokenizer to add them.

Yeah, well, the whole idea of code reuse via copy-paste-modify is
pretty problematic. There seems to be no awareness of this at all in
JavaCC. For example, the JavaCC distro comes with a couple of
examples, where they just take the Java grammar, and then copy it
somewhere else and modify it -- you know, they tweak one or two
productions. And since these are provided examples, I have to assume
that they are being offered as good examples of using the tool. I
would think it's obvious that if you wanted to do something based on
the Java grammar, you would want some way to INCLUDE the canonical
java grammar and then just specify the tweaks you need, but not
copy-pasting the whole thing.

>
> Ideally, I'd like JavaCC/FreeCC to support reference to
> UNICODE.letter and UNICODE.digit (or something like that), which
> would cover all the characters officially classified as "letters" and
> "digits', respectively, in the version of Unicode being used.
>
> 2. Allowing subtraction in character sets, e.g. to match all ASCII
> letters, except M and m, one might write something like
>
> TOKEN: {
>
> < #ASCII_LETTER_EXCEPT_M: [ "A-Z", "a"-"z" ] - ("M" | "m") >
>
> }
>
> Any chance of allowing these tricks in FreeCC?

Well, sure, I think this kind of thing would be quite useful. The only
thing is that, if I'm the only person working on this, then I don't
know when I'll turn my attention to this. Also, the whole lexer
generation piece is where I have the biggest holes in my understanding
of how the tool works. I have been filling this in a bit. I have read
up somewhat on the NFA->DFA algorithm type stuff.

But the best chance for this stuff happening reasonably soon is if
somebody shows up who wants to take ownership of that piece. You could
be interested. If you ever tried to get into the JavaCC codebase and
basically ran away in fright, I think that you'd find that the FreeCC
code is much more amenable to getting involved. It's been heavily
refactored and cleaned up. With all the actual generation of java code
now in external templates, the java codebase is much smaller and
cleaner.

I hope that answers your questions.

Regards,

JR

Kenneth Reid Beesley

unread,
Nov 18, 2008, 1:39:43 PM11/18/08
to freecc...@googlegroups.com, me beesley

On 16 Nov 2008, at 05:06, Jonathan Revusky wrote:

>
> On Sat, Nov 15, 2008 at 1:33 AM, Kenneth Reid Beesley
> <krbe...@gmail.com> wrote:
>>
>>
>> My memory grows dim, but years ago there was a discussion about
>> enhancing/facilitating Unicode-character-specification in the JavaCC
>> tokenizer.
>
> Well, I wouldn't have been involved in that, since I only
> participated in the JavaCC user list for the first time in April of
> this year. I was just googling around for it and the first search
> string I wrote was 'unicode javacc list users'. Interestingly enough
> the first hit on the list returned was:
>
> http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/javacc_unicode.pdf
>
> which is by you.

Jonathan,

Yes, I always thought the JavaCC documentation was in an abysmal
state, and I tried to do something about it.
This paper is about how to instantiate JavaCC parsers that read source
files in various encodings, using available
JavaCC parser constructors.

It didn't get into the old debate on how to enhance the way that
JavaCC lexers can be specified, to allow easier reference
to Unicode character types.

> The second hit is Theo Norvell's JavaCC FAQ and the
> third hit is possibly the conversation you are referring to, which
> would have been mid-2006. I can't see in that thread where they ever
> resolved to do anything. Though, regardless, I am quite sure that they
> never did anything.

I'm pretty sure that nothing was done. I've had a job change since
then and can't seem to find my old notes.

>
>
> BTW, I ran across your paper on using unicode in JavaCC, and I think
> you may be wrong on a certain point. IIRC, you state that JavaCC will
> handle unicode correctly if you instantiate your parser with a Reader,
> and that the UNICODE_INPUT option is unnecessary. I don't think that's
> true, though maybe it is supposed to be. I am pretty sure that you
> need UNICODE_INPUT=true for a parser to handle unicode correctly.

I'd love to get this point settled. I was informed by a JavaCC guru
at one point, years ago, that the
UNICODE_INPUT option was "obsolete" since the time that Java
introduced the Unicode-savvy Reader
classes (and thus did at the Java level what JavaCC originally had to
do itself). Of course,
the official documentation on options was/is notoriously out of date.

"Obsolete" can of course be ambiguous: 1) having no effect in any
case, or 2) superseded by a new, better way to do it.
Perhaps Readers made UNICODE_INPUT obsolete in this second sense.

Tom Copeland wrote (29 Dec 2006) "Anyhow, I can't think of a reason
when you would use UNICODE_INPUT
rather than using a Reader with the proper character encoding. If any
of the real gurus want to weigh in to correct me here that'd be
great..."


>
> This brings me to a fairly big question I have had in my mind about
> this whole issue. What I have been wondering about and had been
> meaning to bring up here was the topic of whether there is any strong
> reason to support anything besides unicode input.

The handling of various source-file encodings is _already_ provided in
Java itself. And JavaCC
already provides a simple parser constructor that allows you, where
necessary, to manually specify the
encoding of the input file.

> AFAICS, supporting
> non-unicode input really only exists as an optimization option.
> Surely, any lexer for ASCII input can be handled using a charstream
> that allows full unicode.

JavaCC lexers are written in terms of Unicode characters. All input
text gets converted,
one way or another, to Unicode before it ever gets to your JavaCC lexer.

> I think that specifically handling the
> non-unicode case allows certain code to be table-based, and thus,
> somewhat faster. What I don't know is how much faster. Intuitively, it
> is hard to believe there is much to gain from specially handling
> non-unicode input. After all, Java itself uses 16-bit characters in
> strings and so on whether you need them or not, so you've already got
> a lot of the unicode conversion overhead assumed regardless.

Java programs (including JavaCC lexers and parsers) always handle text
internally as String
(or StringBuffer, etc) objects that are Unicode (encoded as UTF-16).
Any text read in from file (including
your source programs read by a JavaCC parser) is converted into
Unicode one
way or another before it gets to the lexer. This is the main subject
of the javacc_unicode.pdf
paper that you found.

By Default:
Java somehow interrogates your operating system to find the
default encoding of the operating system. And if your Java program
reads text input
from a file, using a simple FileReader (which extends
InputStreamReader), the text will automatically
be converted from your default operating-system encoding into Unicode
Strings. By default,
Java also prints out its internal Unicode Strings by converting them
to the operating system's default
encoding. (This default behavior can be overridden by using Writer
objects, wherein you manually
specify the encoding of the output file.) Many users never notice
the silent default conversions, which tend to make Java programs more
portable.

Explicit Conversion:
If your source file is stored in an encoding that is not the default
encoding of your operating
system (e.g. if it is in Unicode UTF-8, and your default operating
system encoding is Latin-1),
then you can use Reader objects and overtly specify the encoding.
JavaCC even provides an
abbreviated parser constructor that allows you to specify overly the
encoding of the source file:

XXX parser = null ;
try {
parser = new XXX(new FileInputStream("input.utf8"), "UTF-8") ;
}
catch (FileNotFoundException e) {
System.out.println("File not found. Exiting.") ;
System.exit(0) ;
}

So I think the handling of non-Unicode encodings is already done, and
mostly in Java itself,
before the text arrives at your lexer.
I looked into Unicode character types again:
There's a text file UnicodeData.txt (for each release of Unicode) that
classifies each character
as letter, punctuation, control, etc., using several subtypes. I
fooled around with it a bit last night,
writing a little Perl script to extract "letters".
"Lu" is uppercase letter. "Ll" is lowercase letter. And there are
perhaps more "L" encodings
that I don't understand yet. If you extract letters this way, from
UnicodeData.txt, to list them in your
lexer descriptions, then you'd have to redo it for each new release of
Unicode.

It's better to look at how native Java regular expressions and Unicode
regular expressions refer to Unicode character types.
Inside Java regular expressions, you can match whole Unicode
categories of characters, e.g.
\p{L} matches a single Unicode letter character. While \P{L}, with a
big P, matches any single character that is
_not_ a letter character. Similarly \p{Ll} matches any lowercase
letter, and \p{Lu} matches any
uppercase letter. There are many such codes to refer to official
Unicode character types and subtypes, documented
in

http://www.unicode.org/reports/tr18/#Categories
http://www.regular-expressions.info/unicode.html

It would be useful to have these character-type-matching-codes exposed
inside JavaCC itself, but
that would probably require a hard-core Java guru.


>>
>>
>> 2. Allowing subtraction in character sets, e.g. to match all ASCII
>> letters, except M and m, one might write something like
>>
>> TOKEN: {
>>
>> < #ASCII_LETTER_EXCEPT_M: [ "A-Z", "a"-"z" ] - ("M" | "m") >
>>
>> }
>>
>> Any chance of allowing these tricks in FreeCC?
>
> Well, sure, I think this kind of thing would be quite useful. The only
> thing is that, if I'm the only person working on this, then I don't
> know when I'll turn my attention to this. Also, the whole lexer
> generation piece is where I have the biggest holes in my understanding
> of how the tool works. I have been filling this in a bit. I have read
> up somewhat on the NFA->DFA algorithm type stuff.

I fear that I don't have the skills or the time to dig down into the
lexer-syntax-and-generation code.
Limited subtraction of character sets inside regular expressions is
allowed in "Unicode
Regular Expressions"

http://www.unicode.org/reports/tr18/#Subtraction_and_Intersection

Most computer-language implementations of "regular expressions" don't
support subtraction.

*****

There's another issue, and that is how to enhance JavaCC to allow the
lexer to refer to
Unicode supplementary characters (beyond the Basic Multilingual
Plane). The basic problem
may still lie in Java itself. I'm not completely up to date with
Java's latest handling of Unicode, but
earlier versions of Java kind of made a mess of supplementary
characters (in my opinion).


>
>
> But the best chance for this stuff happening reasonably soon is if
> somebody shows up who wants to take ownership of that piece. You could
> be interested. If you ever tried to get into the JavaCC codebase and
> basically ran away in fright, I think that you'd find that the FreeCC
> code is much more amenable to getting involved. It's been heavily
> refactored and cleaned up. With all the actual generation of java code
> now in external templates, the java codebase is much smaller and
> cleaner.

I wish I could dig into the lexer and make things better, but I know
when something
is beyond my skills.

Thanks for your response,

Ken

Jonathan Revusky

unread,
Nov 19, 2008, 1:33:19 AM11/19/08
to freecc...@googlegroups.com
Kenneth Reid Beesley wrote:
>
> On 16 Nov 2008, at 05:06, Jonathan Revusky wrote:
>
>> On Sat, Nov 15, 2008 at 1:33 AM, Kenneth Reid Beesley
>> <krbe...@gmail.com> wrote:
>>>
>>> My memory grows dim, but years ago there was a discussion about
>>> enhancing/facilitating Unicode-character-specification in the JavaCC
>>> tokenizer.
>> Well, I wouldn't have been involved in that, since I only
>> participated in the JavaCC user list for the first time in April of
>> this year. I was just googling around for it and the first search
>> string I wrote was 'unicode javacc list users'. Interestingly enough
>> the first hit on the list returned was:
>>
>> http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/javacc_unicode.pdf
>>
>> which is by you.
>
> Jonathan,
>
> Yes, I always thought the JavaCC documentation was in an abysmal
> state, and I tried to do something about it.

Well, I can tell you it's not just the documentation that is in bad
shape, it's the whole thing. The code is really horrendous, just
hopelessly entangled. JavaCC was written by people with very little real
understanding of basic notions like encapsulation or modularity. So, for
example, when you look through the code, you see there are practically
no private variables. I think everything is package visible for the
most part, except all the real machinery is in a single package
(org.javacc.parser) so it's basically as if all the variables are public.

Now, to give credit where some credit is due, the tool works pretty
well. It works okay, but it is pretty much impossible to extend JavaCC
in any non-trivial way without a massive code cleanup beforehand. I sort
of embarked on it at some point as a challenge. I started looking at the
JavaCC code because I was interested in the question of how hard it
would be to get it to generate parsers for other languages. But anyway,
I think the truth is that, if I had known just how badly entangled the
code was and how much work it would be to clean the whole thing up, I
don't think I would have embarked on what I did. Also, when I started, I
didn't know I'd have to fork off a new project, that my attempts to
re-animate things there would be met with total indifference ranging all
the way to pathological hostility.
Well, I hedged above by saying "you may be wrong..." because I hadn't
looked into this for a while, and was thinking that my memory might be
playing tricks on me, but I've looked into this again, and it is quite
clear that you were misinformed by whichever "JavaCC guru".


> Of course,
> the official documentation on options was/is notoriously out of date.

Well, the fact that you had to embark on a mini research project to get
an answer to such a simple question really is a pretty damning testimony
to the state of things there.

> "Obsolete" can of course be ambiguous: 1) having no effect in any
> case,

Well, it's fairly easy to demonstrate that the UNICODE_INPUT option does
have an effect. For example, if I rebuild FreeMarker with
UNICODE_INPUT=false it then fails one of our unit tests, which
specifically checks whether it can handle a template with non-ASCII
characters. This is despite the fact that the FreeMarker code uses
Readers exclusively.

> or 2) superseded by a new, better way to do it.
> Perhaps Readers made UNICODE_INPUT obsolete in this second sense.

Well,... they didn't. I think I can just point you to the places in the
FreeCC code. where the UNICODE_INPUT option is used. Now, this is the
FreeCC codebase, but this is a result of refactoring JavaCC. I am quite
certain that JavaCC and FreeCC behave the same on this.

Now, as you're probably aware, in FreeCC, all the ostr.println(...)
statements that were used to generate java code have been replaced by
using freemarker page templates. Anyway, the following template
LexGen.java.ftl is the part that generates the lexer (a.k.a.
TokenManager) code.

http://code.google.com/p/freecc/source/browse/trunk/freecc/src/templates/java/LexGen.java.ftl

I direct your attention to line 38 of this file, where a convenience
variable called USING_UNICODE is defined. Basically, if either of the
options JAVA_UNICODE_ESCAPE or UNICODE_INPUT are true, then this is set
to true. (JAVA_UNICODE_ESCAPE basically just specifies that in addition
to handling unicode, you also handle \uXXXX as well.) Anyway, you can
search forward for where this USING_UNICODE variable is actually used.
It is actually used in 6 different spots in that template. (And this,
BTW, is basically the only place that the UNICODE_INPUT option is used
in the code generation.)

I think a good example is on line 1169. You can see the if-else
conditioned on whether unicode input is enabled. The code that is
generated with UNICODE_INPUT turned off uses a bit vector variable
jbitVecXXXX, where the code that is generated with UNICODE_INPUT turned
on invokes a method called jjCanMove_XXXXX that takes various parameters.

Anyway, the point is that with UNICODE_INPUT turned off, the code
generated uses a peephole optimization basically that assumes that all
the input is ASCII.

So, anyway, the real straight dope on this is that, even though the
underlying Java platform is handling all the unicode/charset conversion
logic, and even if you are using Readers throughout your code, your
JavaCC-built parser will not work correctly for non-ASCII input if you
have not specified UNICODE_INPUT=true (or alternatively
JAVA_UNICODE_ESCAPE=true). THis is because the code generation logic is
generating optimized code (probably only slightly optimized, and
probably the optimization is not worth a bucket of warm spit) but it is
generating the optimized code if UNICODE_INPUT is not specified. Said
optimization is based on the assumption that all the input is ASCII.


>
> Tom Copeland wrote (29 Dec 2006) "Anyhow, I can't think of a reason
> when you would use UNICODE_INPUT
> rather than using a Reader with the proper character encoding. If any
> of the real gurus want to weigh in to correct me here that'd be
> great..."

Copeland is quite confused. He talks of using UNICODE_INPUT=true and
using a Reader with the proper encoding as if they are two alternative
solutions. But that's not the case at all. Even if you have
UNICODE_INPUT=true, you have to be using a Reader with the appropriate
encoding (Of course!!!). And, as I point out above, if you use a Reader
with the proper encoding so that your input is a sequence of 16-bit
unicode characters, your JavaCC-built parser will not do the right thing
with it unless you have set UNICODE_INPUT=true. That is because of these
evil optimization blocks (that, now that I look at this, I am pretty
much absolutely certain that they do not affect performance noticeably,
and are thus, basically masturbatory) that are based on the assumption
that all the characters coming in are ASCII.

>
>
>> This brings me to a fairly big question I have had in my mind about
>> this whole issue. What I have been wondering about and had been
>> meaning to bring up here was the topic of whether there is any strong
>> reason to support anything besides unicode input.
>
> The handling of various source-file encodings is _already_ provided in
> Java itself. And JavaCC
> already provides a simple parser constructor that allows you, where
> necessary, to manually specify the
> encoding of the input file.

Yes, what you are saying is right, or actually, it SHOULD be right. I
mean, this whole thing here is because this is code that was written by
somebody who lacks (or lacked, because this stuff was written over 10
years ago) a certain maturity of judgment. I mean, there is a certain
kind of coder, typically inexperienced, who cannot accept the idea of
any efficiency loss. For them, the art of programming is implementing
the algorithm in the absolutely most efficient way possible. The
subsequent maintainability of the code is not a factor for them, because
-- they lack experience, right? So, you know, they'll jump through the
most incredible hoops to gain some extra execution efficiency. That is
what the support for static parsers vs. non-static parsers is basically
about. I mean, if you don't need your parser to be thread-safe, then all
the methods and variables in it can be static, right? And you save
overhead in theory. THe problem is that it might be 3% or something.
Nobody with much of a grain of sense, understanding the tradeoffs would
bother with static parsers. Of course, the worst aspect of that is that
this was the default. You had to specifically say STATIC=false to do
what should have been the default.

So, this is another travesty, where UNICODE_INPUT is false, by default,
and it creates a terrible gotcha, because a reasonable person would
think that if they defined their grammar in terms of unicode, and they
fed the parser a unicode stream, that this would work, but no, it
doesn't (and it really doesn't, I've tested it...) because basically
they have this kludgy optimization on by default which assumes that all
the characters are ASCII.

>
>> AFAICS, supporting
>> non-unicode input really only exists as an optimization option.
>> Surely, any lexer for ASCII input can be handled using a charstream
>> that allows full unicode.
>
> JavaCC lexers are written in terms of Unicode characters. All input
> text gets converted,
> one way or another, to Unicode before it ever gets to your JavaCC lexer.

Yes, except for the little detail that if you fail to specify
UNICODE_INPUT=true, then your scanner code will contain a little
peephole optimization here and there that assumes that all the input is
ASCII.
>
>> I think that specifically handling the
>> non-unicode case allows certain code to be table-based, and thus,
>> somewhat faster. What I don't know is how much faster. Intuitively, it
>> is hard to believe there is much to gain from specially handling
>> non-unicode input. After all, Java itself uses 16-bit characters in
>> strings and so on whether you need them or not, so you've already got
>> a lot of the unicode conversion overhead assumed regardless.
>
> Java programs (including JavaCC lexers and parsers) always handle text
> internally as String
> (or StringBuffer, etc) objects that are Unicode (encoded as UTF-16).
> Any text read in from file (including
> your source programs read by a JavaCC parser) is converted into
> Unicode one
> way or another before it gets to the lexer. This is the main subject
> of the javacc_unicode.pdf
> paper that you found.

Yes, yes, I understand all that. My point is that any parser generation
logic that can handle a 16-bit character set can also handle an 8-bit or
7-bit character set. The only reason that you would have a disposition
to state that all the input is ASCII, say, is because, in principle, you
can generate somewhat more efficient code (slightly more efficient I
would guess) by being able to assume that all the input is just ASCII.

Of course, the optimization in question is bound to be quite trivial
because simply by virtue of the fact that you are using the Java
platform, all kinds of byte->character conversions are happening all
over the place anyway and there's nothing you can do about it because
that's down at the JVM level.
Yes, agreed. We understand this. This all means that JavaCC should be
able to parse unicode by default without any extra options specified.
However, that is not the case, because you need to specify
UNICODE_INPUT=true, or the lexer will be generated with some extra
peephole optimizations here and there that assume your input is all
ASCII characters.

That is the way JavaCC works unfortunately and there is no way I can
change this.

However, it will be fixed in FreeCC.

JR

Attila Szegedi

unread,
Nov 19, 2008, 4:10:25 AM11/19/08
to freecc...@googlegroups.com
Yeah, it's a very pointless option to have in 2008. It was probably
very pointless to have ever, FWIW. It's basically a "do you want to
get screwed down the line" option. It's just not worth keeping your
codebase more complex with alternative code for two values of this
option. I'd suggest you remove this option altogether, and just
default to correct handling of Unicode. For a while you might keep it
recognized and logging a warning that it's no longer required.

It's another good distinguishing feature, right? "Handles Unicode
characters correctly out-of-the-box". :-)

Attila.

--
home: http://www.szegedi.org
twitter: http://twitter.com/szegedi
weblog: http://constc.blogspot.com

Kenneth Reid Beesley

unread,
Nov 20, 2008, 1:34:47 PM11/20/08
to freecc...@googlegroups.com, me beesley
Some brief comments below


On 18 Nov 2008, at 23:33, Jonathan Revusky wrote:

>
> <snip>


>>
>>
>>>
>>> BTW, I ran across your paper on using unicode in JavaCC, and I think
>>> you may be wrong on a certain point. IIRC, you state that JavaCC
>>> will
>>> handle unicode correctly if you instantiate your parser with a
>>> Reader,
>>> and that the UNICODE_INPUT option is unnecessary. I don't think
>>> that's
>>> true, though maybe it is supposed to be. I am pretty sure that you
>>> need UNICODE_INPUT=true for a parser to handle unicode correctly.
>>

Ken Beesley responded

Many thanks for the clarification. This is finally making sense.
In any non-trivial testing, I had probably set the external option
variable

JAVA_UNICODE_ESCAPE = true

which means that the internal "convenience variable" UNICODE_INPUT was
being forced to true automatically.
So I wouldn't have seen any difference setting the external
UNICODE_INPUT option to true vs. false.


>
>
>
>>
>> Tom Copeland wrote (29 Dec 2006) "Anyhow, I can't think of a reason
>> when you would use UNICODE_INPUT
>> rather than using a Reader with the proper character encoding. If
>> any
>> of the real gurus want to weigh in to correct me here that'd be
>> great..."
>
> Copeland is quite confused. He talks of using UNICODE_INPUT=true and
> using a Reader with the proper encoding as if they are two alternative
> solutions. But that's not the case at all. Even if you have
> UNICODE_INPUT=true, you have to be using a Reader with the appropriate
> encoding (Of course!!!). And, as I point out above, if you use a
> Reader
> with the proper encoding so that your input is a sequence of 16-bit
> unicode characters, your JavaCC-built parser will not do the right
> thing
> with it unless you have set UNICODE_INPUT=true.

or
JAVA_UNICODE_ESCAPE = true

which is probably what I had specified

Yikes. Not good. I share your opinion that the attempted
optimization-for-ASCII is probably unwise and unnecessary.
And making such an optimization-for-ASCII the default, in a modern
Java environment, is evil. It takes
Java's commendable dedication to handing Unicode and emasculates it.

Full handling of Unicode input should be the default in JavaCC/
FreeCC. If supplied at all, such an optimization-for-ASCII should be
an explicit user option, e.g.

HANDLE_ONLY_ASCII_INPUT = true
(default false)

but even this is potentially confusing and probably unwise.

Best,

Ken

Jonathan Revusky

unread,
Nov 20, 2008, 6:16:01 PM11/20/08
to freecc...@googlegroups.com
On Wed, Nov 19, 2008 at 10:10 AM, Attila Szegedi <szeg...@gmail.com> wrote:
>
> Yeah, it's a very pointless option to have in 2008. It was probably
> very pointless to have ever, FWIW.

I would guess the chronology on this was that the original
implementation of this scanner generation stuff in JavaCC owes heavily
to code from one of the old UNIX dinosaurs, like lex or flex or
whatever. That code surely doesn't handle unicode input, so the extra
code to handle unicode was added later, but then, the person who did
it realized that it was less efficient (obviously) than the code that
handled 8-bit characters, so he kept the old code around, leaving it
as the default, unless you specifically set UNICODE_INPUT=true.

But the whole thing is outrageously sloppy, because even if you
specifically put unicode characters in your lexical specification, the
thing still requires you to specify UNICODE_INPUT. I mean, if I
specifically put character ranges that are outside of ASCII in my
grammar file, shouldn't the tool then be clever enough to realize that
UNICODE_INPUT should be on?

> It's basically a "do you want to
> get screwed down the line" option. It's just not worth keeping your
> codebase more complex with alternative code for two values of this
> option. I'd suggest you remove this option altogether, and just
> default to correct handling of Unicode. For a while you might keep it
> recognized and logging a warning that it's no longer required.

Today, I cleaned this up. I deleted all the code in SVN that relates
to non-unicode streams. For good measure, I also removed the
generation of all constructors that take an InputStream as an
argument. Just left all the ones that use Readers. Those constructors
were just convenience constructors. I think their presence is mostly
just confusing.

And, for now, UNICODE_INPUT option is on the wiki page that lists all
the obsolete/deprecated options from JavaCC.
http://code.google.com/p/freecc/wiki/DeprecatedConstructs

>
> It's another good distinguishing feature, right? "Handles Unicode
> characters correctly out-of-the-box". :-)

Yes, though that would also be true of JavaCC if they simply had the
UNICODE_INPUT default (sensibly) to true.

JR

Jonathan Revusky

unread,
Nov 24, 2008, 5:31:45 PM11/24/08
to freecc...@googlegroups.com
Kenneth Reid Beesley wrote:
>
>
> I looked into Unicode character types again:
> There's a text file UnicodeData.txt (for each release of Unicode) that
> classifies each character
> as letter, punctuation, control, et c., using several subtypes. I
> fooled around with it a bit last night,
> writing a little Perl script to extract "letters".
> "Lu" is uppercase letter. "Ll" is lowercase letter. And there are
> perhaps more "L" encodings
> that I don't understand yet. If you extract letters this way, from
> UnicodeData.txt, to list them in your
> lexer descriptions, then you'd have to redo it for each new release of
> Unicode.

Yes, I see (I think) that the full generalized solution is something
like this, where you instantiate various RCharacterList objects for
lower case letter, etcetera, based on a configurable UnicodeData.txt
file. However, I don't think there is a need to solve the full
generalized problem initially. Getting set difference implemented and
getting the aliases like \Ll or whatever working are two things that can
be attacked separately. And then fully generalizing the solution could
then be attacked in turn.

>
> It's better to look at how native Java regular expressions and Unicode
> regular expressions refer to Unicode character types.
> Inside Java regular expressions, you can match whole Unicode
> categories of characters, e.g.
> \p{L} matches a single Unicode letter character. While \P{L}, with a
> big P, matches any single character that is
> _not_ a letter character. Similarly \p{Ll} matches any lowercase
> letter, and \p{Lu} matches any
> uppercase letter. There are many such codes to refer to official
> Unicode character types and subtypes, documented
> in
>
> http://www.unicode.org/reports/tr18/#Categories
> http://www.regular-expressions.info/unicode.html
>
> It would be useful to have these character-type-matching-codes exposed
> inside JavaCC itself, but
> that would probably require a hard-core Java guru.

Well, Ken, you are interested enough in the whole topic to investigate
these matters. As for the notion that you need to be a hard-core Java
guru to muck with this stuff,... well... I just looked into this (in the
last few days or so which is why my response to this message is so late)
and it doesn't look too hard to beat this stuff into shape. The
necessary changes would be fairly localized AFAICS.

Anyway, just bear with me and I'll explain what I know (or think I know)
about this:

Now, if you look here, this is the actual object that is used to span a
range of characters (or multiple ranges of characters actually.)

http://code.google.com/p/freecc/source/browse/trunk/freecc/src/java/org/visigoths/freecc/lexgen/RCharacterList.java

Now, if you look over it, you see that it mostly contains a few
statically hard-coded tables that deal with upper/lower case conversions
in Unicode. But the actual object instance is quite simple really. I
mean, the instance data for an RCharacterList instance really is stored
basically in the descriptors List, defined on line 55. Now, in Eclipse,
you have this command to see all the points where a variable or method
is used (particularly handy in something like the JavaCC codebase which
tends to define a lot of public variables). Anyway, a bit of
investigation (more than should be necessary if the code were really
well structured) and you figure out that the descriptors List can only
contain two types of Object, SingleCharacter and CharacterRange. You can
see these here:

http://code.google.com/p/freecc/source/browse/trunk/freecc/src/java/org/visigoths/freecc/lexgen/SingleCharacter.java
http://code.google.com/p/freecc/source/browse/trunk/freecc/src/java/org/visigoths/freecc/lexgen/CharacterRange.java

So, basically, set difference is not really that hard. You simply run
over the the descriptors list and make adjustments based on the rhs of
the set subtraction. Now, if the descriptor object is just a
SingleCharacter, then you remove it or leave it depending on whether it
is in the range to be subtracted, right? A CharacterRange object is not
actually that much more difficult. For example, if we have the single
character range ("A"-"Z") and we want to remove "M"-"O" from it, we
would have to break the CharacterRange object into the two subranges
("A"-"L") and ("P"-"Z") and those two CharacterRange objects would
replace the single ("A"-"Z") element in the descriptors List.

Of course, case sensitivity and negation are the extra wrinkles. (And
you could, just to get something working, resolve not to solve those for
now, but to get the absolutely most simple case working...) But anyway,
as long as you can munge a CharacterRange object into a set of
CharacterRange objects that represent the difference, it really seems
that all the existing machinery should just continue to work.

Of course, the other matter is what notation is to be used for set
difference, since I guess the straight minus sign is being used to
specify a range of characters. Well, maybe backslash. Or maybe it can be
reused without ambiguity. I haven't looked into it sufficiently.

>
>
>>>
>>> 2. Allowing subtraction in character sets, e.g. to match all ASCII
>>> letters, except M and m, one might write something like
>>>
>>> TOKEN: {
>>>
>>> < #ASCII_LETTER_EXCEPT_M: [ "A-Z", "a"-"z" ] - ("M" | "m") >
>>>
>>> }
>>>
>>> Any chance of allowing these tricks in FreeCC?


Well, as I say, the above does not look particularly hard to implement.
Basically, the lhs in the set difference, ["A-Z", "a"-"z"] is reified as
a RCharacterList object whose descriptors variable contains two
CharacterRange objects. Basically, changing this into basically the same
internal representation as ["A"-"L", "N"-"Z", "a"-"l", "n"-"z"] is quite
doable. It is not proverbial rocket science.


>> Well, sure, I think this kind of thing would be quite useful. The only
>> thing is that, if I'm the only person working on this, then I don't
>> know when I'll turn my attention to this. Also, the whole lexer
>> generation piece is where I have the biggest holes in my understanding
>> of how the tool works. I have been filling this in a bit. I have read
>> up somewhat on the NFA->DFA algorithm type stuff.
>
> I fear that I don't have the skills or the time to dig down into the
> lexer-syntax-and-generation code.
> Limited subtraction of character sets inside regular expressions is
> allowed in "Unicode
> Regular Expressions"
>
> http://www.unicode.org/reports/tr18/#Subtraction_and_Intersection

Yes, I see that they use a double-minus operator -- for set subtraction.
That looks quite doable.

>
> Most computer-language implementations of "regular expressions" don't
> support subtraction.


Well, I'll take your word on that. I'm not exactly a regexp virtuoso. I
do use them here and there, but probably I master a pretty pathetic subset.

>
> *****
>
> There's another issue, and that is how to enhance JavaCC to allow the
> lexer to refer to
> Unicode supplementary characters (beyond the Basic Multilingual
> Plane). The basic problem
> may still lie in Java itself. I'm not completely up to date with
> Java's latest handling of Unicode, but
> earlier versions of Java kind of made a mess of supplementary
> characters (in my opinion).

Well, Ken, at first blush, it does seem that you investigated the whole
topic in enough depth to form an opinion. (Or however deeply is
necessary to conclude that whichever gurus or whoever it was bungled
whatever the thing is... :-))

>
>
>>
>> But the best chance for this stuff happening reasonably soon is if
>> somebody shows up who wants to take ownership of that piece. You could
>> be interested. If you ever tried to get into the JavaCC codebase and
>> basically ran away in fright, I think that you'd find that the FreeCC
>> code is much more amenable to getting involved. It's been heavily
>> refactored and cleaned up. With all the actual generation of java code
>> now in external templates, the java codebase is much smaller and
>> cleaner.
>
> I wish I could dig into the lexer and make things better, but I know
> when something
> is beyond my skills.

Yeah, okay.... but...

....the inescapable fact does remain that any specific thing is going to
be beyond your skills .... I mean, as a practical matter, whether it
really is beyond your skills or not .... IF ... you talk yourself out of
it. (I really hope that doesn't come off as condescending or anything.
Maybe it could, it's not the intention. When I say "this is an
inescapable fact etcetera", I don't mean it in any personal way really.
IMO, this really is just a cold, hard fact....)

Now, neither you nor anybody here owes me anything, but here's the pitch:

Basically, I could paraphrase the above as making the case that it
really may not be more work to get some of this stuff actually
implemented in code than it was, for example, to write the essay you
wrote as a result of picking the brains of some self-styled JavaCC
gurus. But wouldn't you get a much better sense of satisfaction and
closure out of actually implementing the feature?

And, BTW, I don't think that implementing the aliases like \l for a
unicode letter and so on is very hard either. AFAICS, all you need is to
have some predefined RCharacterList instances that you can fish out --
(probably clone the canonical instance on demand and then use them and
throw them away...)

Regards,

JR

Attila Szegedi

unread,
Nov 25, 2008, 6:03:00 AM11/25/08
to freecc...@googlegroups.com

On 2008.11.24., at 23:31, Jonathan Revusky wrote:

>
> Kenneth Reid Beesley wrote:
>> *****
>>
>> There's another issue, and that is how to enhance JavaCC to allow the
>> lexer to refer to
>> Unicode supplementary characters (beyond the Basic Multilingual
>> Plane). The basic problem
>> may still lie in Java itself. I'm not completely up to date with
>> Java's latest handling of Unicode, but
>> earlier versions of Java kind of made a mess of supplementary
>> characters (in my opinion).

It is easy to get confused here. Saying that the internal
representation of Java strings is "Unicode" is an imprecise
oversimplification. An individual "char" in Java is simply an unsigned
16 bit value. As for java.lang.String, the sequences of those 16-bit
values are specifically to be interpreted as UTF-16. This means that
"only" the characters from the Basic Multilingual Plane (BMP) of
Unicode can be encoded in a single "char" value. In most practical
applications, this ain't much of a limitation, though, as up to
Unicode 3.0, all characters are in the BMP, that's why I put "only" in
quotes :-)

Unicode 3.1 does define some ~45000 new characters in other planes,
though, so if you'd wish to express those, you could, but they would
be need to be represented on two "char" values (and would throw off
the indexOf(), length() etc. methods as they'd no longer operate on
Unicode characters, but rather on UTF-16 encoding units). I.e. U+10000
is encoded as U+D800,U+DC00 in UTF-16.

So, it can be said that java.lang.String uses UTF-16 encoding, and
that nicely maps exactly one "char" value to exactly one Unicode
character for all Unicode 3.0 characters.

Now, for a completely independent another issue:

To make matters even more interesting, both composed and decomposed
variants for characters are allowed, i.e. "ffi" can be equally
expressed as U+0066,U+0066,+U0069 or as the U+FB03 ligature. This is
not a Java problem, it's a trait of the Unicode standard. Sun finally
added java.text.Normalizer for Java 6 that allows you to obtain a
various normalized decompositions/compositions of your Unicode
sequences. It is probably a good idea to pass your char sequence
through it before letting it be analyzed by a lexer; it'll definitely
make the lexer rules easier.

If you need this functionality prior to Java 6, your best bet is IBM's
"International Components for Unicode" library: <http://www-01.ibm.com/software/globalization/icu/index.jsp
> (I've been using it for years).

So, to summarize:
- you can run your input char stream through a Unicode normalizer in
order to not have to worry about composed/decomposed variants in
downstream processing
- as long as you're comfortable not having to handle Unicode 3.1
input, you also needn't worry about the fact that java.lang.String is
actually a UTF-16 encoded text representation. (I know, that's like
saying "as long as you're comfortable not having to handle non-ASCII
input, you needn't know what's UTF-8" :-) )

Attila.

Kenneth Reid Beesley

unread,
Nov 26, 2008, 4:48:45 AM11/26/08
to freecc...@googlegroups.com

On 25 Nov 2008, at 04:03, Attila Szegedi wrote:

>
>
>>
>> Kenneth Reid Beesley wrote:
>>> *****
>>>
>>> There's another issue, and that is how to enhance JavaCC to allow
>>> the
>>> lexer to refer to
>>> Unicode supplementary characters (beyond the Basic Multilingual
>>> Plane). The basic problem
>>> may still lie in Java itself. I'm not completely up to date with
>>> Java's latest handling of Unicode, but
>>> earlier versions of Java kind of made a mess of supplementary
>>> characters (in my opinion).
>
> It is easy to get confused here. Saying that the internal
> representation of Java strings is "Unicode" is an imprecise
> oversimplification. An individual "char" in Java is simply an unsigned
> 16 bit value. As for java.lang.String, the sequences of those 16-bit
> values are specifically to be interpreted as UTF-16. This means that
> "only" the characters from the Basic Multilingual Plane (BMP) of
> Unicode can be encoded in a single "char" value. In most practical
> applications, this ain't much of a limitation, though, as up to
> Unicode 3.0, all characters are in the BMP, that's why I put "only" in
> quotes :-)

Attila,

Thanks for the overview, but I understand this stuff pretty well.
From the beginning,
Java was strongly devoted to the original 16-bit vision of Unicode,
but when supplementary characters were
introduced (theoretically with 2.0 in 1996 and de facto with 3.1 in
2001) it stumbled badly (in my opinion).


>
>
> Unicode 3.1 does define some ~45000 new characters in other planes,
> though, so if you'd wish to express those, you could, but they would
> be need to be represented on two "char" values (and would throw off
> the indexOf(), length() etc. methods as they'd no longer operate on
> Unicode characters, but rather on UTF-16 encoding units). I.e. U+10000
> is encoded as U+D800,U+DC00 in UTF-16.

Yes, this is what I find disappointing about the Java implementation.
Other languages have,
in my opinion, evolved to handle these supplementary characters much
better. For example:

The internal encoding of Perl unicode strings happens to be UTF-8, but
that's well
hidden from you. Perl Unicode strings can be visualized by the
programmer as sequences of Unicode
Characters, and indexing and length (and looping through the
characters in a string) work intuitively.

In Python, you get similar intuitive indexing/length/looping behavior
with a "ucs4" build, where strings are technically stored in UTF-32.

Java's internal representation of Strings is UTF-16, but it's not
hidden from you;
the programmer has to worry constantly about the Unicode character vs.
char gap. (And you get similar
problems with a 'ucs2' build of Python.)

In Perl code, you can represent any Unicode character (including
supplementary characters) with the \x{H....} notation.
In Python code, you can represent BMP characters with \uHHHH and
supplementary characters with \UHHHHHHHH.

The last time I looked (and I haven't looked closely at Java 6) Java
still had no convenient way to refer to a supplementary character in
code.
You have to calculate the two surrogate values and enter them
literally as \uHHHH\uHHHH. That's pretty lame (in my opinion).
Please update me out as necessary.

********

And this brings me to the problem cited in my message: how to update
the JavaCC/FreeCC lexer syntax to allow reference to ranges of
supplementary characters. Suppose that I'm implementing a new
programming language wherein the IDs can contain any Unicode "letter"
character. I might
define a helper


< #UNICODE_ID_LETTER: [
"\u0041" - "\u005A" , // Latin
"\u0061" - "\u007A" ,
"\u00AA" ,
"\u00B5" ,
"\u00BA" ,
"\u00C0" - "\u00D6" ,
"\u00D8" - "\u00F6" ,
"\u00F8" - "\u02C1" ,
"\u02C6" - "\u02D1" ,
"\u02E0" - "\u02E4" ,
"\u02EC" ,
"\u02EE" ,
"\u0370" - "\u0374" , // Greek
"\u0376" - "\u0377" ,
"\u037A" - "\u037D" ,
"\u0386" ,
"\u0388" - "\u038A" ,
"\u038C" ,
"\u038E" - "\u03A1" ,
"\u03A3" - "\u03F5" ,
"\u03F7" - "\u03FF" ,
"\u0400" - "\u0481" , // Cyrillic
"\u048A" - "\u0523"

etc, etc, etc

] >

which works for the BMP, but what do I do when I want to include
Shavian, Deseret, Gothic, etc. from the supplementary area?
Using a Python-like notation, the Shavian range is

"\U00010450" - "\U0001047F"

but (the last time I looked) Java itself doesn't support such a
notation. If you're better informed than I am, please set me straight.
That's what I meant when I wrote that "the basic problem may still lie
in Java".

>
>
> Now, for a completely independent another issue:
>
> To make matters even more interesting, both composed and decomposed
> variants for characters are allowed, i.e. "ffi" can be equally
> expressed as U+0066,U+0066,+U0069 or as the U+FB03 ligature. This is
> not a Java problem, it's a trait of the Unicode standard. Sun finally
> added java.text.Normalizer for Java 6 that allows you to obtain a
> various normalized decompositions/compositions of your Unicode
> sequences. It is probably a good idea to pass your char sequence
> through it before letting it be analyzed by a lexer; it'll definitely
> make the lexer rules easier.
>
> If you need this functionality prior to Java 6, your best bet is IBM's
> "International Components for Unicode" library: <http://www-01.ibm.com/software/globalization/icu/index.jsp
>> (I've been using it for years).

Yes, everyone should be aware of the Normalization challenge.

Before java.text.Normalizer I successfully used sun.text.Normalizer
for a while. As you say, ICU is often the best
solution for getting around the awkwardness of the Java implementation
of supplementary characters.
E.g. for iterating through a String character by character (rather
than char by char) you can use
com.ibm.icu.text.UCharacterIterator, but this is much less convenient
than the built-in support for supplementary
characters in Perl and Java.

>>
>
> So, to summarize:
> - you can run your input char stream through a Unicode normalizer in
> order to not have to worry about composed/decomposed variants in
> downstream processing

Of course.

>
> - as long as you're comfortable not having to handle Unicode 3.1
> input, you also needn't worry about the fact that java.lang.String is
> actually a UTF-16 encoded text representation. (I know, that's like
> saying "as long as you're comfortable not having to handle non-ASCII
> input, you needn't know what's UTF-8" :-) )

But this was the whole point :) Supplementary characters are
important to me,
and they've been an official and de facto part of the Unicode standard
since 2001. In November 2008 I shouldn't have to ignore
or suppress them. I'd like to define lexers that refer to
supplementary characters and ranges of supplementary characters,
just like they refer to BMP characters. Unicode is now at version
5.1.0, and if we settle for 3.0 we're aiming pretty low.

So the big question is: How might JavaCC/FreeCC be enhanced/expanded
to recognize tokens that include supplementary characters?
And the next obvious question is: If we did implement an expanded
FreeCC lexer syntax, e.g. something like

...
"\U00010330" - "\U0001034F", // Gothic
"\U00010400" - "\U0001044F", // Deseret
"\U00010450" - "\U0001047F", // Shavian
...

how would that translate into Java? Would we have to change the front-
end to rely on UCharacterIterator?

Best,

Attila Szegedi

unread,
Nov 27, 2008, 4:53:55 AM11/27/08
to freecc...@googlegroups.com

A-ha!

Okay, I didn't understand that from your previous posting. I thought
that you specifically *don't* want to deal with supplementals
explicitly in the lexer.

> Unicode is now at version
> 5.1.0, and if we settle for 3.0 we're aiming pretty low.
>
> So the big question is: How might JavaCC/FreeCC be enhanced/expanded
> to recognize tokens that include supplementary characters?
> And the next obvious question is: If we did implement an expanded
> FreeCC lexer syntax, e.g. something like
>
> ...
> "\U00010330" - "\U0001034F", // Gothic
> "\U00010400" - "\U0001044F", // Deseret
> "\U00010450" - "\U0001047F", // Shavian
> ...
>
> how would that translate into Java? Would we have to change the
> front-
> end to rely on UCharacterIterator?

Well, yes, ideally this would work with an equivalent of
java.io.Reader that returns UCS-4 values directly. Funnily enough,
java.io.Reader#read() returns an int, so it is actually already
suitable for the task, unfortunately the read(char[]) and read(char[],
int, int) aren't -- they'd have to be changed to receive int[]
instead. And of course, FreeCC would internally need to deal with int
values instead of char values to represent characters. Not a too big
deal, actually; I don't think the FreeCC internals keep too much
arrays around, so having the processed chunk of text stored on 32 bit/
char is probably acceptable. The author of the parser would need to
make a transcoding decision though if he'd need to transform a
sequence of UCS-4 characters into either a java.lang.String object or
a char[] object...

Reply all
Reply to author
Forward
0 new messages