Is this Regular Expression for UTF-8 Correct??

Peter Olcott

unread,

May 13, 2010, 4:06:36 PM5/13/10

to

The solution is based on the GREEN portions of the first
chart shown
on this link:
http://www.w3.org/2005/03/23-lex-U

A semantically identical regular expression is also found on
the above link underValidating lex Template

1 ['\u0000'-'\u007F']
2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
3 | ( '\u00E0' ['\u00A0'-'\u00BF']
['\u0080'-'\u00BF'])
4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'])
5 | ( '\u00ED' ['\u0080'-'\u009F']
['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'])
7 | ( '\u00F0' ['\u0090'-'\u00BF']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
9 | ( '\u00F4' ['\u0080'-'\u008F']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

Here is my version, the syntax is different, but the UTF8
portion should be semantically identical.

UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]

ASCII [\x0-\x7F]

U1 [a-zA-Z_]
U2 [\xC2-\xDF][\x80-\xBF]
U3 [\xE0][\xA0-\xBF][\x80-\xBF]
U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5 [\xED][\x80-\x9F][\x80-\xBF]
U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]

UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

// This identifies the "Letter" portion of an Identifier.
L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

I guess that most of the analysis may simply boil down to
whether or not the original source from the link is
considered reliable. I had forgotten this original source
when I first asked this question, that is why I am reposting
this same question again.

Leigh Johnston

unread,

May 13, 2010, 4:27:39 PM5/13/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message
news:3sudnRDN849QxnHW...@giganews.com...

What has this got to do with C++? What is your C++ language question?

/Leigh

Peter Olcott

unread,

May 13, 2010, 4:36:24 PM5/13/10

to

"Leigh Johnston" <le...@i42.co.uk> wrote in message
news:GsGdnbYz-OIj_XHW...@giganews.com...

I will be implementing a utf8string to supplement
std::string and will be using a regular expression to
quickly divide up UTF-8 bytes into Unicode CodePoints.

Since there are no UTF-8 groups, or even Unicode groups I
must post these questions to groups that are at most
indirectly related to this subject matter.

Leigh Johnston

unread,

May 13, 2010, 4:41:16 PM5/13/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

news:xMOdnahxZJNX_3HW...@giganews.com...

>>
>> What has this got to do with C++? What is your C++ language question?
>>
>> /Leigh
>
> I will be implementing a utf8string to supplement std::string and will be
> using a regular expression to quickly divide up UTF-8 bytes into Unicode
> CodePoints.
>
> Since there are no UTF-8 groups, or even Unicode groups I must post these
> questions to groups that are at most indirectly related to this subject
> matter.

Wrong: off-topic is off-topic. If I chose to write a Tetris game in C++ it
would be inappropriate to ask about the rules of Tetris in this newsgroup
even if there was not a more appropriate newsgroup.

/Leigh

Peter Olcott

unread,

May 13, 2010, 4:54:08 PM5/13/10

to

"Leigh Johnston" <le...@i42.co.uk> wrote in message

news:v7CdnY8dPrNy_nHW...@giganews.com...

I think that posting to the next most relevant group(s)
where a directly relevant group does not exist is right, and
thus you are simply wrong.

Leigh Johnston

unread,

May 13, 2010, 5:11:40 PM5/13/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

news:L76dnYQBH-5s-3HW...@giganews.com...

From this newsgroup's FAQ:

"Only post to comp.lang.c++ if your question is about the C++ language
itself."

Thus I am simply correct?

/Leigh

Ian Collins

unread,

May 13, 2010, 5:33:29 PM5/13/10

to

On 05/14/10 08:06 AM, Peter Olcott wrote:
> Is this Regular Expression for UTF-8 Correct??

It's a fair bet you are off-topic in all the groups you have cross
posted to. Why don't you pick a group for a language with built in UTF8
and regexp support (PHP?) and badger them?

--
Ian Collins

Peter Olcott

unread,

May 13, 2010, 5:59:12 PM5/13/10

to

"Leigh Johnston" <le...@i42.co.uk> wrote in message

news:TNydnVD6sciN9nHW...@giganews.com...

I can not accept that the "correct" answer is that some
questions can not be asked.

Peter Olcott

unread,

May 13, 2010, 6:04:52 PM5/13/10

to

"Ian Collins" <ian-...@hotmail.com> wrote in message
news:8539h9...@mid.individual.net...

What does this question have to do with the C++ language?

At least my question is indirectly related to C++ by making
a utf8string for the C++ language from the regular
expression.

Your question is not even indirectly related to the C++
language.

Kai-Uwe Bux

unread,

May 13, 2010, 6:12:38 PM5/13/10

to

Peter Olcott wrote:

>
> "Leigh Johnston" <le...@i42.co.uk> wrote in message
> news:TNydnVD6sciN9nHW...@giganews.com...
>> "Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

[... on topicality of regular expressions ...]

>> From this newsgroup's FAQ:
>>
>> "Only post to comp.lang.c++ if your question is about the
>> C++ language itself."
>>
>> Thus I am simply correct?
>>
>> /Leigh
>
> I can not accept that the "correct" answer is that some
> questions can not be asked.

a) Whether you can accept that answer or not, it could still be correct.

b) However, it isn't: the question can be asked. It should just be asked
_elsewhere_. May I suggest comp.programming, where it seems to be on-topic.

Best

Kai-Uwe Bux

Joseph M. Newcomer

unread,

May 13, 2010, 6:31:56 PM5/13/10

to

See below...

On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Leigh Johnston" <le...@i42.co.uk> wrote in message
>news:GsGdnbYz-OIj_XHW...@giganews.com...
>> "Peter Olcott" <NoS...@OCR4Screen.com> wrote in message
>> news:3sudnRDN849QxnHW...@giganews.com...
>>> Is this Regular Expression for UTF-8 Correct??
>>>
>>> The solution is based on the GREEN portions of the first
>>> chart shown
>>> on this link:
>>> http://www.w3.org/2005/03/23-lex-U

****
Note that in the "green" areas, we find

U0482 Cyrillic thousands sign
U055A Armenian apostrophe
U055C Armenian exclamation mark
U05C3 Hebrew punctuation SOF Pasuq
U060C Arabic comma
U066B Arabic decimal separator
U0700-U0709 Assorted Syriac punctuation marks
U0966-U096F Devanagari digits 0..9
U09E6-U09EF Bengali digits 0..9
U09F2-U09F3 Bengali rupee marks
U0A66-U0A6F Gurmukhi digits 0..9
U0AE6-U0AEF Gujarati digits 0..9
U0B66-U0B6F Oriya digits 0..9
U0BE6-U0BEF Tamil digits 0..9
U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
U0BF3-U0BFA Tamil punctuation marks
U0C66-U0C6F Telugu digits 0..9
U0CE6-U0CEF Kannada digits 0..9
U0D66-U0D6F Malayam digits 0..9
U0E50-U0E59 Thai digits 0..9
U0ED0-U0ED9 Lao digits 0..9
U0F20-U0F29 Tibetan digits 0..9
U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
U1040-U1049 - Myanmar digits 0..9
U1360-U1368 Ethiopic punctuation marks
U1369-U137C Ethiopic numeric values (digits, tens of digits, etc.)
U17E0-U17E9 Khmer digits 0..9
U1800-U180E Mongolian punctuation marks
U1810-U1819 Mongolian digits 0..9
U1946-U194F Limbu digits 0..9
U19D0-U19D9 New Tai Lue digits 0..9
...at which point I realized I was wasting my time, because I was attempting to disprovde
what is a Really Dumb Idea, which is to write applications that actually work on UTF-8
encoded text.

You are free to convert these to UTF-8, but in addition, if I've read some of the
encodings correctly, the non-green areas preclude what are clearly "letters" in other
languages.

Forget UTF-8. It is a transport mechanism used at input and output edges. Use Unicode
internally.
****

***
For someone who had an unholy fixation on "performance", why would you choose such a slow
mechanism for doing recognition?

I can imagine a lot of alternative approaches, including having a table of 65,536
"character masks" for Unicode characters, including on-the-fly updating of the table, and
extensions to support surrogates, which would outperform any regular expression based
approach.

What is your crtiterion for what constitutes a "letter"? Frankly, I have no interest in
decoding something as bizarre as UTF-8 encodings to see if you covered the foreign
delimiters, numbers, punctuation marks, etc. properly, and it makes no sense to do so. So
there is no way I would waste my time trying to understand an example that should not
exist at all.

Why do you seem to choose the worst possible choice when there is more than one way to do
something? The choices are (a) work in 8-bit ANSI (b) work in UTF-8 (c) work in Unicode.
Of these, the worst possible choice is (b), followed by (a). (c) is clearly the winner.

So why are you using something as bizarre as UTF-8 internally? UTF-8 has ONE role, which
is to write Unicode out in an 8-bit encoding, and read Unicode in an 8-bit encoding. You
do NOT want to write the program in terms of UTF-8!
joe
****

>
>Since there are no UTF-8 groups, or even Unicode groups I
>must post these questions to groups that are at most
>indirectly related to this subject matter.
>

Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Joseph M. Newcomer

unread,

May 13, 2010, 6:33:37 PM5/13/10

to

Actually, what it does is give us another opportunity to point how how really bad this
design choice is, and thus Peter can tell us all we are fools for not answering a question
that should never have been asked, not because it is inappropriate for the group, but
because it represents the worst-possible-design decision that could be made.
joe

Victor Bazarov

unread,

May 13, 2010, 6:41:32 PM5/13/10

to

On 5/13/2010 6:04 PM, Peter Olcott wrote:
> "Ian Collins"<ian-...@hotmail.com> wrote in message
> news:8539h9...@mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you
>> have cross posted to. Why don't you pick a group for a
>> language with built in UTF8 and regexp support (PHP?) and
>> badger them?
>>
>> --
>> Ian Collins
>
> What does this question have to do with the C++ language?

It does not have to have anything to do with C++. A post on the
topicality of another post is *always on topic*.

> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.

<sarcasm>
I am about to hold a party where I expect my colleagues to show up.
They are all C++ programmers. Would the question on what to feed them,
or whether 1970s pop music is going to be appropriate, be on topic in
comp.lang.c++? It's *indirectly related* to C++, isn't it?
</sarcasm>

> Your question is not even indirectly related to the C++
> language.

See above.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Sam

unread,

May 13, 2010, 7:12:48 PM5/13/10

to

Victor Bazarov writes:

This guy is a tool. He re-posted this question a second time because when he
first posted that snippet nobody cared either. But after watching the
struggle in the original thread, the ugly carnage appealed to the
infinitesimally small humanitarian aspect of my psyche sufficiently enough
to motivate myself into actually looking at the regexp monstrosity. But
after I explained why that spaghetti of a regexp does not jive with RFC
2279, he got all huffy about it. He was confident that I was wrong, and that
the regular expression was right. But I was able to explain my reasoning, by
referencing directly to the contents of RFC 2279, and he was unable to
explain why he thought I was wrong, instead sprinkling more URLs to some
apparently orphaned web pages that said something else.

Which raised an obvious question: if he was so sure that his regular
expressions were correct, why was he asking? What exactly is the part of RFC
2279 that he didn't understand?

It seems to be his personality trait: when he asks a question, he thinks he
knows what the answer is, and every other answer is wrong. I can't figure
out what the real reason for asking the question must be, but I think I
really don't want to know the answer.

It remains to be seen how long it will take him to figure out that the
difficulty he has in getting someone answer this might be, just might be,
due to the simple fact that this is one of these things that can be answered
simply by RTFMing. Really, UTF-8 is not some patented trade secret. Its
specifications are openly available, to anyone who wants to read them. And
anyone who reads them should be able to figure out the correct regexp for
themselves. It's not rocket science.

Amusingly, he's been trying to find the answer to this question longer than
it took myself, originally, to read RFC 2279, and implement encoding and
decoding of Unicode using UTF-8. In C++. Well, in C actually, but it's still
technically valid C++. Which, I guess, makes this on-topic, under the new
rules that just came down, by fiat.

Peter Olcott

unread,

May 13, 2010, 7:14:47 PM5/13/10

to

Ah so in other words an extremely verbose, "I don't know".
Let me take a different approach. Can postings on www.w3.org
generally be relied upon?

"Joseph M. Newcomer" <newc...@flounder.com> wrote in
message news:gprou55bvl3rgp2qm...@4ax.com...

Peter Olcott

unread,

May 13, 2010, 7:22:01 PM5/13/10

to

This time I found the original source of a semantically identical
regular expression that you berated so rudely.
http://www.w3.org/2005/03/23-lex-U

Who knows, maybe www.w3.org is wrong and you are right?

Sam

unread,

May 13, 2010, 7:40:06 PM5/13/10

to

Peter Olcott writes:

And as I wrote in the first thread, I suspected that the regular expression
mish-mash's actual purpose was to validate some defined a subset of the
entire Unicode range, as encoded in UTF-8.

See message <cone.1273539693...@commodore.email-scan.com>,
where I wrote:

> I think what that regexp really does is match a subset of all valid
> UTF-8 sequences that corresponds with a subset of Unicodes that the
> author was interested in. It doesn't match all valid UTF-8 sequences,
> which the non-regexp version does.

And reading the "www.w3.org" link, it's clear that's exactly what it does,
and what the criteria is. Still, you replied as follows, in
<cvydnfcDPJR5MHXW...@giganews.com>:

> I think that your understanding might be less than complete. If you read
> the commentary you will see that your view is not supported.

Obviously, it's your thoughts turned out to be "less than complete". That
regular expression does not validate whether an arbitrary octet stream is a
UTF-8-encoded unicode value sequence. That regular expression checks whether
whether an arbitrary octet stream is a UTF-8-encoded unicode value sequence
and all unicode values belong to a specific, defined subset of the entire
unicode value range.

Peter Olcott

unread,

May 13, 2010, 7:58:23 PM5/13/10

to

On 5/13/2010 6:40 PM, Sam wrote:
>> This time I found the original source of a semantically identical
>> regular expression that you berated so rudely.
>> http://www.w3.org/2005/03/23-lex-U
>>
>> Who knows, maybe www.w3.org is wrong and you are right?
>
> And as I wrote in the first thread, I suspected that the regular
> expression mish-mash's actual purpose was to validate some defined a
> subset of the entire Unicode range, as encoded in UTF-8.

And this view is clearly incorrect. It validates the the entire set of
UTF-8 encodings. Here is a quote:

"This pattern does not restrict to the set of
defined UCS characters, instead to the set that
is permitted by UTF-8 encoding."

The difference is the missing D800-DFFF High and Low surrogates that are
not legal in UTF-8. All of the other CodePoints from 0-10FFFF are
represented.

Sam

unread,

May 13, 2010, 9:01:36 PM5/13/10

to

Peter Olcott writes:

Since you claim to know so much about UTF-8 encoding and decoding -- even
more than RFC 2279 -- it's a wonder you had to ask your question at all. It
seems that you already knew the answer to the question.

Good luck UTF-8 encoding and decoding.

Peter Olcott

unread,

May 13, 2010, 9:24:23 PM5/13/10

to

http://tools.ietf.org/html/rfc3629
This memo obsoletes and replaces RFC 2279.

>
> Good luck UTF-8 encoding and decoding.
>

Thanks.

Joseph M. Newcomer

unread,

May 13, 2010, 11:17:38 PM5/13/10

to

Yes,. you have to accept that there are questions that should not be asked.

There are two right now: (a) how to write a parser that works on UTF-8 input (b) how to
disable mouse clicks during a lenghty computation. The correct answers to both questions
are "don't even try to do it that way! Redesign it so these are no longer problems".
Alternatively, think of it as "do not ask questions of how to solve problems which are the
direct result of incorrect design choices; change the design so the problem no longer
exists, then the question does not need to be asked"

Using UTF-8 is a particularly poor design choice. Doing long computations in the main GUI
thread is a particularly poor design choice. The questions would not arise if the poor
design choices had not been made. This is reality. The questions should not be asked,
because they indicate that poor choices have been made which make the questions necessary.

Answering the questions by giving an answer that solves what the questioner asked is not a
service to the person asking the question; what it allows is that a bad design decision is
allowed to stand, which will in turn lead to more problems, which produce more questions.
By doing the redesign and getting rid of the bad decisions, the problem goes away and
cannot return in the forseeable future. The problem of using a MBCS like UTF-8 is not
limited to something like an r.e.; the problems of handling the character set are
pervasive and very complex, and will continue to plague the implementation. The problems
of doing long computations in the main GUI thread merely introduce more and more failure
modes that will have to be kludged around, so the correct answer is "redesign it". Poor
design does not go away by making one patch. The patches eventually grow scar tissue, as
one kludge leads to another which leads to another, and so on.

So get rid of the UTF-8 internally, write it for Unicode, and then the problem goes away.
joe

Joseph M. Newcomer

unread,

May 13, 2010, 11:18:58 PM5/13/10

to

No, an extremely verbose "You are going about this completely wrong".
joe

Mihai N.

unread,

May 14, 2010, 4:30:00 AM5/14/10

to

> I can imagine a lot of alternative approaches, including having a table of
> 65,536 "character masks" for Unicode characters

As we know, 65,536 (FFFF) is not enough, Unicode codepoints go to 10FFFF :-)

> What is your crtiterion for what constitutes a "letter"?

The best way to attack the identification is by using Unicode properties
Each code point has attributes indicating if it is a letter
(General Category)

A good starting point is this:
http://unicode.org/reports/tr31/tr31-1.html

But this only shows that basing that on some UTF-8 kind of thing is no
the way. And how are you going to deal with combining characters?
Normalization?

There are very good reasons why the rule of thumb is:
- UTF-16 or UTF-32 for processing
- UTF-8 for storage/exchange

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mihai N.

unread,

May 14, 2010, 4:40:39 AM5/14/10

to

> Can postings on www.w3.org generally be relied upon?

For official documents, in general yes.
Unless it is some private post that says something like:
"It is not endorsed by the W3C members, team, or any working group."
(see http://www.w3.org/2005/03/23-lex-U)

And also does not mean that a solution that is enough to do some basic
utf-8 validation for html is the right tool for writing a compiler.

Jasen Betts

unread,

May 14, 2010, 5:01:40 AM5/14/10

to

On 2010-05-13, Peter Olcott <NoS...@OCR4Screen.com> wrote:
>
> "Ian Collins" <ian-...@hotmail.com> wrote in message
> news:8539h9...@mid.individual.net...
>> On 05/14/10 08:06 AM, Peter Olcott wrote:
>>> Is this Regular Expression for UTF-8 Correct??
>>
>> It's a fair bet you are off-topic in all the groups you
>> have cross posted to. Why don't you pick a group for a
>> language with built in UTF8 and regexp support (PHP?) and
>> badger them?
>>
>> --
>> Ian Collins
>
> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.

Just use iconv.

and don't cross post off-topic.

--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---

Peter Olcott

unread,

May 14, 2010, 9:27:45 AM5/14/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:68gpu599cjcsm3rjh...@4ax.com...

> No, an extremely verbose "You are going about this
> completely wrong".
> joe

Which still avoids rather than answers my question. This was
at one time a very effective ruse to hide the fact that you
don't know the answer. I can see through this ruse now, so
there is no sense in my attempting to justify my design
decision to you. That would simply be a waste of time.

Peter Olcott

unread,

May 14, 2010, 9:36:21 AM5/14/10

to

"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns9D78F42...@207.46.248.16...

>
>> I can imagine a lot of alternative approaches, including
>> having a table of
>> 65,536 "character masks" for Unicode characters
>
> As we know, 65,536 (FFFF) is not enough, Unicode
> codepoints go to 10FFFF :-)
>
>
>
>> What is your crtiterion for what constitutes a "letter"?
>
> The best way to attack the identification is by using
> Unicode properties
> Each code point has attributes indicating if it is a
> letter
> (General Category)
>
> A good starting point is this:
> http://unicode.org/reports/tr31/tr31-1.html
>
> But this only shows that basing that on some UTF-8 kind of
> thing is no
> the way. And how are you going to deal with combining
> characters?
> Normalization?

I am going to handle this simplistically. Every code point
above the ASCII range will be considered an alpha numeric
character.

Eventually I will augment this to further divide these code
points into smaller categories. Unicode is supposed to have
a way to do this, but, I never could find anything as simple
as a table of the mapping of Unicode code points to their
category.

Peter Olcott

unread,

May 14, 2010, 9:41:37 AM5/14/10

to

"Mihai N." <nmihai_y...@yahoo.com> wrote in message

news:Xns9D781110...@207.46.248.16...

>> Can postings on www.w3.org generally be relied upon?
>
> For official documents, in general yes.
> Unless it is some private post that says something like:
> "It is not endorsed by the W3C members, team, or any
> working group."
> (see http://www.w3.org/2005/03/23-lex-U)
>
> And also does not mean that a solution that is enough to
> do some basic
> utf-8 validation for html is the right tool for writing a
> compiler.

I am internationalizing the language that I am creating
within the timeframe that I have.

UTF-8 is the standard encoding for internet applications. It
works across every platform equally well without adaptation.
It does not care about Little or Big Endian, it simply works
everywhere correctly.

Pete Delgado

unread,

May 14, 2010, 12:38:57 PM5/14/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in message

news:jfvou5ll41i9818ub...@4ax.com...

> Actually, what it does is give us another opportunity to point how how
> really bad this
> design choice is, and thus Peter can tell us all we are fools for not
> answering a question
> that should never have been asked, not because it is inappropriate for the
> group, but
> because it represents the worst-possible-design decision that could be
> made.
> joe

Come on Joe, give Mr. Olcott some credit. I'm sure that he could dream up
an even worse design as he did with his OCR project once he is given (and
ignores) input from the professionals whos input he claims to seek. ;)

-Pete

Peter Olcott

unread,

May 14, 2010, 12:53:30 PM5/14/10

to

"Pete Delgado" <Peter....@NoSpam.com> wrote in message
news:uU4O0P48...@TK2MSFTNGP05.phx.gbl...

Most often I am not looking for "input from professionals",
I am looking for answers to specific questions.

I now realize that every non-answer response tends to be a
mask for the true answer of "I don't know".

Pete Delgado

unread,

May 14, 2010, 2:12:38 PM5/14/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

news:v5OdnTkHxKqRHXDW...@giganews.com...

>
> "Pete Delgado" <Peter....@NoSpam.com> wrote in message
> news:uU4O0P48...@TK2MSFTNGP05.phx.gbl...
>>
>> "Joseph M. Newcomer" <newc...@flounder.com> wrote in message
>> news:jfvou5ll41i9818ub...@4ax.com...
>>> Actually, what it does is give us another opportunity to point how how
>>> really bad this
>>> design choice is, and thus Peter can tell us all we are fools for not
>>> answering a question
>>> that should never have been asked, not because it is inappropriate for
>>> the group, but
>>> because it represents the worst-possible-design decision that could be
>>> made.
>>> joe
>>
>> Come on Joe, give Mr. Olcott some credit. I'm sure that he could dream up
>> an even worse design as he did with his OCR project once he is given (and
>> ignores) input from the professionals whos input he claims to seek. ;)
>>
>>
>> -Pete
>>
>>
>
> Most often I am not looking for "input from professionals", I am looking
> for answers to specific questions.

Which is one reason why your projects consistantly fail. If you have a few
days, take a look at the book "Programming Pearls" by Jon
Bentley -specifically the first chapter. Sometimes making sure you are
asking the *right* question is more important than getting an answer to a
question. You seem to have a problem with that particular concept.

>
> I now realize that every non-answer response tends to be a mask for the
> true answer of "I don't know".

In my case, you should change "I don't know" in your sentance above to: "I
don't care"...

To clarify:

* I don't care to answer off-topic questions
* I don't care to answer questions where the answer will be ignored
* I don't care to have to justify a correct answer against an incorrect
answer
* I don't care to answer questions where the resident SME (Mihai) has
already guided you
* I don't care to feed the trolls

HTH

-Pete

Peter Olcott

unread,

May 14, 2010, 2:44:56 PM5/14/10

to

"Pete Delgado" <Peter....@NoSpam.com> wrote in message

news:O8vhKE5...@TK2MSFTNGP04.phx.gbl...

>
>> Most often I am not looking for "input from
>> professionals", I am looking for answers to specific
>> questions.
>
> Which is one reason why your projects consistantly fail.
> If you have a few

None of my projects have ever failed. Some of my projects
inherently take an enormous amount of time to complete.

> days, take a look at the book "Programming Pearls" by Jon
> Bentley -specifically the first chapter. Sometimes making
> sure you are asking the *right* question is more important
> than getting an answer to a question. You seem to have a
> problem with that particular concept.

Yes especially on those cases where I have already thought
the problem through completely using categorically
exhaustively complete reasoning.

In those rare instances anything at all besides a direct
answer to a direct question can only be a waste of time for
me.

Pete Delgado

unread,

May 14, 2010, 5:24:27 PM5/14/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

news:f-6dnTCV1ce3B3DW...@giganews.com...

>
> "Pete Delgado" <Peter....@NoSpam.com> wrote in message
> news:O8vhKE5...@TK2MSFTNGP04.phx.gbl...
>>
>>> Most often I am not looking for "input from professionals", I am looking
>>> for answers to specific questions.
>>
>> Which is one reason why your projects consistantly fail. If you have a
>> few
>
> None of my projects have ever failed. Some of my projects inherently take
> an enormous amount of time to complete.

ROTFL

OK Peter.. If you say so... :-) I suppose that is the benefit of doing
development soley for your own amusement. You can take inordinate amounts of
time and not have to care if the market passes you by or if the relevancy of
the software is diminished.

>
>> days, take a look at the book "Programming Pearls" by Jon
>> Bentley -specifically the first chapter. Sometimes making sure you are
>> asking the *right* question is more important than getting an answer to a
>> question. You seem to have a problem with that particular concept.
>
> Yes especially on those cases where I have already thought the problem
> through completely using categorically exhaustively complete reasoning.

That *sounds* nice, but if one considers your recent questions here as a
guage of your success at reasoning out the problem and coming up with a
realistic, workable solution, it seems that your words and deeds do not
match.

>
> In those rare instances anything at all besides a direct answer to a
> direct question can only be a waste of time for me.

...which is why, long ago, I suggested that you simply hire a consultant.

-Pete

Bill Snyder

unread,

May 14, 2010, 6:59:12 PM5/14/10

to

On Fri, 14 May 2010 17:24:27 -0400, "Pete Delgado"
<Peter....@NoSpam.com> wrote:

>
>"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message
>news:f-6dnTCV1ce3B3DW...@giganews.com...

>> None of my projects have ever failed. Some of my projects inherently take

>> an enormous amount of time to complete.
>
>ROTFL
>
>OK Peter.. If you say so... :-) I suppose that is the benefit of doing
>development soley for your own amusement. You can take inordinate amounts of
>time and not have to care if the market passes you by or if the relevancy of
>the software is diminished.

A project with an infinitely-extensible deadline can never fail;
it can only require more work.

--
Bill Snyder [This space unintentionally left blank]

Peter Olcott

unread,

May 14, 2010, 11:42:16 PM5/14/10

to

"Joseph M. Newcomer" <newc...@flounder.com> wrote in

message news:68gpu599cjcsm3rjh...@4ax.com...

Do you know anywhere where I can get a table that maps all
of the code points to their category?

>>> ...at which point I realized I was wasting my time,
>>> because I was attempting to disprovde
>>> what is a Really Dumb Idea, which is to write
>>> applications
>>> that actually work on UTF-8
>>> encoded text.
>>>
>>> You are free to convert these to UTF-8, but in addition,
>>> if I've read some of the
>>> encodings correctly, the non-green areas preclude what
>>> are
>>> clearly "letters" in other
>>> languages.
>>>
>>> Forget UTF-8. It is a transport mechanism used at input
>>> and output edges. Use Unicode
>>> internally.

That is how I intend to use it. To internationalize my GUI
scripting language the interpreter will accept UTF-8 input
as its source code files. It is substantially implemented
using Lex and Yacc specifications for "C" that have been
adapted to implement a subset of C++.

It was far easier (and far less error prone) to add the C++
that I needed to the "C" specification than it would have
been to remove what I do not need from the C++
specification.

The actual language itself will store its strings as 32-bit
codepoints. The SymbolTable will not bother to convert its
strings from UTF-8. It turns out that UTF-8 byte sort order
is identical to Unicode code point sort order.

I am implementing a utf8string that will provide the most
useful subset of the std::string interface. I need the
regular expression for Lex, and it also can be easily
converted into a DFA to very quickly and completely
correctly break of a UTF-8 string into its code point
constituent parts.

Do you know anywhere where I can get a table that maps all
of the code points to their category?

It is a shame that Microsoft will be killing this group next
month, where will we go?

Mihai N.

unread,

May 15, 2010, 6:21:47 AM5/15/10

to

> Do you know anywhere where I can get a table that maps all
> of the code points to their category?

ftp://ftp.unicode.org/Public/5.2.0/ucd

UnicodeData.txt
The main guide for that is ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
(if you don't want to go thru the standard, which is the adviseable thing)

And when you bump your head, remeber that joe and I warned you about utf-8.
It was not designed for this kind of usage.

Peter Olcott

unread,

May 15, 2010, 10:12:09 AM5/15/10

to

"Mihai N." <nmihai_y...@yahoo.com> wrote in message

news:Xns9D792235...@207.46.248.16...

>
>> Do you know anywhere where I can get a table that maps
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd
>
> UnicodeData.txt
> The main guide for that is
> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
> (if you don't want to go thru the standard, which is the
> adviseable thing)
>
> And when you bump your head, remeber that joe and I warned
> you about utf-8.
> It was not designed for this kind of usage.
>
>

Joe also said that UTF-8 was designed for data interchange
which is how I will be using it. Joe also falsely assumed
that I would be using UTF-8 for my internal representation.
I will be using UTF-32 for my internal representation.

I will be using UTF-8 as the source code for my language
interpreter, which has the advantage of simply being ASCII
for the English language, and working across every platform
without requiring adaptations such as Little Endian and Big
Endian. UTF-8 will also be the output of my OCR4Screen DFA
recognizer.

Peter Olcott

unread,

May 15, 2010, 11:08:25 AM5/15/10

to

"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns9D792235...@207.46.248.16...
>

>> Do you know anywhere where I can get a table that maps
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd
>

What I am looking for is a mapping between Unicode code
points (compressed into code point ranges when possible)
that maps to General Category Values as two character
abbreviations. I will look though this first link to see if
I can find this. Initially I saw a lot of things that were
not this.

Peter Olcott

unread,

May 15, 2010, 11:48:25 AM5/15/10

to

"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns9D792235...@207.46.248.16...
>

>> Do you know anywhere where I can get a table that maps
>> all
>> of the code points to their category?
>
> ftp://ftp.unicode.org/Public/5.2.0/ucd

I found the table that I was looking for here:
ftp://ftp.unicode.org/Public/5.2.0/ucd/UnicodeData.txt
Thanks for all your help.

Joseph M. Newcomer

unread,

May 17, 2010, 12:03:39 AM5/17/10

to

How about a non-answer is a substitute for "this is the most incredibly stupid idea I have
seen in decades, and I'm not going to waste my time pointing out the obvious silliness of
it"?

You are again spending massive effort to solve an artificial problem of your own creation,
caused by making poor initial design choices, and supported by nonsensical
rationalizations. A professional programmer knows certain patterns (that is our
strength!) and among these are the recognition that if you have to implement complex
solutions to simple problems, you have made a bad design choice and are best served by
re-examining the design choices and making design choices that eliminate the need for
complex solutions, particularly when the complexity simply goes away if a different set of
solutions is postulated.

Personally, if I had to do a complex parser design, I'd want to eliminate the need to deal
with UTF-16 surrogates, and I'd write my code in terms of UTF-32. Much simpler, and
isolates the complexity and the input and output edges, not making it uniformly
distributed throughout the code. And I'd know not to make childish decisions such as "it
costs too much to do the conversion" because I outgrew those kinds of arguments certainly
by 1980 (that's thirty years ago). My first instance of this was a typesetting program I
did around 1970 where I stored the text as 9-bit rather than 7-bit bytes because I could
encode font informtion more readily in the upper two bits. And I didn't even CONSIDER the
size and performance issues of 9-bit vs. 7-bit bytes because I knew they didn't matter in
the slightest. So I guess I learned this lesson 40 years ago. It greatly simplified the
internal coding.

But you are sounding like a first-semester programmer who was taught by some old PDP-11
programmer, and I don't buy either the size or the conversion performance arguments. You
don't even have NUMBERS to argue your position! Optimization decisions that are argued
without quantitative supporting measurments are almost always wrong. But we've had this
discussion before, and your view is "My mind is made up, don't require me to get FACTS to
support my decision!" In the Real World, before we can justify wasting lots of programmer
time to implement bad decisions, we require justification. But maybe that's just my
project management experience talking. Horrible, this dependence on reality that I have.

If someone came to me with such a design, and was as insistent as you will be, my first
requirement would be "Write a program that reads UTF-8 files of the expected size, then
writes them back out. Measure its performance reading several dozen different files, and
run each experiment 100 times, measuring the time-to-completion". Then "modify the
program to convert the data to UTF-16, convert it back to UTF-8, and run the same
experiment sent. Demonstrate that the change in the mean time is statistically
significant". Hell, the variation of LOADING the PROGRAM Is going to differ from
experiment to experiment by a variance several orders of magnitude greater than the
conversion cost! So don't try to make the case that the conversion cost matters; the
truth, based on actual performance measurements end-to-end, is that it does not. But, not
having actually done performance measurement, you don't understand that. Those of us who
devoted nontrivial parts of our lives to optimizing program performance KNOW what the
problems are, and know that the conversion cannot possibly matter.
joe
*****
joe
****

Joseph M. Newcomer

unread,

May 17, 2010, 12:28:29 AM5/17/10

to

See below...

On Fri, 14 May 2010 13:44:56 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Pete Delgado" <Peter....@NoSpam.com> wrote in message
>news:O8vhKE5...@TK2MSFTNGP04.phx.gbl...
>>
>>> Most often I am not looking for "input from
>>> professionals", I am looking for answers to specific
>>> questions.
>>
>> Which is one reason why your projects consistantly fail.
>> If you have a few
>
>None of my projects have ever failed. Some of my projects
>inherently take an enormous amount of time to complete.

****
No something you should brag about. Going back to my original comments, you are creating
an artificially complex solution to what should be a simple problem, by making bad design
choices and then warping reality to support them, when the correct answer is "Don't do it
that way". If you simplify the problem, you get do make decisions which can be
implemented more readily, those decreasing the amount of time required to complete them.
****

>
>> days, take a look at the book "Programming Pearls" by Jon
>> Bentley -specifically the first chapter. Sometimes making
>> sure you are asking the *right* question is more important
>> than getting an answer to a question. You seem to have a
>> problem with that particular concept.
>
>Yes especially on those cases where I have already thought
>the problem through completely using categorically
>exhaustively complete reasoning.

*****
There is no such thing in the world we live in. You have made a number of false
assumptions (for example, that conversion time is statistically significant relative to
other performance issues) and used that set of false assumptions to drive a set of design
decisions which make no sense if you take reality into consideration. For example, these
is no possible way the UTF-8-UTF-16 conversion could possibly take longer to handle than a
single page fault, but you are optimizing it out of existence without realizing that
simply loading the program will have orders of magnitude greater variance than this cost.
This is because you are working with the assumptions that (a) loading a program takes
either zero time or a fixed time each time it is loaded and (b) opening the file you are
reading takes either zero time or a fixed time each time it is opened. Sadly, neither of
these assumptions are valid, and consequently if you run 100 experiments or loading and
executing the program, these two paramters will dominate the total performance by orders
of magnitude more than the cost of the conversion! So you are trying to optimize
something that is statistically insignificant!
****

>
>In those rare instances anything at all besides a direct
>answer to a direct question can only be a waste of time for
>me.

*****
You want a direct answer: the design to use UTF-8 internally is a Really Stupid Idea!
DON'T WASTE YOUR TIME TRYING TO DO IT! That's the DIRECT answer. Everything else is
wasting our time trying to tell you in simple words that even you might understand just
WHY it is a Really Stupid Idea.

There is no point in trying to analye the regexp because I can not believe why any
intelligent programmer would WANT to use such a bad design! Therefore, it was a bad
question and does not deserve getting an answer; the correct answer is to do the job
right. You have this fixation that if you pose what is clearly a bad design, we experts
are supposed to sit back and encourage bad design decisions? That is not what we do.

We feel a little bit like Calvin's dad from the old "Calvin and Hobbes" cartoons. Calvin
comes over to his father and says "Dad, can I have a chain saw" and his father says "no".
Calvin goes away feeling unhappy, and in the last of the four panels says "but now how am
I going to learn how to juggle?"

If you want to juggle chain saws, we aren't going to answer your questions on how to do
it. We will try to advise you that juggling running chain saws is probably a Really
Stupid Idea. If you were an experienced knife juggler, and could juggle flaming torches,
we might suggest that there are approaches to this, but your idea that you can apply
categorical reasoning to the problem of chain-saw juggling when you have clearly
demonstrated by your question that you have never once juggled anything, makes us leery of
encouraging you to continue this practice.

Note that "categorical reasoning" does not turn into a deep understanding of fundamentally
stochastic processes. Las Vegas casinos would love you, because you would try to apply
this technique to, say, roulette wheels and dice, and guess who wins?

Prove, by exhaustive categorical reasoning, that loading a program takes a fixed amount of
time. Then I'll credit its power.
joe
****

Joseph M. Newcomer

unread,

May 17, 2010, 12:33:31 AM5/17/10

to

See below...

On Fri, 14 May 2010 08:27:45 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newc...@flounder.com> wrote in
>message news:68gpu599cjcsm3rjh...@4ax.com...
>> No, an extremely verbose "You are going about this
>> completely wrong".
>> joe
>
>Which still avoids rather than answers my question. This was
>at one time a very effective ruse to hide the fact that you
>don't know the answer. I can see through this ruse now, so
>there is no sense in my attempting to justify my design
>decision to you. That would simply be a waste of time.

****
I think I answered part of it. The part that matters. THe part that says "this is
wrong". I did this by pointing out some counterexamples.

I know the answer: Don;t Do It That Way. You are asking for a specific answer that will
allow you to pursue a Really Bad Design Decision. I'm not going to answer a bad question;
I'm going to tell you what the correct solution is. I'm avoiding the question because it
is a really bad question, because you should be able to answer it yourself, and because
giving an answer simply justifies a poor design. I don't justify poor designs, I try to
kill them.

Only you could make a bad design decision and feel you have to justify it. Particularly
when the experts have already all told you it is a bad design decision, and you should not
go that way.
joe

Joseph M. Newcomer

unread,

May 17, 2010, 12:39:09 AM5/17/10

to

See below...

****
You don't need to. There's an API that does that. Go read the Unicode support. You can
also read the code for my Local Explorer, or you can downlod the table from
www.unicode.org.
joe
****

>
>>>> ...at which point I realized I was wasting my time,
>>>> because I was attempting to disprovde
>>>> what is a Really Dumb Idea, which is to write
>>>> applications
>>>> that actually work on UTF-8
>>>> encoded text.
>>>>
>>>> You are free to convert these to UTF-8, but in addition,
>>>> if I've read some of the
>>>> encodings correctly, the non-green areas preclude what
>>>> are
>>>> clearly "letters" in other
>>>> languages.
>>>>
>>>> Forget UTF-8. It is a transport mechanism used at input
>>>> and output edges. Use Unicode
>>>> internally.
>
>That is how I intend to use it. To internationalize my GUI
>scripting language the interpreter will accept UTF-8 input
>as its source code files. It is substantially implemented
>using Lex and Yacc specifications for "C" that have been
>adapted to implement a subset of C++.

*****
So why does the question matter? Accepting UTF-8 input makes perfect sense, but the first
thing you should do with it is convert it to UTF-16, or better still UTF-32.
****

>
>It was far easier (and far less error prone) to add the C++
>that I needed to the "C" specification than it would have
>been to remove what I do not need from the C++
>specification.

***
Huh? What's this got to do with the encoding?
***

>
>The actual language itself will store its strings as 32-bit
>codepoints. The SymbolTable will not bother to convert its
>strings from UTF-8. It turns out that UTF-8 byte sort order
>is identical to Unicode code point sort order.

****
Strange. I though sort order was locale-specific and independent of code points. But
then, maybe I just understand what is going on.
*****

>
>I am implementing a utf8string that will provide the most
>useful subset of the std::string interface. I need the
>regular expression for Lex, and it also can be easily
>converted into a DFA to very quickly and completely
>correctly break of a UTF-8 string into its code point
>constituent parts.
>
>Do you know anywhere where I can get a table that maps all
>of the code points to their category?

*****
www.unicode.org

Also, there is an API call that does this, and you can check the source of my Locale
Explorer to find it.
joe
****

Joseph M. Newcomer

unread,

May 17, 2010, 12:42:45 AM5/17/10

to

See below...

On Sat, 15 May 2010 09:12:09 -0500, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

>
>"Mihai N." <nmihai_y...@yahoo.com> wrote in message
>news:Xns9D792235...@207.46.248.16...
>>
>>> Do you know anywhere where I can get a table that maps
>>> all
>>> of the code points to their category?
>>
>> ftp://ftp.unicode.org/Public/5.2.0/ucd
>>
>> UnicodeData.txt
>> The main guide for that is
>> ftp://ftp.unicode.org/Public/5.1.0/ucd/UCD.html
>> (if you don't want to go thru the standard, which is the
>> adviseable thing)
>>
>> And when you bump your head, remeber that joe and I warned
>> you about utf-8.
>> It was not designed for this kind of usage.
>>
>>
>Joe also said that UTF-8 was designed for data interchange
>which is how I will be using it. Joe also falsely assumed
>that I would be using UTF-8 for my internal representation.
>I will be using UTF-32 for my internal representation.

****
But then, you would not need the UTF-8 regexps! You would only need those if you were
storing the data in UTF-8. To give an external grammar to your language, you should give
the UTF-32 regexps, and if necessary, you can TRANSLATE those to UTF-8, but you don't
start with UTF-8. The lex input would need to be in terms of UTF-32, so you would not be
using UTF-8 there, either.
****

>
>I will be using UTF-8 as the source code for my language
>interpreter, which has the advantage of simply being ASCII
>for the English language, and working across every platform
>without requiring adaptations such as Little Endian and Big
>Endian. UTF-8 will also be the output of my OCR4Screen DFA
>recognizer.
>
>>
>> --
>> Mihai Nita [Microsoft MVP, Visual C++]
>> http://www.mihai-nita.net
>> ------------------------------------------
>> Replace _year_ with _ to get the real email
>>
>

Joseph M. Newcomer

unread,

May 17, 2010, 12:54:07 AM5/17/10

to

See below..

On Fri, 14 May 2010 01:30:00 -0700, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>
>> I can imagine a lot of alternative approaches, including having a table of
>> 65,536 "character masks" for Unicode characters
>
>As we know, 65,536 (FFFF) is not enough, Unicode codepoints go to 10FFFF :-)

****
Yes, but this leads to questions of how to build sparse encodings or handling surrogates
with secondary tables, and I did not want to confuse the issue. First-cut performance
would be to use a 64K table, and for values above FFFF decode to a secondary table.

But this would be too much reality to absorb and once.
****

>
>
>
>> What is your crtiterion for what constitutes a "letter"?
>
>The best way to attack the identification is by using Unicode properties
>Each code point has attributes indicating if it is a letter
>(General Category)
>
>A good starting point is this:
> http://unicode.org/reports/tr31/tr31-1.html
>
>But this only shows that basing that on some UTF-8 kind of thing is no
>the way. And how are you going to deal with combining characters?
>Normalization?

****
Ahh, that old concept, "reality" again. This is what I meant by the question about what
constitutes a letter; for example, the trivial case orf a byte sequence that encodes a
nonspacing accent mark with a letter that follows requires a separate lexical rule because
lex only works on actual input characters (even if modified to support UTF-32), and
therefore the overly-simplistic regexp shown is clearly untenable. But again, I did not
want to point out the subtleties because I would not have been using exhaustive
categorical reasoning to derive why the question was a stupid question. So I pointed out
just the most trivial of failure modes, and asked a fundamental question, for which, alas,
you gave the answer (thus cheating me out of further annoying Peter by forcing him to
actually think the problem through). Vowel marks in some languages (e.g., Hebrew) are
another counterexample.

Even UTF-32 doesn't solve the "what is a letter" question! Which is why the regexp rules
are clearly bad!
joe
****

>
>There are very good reasons why the rule of thumb is:
> - UTF-16 or UTF-32 for processing
> - UTF-8 for storage/exchange

David Schwartz

unread,

May 17, 2010, 5:13:29 AM5/17/10

to

On May 13, 2:59 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> I can not accept that the "correct" answer is that some
> questions can not be asked.

Nobody said that. That's your straw man.

If you go to an alcoholics anonymous meeting and want to talk about
your sex addition, they will politely explain to you this particular
group is for people who have problems with alcohol. It matters not
whether there is or isn't some group for sex addicts around. Simply
put, AA folks deal with alcohol addiction and they shouldn't even have
to think about whether there is or isn't some other group for sex
addicts that you may or may not find suitable. That's not their
problem.

DS

David Schwartz

unread,

May 17, 2010, 5:17:03 AM5/17/10

to

On May 13, 3:04 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> What does this question have to do with the C++ language?
>
> At least my question is indirectly related to C++ by making
> a utf8string for the C++ language from the regular
> expression.
>

> Your question is not even indirectly related to the C++
> language.

Unfortunately, no better way is known to keep conversations on topic.
If you know a better way, we'd all love to hear it. If you don't
respond immediately in the forum and point out that something is off
topic, other people browsing the forum will think the question was on
topic. Other ways have been tried in the past (such as private mails
where possible, monthly posts about topicality rather than replying to
each off-topic post, and so on). None have been shown to be effective.

Painful experience has shown that the most effective technique is to
verbally berate and ridicule people who post off topic. Thus others
will see the negative response by the group and now want their posts
to be met with a similar response.

Again, this wasn't anyone's first choice, and if you know a better
way, please tell us. (In the appropriate forum, of course!)

DS

Peter Olcott

unread,

May 17, 2010, 9:34:25 AM5/17/10

to

You probably have a point here. My "devil's advocate" counter argument
is showing up all of the nuances of the alternative design decisions.

Where I am going to be able to talk to you when Microsoft shuts down the
microsoft.public.* hierachy?

Peter Olcott

unread,

May 17, 2010, 9:49:14 AM5/17/10

to

Ultimately all UTF-8 validators must be regular expressions implemented
as finite state machines. I can't imagine a better way.

Peter Olcott

unread,

May 17, 2010, 9:57:58 AM5/17/10

to

If a decision is truly bad, then there must be dysfunctional results
that make the decision a bad one. If dysfunctional results can not be
provided, then the statement that it is a bad decision lacks sufficient
support. My original intention was to use UTF-32 as my internal
representation. I have not yet decided to alter this original decision.

The fact that someone provided an example where UTF-8 strings would
often substantially vary in length provides the best counter example
showing that your view is likely correct about internal representation.

In fact I will simply state that I am now convinced that UTF-32 is the
best way to go.

I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
completed using Lex and Yacc. Besides this I need a good way to parse
UTF-8 to convert it to UTF-32.

Peter Olcott

unread,

May 17, 2010, 10:04:07 AM5/17/10

to

On 5/16/2010 11:39 PM, Joseph M. Newcomer wrote:
>>
>> That is how I intend to use it. To internationalize my GUI
>> scripting language the interpreter will accept UTF-8 input
>> as its source code files. It is substantially implemented
>> using Lex and Yacc specifications for "C" that have been
>> adapted to implement a subset of C++.
> *****
> So why does the question matter? Accepting UTF-8 input makes perfect sense, but the first
> thing you should do with it is convert it to UTF-16, or better still UTF-32.
> ****
>>
>> It was far easier (and far less error prone) to add the C++
>> that I needed to the "C" specification than it would have
>> been to remove what I do not need from the C++
>> specification.
> ***
> Huh? What's this got to do with the encoding?

(1) Lex requires a RegEx

(2) I still must convert from UTF-8 to UTF-32, and I don't think that a
faster or simpler way to do this besides a regular expression
implemented as a finite state machine can possibly exist.

>> The actual language itself will store its strings as 32-bit
>> codepoints. The SymbolTable will not bother to convert its
>> strings from UTF-8. It turns out that UTF-8 byte sort order
>> is identical to Unicode code point sort order.
> ****
> Strange. I though sort order was locale-specific and independent of code points. But
> then, maybe I just understand what is going on.

The SymbolTable only needs to be able to find its symbols in a std::map.
Accounting for locale specific sort order is a waste of time in this case.

Peter Olcott

unread,

May 17, 2010, 10:07:20 AM5/17/10

to

On 5/16/2010 11:42 PM, Joseph M. Newcomer wrote:
> See below...
> On Sat, 15 May 2010 09:12:09 -0500, "Peter Olcott"<NoS...@OCR4Screen.com> wrote:
>
>> Joe also said that UTF-8 was designed for data interchange
>> which is how I will be using it. Joe also falsely assumed
>> that I would be using UTF-8 for my internal representation.
>> I will be using UTF-32 for my internal representation.
> ****
> But then, you would not need the UTF-8 regexps! You would only need those if you were
> storing the data in UTF-8. To give an external grammar to your language, you should give
> the UTF-32 regexps, and if necessary, you can TRANSLATE those to UTF-8, but you don't
> start with UTF-8. The lex input would need to be in terms of UTF-32, so you would not be
> using UTF-8 there, either.

My source code encoding will be UTF-8. My interpreter is written in Yacc
and Lex, and 75% complete. This makes a UTF-8 regular expression mandatory.

Joseph M. Newcomer

unread,

May 17, 2010, 12:24:43 PM5/17/10

to

See below...

On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>On 5/16/2010 11:39 PM, Joseph M. Newcomer wrote:
>>>
>>> That is how I intend to use it. To internationalize my GUI
>>> scripting language the interpreter will accept UTF-8 input
>>> as its source code files. It is substantially implemented
>>> using Lex and Yacc specifications for "C" that have been
>>> adapted to implement a subset of C++.
>> *****
>> So why does the question matter? Accepting UTF-8 input makes perfect sense, but the first
>> thing you should do with it is convert it to UTF-16, or better still UTF-32.
>> ****
>>>
>>> It was far easier (and far less error prone) to add the C++
>>> that I needed to the "C" specification than it would have
>>> been to remove what I do not need from the C++
>>> specification.
>> ***
>> Huh? What's this got to do with the encoding?
>(1) Lex requires a RegEx

***
Your regexp should be in terms of UTF32, not UTF8.
****

>
>(2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>faster or simpler way to do this besides a regular expression
>implemented as a finite state machine can possibly exist.

****
Actually, there is; you obviously know nothing about UTF-8, or you would know that the
high-order bits of the first byte tell you the length of the encoding, and the FSM is
written entirely in terms of the actual encoding, and is never written as a regexp.

RTFM.

You are expected to have spent a LITTLE time reading about a subject before asking a
question.
****

>
>
>>> The actual language itself will store its strings as 32-bit
>>> codepoints. The SymbolTable will not bother to convert its
>>> strings from UTF-8. It turns out that UTF-8 byte sort order
>>> is identical to Unicode code point sort order.
>> ****
>> Strange. I though sort order was locale-specific and independent of code points. But
>> then, maybe I just understand what is going on.
>
>The SymbolTable only needs to be able to find its symbols in a std::map.
>Accounting for locale specific sort order is a waste of time in this case.

****
OK, then it is not sort order, and the fact that the byte-encoded sort is in the same
order is irrelevant, so why did you mention it as if it had meaning? std::map doesn't
care about what YOU mean by "sort order", it only requires byte sequences for keys, where
the interpretation of the byte sequence is a function of the data type.

But the text should already be in UTF-32! Why are you wasting time worrying about UTF-8?
joe
****

Joseph M. Newcomer

unread,

May 17, 2010, 12:31:26 PM5/17/10

to

See below...

***
Perhaps. But since you obviously never read the documentation on UTF-8, you did not
realize that it is based entirely on the number of high-order 1-bits in the first byte of
the sequence. THis does not require a regexp written in terms of byte ranges to handle
correctly, and nobody writing a UTF-8 validator would consider this approach. The set of
rules you are citing are essentially "if my only tool is a regexp recongizer, how can I
solve a trivial problem using that tool?". It doesn't say the solution is a *good*
solution, only that it is *a* solution in the artificially constrained solution space. If
you remove the constraint (as most practical programmers would) then the complexity
evaporates.

So not only is it *possible* to imagine a better way, the better way is *actually
documented* in the Unicode spec! I'm sorry you have such a limited imagination, but it is
one of your weaknesses. You start plunging down a garden path and never once ask "is this
the right or best path to my goal?" It may not even lead to the goal, but you love to
fasten on worst-possible-paths and then berate the rest of us for telling you that your
choice is wrong.
joe
****

Joseph M. Newcomer

unread,

May 17, 2010, 12:51:07 PM5/17/10

to

See below...

****
Dysfunctional results:
Horrible costs to do regexp manipulation when none is needed
Added complexity distributed uniformly over the entire implementation
Actually not correct because it ignores
-localized punctuation
-localized numbers
-bidirectional text
Other than it is needlessly complex, horribly inefficient, and wrong, what more do you
need to know?
***

>
>The fact that someone provided an example where UTF-8 strings would
>often substantially vary in length provides the best counter example
>showing that your view is likely correct about internal representation.

****
But is that not obvious at the beginning? You should have realized that!
****

>
>In fact I will simply state that I am now convinced that UTF-32 is the
>best way to go.
>
>I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>completed using Lex and Yacc. Besides this I need a good way to parse
>UTF-8 to convert it to UTF-32.

****
No, it sucks. For reasons I have pointed out. You can easily write a UTF-32 converter
just based on the table in the Unicode 5.0 manual!

I realized that I have this information on a slide in my course, which is on my laptop, so
with a little copy-and-paste-and-reformat, here's the table. Note that no massive FSM
recognition is required to do the conversion, and it is even questionable as to whether an
FSM is required at all!

All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
either 0 or 1

UTF-32 00000000 00000000 00000000 0xxxxxxx
UTF-16 00000000 0xxxxxxx
UTF-8 0xxxxxx

UTF-32 00000000 00000000 00000yyy yyxxxxxx
UTF-16 00000yyy yyxxxxxx
UTF-8 110yyyyy 10xxxxxx

UTF-32 00000000 00000000 zzzzyyyy yyxxxxxx
UTF-16 zzzzyyyy yyxxxxxx
UTF-8 1110zzzz 10yyyyyy 10xxxxxx

UTF-32 00000000 000uuuuu zzzzyyyy yyzzzzzz
UTF-16 110110ww wwzzzzyy 110111yy yyxxxxxx*
UTF-8 11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx

uuuuu = wwww + 1

Pete Delgado

unread,

May 17, 2010, 1:42:19 PM5/17/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

news:lI2dnTWeE-MF0GzW...@giganews.com...

> My source code encoding will be UTF-8. My interpreter is written in Yacc
> and Lex, and 75% complete. This makes a UTF-8 regular expression
> mandatory.

Peter,
You keep mentioning that your interpreter is 75% complete. Forgive my morbid
curiosity but what exactly does that mean?

-Pete

Peter Olcott

unread,

May 17, 2010, 3:05:54 PM5/17/10

to

On 5/17/2010 11:24 AM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>
>>> Huh? What's this got to do with the encoding?
>> (1) Lex requires a RegEx
> ***
> Your regexp should be in terms of UTF32, not UTF8.

Wrong, Lex can not handle data larger than bytes.

> ****
>>
>> (2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>> faster or simpler way to do this besides a regular expression
>> implemented as a finite state machine can possibly exist.
> ****
> Actually, there is; you obviously know nothing about UTF-8, or you would know that the
> high-order bits of the first byte tell you the length of the encoding, and the FSM is
> written entirely in terms of the actual encoding, and is never written as a regexp.

Ah I see, if I don't know everything that I must know nothing, I think
that this logic is flawed. None of the docs that I read mentioned this
nuance. It may prove to be useful. It looks like this will be most
helpful when translating from UTF-32 to UTF-8, and not the other way
around.

It would still seem to be slower and more complex than a DFA based
finite state machine for validating a UTF-8 byte sequence. It also look
like it would be slower for translating from UTF-8 to UTF-32.

Peter Olcott

unread,

May 17, 2010, 3:15:26 PM5/17/10

to

On 5/17/2010 11:31 AM, Joseph M. Newcomer wrote:
> See below...

> On Mon, 17 May 2010 08:49:14 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>
>> Ultimately all UTF-8 validators must be regular expressions implemented
>> as finite state machines. I can't imagine a better way.
> ***
> Perhaps. But since you obviously never read the documentation on UTF-8, you did not
> realize that it is based entirely on the number of high-order 1-bits in the first byte of
> the sequence. THis does not require a regexp written in terms of byte ranges to handle
> correctly, and nobody writing a UTF-8 validator would consider this approach. The set of
> rules you are citing are essentially "if my only tool is a regexp recongizer, how can I
> solve a trivial problem using that tool?". It doesn't say the solution is a *good*
> solution, only that it is *a* solution in the artificially constrained solution space. If
> you remove the constraint (as most practical programmers would) then the complexity
> evaporates.
>
> So not only is it *possible* to imagine a better way, the better way is *actually
> documented* in the Unicode spec!

My GUI scripting language is 75% completed and written in Lex and Yacc.

Lex can not handle data larger than bytes.

> I'm sorry you have such a limited imagination, but it is

Peter Olcott

unread,

May 17, 2010, 3:31:06 PM5/17/10

to

The Yacc and Lex are done and working and correctly translate all input
into a corresponding abstract syntax tree. The control flow portion of
the code generator is done and correctly translates control flow
statements into corresponding jump code with minimum branches. The
detailed design for everything else is complete. The everything else
mostly involves how to handle all of the data types including objects,
and also including elemental operations upon these data types and objects.

Pete Delgado

unread,

May 17, 2010, 3:47:21 PM5/17/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

news:a_WdndCS8YLmBGzW...@giganews.com...

Peter,
Correct me if I'm wrong, but it sounds to me as if you are throwing
something at Yacc and Lex and are getting something you think is
"reasonable" out of them but you haven't been able to test the validity of
the output yet since the remainder of the coding to your spec is yet to be
done. Is that a true statement?

-Pete

Peter Olcott

unread,

May 17, 2010, 3:53:39 PM5/17/10

to

On 5/17/2010 11:51 AM, Joseph M. Newcomer wrote:
> See below...

> On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>>
>>
>> I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>> completed using Lex and Yacc. Besides this I need a good way to parse
>> UTF-8 to convert it to UTF-32.
> ****
> No, it sucks. For reasons I have pointed out. You can easily write a UTF-32 converter
> just based on the table in the Unicode 5.0 manual!

Lex can ONLY handle bytes. Lex apparently can handle the RegEx that I
posted. I am basically defining every UTF-8 byte sequence above the
ASCII range as a valid Letter that can be used in an Identifier.
[A-Za-z_] can also be used as a Letter.

>
> I realized that I have this information on a slide in my course, which is on my laptop, so
> with a little copy-and-paste-and-reformat, here's the table. Note that no massive FSM
> recognition is required to do the conversion, and it is even questionable as to whether an
> FSM is required at all!
>
> All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
> either 0 or 1
>
> UTF-32 00000000 00000000 00000000 0xxxxxxx
> UTF-16 00000000 0xxxxxxx
> UTF-8 0xxxxxx
>
> UTF-32 00000000 00000000 00000yyy yyxxxxxx
> UTF-16 00000yyy yyxxxxxx
> UTF-8 110yyyyy 10xxxxxx
>
> UTF-32 00000000 00000000 zzzzyyyy yyxxxxxx
> UTF-16 zzzzyyyy yyxxxxxx
> UTF-8 1110zzzz 10yyyyyy 10xxxxxx
>
> UTF-32 00000000 000uuuuu zzzzyyyy yyzzzzzz
> UTF-16 110110ww wwzzzzyy 110111yy yyxxxxxx*
> UTF-8 11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx
>
> uuuuu = wwww + 1
> Joseph M. Newcomer [MVP]
> email: newc...@flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

I was aware of the encodings between UTF-8 and UTF-32, the encoding to
UTF-16 looks a little clumsy when we get to four UTF-8 bytes.

Peter Olcott

unread,

May 17, 2010, 4:24:03 PM5/17/10

to

That is just as true as the statement that no one is more Jewish than
the pope.

I started with a known good Yacc and Lex specification for "C"
http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

I carefully constructed code to build a very clean abstract syntax tree.
I made absolute minimal transformations to the original Yacc and Lex
spec to attain the subset of C++ that I will be implementing.
(Basically the "C" language plus classes and minus pointers).

I used this abstract syntax tree to generate jump code with absolute
minimum number of of branches for the control flow statements that this
C++ subset will be providing. Testing has indicated that these results
have been correctly achieved.

Then I designed the whole infrastucture to handle every type of
elemental operation upon the fundamental data types, as well the
aggregate data types.

Joseph M. Newcomer

unread,

May 17, 2010, 5:38:07 PM5/17/10

to

See below...

On Mon, 17 May 2010 14:05:54 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>On 5/17/2010 11:24 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>>
>>>> Huh? What's this got to do with the encoding?
>>> (1) Lex requires a RegEx
>> ***
>> Your regexp should be in terms of UTF32, not UTF8.
>
>Wrong, Lex can not handle data larger than bytes.
***

The wrong choice of tooling. Just because the tooling is stupid is no reason to cater to
its failures.

It is too bad, but lex was done in a different era when everyone used 7-bit ASCII. Using
antiquated tools generates problems that in a sane world would not have to exist. I'd
think that converting lex to 32-bit would be a simpler solution.
****

>
>> ****
>>>
>>> (2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>>> faster or simpler way to do this besides a regular expression
>>> implemented as a finite state machine can possibly exist.
>> ****
>> Actually, there is; you obviously know nothing about UTF-8, or you would know that the
>> high-order bits of the first byte tell you the length of the encoding, and the FSM is
>> written entirely in terms of the actual encoding, and is never written as a regexp.
>
>Ah I see, if I don't know everything that I must know nothing, I think
>that this logic is flawed. None of the docs that I read mentioned this
>nuance. It may prove to be useful. It looks like this will be most
>helpful when translating from UTF-32 to UTF-8, and not the other way
>around.

****
You are asking questions based on such elementary properties to the encoding that there is
little excuse for defending your position. You did not start out by saying "I know
nothing about UTF-8 at all..." in which case the simple answer is
(a) read about UTF-8 encoding (I've given you the citation, including
the page number)
(b) realize that your base approach is almost certainly wrong.
****

>
>It would still seem to be slower and more complex than a DFA based
>finite state machine for validating a UTF-8 byte sequence. It also look
>like it would be slower for translating from UTF-8 to UTF-32.

****
If I didn't have to leave in 10 minutes, I'd write the UTF-18 to UT32 conversion. But I
have to put together some items for my presentation (find the extension cord, among other
tasks) which doesn't quite leave enough time to write the code, which will take almost the
entire 10 minutes to write. But when I get back tonight, I'll write it and time myself.

But speed is not an issue here, and lex was never intended to support UTF-8. It sounds
like you are complaining that punched cards don't support full Unicode. Wrong tool for
the job.
joe

Peter Olcott

unread,

May 17, 2010, 11:29:15 PM5/17/10

to

I have made Lex handle UTF-8 in an optimal way, thus you are wrong yet
again. I do now agree with you that UTF-32 is generally much better than
UTF-8 for internal representation.

I want my GUI scripting language to be like the original Turbo Pascal
2.0, at least 100-fold faster compiles than the next best alternative.

Joseph M. Newcomer

unread,

May 18, 2010, 12:12:55 AM5/18/10

to

See below...

****
I presume these are Yacc and Lex specifications, but why do you claim they are "known
good" specifications? And what do they have to do with UTF-8 encodings?

You don't have a good grasp of what constitutes a "letter" beyond the ASCII-7 spec so how
can you extend it to support UTF-8?
****

>
>I carefully constructed code to build a very clean abstract syntax tree.
>I made absolute minimal transformations to the original Yacc and Lex
>spec to attain the subset of C++ that I will be implementing.
>(Basically the "C" language plus classes and minus pointers).
>
>I used this abstract syntax tree to generate jump code with absolute
>minimum number of of branches for the control flow statements that this
>C++ subset will be providing. Testing has indicated that these results
>have been correctly achieved.

****
What is "jump code"? I've been in the compiler business since 1970 and I've never heard
of this code style. We assign tasks of generating code to first-semester compiler course
students; it isn't all that challenging. I did a google of "jump code" and got a lot of
useless stuff about advertising links in Web pages. Somehow, I don't believe you are
generating these from your compiler.
****

>
>Then I designed the whole infrastucture to handle every type of
>elemental operation upon the fundamental data types, as well the
>aggregate data types.

****
How do you know how much code is required, so that you know you are 75% done? I was once
consulting, and the company said "Yes, you've identified a serious bug, but we can't fix
it now during the coding phase, because we are 37.247% done with the coding" (they were
coding something the like of which had never existed before, but they could tell to three
decimal places how far they were into the coding) "...So we'll let it go to the testing
phase, and fix it there" (never mind it would cost several times as much to fix it; it was
somebody elses's schedule). So I am always deeply suspect of any "n% done" statements
under any conditions (let alonesomething like "75%" which means "more than 74% and less
than 76%". An estimate of "3/4 done" (which means you know to within one part in 4 how
much might be left) makes sense. But as a physicist I was taught to never add a decimal
point or digit of precision beyond what is actually known. "7/10" is not the same as
"70%" except in abstract math. "70%" means "not 69%, not 71%, I know to one part in 100
how accurate I am" but "7/10" means "more than 6/10 and less than 8/10" which would
translate to "70%�10%". So you have to not just quote a value, but your confidence in the
value.

In Pennsylvania, it was announced we have "75% compliance" with the 2010 census. This
leads to the question, if we are attempting to find out how many people we have, how do we
know that 75% of them have responded? We don't actually know how many people we have,
because we have a partial response, but how do we know it is 75%? 75% of how many? Well,
we don't know how many, that's what we're trying to figure out...
joe
****

Joseph M. Newcomer

unread,

May 18, 2010, 12:23:41 AM5/18/10

to

See below...

On Mon, 17 May 2010 14:53:39 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>On 5/17/2010 11:51 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>>>
>>>
>>> I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>>> completed using Lex and Yacc. Besides this I need a good way to parse
>>> UTF-8 to convert it to UTF-32.
>> ****
>> No, it sucks. For reasons I have pointed out. You can easily write a UTF-32 converter
>> just based on the table in the Unicode 5.0 manual!
>
>Lex can ONLY handle bytes. Lex apparently can handle the RegEx that I
>posted. I am basically defining every UTF-8 byte sequence above the
>ASCII range as a valid Letter that can be used in an Identifier.
>[A-Za-z_] can also be used as a Letter.

****
No, you are saying "because I chose to use an obsolete piece of software, I am constrained
to think in obsolete terms".

Note that it makes no sense to declare every UTF-8 codepoint that translates UTF-32> 0xFF
as a letter. Only in some fantasy world could this possibly make sense, since you are
declaring that all localized digits, localized punctuation marks, and math symbols must be
letters! I find it hard to imagine how you can say this with your stated goal of letting
people code in their native language. Imagine if I allowed ASCII-7 identifiers like
"A+B"
"$%^&"
but that's what you are doing if you recognize international punctuation marks and digits
as "letters"

So I don't believe you have a clue about how to handle UTF-8.

I'd first look into fixing lex to handle modern coding practice. Otherwise, you have a
tool that doesn't solve your problem. You have to write lexical rules that recognize ONLY
letters about U00FF, and for digits, you have to allow digit sequences that allow
localized digits. And if I use the symbols ABCDEFGHIJ to represent some sequence of
localized digits, then you should write rules that allow numbers to be written like "123"
and "ABC" (try to remember that by "A" I mean the localized value of the digit "1"), but
then how do you handle a sequence like A1A? Does it translate to tne number we would
write as "111" or is it syntactically illegal? Essentially, doing this in UTF-8 is pretty
miserable, but if you don't do it, you are parsing nonsense phrases.
joe
****

Mihai N.

unread,

May 18, 2010, 3:58:16 AM5/18/10

to

> Ultimately all UTF-8 validators must be regular expressions implemented
> as finite state machines.

> I can't imagine a better way.

That, right here, is your problem. This is where you are wrong.

Oliver Regenfelder

unread,

May 18, 2010, 10:06:43 AM5/18/10

to

Hello Peter Olcott wrote:
> I have made Lex handle UTF-8 in an optimal way

^^^^^^^^^^^

I just picked one out of many. But reading all your posts I
always have the feeling that management is talking to me.

They are always filled with adjectives used to state that
things are already perfect, in their best ways, the fastest
solution, the best acomplishable...

Best regards,

Oliver

Joseph M. Newcomer

unread,

May 18, 2010, 11:23:42 AM5/18/10

to

See below...

****
And, like most managers, the Reality Distortion Field is running at full strength. For
example, any character > U00FF is by definition a "letter", which is so far out of contact
with any reality we could recognize that it is mind-boggling that anyone could assert this
with a straight face!

But it is the result of being a superb designer. The rest of us have to struggle merely
to get things right, but if you define "right" to be "whatever I have now", you are always
right.
joe
****
>
>Best regards,
>
>Oliver

Peter Olcott

unread,

May 18, 2010, 11:39:09 AM5/18/10

to

>>
>> That is just as true as the statement that no one is more Jewish than
>> the pope.
>>
>> I started with a known good Yacc and Lex specification for "C"
>> http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
>> http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
> ****
> I presume these are Yacc and Lex specifications, but why do you claim they are "known
> good" specifications? And what do they have to do with UTF-8 encodings?
>
> You don't have a good grasp of what constitutes a "letter" beyond the ASCII-7 spec so how
> can you extend it to support UTF-8?
> ****

I already explained this several times, I will not repeat myself again.

>>
>> I carefully constructed code to build a very clean abstract syntax tree.
>> I made absolute minimal transformations to the original Yacc and Lex
>> spec to attain the subset of C++ that I will be implementing.
>> (Basically the "C" language plus classes and minus pointers).
>>
>> I used this abstract syntax tree to generate jump code with absolute
>> minimum number of of branches for the control flow statements that this
>> C++ subset will be providing. Testing has indicated that these results
>> have been correctly achieved.
> ****
> What is "jump code"? I've been in the compiler business since 1970 and I've never heard
> of this code style. We assign tasks of generating code to first-semester compiler course
> students; it isn't all that challenging. I did a google of "jump code" and got a lot of
> useless stuff about advertising links in Web pages. Somehow, I don't believe you are
> generating these from your compiler.
> ****

Take a look at the Dragon book under intermediate code generation. It is
pretty east to generate control flow code once you have derived an
abstract syntax tree. Trying to do this without an abstract syntax tree
is a little too difficult. The next level challenge is to generate this
control flow with with an absolute minimum number of jumps (gotos).

>>
>> Then I designed the whole infrastucture to handle every type of
>> elemental operation upon the fundamental data types, as well the
>> aggregate data types.
> ****
> How do you know how much code is required, so that you know you are 75% done? I was once

This can be reasonably accurately estimated once detailed design is
completed. I can know that I am about 75% complete because most of the
first draft of all of the code is complete, and the crucial
infrastructure components are working and tested.

I break down the design of any system into a hierarchy of increasing
specificity. With my method the last step of detailed design is coding,
and I do this in Microsoft Word.

The primary measure of code quality is the degree that non essential
complexity has been eliminated. I spent a lot of time making the code as
simple and easy to read as possible. Because of this my desk checking is
very effective at getting the code nearly completely correct the first
time, without any help from the compiler.

After I have reviewed this code many times, I drop it into the compiler
to correct the syntax errors. Then a little more time is required to
test and debug it. Since it was designed to make testing easy both
testing and debugging combined typically take about 5% of the total
development time. There are never any serious logic errors ever detected
in debugging.

Joseph M. Newcomer

unread,

May 18, 2010, 12:01:39 PM5/18/10

to

See below...

****
Prove it.
****

>
>I want my GUI scripting language to be like the original Turbo Pascal
>2.0, at least 100-fold faster compiles than the next best alternative.

****
Look into the work done by Nigel Horspool (University of Victoria, BC) who is possibly the
world's expert on creating fast parsers from lex and yacc output. He uses NO tables at
all, ever, nothing but fast code generated from the tables. The tables are too slow. So
if you are doing anything table-driven, you are already far behind the optimal performance
point.
joe
****

Joseph M. Newcomer

unread,

May 18, 2010, 12:07:50 PM5/18/10

to

See below...

On Tue, 18 May 2010 00:58:16 -0700, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>
>> Ultimately all UTF-8 validators must be regular expressions implemented
>> as finite state machines.
>
>> I can't imagine a better way.
>That, right here, is your problem. This is where you are wrong.

****
Not if you live in the deep center of a Reality Distortion Field.

Nigel Horspool provde, many years ago (close to 20) that table-drive DFAs and parsers were
far too slow compared to custom C code, but what he did was generate the C code from the
lex and yacc tables. His comment was "this code embodies some of the worst coding
practices you can imagine, but because it is generated automatically, no one ever has to
actually look at it or try to maintain it. So we can use a goto into the middle of
another case of a switch and not hesitate"

He could parse a million lines a minute of C code and pretty much every other language.
The critique was "but you have lousy error recovery on the parse" and his response was
"but at a million lines a minute, I can parse any reasonable-sized program completely on
every keystroke, and you won't even notice the delay, because the parse is finished before
your finger comes up from the key". So he only needed to flag the first error and not
have to restart the parse at a "clean" point, the nightmare of all parsing technologies.

He is a professor at the University of Victoria, British Columbia.
joe

Joseph M. Newcomer

unread,

May 18, 2010, 12:28:35 PM5/18/10

to

And, of course, there is the style that says:

"A must be true"
"In my implementation, A is false"
"My implementation is correct"

This is a logical form known as a sillygism.
joe

On Tue, 18 May 2010 16:06:43 +0200, Oliver Regenfelder <oliver.re...@gmx.at> wrote:

Peter Olcott

unread,

May 18, 2010, 2:38:01 PM5/18/10

to

I really don't care what you believe.

Peter Olcott

unread,

May 18, 2010, 3:01:09 PM5/18/10

to

On 5/18/2010 2:58 AM, Mihai N. wrote:
>
>> Ultimately all UTF-8 validators must be regular expressions implemented
>> as finite state machines.
>
>> I can't imagine a better way.
> That, right here, is your problem. This is where you are wrong.
>
>

My method can completely validate any UTF-8 sequence of bytes and decode
it into its corresponding code point values in fewer machine clock
cycles than any possible alternative.** I am not wrong and will post the
code within about a week.

** Not counting hand tweaking the corresponding assembly language
equivalent of my method. Also not counting slight improvements to the
basic method that may be suggested.

What I am claiming is that my basic method is the fastest possible
method, I am not claiming that the code perfectly implements this method.

Peter Olcott

unread,

May 18, 2010, 3:23:34 PM5/18/10

to

On 5/18/2010 10:23 AM, Joseph M. Newcomer wrote:
> See below...
> On Tue, 18 May 2010 16:06:43 +0200, Oliver Regenfelder<oliver.re...@gmx.at> wrote:
>
>> Hello Peter Olcott wrote:
>>> I have made Lex handle UTF-8 in an optimal way
>> ^^^^^^^^^^^
>>
>> I just picked one out of many. But reading all your posts I
>> always have the feeling that management is talking to me.
>>
>> They are always filled with adjectives used to state that
>> things are already perfect, in their best ways, the fastest
>> solution, the best acomplishable...
> ****
> And, like most managers, the Reality Distortion Field is running at full strength. For
> example, any character> U00FF is by definition a "letter",

If you can't even quote me correctly how can you possibly get anything
else correctly?

Peter Olcott

unread,

May 18, 2010, 3:30:39 PM5/18/10

to

My UTF-8 DFA is hand tweaked. Lex may not produce the fastest possible
DFA, but a DFA is still the fastest possible approach. I doubt that you
can find any lexical analyzer that is not DFA based that can beat the
fastest DFA based lexical analyzer.

If you can, and it will not take me very long to confirm your findings I
would be interested in a valid counter example if you can provide one.

Joseph M. Newcomer

unread,

May 18, 2010, 5:49:56 PM5/18/10

to

See below...

On Tue, 18 May 2010 14:23:34 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>On 5/18/2010 10:23 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Tue, 18 May 2010 16:06:43 +0200, Oliver Regenfelder<oliver.re...@gmx.at> wrote:
>>
>>> Hello Peter Olcott wrote:
>>>> I have made Lex handle UTF-8 in an optimal way
>>> ^^^^^^^^^^^
>>>
>>> I just picked one out of many. But reading all your posts I
>>> always have the feeling that management is talking to me.
>>>
>>> They are always filled with adjectives used to state that
>>> things are already perfect, in their best ways, the fastest
>>> solution, the best acomplishable...
>> ****
>> And, like most managers, the Reality Distortion Field is running at full strength. For
>> example, any character> U00FF is by definition a "letter",
>
>If you can't even quote me correctly how can you possibly get anything
>else correctly?

****
Yes, I re-read, and realized that you said anything outside the ASCII set, by which I
assume you mean anything > U007E, which makes your mistake even more egregious!

I was being generous and giving you credit for a better design than you actually had. MY
apologies for misinterpreting your bad design in a more generous fashion than it deserved.
joe

****

Joseph M. Newcomer

unread,

May 18, 2010, 6:26:04 PM5/18/10

to

It is important to distinguish between the *concept* of DFA and the implementation of "a
table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"

All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
work at this point, of generating tight C code, and no table appears anywhere in his
implementation.
joe

Peter Olcott

unread,

May 18, 2010, 6:27:38 PM5/18/10

to

On 5/18/2010 4:49 PM, Joseph M. Newcomer wrote:
> See below...

> On Tue, 18 May 2010 14:23:34 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>
>>> And, like most managers, the Reality Distortion Field is running at full strength. For
>>> example, any character> U00FF is by definition a "letter",
>>
>> If you can't even quote me correctly how can you possibly get anything
>> else correctly?
> ****
> Yes, I re-read, and realized that you said anything outside the ASCII set, by which I
> assume you mean anything> U007E, which makes your mistake even more egregious!
>
> I was being generous and giving you credit for a better design than you actually had. MY
> apologies for misinterpreting your bad design in a more generous fashion than it deserved.
> joe
>

So what specific dysfunctional result is derived by providing greater
than typical flexibility the sequence of characters that can be used in
an Identifier?

It seems that all of your criticism of this point is entirely based on
mindless conformity with preexisting conventions.

Peter Olcott

unread,

May 18, 2010, 6:36:48 PM5/18/10

to

On 5/18/2010 5:26 PM, Joseph M. Newcomer wrote:
> It is important to distinguish between the *concept* of DFA and the implementation of "a
> table-driven DFA" and the special case of "a table-drive DFA generated by the lex tool"
>
> All lexical analysis is done using the concept of DFAs, but the table-driven DFA output by
> lex is not very fast. Again, I point to the work of Nigel Horspool, which is very old
> work at this point, of generating tight C code, and no table appears anywhere in his
> implementation.
> joe
>

I really, really doubt that no table is used in a lexical analyzer that
is very very fast. Some little include of an include that you missed is
more likely. I can't accept this without direct proof.

I can accept that he may have derived a parser that is very very fast
without the need for a table. This is more plausible.

Joseph M. Newcomer

unread,

May 18, 2010, 8:36:34 PM5/18/10

to

See below...

On Tue, 18 May 2010 17:27:38 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>On 5/18/2010 4:49 PM, Joseph M. Newcomer wrote:
>> See below...
>> On Tue, 18 May 2010 14:23:34 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>>
>>>> And, like most managers, the Reality Distortion Field is running at full strength. For
>>>> example, any character> U00FF is by definition a "letter",
>>>
>>> If you can't even quote me correctly how can you possibly get anything
>>> else correctly?
>> ****
>> Yes, I re-read, and realized that you said anything outside the ASCII set, by which I
>> assume you mean anything> U007E, which makes your mistake even more egregious!
>>
>> I was being generous and giving you credit for a better design than you actually had. MY
>> apologies for misinterpreting your bad design in a more generous fashion than it deserved.
>> joe
>>
>
>So what specific dysfunctional result is derived by providing greater
>than typical flexibility the sequence of characters that can be used in
>an Identifier?

****
OK, consider if the following were a specifcation for a C identifier:
[A-Za-z_#!~`$%^&*()-=+{};:.,<>]

It gives me a lot more "flexibility", but it generally doesn't make sense. So if it
doesn't make sense to an English language user to see comma as a "letter", how does it
make sense for an Armenian user if character U055D (Armenian comma) is a letter? Or U0589
(Armenian full stop, that is, period) or U058A (Armenian hyphen). Sure, if the language
is COBOL, hyphen is a letter, but you were talking about a C-like language.

Then there's U060C, Arabic comma, U060D, Arabic date separator, U061B, Arabic semicolon,
U0700 Syriac End Of Paragraph, U0710, Syriac Supralinear Full Stop, U0703, Syriac
Superalinear Colon, U0704, Syriac Sublinear Colon, U0953, Devangari Grave Accent (our
equivalent of `), U9F2, Bengali Rupee Mark [although you can argue that this might be the
same as allowing $ in an identifier, which many languages do], U0BF3, Tamil day sign,
U0DF4 Sinhala Puctuation Kunddaliya, U10FB, Georgian Paragraph Separator (this might be a
whitespace), U1363, Ethiopic comma, U1364, Ethiopic Semicolon, just to name a few of the
interesting characters I found within the first 1/3 of the Unicode characters. Then I got
bored, because I proved beyond any question your specification of "programming in their
native language" claim is complete nonsense.

Maybe years of working with students and fellow faculty from all over the world has made
me somewhat sensitive to the issues of localization of programming languages. There's a
rather famous letter from a Norwegian to SIGPLAN Notices (around 1979) where he points out
that if Americans were forced to drop the letter "C" because its pronouciation was
ambiguous, and so on; that we could not use the vowels "a" or "o" as letters, etc., it
would be considered a serious imposition. So we might have a specification like
[BD-NP-Zbd-np-z_] and he points out that for someone from another country, not being able
to use accented characters is a similar burden. But if you allow accented characters, you
have to allow localized digits in numbers, and localized punctuation (friends pointed out
that certain words written with accent marks are valid identifiers, but without the accent
marks are actually quite rude words; for example, "count" is a valid identifier, but if
"o" were not allowed, we get something quite different. Apparently this is particularly
true in languages like French with lots of accented characters; but this was over 40 years
ago, and I never learned French, so I can't reproduce the examples).

Or were you just kidding about that "program in their native language" part?
joe
****

>
>It seems that all of your criticism of this point is entirely based on
>mindless conformity with preexisting conventions.

Peter Olcott

unread,

May 18, 2010, 8:54:17 PM5/18/10

to

On 5/18/2010 7:36 PM, Joseph M. Newcomer wrote:
> See below...

>> So what specific dysfunctional result is derived by providing greater
>> than typical flexibility the sequence of characters that can be used in
>> an Identifier?
> ****
> OK, consider if the following were a specifcation for a C identifier:
> [A-Za-z_#!~`$%^&*()-=+{};:.,<>]
>
> It gives me a lot more "flexibility", but it generally doesn't make sense.

It is not whether or not that it makes sense within the context of
natural human language that restricts the use of these punctuation
characters from being used for Identifiers. The real issue is that the
parser would choke.

It is not the job of the compiler to make sure that the program makes
sense from the perspective of natural language. This is out-of-scope. It
is like requiring the compiler to validate the grammar of the user
specified comments, and check the spelling of the words within these
comments.

Since your reasoning began from an incorrect premise the reasoning
following this false premise is necessarily unsound.

Joseph M. Newcomer

unread,

May 18, 2010, 9:04:26 PM5/18/10

to

See below...

On Tue, 18 May 2010 10:39:09 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>>>
>>> That is just as true as the statement that no one is more Jewish than
>>> the pope.
>>>
>>> I started with a known good Yacc and Lex specification for "C"
>>> http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
>>> http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
>> ****
>> I presume these are Yacc and Lex specifications, but why do you claim they are "known
>> good" specifications? And what do they have to do with UTF-8 encodings?
>>
>> You don't have a good grasp of what constitutes a "letter" beyond the ASCII-7 spec so how
>> can you extend it to support UTF-8?
>> ****
>
>I already explained this several times, I will not repeat myself again.

****
But you always give the same silly explanation, that "it is easier". It doesn't justify
why you have decided that punctuation marks and digits are "letters". So if you gave the
same meaningless explanation, the question still remains.
****

>
>>>
>>> I carefully constructed code to build a very clean abstract syntax tree.
>>> I made absolute minimal transformations to the original Yacc and Lex
>>> spec to attain the subset of C++ that I will be implementing.
>>> (Basically the "C" language plus classes and minus pointers).
>>>
>>> I used this abstract syntax tree to generate jump code with absolute
>>> minimum number of of branches for the control flow statements that this
>>> C++ subset will be providing. Testing has indicated that these results
>>> have been correctly achieved.
>> ****
>> What is "jump code"? I've been in the compiler business since 1970 and I've never heard
>> of this code style. We assign tasks of generating code to first-semester compiler course
>> students; it isn't all that challenging. I did a google of "jump code" and got a lot of
>> useless stuff about advertising links in Web pages. Somehow, I don't believe you are
>> generating these from your compiler.
>> ****
>
>Take a look at the Dragon book under intermediate code generation. It is
>pretty east to generate control flow code once you have derived an
>abstract syntax tree. Trying to do this without an abstract syntax tree
>is a little too difficult. The next level challenge is to generate this
>control flow with with an absolute minimum number of jumps (gotos).

****
This is really trivial, but I never heard it called "jump code". We have referred to it
as "byte codes", "quads", "threaded interpretive code", and other designations. The fact
that google did not uncover any compiler references for the phrase "jump code" puts it
definitely off the mainstream.

The minimum-jump solution is often best handled by peephole optimization on the quads that
represent the symbolic code; it is a lot more interesting when you have architectures that
have short-jump and long-jump instructions, particularly when the conditional jumps are
all short-jumps. For example, the optimizations the Bliss-11 compiler was doing in 1972.
The basic blocks were re-sorted by the peephole optimizer to minimize the number of jump
instructions, give the nature of the short-conditional-jumps. Ho hum. Did you think this
was a *new* problem? It is a very, very old problem, and was first addressed in the
FORTRAN I compiler. The problem there was the host machine (IBM 704/709/709x) didn't have
enough memory to do jump-sorting of basic blocks. By the time we did Bliss-11 we had
approximately a megabyte of memory to hold each function's intermediate representation.
Well, half a megabyte, given the code size. Steve Hobbs was the expert on jump
optimization in that compiler.
****

>
>>>
>>> Then I designed the whole infrastucture to handle every type of
>>> elemental operation upon the fundamental data types, as well the
>>> aggregate data types.
>> ****
>> How do you know how much code is required, so that you know you are 75% done? I was once
>
>This can be reasonably accurately estimated once detailed design is
>completed. I can know that I am about 75% complete because most of the
>first draft of all of the code is complete, and the crucial
>infrastructure components are working and tested.
>
>I break down the design of any system into a hierarchy of increasing
>specificity. With my method the last step of detailed design is coding,
>and I do this in Microsoft Word.

****
You code in Word? Word is essentially user-hostile for writing code!
***

>
>The primary measure of code quality is the degree that non essential
>complexity has been eliminated. I spent a lot of time making the code as
>simple and easy to read as possible. Because of this my desk checking is
>very effective at getting the code nearly completely correct the first
>time, without any help from the compiler.

****
There are many measures of code quality. Mine is that the code is robust under
maintenance, that is, it can be easily modified by someone else because there are no
orthogonal dependencies. "Simple" is not always the best metric for such cases.

I still don't figure out how you can estimate the amount of code required in the future.
There is rarely a linear correlation between specifcation and lines of code

But my experience in managing projects really soured me on the ability of a programmer to
estimate how much work is left to do.
joe
****

Joseph M. Newcomer

unread,

May 18, 2010, 9:07:15 PM5/18/10

to

The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
analysis all the time using DFAs and never once do I have a table. Years of experience
taught me that they are faster than anything lex or analogous lexer generators can do. But
I'll reserve judgment.
joe

Peter Olcott

unread,

May 18, 2010, 9:42:55 PM5/18/10

to

On 5/18/2010 8:07 PM, Joseph M. Newcomer wrote:
> The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
> analysis all the time using DFAs and never once do I have a table. Years of experience
> taught me that they are faster than anything lex or analogous lexer generators can do. But
> I'll reserve judgment.
> joe

How can you scan for dozens of keywords efficiently without a state
transition matrix? Without a state transition matrix its gotta be
something like if, else if, else if, et cetera.

With a state transition matrix you can simultaneously look for an
unlimited number of different character strings with zero comparisons
per byte.

Joseph M. Newcomer

unread,

May 18, 2010, 10:59:12 PM5/18/10

to

See below...

On Tue, 18 May 2010 19:54:17 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>On 5/18/2010 7:36 PM, Joseph M. Newcomer wrote:
>> See below...
>>> So what specific dysfunctional result is derived by providing greater
>>> than typical flexibility the sequence of characters that can be used in
>>> an Identifier?
>> ****
>> OK, consider if the following were a specifcation for a C identifier:
>> [A-Za-z_#!~`$%^&*()-=+{};:.,<>]
>>
>> It gives me a lot more "flexibility", but it generally doesn't make sense.
>
>It is not whether or not that it makes sense within the context of
>natural human language that restricts the use of these punctuation
>characters from being used for Identifiers. The real issue is that the
>parser would choke.
>
>It is not the job of the compiler to make sure that the program makes
>sense from the perspective of natural language. This is out-of-scope. It
>is like requiring the compiler to validate the grammar of the user
>specified comments, and check the spelling of the words within these
>comments.

****
No, it is the job of the person writing the lexical specification, and writing the syntax
specification. And it is NOT like asking for spell-checking in comments! I'm not even
sure you are applying any form of logic to that leap of reasoning.

Instead, a well-formed syntactic entity such as a number in a language that is pretending
to be localizable must include localized digits. These digits and puctuation marks cannot
possibly make sense as part of an identifier.

So if I'm going to be able to write in my native language, I expect to use my native
digits to define number. I certainly do, and why should I tell a Thai programmer he must
use MY culture's digits? That seems arrogant in the extreme.

Given that it is not all that hard to actually write the production rules to do this, I
fail to see why you keep insisting your failed design is correct (which doesn't match your
own claim of localization). This is not rocket science. It is just a few new lexical
rules!
joe
****

Peter Olcott

unread,

May 18, 2010, 11:36:13 PM5/18/10

to

On 5/18/2010 9:59 PM, Joseph M. Newcomer wrote:
> See below...

If internationalizing my GUI scripting language is not trivial then it
is far too expensive to implement at this stage of development. I don't
really even have the time to learn about these things much less
implement them. I have too many man years worth of work to fit into too
few hours in the day.

It is easy enough to use localized digits now that I have the table that
shows the categories of code points. Yet at this point I have already
expended more on this aspect than I could afford. Users of my language
will have to deal with unrestrained UTF-8 Identifiers and zero more
localization until I have a self-sustaining positive cash flow.

Mihai N.

unread,

May 19, 2010, 1:34:59 AM5/19/10

to

> I can't accept this without direct proof.

Yet, we are all supposed to take everything *you*
say without any proof.

Mihai N.

unread,

May 19, 2010, 1:42:01 AM5/19/10

to

> You code in Word? Word is essentially user-hostile for writing code!

Actually, it is very nice.

You set the spelling language to "C" and get nice squiggles for
all syntax errors.
;-)

Also, has the great benefit that it does not compile and run your code, which
allows to write perfect code and create super fast algorithms.
Compiling the code and running it destroy the perfection :-)
It's called "reality" and it is a bad thing.

Joseph M. Newcomer

unread,

May 19, 2010, 2:18:53 AM5/19/10

to

See below...

On Tue, 18 May 2010 20:42:55 -0500, Peter Olcott <NoS...@OCR4Screen.com> wrote:

>On 5/18/2010 8:07 PM, Joseph M. Newcomer wrote:
>> The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
>> analysis all the time using DFAs and never once do I have a table. Years of experience
>> taught me that they are faster than anything lex or analogous lexer generators can do. But
>> I'll reserve judgment.
>> joe
>
>How can you scan for dozens of keywords efficiently without a state
>transition matrix? Without a state transition matrix its gotta be
>something like if, else if, else if, et cetera.

****
You assumption that a "state transition matrix" is required shows that you have never
actually written a hard-coded lexer. I've written so many that I've lost count. I'll
knock one off as fast as I can type; in fact, typing speed is the limit to how long it
takes to create one.

How did you develop this fixation that a "state transition matrix" is required? That is
only one way to implement a DFA. A switch statement works well, too, and if-statements
are considered a viable technique. While a table-driven method is known to be one of the
implementations, it is far and away not the ONLY known implementation. Nigel's experience
was that a hard-coded recognizer without a table is blindingly fast, MUCH faster than a
lexer that uses a transition matrix to drive the recognition. As these things go, using
transition matrices is a rather "modern" invention in lexing. The compiler for the CDC
Star computer used pipelined floating point operations to create a character class vector
(because the Star was LOUSY at character manipulation but blindingly fast at pipelined
floating point). So don't try to explain to me that table-driver lexers are the only
possible implementation; I was writing hard-coded lexers in 1966, and using table-driven
lexers in 1967 that were created from a lexer generator written by Art Evans for his PhD
dissertation in 1961. We wrote a hand-coded lexer for the Bliss compiler. I've been
writing compilers of one sort or another for 45 years, with table-driven components,
hard-coded components, and everything in between. I've written sparse matrix packers
using "comb vectors" to compress parse tables (an invention of people at Kahrsruhe
University in Germany). And I can write programs that can convert NDFA machines to DFA
machines, and then optimize the DFA to minimum states, and used to give this as a freshman
programming exercise. So please stop trying to tell me how much I know or don't know
about lexers, parsers, and compiler implementations in general. I also worked closely
with one of the world's best language designers, and I learned a lot about language design
from him. Not enough to be as good as he is, but enough to know when I see a bad design.
Do you know what a "perfect hash" is? I've built perfect hash algorithms. Do you know
where they are used, and why?

I'm not some newbie at this business. I worked with optimizing compiler technology from
1970 through 1983, and my Ph.D. was in optimal code generation from table-driven code
generators. I've built debuggers, byte code interpreters, tree interpreters, threaded
code interpreters, and machine code generators. I wrote my first compiler in 1967, which
took an Algol-like language and generator hard code. I was one of the key developers of
an interactive language from 1967-1969. My first optimizing compiler was a student
project in 1967 (I can date that exactly, because the machine I wrote it for was
decomissioned in 1968; our grade was inversely proportional to code size, and I got an
A+). I've forgotten more about compilers than most people have ever learned. The last
compiler I wrote had not a trace of tables anywhere in it, and I wrote it three years ago
(I just checked the file dates). My first Windows project was a compiler which had not a
single table for either its lexer or parser, back in 1990. So I have no idea why you
think these representations are essential.

And my skills pale beside those of deep experts like Nigel Horspool; see, for example,

http://webhome.cs.uvic.ca/~nigelh/pubs.html

and I recommend the paper

http://webhome.cs.uvic.ca/~nigelh/Publications/fastparse.pdf

A 1990 paper wherein he shows why performance of table-driven parsers can be improved by
very large factors (like 8x) by hard-coding the transitions in pure C code without a table
anywhere in the resulting compiler. Don't try to explain why table-driven must be faster
than any alternative when it was proven 20 years ago that this is not true. I suggest
looking at Table 2 on page 19. You asked for proof; here it is.
****

>
>With a state transition matrix you can simultaneously look for an
>unlimited number of different character strings with zero comparisons
>per byte.

****
So what's your complaint about adding lexer rules for localized digit sequences? You just
said it won't add any time...
joe

Peter Olcott

unread,

May 19, 2010, 10:10:52 AM5/19/10

to

On 5/19/2010 12:34 AM, Mihai N. wrote:
>
>> I can't accept this without direct proof.
>
> Yet, we are all supposed to take everything *you*
> say without any proof.
>
>

Please cite examples of what I need to prove. So far no one asked for
any proof.

Peter Olcott

unread,

May 19, 2010, 10:20:37 AM5/19/10

to

On 5/19/2010 12:42 AM, Mihai N. wrote:
>> You code in Word? Word is essentially user-hostile for writing code!
>
> Actually, it is very nice.
>
> You set the spelling language to "C" and get nice squiggles for
> all syntax errors.
> ;-)
>
> Also, has the great benefit that it does not compile and run your code, which
> allows to write perfect code and create super fast algorithms.
> Compiling the code and running it destroy the perfection :-)
> It's called "reality" and it is a bad thing.
>
>

It has the benefit of making code easier to read for extensive desk
checking. I can enlarge the font, add bolding, and color high-lighting.
Also this code can be directly embedded in my design notes. This process
works very well for me.

How much time is typically spent on debugging and testing? From what I
have heard many badly written systems spend almost all of their time on
debugging. I only spent about 5% of my time on debugging and testing
combined. The code is designed to make automated testing easy.

Peter Olcott

unread,

May 19, 2010, 10:52:18 AM5/19/10

to

On 5/19/2010 1:18 AM, Joseph M. Newcomer wrote:
> See below...
> On Tue, 18 May 2010 20:42:55 -0500, Peter Olcott<NoS...@OCR4Screen.com> wrote:
>
>> On 5/18/2010 8:07 PM, Joseph M. Newcomer wrote:
>>> The counterexample is that he DID it. You'll have to ask him for proof. But I do lexical
>>> analysis all the time using DFAs and never once do I have a table. Years of experience
>>> taught me that they are faster than anything lex or analogous lexer generators can do. But
>>> I'll reserve judgment.
>>> joe
>>
>> How can you scan for dozens of keywords efficiently without a state
>> transition matrix? Without a state transition matrix its gotta be
>> something like if, else if, else if, et cetera.
> ****
> You assumption that a "state transition matrix" is required shows that you have never
> actually written a hard-coded lexer. I've written so many that I've lost count. I'll
> knock one off as fast as I can type; in fact, typing speed is the limit to how long it
> takes to create one.
>
> How did you develop this fixation that a "state transition matrix" is required?

Because it is intuitively obvious that there can be no other way that
requires fewer machine cycles.

> That is
> only one way to implement a DFA. A switch statement works well,

This is a fundamental part of my DFA recognizer.

> too, and if-statements
> are considered a viable technique.

Several comparisons per byte must take longer than looking up the
ActionCode switch statement value based on a state_transition_matrix
that is indexed by current_state and current_input_byte.

This is especially true when the state_transition_matrix is tiny enough
to always fit in cache. (Cache locality of reference is the most
significant thing that I learned from you).

With all of the experience that you cited below the above sentences
would completely explain every relevant detail of my entire DFA design.

I see no possible way that any lexer that can recognize 50 different
keywords can be made any faster than the above design. This certainly
can not be done with sequences if, else if, ... statements.

> While a table-driven method is known to be one of the
> implementations, it is far and away not the ONLY known implementation. Nigel's experience
> was that a hard-coded recognizer without a table is blindingly fast, MUCH faster than a
> lexer that uses a transition matrix to drive the recognition. As these things go, using
> transition matrices is a rather "modern" invention in lexing. The compiler for the CDC
> Star computer used pipelined floating point operations to create a character class vector
> (because the Star was LOUSY at character manipulation but blindingly fast at pipelined
> floating point). So don't try to explain to me that table-driver lexers are the only
> possible implementation; I was writing hard-coded lexers in 1966, and using table-driven
> lexers in 1967 that were created from a lexer generator written by Art Evans for his PhD
> dissertation in 1961. We wrote a hand-coded lexer for the Bliss compiler. I've been
> writing compilers of one sort or another for 45 years, with table-driven components,
> hard-coded components, and everything in between. I've written sparse matrix packers
> using "comb vectors" to compress parse tables (an invention of people at Kahrsruhe
> University in Germany). And I can write programs that can convert NDFA machines to DFA
> machines, and then optimize the DFA to minimum states, and used to give this as a freshman
> programming exercise. So please stop trying to tell me how much I know or don't know
> about lexers, parsers, and compiler implementations in general. I also worked closely
> with one of the world's best language designers, and I learned a lot about language design
> from him. Not enough to be as good as he is, but enough to know when I see a bad design.
> Do you know what a "perfect hash" is?

Yes

> I've built perfect hash algorithms. Do you know
> where they are used, and why?

No.

>
> I'm not some newbie at this business. I worked with optimizing compiler technology from
> 1970 through 1983, and my Ph.D. was in optimal code generation from table-driven code
> generators. I've built debuggers, byte code interpreters, tree interpreters, threaded
> code interpreters, and machine code generators. I wrote my first compiler in 1967, which
> took an Algol-like language and generator hard code. I was one of the key developers of
> an interactive language from 1967-1969. My first optimizing compiler was a student
> project in 1967 (I can date that exactly, because the machine I wrote it for was
> decomissioned in 1968; our grade was inversely proportional to code size, and I got an
> A+). I've forgotten more about compilers than most people have ever learned. The last
> compiler I wrote had not a trace of tables anywhere in it, and I wrote it three years ago
> (I just checked the file dates). My first Windows project was a compiler which had not a
> single table for either its lexer or parser, back in 1990. So I have no idea why you
> think these representations are essential.
>
> And my skills pale beside those of deep experts like Nigel Horspool; see, for example,
>
> http://webhome.cs.uvic.ca/~nigelh/pubs.html
>
> and I recommend the paper
>
> http://webhome.cs.uvic.ca/~nigelh/Publications/fastparse.pdf
>
> A 1990 paper wherein he shows why performance of table-driven parsers can be improved by
> very large factors (like 8x) by hard-coding the transitions in pure C code without a table
> anywhere in the resulting compiler. Don't try to explain why table-driven must be faster
> than any alternative when it was proven 20 years ago that this is not true. I suggest
> looking at Table 2 on page 19. You asked for proof; here it is.
> ****

The paper is named fast parse, and I already know that parsing might be
done faster without a table. What about lexing? I did a keyword search
on lex (which could be used in lexical) and it was no where in the
paper. I suggest that you might be incorrect in your assertion.

>>
>> With a state transition matrix you can simultaneously look for an
>> unlimited number of different character strings with zero comparisons
>> per byte.
> ****
> So what's your complaint about adding lexer rules for localized digit sequences? You just
> said it won't add any time...

It adds development time that I can not afford to spend right now.

Joseph M. Newcomer

unread,

May 19, 2010, 11:29:49 AM5/19/10

to

Actually, I state *opinions* that nobody has to accept.
joe

On Tue, 18 May 2010 22:34:59 -0700, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>
>> I can't accept this without direct proof.
>
>Yet, we are all supposed to take everything *you*
>say without any proof.

Peter Olcott

unread,

May 19, 2010, 11:47:50 AM5/19/10

to

The evidence that you provided to show that a faster alternative existed
to a state transition matrix based lexer did no lexing only parsing. The
title of the paper pertained to parsing, not lexing.

Pete Delgado

unread,

May 19, 2010, 11:52:09 AM5/19/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message
news:GOadnXHFWbfxbG7W...@giganews.com...

You provide no proof of your numerous performance claims. When pressed, you
do cite your own flawed reasoning, but you never provide numerical,
verifiable proof of working code - something Joe has asked you to do many
times in the hopes that it would help you learn where your assumptions are
flawed.

Would you like me to refer you to the specific posts or can you take a few
moments from your hectic development and newsgroup posting cycle to do the
search on your own?

-Pete

Peter Olcott

unread,

May 19, 2010, 12:05:02 PM5/19/10

to

On 5/19/2010 10:52 AM, Pete Delgado wrote:
> "Peter Olcott"<NoS...@OCR4Screen.com> wrote in message
> news:GOadnXHFWbfxbG7W...@giganews.com...
>> On 5/19/2010 12:34 AM, Mihai N. wrote:
>>>
>>>> I can't accept this without direct proof.
>>>
>>> Yet, we are all supposed to take everything *you*
>>> say without any proof.
>>>
>>>
>> Please cite examples of what I need to prove. So far no one asked for any
>> proof.
>
> You provide no proof of your numerous performance claims. When pressed, you
> do cite your own flawed reasoning, but you never provide numerical,
> verifiable proof of working code - something Joe has asked you to do many
> times in the hopes that it would help you learn where your assumptions are
> flawed.

I just posted the complete detailed design of the UTF-8 recognizer. The
code will follow within a week. The only thing that needs to be added to
this design is the actual code the does the translation from UTF-8 to
UTF-32.

Oliver Regenfelder

unread,

May 19, 2010, 12:27:33 PM5/19/10

to

Hello,

Peter Olcott wrote:
> On 5/19/2010 12:42 AM, Mihai N. wrote:
>>> You code in Word? Word is essentially user-hostile for writing code!
>>
>> Actually, it is very nice.
>>
>> You set the spelling language to "C" and get nice squiggles for
>> all syntax errors.
>> ;-)
>>
>> Also, has the great benefit that it does not compile and run your
>> code, which
>> allows to write perfect code and create super fast algorithms.
>> Compiling the code and running it destroy the perfection :-)
>> It's called "reality" and it is a bad thing.
>>
>>
>
> It has the benefit of making code easier to read for extensive desk
> checking. I can enlarge the font, add bolding, and color high-lighting.
> Also this code can be directly embedded in my design notes. This process
> works very well for me.

This is it. He is either a Troll or leaked from some funky parallel
universe.

Best regards,

Oliver

Pete Delgado

unread,

May 19, 2010, 1:09:51 PM5/19/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> wrote in message

news:vOGdnfGVDviykWnW...@giganews.com...

> On 5/19/2010 10:52 AM, Pete Delgado wrote:
>> "Peter Olcott"<NoS...@OCR4Screen.com> wrote in message
>> news:GOadnXHFWbfxbG7W...@giganews.com...
>>> On 5/19/2010 12:34 AM, Mihai N. wrote:
>>>>
>>>>> I can't accept this without direct proof.
>>>>
>>>> Yet, we are all supposed to take everything *you*
>>>> say without any proof.
>>>>
>>>>
>>> Please cite examples of what I need to prove. So far no one asked for
>>> any
>>> proof.
>>
>> You provide no proof of your numerous performance claims. When pressed,
>> you
>> do cite your own flawed reasoning, but you never provide numerical,
>> verifiable proof of working code - something Joe has asked you to do many
>> times in the hopes that it would help you learn where your assumptions
>> are
>> flawed.
>
> I just posted the complete detailed design of the UTF-8 recognizer. The
> code will follow within a week. The only thing that needs to be added to
> this design is the actual code the does the translation from UTF-8 to
> UTF-32.

I'll be happy to look for the code on the 27th!

-Pete