Why does the lexer convert text integer lexemes to binary integers? I thought that lexers should be simple?

Roger L Costello

unread,

Jul 14, 2022, 11:10:56 AM7/14/22

to

Hi Folks,

A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
expressions. When the lexer encounters an integer lexeme, it casts the lexeme
to a binary integer and returns the value to the parser. The lexer contains a
rule that looks something like this:

{INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }

But, but, but, ...

Countless times on this list I have been told: Keep the lexer simple!

By converting the lexeme to an integer, the lexer has assumed that the parser
needs/wants a binary integer, not a text number. How does the lexer know what
the parser needs/wants? That seems like knowledge the lexer shouldn't have if
the lexer is to be simple. Further, even if one parser needs/wants a binary
integer value, that parser might be swapped out at a later date and replaced
with a different parser that wants the text number.

It seems to me that the lexer should return to the parser the text number and
it is the responsibility of the parser to convert the value to an integer data
type if it desires.

What do you think?

/Roger
[I think the lexer should provide the tokens that the parser needs. If
integers are always handled as numbers, convert them, if not, don't.
If the parser does one and later changes to do the other, you can
change the lexer, too. -John]

gah4

unread,

Jul 14, 2022, 2:13:07 PM7/14/22

to

On Thursday, July 14, 2022 at 8:10:56 AM UTC-7, Roger L Costello wrote:

> A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
> expressions. When the lexer encounters an integer lexeme, it casts the lexeme
> to a binary integer and returns the value to the parser. The lexer contains a
> rule that looks something like this:

> {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }

A common example is an RPN or algebraic calculator.
(Which evaluates numerical expressions.)

It is a nice simple example, compared to a complete programming language.

For algebraic, you get into precedence and such, so real parsing problems,
but in a very simple, and easy to follow, case.

The other things, is that it isn't so easy to return character strings in C.
You would malloc() it and copy the value over, and then hope that
somewhere later it is free()d. (I suspect that is done more in Java.)

In the STEP processor that I previously wrote about, the only return
type is character string, and the processor keeps track of allocation.
Macros that do arithmetic convert to numerical type, evaluate an
expression, and convert back to characters.

The code for the processor itself, as usual for compilers, is written
using itself, has a macro to evaluate a constant integer
expression. That is useful, as Fortran 66 (and Fortran 77)
don't do that.

George Neuner

unread,

Jul 14, 2022, 10:13:42 PM7/14/22

to

On Thu, 14 Jul 2022 10:25:24 +0000, Roger L Costello
<cost...@mitre.org> wrote:

>Hi Folks,
>
>A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
>expressions. When the lexer encounters an integer lexeme, it casts the lexeme
>to a binary integer and returns the value to the parser. The lexer contains a
>rule that looks something like this:
>
>{INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }
>
>But, but, but, ...
>
>Countless times on this list I have been told: Keep the lexer simple!
>
>By converting the lexeme to an integer, the lexer has assumed that the parser
>needs/wants a binary integer, not a text number. How does the lexer know what
>the parser needs/wants?

Presumably because the same developer that wrote the parser also wrote
the rules for the lexer.

In many (actually most) cases, the binary representation of an integer
can be stored in less space than the text representation. When lex
and yacc were created, memory was small (and extremely expensive) and
storing things /small/ was very important.

This is also why compilers "intern" symbols and most strings so as to
store only one copy of any given text regardless of how many times it
is seen.

>That seems like knowledge the lexer shouldn't have if
>the lexer is to be simple. Further, even if one parser needs/wants a binary
>integer value, that parser might be swapped out at a later date and replaced
>with a different parser that wants the text number.

That's a valid point, but swapping "components" in that way is rarely,
if ever, done. IME developers treat the lexer and parser combination
as a single input subsystem.

>It seems to me that the lexer should return to the parser the text number and
>it is the responsibility of the parser to convert the value to an integer data
>type if it desires.
>
>What do you think?

The general problem is performance. The place to decide what class a
token falls into is where you contact the text.

In principle the lexer could be as simple as strtok() - just
separating the tokens, returning a pointer directly to its input
buffer, and punting all further processing to the parser.

But ...

The lexer has already scanned the text just to locate the end of it.
If it then simply gives the text to the parser, the parser must scan
the text AGAIN to verify that it represents a number (or whatever the
parser happens to be looking for at that point).

Then the text gets scanned YET AGAIN to convert it to a number.

That is 3 scans of the text if the parser does the work versus 2 scans
if the lexer does the work.

Now you may think numeric constants are short and occur rarely enough
that this shouldn't matter ... and you'd be correct.

But ...

Remember that other things have to be recognized and classified as
well: in particular language reserved words and symbols, and all kinds
of user chosen names. So, a (possibly case sensitive) dictionary of
already seen unique text must be maintained.

Searching and maintaining the dictionary will involve yet more passes
over the text: for indexing/hashing, for comparison, and for copying
if the text is new and needs to be preserved.
[These passes are unavoidable and must be done regardless of who does
the work ... but they need to be mentioned nonetheless.]

Moving (at least gross) input classification into the lexer avoids at
least one pass over the entirety of the input. When considered
against the length of the average input, that can add up to a LOT of
avoided work.

>/Roger
>[I think the lexer should provide the tokens that the parser needs. If
>integers are always handled as numbers, convert them, if not, don't.
>If the parser does one and later changes to do the other, you can
>change the lexer, too. -John]

+1

George

Spiros Bousbouras

unread,

Jul 15, 2022, 12:07:08 PM7/15/22

to

On Thu, 14 Jul 2022 16:38:32 -0400
George Neuner <gneu...@comcast.net> wrote:
> On Thu, 14 Jul 2022 10:25:24 +0000, Roger L Costello
> <cost...@mitre.org> wrote:
>
> >Hi Folks,
> >
> >A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
> >expressions. When the lexer encounters an integer lexeme, it casts the lexeme
> >to a binary integer and returns the value to the parser. The lexer contains a
> >rule that looks something like this:
> >
> >{INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }
> >
> >But, but, but, ...
> >
> >Countless times on this list I have been told: Keep the lexer simple!
> >
> >By converting the lexeme to an integer, the lexer has assumed that the parser
> >needs/wants a binary integer, not a text number. How does the lexer know what
> >the parser needs/wants?
>
> Presumably because the same developer that wrote the parser also wrote
> the rules for the lexer.

[...]

> >It seems to me that the lexer should return to the parser the text number and
> >it is the responsibility of the parser to convert the value to an integer data
> >type if it desires.
> >
> >What do you think?
>
> The general problem is performance. The place to decide what class a
> token falls into is where you contact the text.
>
> In principle the lexer could be as simple as strtok() - just
> separating the tokens, returning a pointer directly to its input
> buffer, and punting all further processing to the parser.
>
> But ...
>
> The lexer has already scanned the text just to locate the end of it.
> If it then simply gives the text to the parser, the parser must scan
> the text AGAIN to verify that it represents a number (or whatever the
> parser happens to be looking for at that point).
>
> Then the text gets scanned YET AGAIN to convert it to a number.
>
> That is 3 scans of the text if the parser does the work versus 2 scans
> if the lexer does the work.

I don't see why it has to be this way. The lexer can return a structure
with components START , END , KIND which says that a token of kind , for
example , "integer number" was found in the input stream starting at START
and ending at END .The parser wouldn't have to verify again that it is
a number and would simply go over the characters from START to END a 2nd
time if it needs to convert the number to a different format. But if the
parser knows that a number at this point is a syntax error then it can
simply signal that and save the extra processing required for conversion
i.e. call atoi() or whatever.

[...]

> Moving (at least gross) input classification into the lexer avoids at
> least one pass over the entirety of the input. When considered
> against the length of the average input, that can add up to a LOT of
> avoided work.

I don't think Roger was suggesting that the lexer should not do gross input
classification (like recognise that a token is a number) but rather that
perhaps it should not convert the number.
[In flex lexers you can't count on text of the token being available
after the action routine is done, so you have to make a copy and keep
track of the copy. That is usually way more work than whatever else
you might do with it. -John]

Jan Ziak

unread,

Jul 15, 2022, 12:09:05 PM7/15/22

to

On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
> In many (actually most) cases, the binary representation of an integer
> can be stored in less space than the text representation.

The output of a command such as (cd /usr/src/linux; grep --only-matching
--recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.

Binary, fixed-width, representation of integers is statistically more
space-efficient compared to implicit-width textual representation only if the
text representation of the integers includes (for example): a plain 32/64-bit
pointer to the start of the text, a plain 16/32/64-bit relative/absolute
offset to the start of the number in a character array, the [length of the
textual form of the number] in an explicit form.

Binary, fixed-width, representation of integers is more likely to be more
space-efficient than [their textual representation with an implicit length] if
the source of the integers doesn't originate from a human hand typing digits
on a keyboard. For example, when the source of the integers is a
high-precision measurement device, is something with a physical counterpart
(such as: the GPS coordinates of objects spread across Earth's surface), etc.

-atom
[It's still hard to imagine that the size difference would matter in a compiler.
If you're logging a million values a second and saving it to an archive, well,
that's different. -John]

Kaz Kylheku

unread,

Jul 15, 2022, 12:28:27 PM7/15/22

to

On 2022-07-14, Roger L Costello <cost...@mitre.org> wrote:
> Hi Folks,
>
> A common example in books on Lex/Flex and Yacc/Bison is evaluating arithmetic
> expressions. When the lexer encounters an integer lexeme, it casts the lexeme
> to a binary integer and returns the value to the parser. The lexer contains a
> rule that looks something like this:
>
> {INTEGER} { yylval.intval = atoi(yytext); return NUMBER; }
>
> But, but, but, ...
>
> Countless times on this list I have been told: Keep the lexer simple!
>
> By converting the lexeme to an integer, the lexer has assumed that the parser
> needs/wants a binary integer, not a text number.

By the facts alone that you have something called "yylval" with a
"yylval.intval" member, you're encoding a tight integration with a
parser, from which these things are coming.

There is no "yylval" global variable in a lex-generated program.

> How does the lexer know what
> the parser needs/wants? That seems like knowledge the lexer shouldn't have if
> the lexer is to be simple.

That's like asking, how does the ethernet driver in Linux know that
the protocol stack wants a "struct sk_buff *"?

That seems like knowledge the driver shouldn't have if it is to be
simple, and usable in any operating system.

Sometimes you just make integration decisions: you make the pieces agree
on some convenient data structure for data interchange and move on,
tossing aside concerns like whether a given piece can be reused in any
conceivable software system with the greatest possible ease.

> Further, even if one parser needs/wants a binary
> integer value, that parser might be swapped out at a later date and replaced
> with a different parser that wants the text number.

Chances are that will never happen; write for today.

A lexer which preserves the original lexemes could be useful for
software such as code reformatters.

If you strongly suspect that you're working on some language that will
eventually have tooling that includes a code reformatter, then maybe
plan for that in the scanning/parsing.

(Right? I mean you clearly don't want a source-to-source beautifier to
be rewriting 0xFF as 255, except as carefully controlled option.)

A lexer could prepare a token structure which has all the information;
a converted semantic item, and the lexeme.

> It seems to me that the lexer should return to the parser the text number and
> it is the responsibility of the parser to convert the value to an integer data
> type if it desires.

But how does the parser know what the parser's client wants? What if
the parser's client wants a syntax tree with the textual lexemes, and
not objects like binary integers?

The same argument can be repeated here that the parser should just
produce a parse tree with literal tokens, punting the conversion problem
higher up.

> What do you think?

How about this: a given token like INTEGER is processed in one place in
the lexer, right? One lexing rule recognizes the integer and can convert
the textual integer to a semantic, computational integer.

Whereas in the parser's grammar, an INTEGER symbol could appear in
multiple places.

Would you rather add code to multiple places to convert a textual
integer to a value, or do it in one place, and just have the value
readily available in those dozens of places?

That's the main driver of why we convert semantic values at the lexical
level in the lex/yacc stack: so that if we have a rule like

whatever : foo INTEGER ':' INTEGER

we can just use $2 and $4 to refer to integer values, and not have
to do:

{
int x = get_integer($2);
int y = get_integer($4);

$$ = function_of(x, y);

free_lexeme($2); /* textual lexemes need dynamic alloc! */
free_lexeme($4);
}

versus:

{ $$ = function_of($2, $4); }

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Jan Ziak

unread,

Jul 15, 2022, 2:22:47 PM7/15/22

to

On Friday, July 15, 2022 at 6:09:05 PM UTC+2, John wrote:
> It's still hard to imagine that the size difference would matter in a
compiler.
> If you're logging a million values a second and saving it to an archive, well,
> that's different. -John

Size matters if the compiler is built around data caches as its core principle
(which means that not only the lexical phase is utilizing caches, but also
means that optimizations are being cached as well). Retrieving cached data
from a storage device (such as: DDR4 DRAM, NVMe SSD) is limited by current
hardware constraints to approximately 10 GB/s in mainstream PCs, while the
speed of computing a SHA-256 hash is limited to approximately 2 GB/s per CPU
core. It is easy to imagine sketches of a compiler based on such a core
principle - it is hard to create a complete "fully fledged" compiler based on
such a core principle.

Btw: Starting a sentence with "It's still hard to imagine that ..." pretty
much ensures that a counter-example will be sent in response.

-atom
[Surely you're aware that the best way to get an answer to a question on
the Internet is to post a wrong answer. -John]

matt.ti...@gmail.com

unread,

Jul 16, 2022, 1:09:14 PM7/16/22

to

On Thursday, 14 July 2022 at 11:10:56 UTC-4, Roger L Costello wrote:
> [...]

> It seems to me that the lexer should return to the parser the text number
and
> it is the responsibility of the parser to convert the value to an integer
data
> type if it desires.
>
> What do you think?

The division of the job into lexing and parsing is *not* an important
separation of concerns. Both of these are written at the same time, by the
same person, and the requirements of the parser feed into the lexer in many
detailed ways. Their specifications are highly coupled and, I expect, almost
always written by the same person pretty much simultaneously.

Instead, this division -- a great big line that divides one part of a
context-free grammar from the other -- is an annoying practical concession
that we make to improve performance in time and size (smaller tables, and
optimization for text), and to get around the limitations in our tools (LR(1)
is a lot less useful when the (1) is a character).

This is specifically why all the answers here are giving you less than
compelling practical justifications for parsing numbers in the lexer and
nobody seems to mind. There really is no "should" and "should not" w.r.t. the
division between lexing and parsing except what is practical.

Christopher F Clark

unread,

Jul 17, 2022, 1:29:53 PM7/17/22

to

You are asking the wrong question. You are optimizing at the wrong level.
Stop.

Not that long ago, I wrote an article on Quora about this exact phenomenon.

https://www.quora.com/What-do-most-programmers-do-when-optimizing-code-that-is-essentially-wrong/answer/Christopher-F-Clark-1?ch=10&oid=337787257&share=2807f9fb&srid=20qA&target_type=answer

You are focusing on the trivial, the irrelevant. It is unlikely that
having the lexer convert integers (or floats or quaternions) into a binary
representation is a sufficiently expensive operation to make sense fretting
about it. And I mean that both in terms of runtime and time spent thinking
about it. Just do whatever lexer you are following from as an example does
and assume they made the right choice. The only time you should pay
attention to it, is once you have something working correctly and you have
measured its performance and determined that the code in question is
actually a bottleneck that is significantly impacting performance. Then,
you have a reason to revisit that choice.

And, when you do, you are as likely to find that the call to atoi is as
much a problem as having the lexer do it. atoi is going to rescan those
digits causing twice as much work as you would have done by building the
numeric representation while you lexed it. Of course, that alternative has
its own issues. It's clearly more complex code in the lexer.

But, again, you shouldn't be worrying about any of that, until you have
isolated it as an actual problem. Otherwise, figure out something simple
that works. Having the lexer call atoi seems pretty simple. It is also
likely to work over a rather large range of inputs. So, you haven't likely
bought yourself any problems by doing that.

Now, if you are doing SQL (or PL/I) float decimal numbers or handling of
BIGINTs (arbitrarily large integers) or reduced rationals, you might want
something more complex. But that's isn't the problem in this case.

Don't spend your time worrying about details until you know they are
important details.

Kind regards,
Chris

--
******************************************************************************

Chris Clark email:
christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

George Neuner

unread,

Jul 17, 2022, 5:23:18 PM7/17/22

to

On Fri, 15 Jul 2022 03:02:07 -0700 (PDT), Jan Ziak
<0xe2.0x...@gmail.com> wrote:

>On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
>> In many (actually most) cases, the binary representation of an integer
>> can be stored in less space than the text representation.
>
>The output of a command such as (cd /usr/src/linux; grep --only-matching
>--recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.
>
>Binary, fixed-width, representation of integers is statistically more
>space-efficient compared to implicit-width textual representation only if the
>text representation of the integers includes (for example): a plain 32/64-bit
>pointer to the start of the text, a plain 16/32/64-bit relative/absolute
>offset to the start of the number in a character array, the [length of the
>textual form of the number] in an explicit form.

???

Decimal 100 is 3 characters in text, but 1 byte in binary.
65535 is 5 characters, but 2 bytes.
2147483648 is 11 characters, but 4 bytes.

0x101010 is 8 characters, 1 byte.
052 is 2 characters, 1 byte.

This dichotomy holds for almost all numbers: regardless of what number
base is used for the text form, the binary form will be shorter.

>Binary, fixed-width, representation of integers is more likely to be more
>space-efficient than [their textual representation with an implicit length] if
>the source of the integers doesn't originate from a human hand typing digits
>on a keyboard. For example, when the source of the integers is a
>high-precision measurement device, is something with a physical counterpart
>(such as: the GPS coordinates of objects spread across Earth's surface), etc.

The source of the value is not relevant - only the form.

>-atom

You completely misunderstood/misinterpreted what I wrote.

I said the binary form CAN BE stored in less space than the text ... I
said NOTHING about being restricted to fixed sized binary integers. It
should be obvious to anyone that storing 42 into a 64-bit integer is
not saving space.

YMMV,
George
[I think the point was that variable length strings are more common than
variable length binary integers, but as a message earlier today pointed out,
it's not likely to make any difference in practical compilers. -John]

George Neuner

unread,

Jul 17, 2022, 9:07:23 PM7/17/22

to

On Sun, 17 Jul 2022 16:52:20 -0400, George Neuner
<gneu...@comcast.net> wrote:

>0x101010 is 8 characters, 1 byte.

Duh! Of course that should be b101010, 7 characters and not
representable in C. Oh well.

George
[It's a perfectly good integer but I think we're done beating the dead horse. -John]

gah4

unread,

Jul 18, 2022, 12:30:51 PM7/18/22

to

On Sunday, July 17, 2022 at 10:29:53 AM UTC-7, christoph...@compiler-resources.com wrote:
> You are asking the wrong question. You are optimizing at the wrong level.
> Stop.

(snip)

> You are focusing on the trivial, the irrelevant. It is unlikely that
> having the lexer convert integers (or floats or quaternions) into a binary
> representation is a sufficiently expensive operation to make sense fretting
> about it.

Not so long ago, I wrote about the whole idea of separating lexing and parsing
was overoptimizing. It was needed many years ago, when computer memories
were in kilobytes, not yet megabytes or gigabytes. But got almost no comment.

Converting to integer has lots of problems, especially in the case of overflow.

In some cases, and even for some people, separating lexing and parsing might
make it easier to write and/or understand. Those are important.

But maybe not.

The STEP processor, which I have written about before, mostly doesn't
separate them. There are some built-in rules related to where macro
matches can start and end, and those are similar to the results of
some lexical analysis. Some of those rules speed up processing, which
was more important 45 years ago than now.

I suspect that I believe now that they could be separate, but implemented
by the same program. That would allow them to be separated where
convenient for the user, and also not separated where it is easier.
[In my experience separating the lexer from the parser makes it a lot easier
to deal with common lexical situations like skipping white space and comments.
You could certainly do that in a combined scheme but I'm not sure it would end
up any simpler. -John]

Thomas Koenig

unread,

Jul 18, 2022, 12:31:16 PM7/18/22

to

George Neuner <gneu...@comcast.net> schrieb:

> On Fri, 15 Jul 2022 03:02:07 -0700 (PDT), Jan Ziak
><0xe2.0x...@gmail.com> wrote:
>
>>On Friday, July 15, 2022 at 4:13:42 AM UTC+2, George Neuner wrote:
>>> In many (actually most) cases, the binary representation of an integer
>>> can be stored in less space than the text representation.
>>
>>The output of a command such as (cd /usr/src/linux; grep --only-matching
>>--recursive "\b[0-9][0-9]*\b") proves the falsity of the above claim.
>>
>>Binary, fixed-width, representation of integers is statistically more
>>space-efficient compared to implicit-width textual representation only if the
>>text representation of the integers includes (for example): a plain 32/64-bit
>>pointer to the start of the text, a plain 16/32/64-bit relative/absolute
>>offset to the start of the number in a character array, the [length of the
>>textual form of the number] in an explicit form.
>
> ???
>
> Decimal 100 is 3 characters in text, but 1 byte in binary.

... if it is known that the number is in the range that can be
represented by a single byte. If it is stored into a four-byte
integer, 100 takes four bytes (obviously). If there is a length
specification, it might even be longer.

But of course our moderator was right when he wrote

[...]

gah4

unread,

Jul 20, 2022, 8:45:07 PM7/20/22

to

On Monday, July 18, 2022 at 9:30:51 AM UTC-7, gah4 wrote:

(snip, or moderator wrote)

> [In my experience separating the lexer from the parser makes it a lot easier
> to deal with common lexical situations like skipping white space and comments.
> You could certainly do that in a combined scheme but I'm not sure it would end
> up any simpler. -John]

Interesting. As I previously noted, STEP mostly doesn't do a separate lexical analysis.

It does, however, do three things before the macros see the input: convert multiple
blanks to a single blank, pass apostrophed strings through whole, and remove
comments delimited by double quotes.

Apostrophed strings are slightly more interesting. Internal double apostrophes
are converted to single apostrophes, and the delimiting apostrophes are
converted to a special character that isn't an input character.

One of my projects 45 years ago, was to write macros to recognize the
syntax of IBM OS/360 Fortran IV. Direct access I/O statements use
a single apostrophe to delimit the record number:

READ(1'N) X,Y,Z

There is no way to write macros for that syntax after the previous processing.

Much fun figuring out all the strange things done in programming language
syntax over the years.

Ev. Drikos

unread,

Jul 21, 2022, 12:16:55 PM7/21/22

to

On 18/07/2022 06:39, gah4 wrote:
> [In my experience separating the lexer from the parser makes it a lot easier
> to deal with common lexical situations like skipping white space and comments.
> You could certainly do that in a combined scheme but I'm not sure it would end
> up any simpler. -John]

Maybe not simpler, but it won't be necessarily more complex. I've just
transferred an example for SQL from my old desktop. The FSA/GLR parser
built can parse ie this command, without a scanner, unambiguously in the
simulator:

SELECT ALL FIRST,LAST FROM USERS;

Below, there are four BNF rules. The first has in the right part the
required lookahead operator '-:' as the grammar allows consecutive IDs.

The second rule may be seen as a meta-rule that will be expanded. That
is, the Builder will make the 'dirty' job for the programmer and attach
the separator to each element listed in the 4th BNF rule below, which in
turn must have only one symbol in each alternative (in the right side).

IMHO, this grammar doesn't look very complex, but others may see it so.

Regards,
Ev. Drikos

------------------------------------------------------------------------

<identifier body> ::=
<identifier start> [ <identifier part> ... ] -: <identifier part>

(<can be followed by separator>) ::=
( <can be followed by separator> ) <separator>
| ( <can be followed by separator> )

<separator> ::=
{ <comment> | <white spaces> }...

<can be followed by separator> ::=
<regular identifier>
| <unsigned numeric literal>
..
...
| <left brace>
| <right brace>

[I don't see how this grammar will allow a comment before the statement. -John]

luser droog

unread,

Jul 22, 2022, 1:01:18 PM7/22/22

to

This approach appears to offer a very nice simplification for most Algol-style
languages. But removing the white space entirely makes it harder (or impossible)
to parse languages like Python and Haskell which use the "offside rule" to
interpret the white space as delimiting multi-line constructs.

I haven't solved the above completely, but I've been building my parser combinators with
an eye towards supporting significant white space in the syntax analysis, while
mostly ignoring it.

It's parsers all the way down, but the parsers are designed to operate over lists,
so the infrastructure is agnostic as to the actual type of the elements of the list.
So, you can build the lexical analysis layer as a graph of parsers that work on
lists of integers (characters) and produce Symbol objects. The syntax analysis
layer can then be built as a graph of parsers that work on lists of Symbols.

The symbol object has an extra slot for stashing extra data, so the white space
can be captured and then hidden from the syntax analysis (unless some handler
or predicate function wants to peek in there).

Ev. Drikos

unread,

Jul 22, 2022, 1:01:50 PM7/22/22

to

On 21/07/2022 13:41, Ev. Drikos wrote:
> ...

>
> [I don't see how this grammar will allow a comment before the statement.
> -John]

The where-used list of the separator contains four regular rules. Which
means I've added them explicitly, as shown at the end of the message.

What I suspect may be restrictive is that the lookahead operator accepts
only the next token, simply one letter in a scannerless grammar. Yet, in
SQL the (Unicode) delimited identifier doesn't end ie with a repeated
symbol as the identifier body of a regular identifier.

Here is the grammar for further inspection and error report:
https://gist.github.com/drikosev/3104c3cb79b0b4d8f41da6162f4c65d3

Regards,
Ev. Drikos

-----------------------------------------------------------------------
<direct SQL statement> ::=
[ <separator> ] <directly executable statement> <semicolon>

<binary string literal>::= X<quote>[{<hexit><hexit>}...]<quote>
[{<separator><quote>[{<hexit><hexit>}...]<quote>}...]

<national character string literal> ::= (too long) ...

<Unicode character string literal> ::= (too long) ...