Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bison determinis​tic LALR(1) parser for Java/C++ (kind of complex langauge) without 'lexar hack' support

212 views
Skip to first unread message

hsa...@gmail.com

unread,
Aug 17, 2012, 2:22:38 PM8/17/12
to
I need to write a parser for a programming langauge which is as
complex as C++/Java, and to even complicate the matter, there are
constructs in this langauge that doesn't allow me to use
type/identifier dis-ambiguating lexer hack. In other words, I will
have to return just one lexical token (say IDENTIFIER) from the lexer
for both type references as well as non-type variable references.

Given these restrictions, I was wondering if it would be a good idea
to pick yacc/bison for my parser...? Or, should I consider a hand
written recursive descent parser.

Regards.
[Get it working in bison, then in the unlikely event that's not fast
enough, profile your compiler to see where it's spending its time and
fix what needs to be fixed. Although in theory GLR can be very slow,
in practice the ambiguities are generally resolved within a few tokens
and the performance is fine. compilers always spend way more time in
the lexer than the parser anyway. Writing RD parsers by hand can be
fun, but you never know what language it actually parses. -John]

Hans-Peter Diettrich

unread,
Aug 18, 2012, 5:13:46 AM8/18/12
to
hsa...@gmail.com schrieb:
> I need to write a parser for a programming langauge which is as
> complex as C++/Java, and to even complicate the matter, there are
> constructs in this langauge that doesn't allow me to use
> type/identifier dis-ambiguating lexer hack.

Why don't you fix your language, and remove such ambiguities? Look at
Pascal or other Wirthian languages...

> In other words, I will
> have to return just one lexical token (say IDENTIFIER) from the lexer
> for both type references as well as non-type variable references.

This shouldn't be a big problem, as long as the parser does not rely
on such a distinction. Once a symbol has been defined, it can contain
some indication about its nature.

> Given these restrictions, I was wondering if it would be a good idea
> to pick yacc/bison for my parser...? Or, should I consider a hand
> written recursive descent parser.

I don't see how this decision is related to above problem.

> Regards.
> [Get it working in bison, then in the unlikely event that's not fast
> enough, profile your compiler to see where it's spending its time and
> fix what needs to be fixed. Although in theory GLR can be very slow,
> in practice the ambiguities are generally resolved within a few tokens
> and the performance is fine. compilers always spend way more time in
> the lexer than the parser anyway. Writing RD parsers by hand can be
> fun, but you never know what language it actually parses. -John]

There exist parser generators for several models. I also doubt that -
except in misdesigned C-ish languages - a compiler spends significant
time in the lexer. This may be true for dummy parsers, which do
nothing but syntax checks, but not for compilers with code generation,
optimization and more.

DoDi
[Compilers spend a lot of time in the lexer, because that's the only
phase that has to look at the input one character at a time. -John]

hsa...@gmail.com

unread,
Aug 18, 2012, 5:09:39 AM8/18/12
to
One of my concern is if I will get anywhere with Bison/yacc based
approach, given that I have the restriction that I will have to use
just one lexer token for both type as well as non-type identifier
refenreces.
[This is a frequent issue, not hard to deal with particularly if
you use GLR for the trickier cases. -John]

Hans-Peter Diettrich

unread,
Aug 19, 2012, 8:01:53 PM8/19/12
to
> [Compilers spend a lot of time in the lexer, because that's the only
> phase that has to look at the input one character at a time. -John]

When the source code resides in a memory buffer, the time for reading
e.g. the characters of an identifier (in the lexer) is neglectable vs.
the time spent in lookup and entering the identifier into a symbol table
(in the parser).

Even if a lexer reads single characters from a file, most OSs maintain
their own file buffer, so that little overhead is added over the
program-buffered solution.

I really would like to see some current benchmarks about the behaviour
of current compilers and systems.

DoDi
[The benchmarks I did were a while ago, but they showed a large
fraction of time in the lexer. I wouldn't disagree that building the
symbol table is slow, but figure out some estimate of the ratio of
the number of characters in a source file to the number of tokens,
and that is a rough estimate of how much slower the lexer will be
than the parser. I agree that some current analyses would be useful.
-John]

Anton Ertl

unread,
Aug 20, 2012, 9:35:57 AM8/20/12
to
Hans-Peter Diettrich <DrDiet...@aol.com> writes:
>> [Get it working in bison, then in the unlikely event that's not fast
>> enough, profile your compiler to see where it's spending its time and
>> fix what needs to be fixed. Although in theory GLR can be very slow,
>> in practice the ambiguities are generally resolved within a few tokens
>> and the performance is fine. compilers always spend way more time in
>> the lexer than the parser anyway. Writing RD parsers by hand can be
>> fun, but you never know what language it actually parses. -John]
>
>There exist parser generators for several models. I also doubt that -
>except in misdesigned C-ish languages - a compiler spends significant
>time in the lexer. This may be true for dummy parsers, which do
>nothing but syntax checks, but not for compilers with code generation,
>optimization and more.

Code generation and optimization do not change the relation between
the time spent in scanning and in parsing. Moreover, if the compiler
spends most of the time in code generation, optimization and/or
"more", there is even less reason to worry about parsing speed.

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Hans-Peter Diettrich

unread,
Aug 20, 2012, 11:14:38 AM8/20/12
to
Hans-Peter Diettrich schrieb:
Your estimation suggests that the benchmarked parser does not much
more than syntax checking the tokens. In this case I agree that a
parser automaton does less work (state transitions) than a lexer
automaton (or equivalents), and that detailed benchmarking is not
required to proof this fact.

I still disagree that "Compilers spend a lot of time in the lexer".
My point is that a real *compiler* does not normally spend significant
time in neither the lexer nor the parser[1]. In so far it's okay to
"profile your compiler to see where it's spending its time and fix
what needs to be fixed", as you said :-)


[1] The timing depends heavily on what tasks are assigned to the parser
itself. Is code generation part of the parser, or part of a subsequent
(optimization...) stage? What about symbol table (scopes) management and
lookup, and an eventual preprocessor which IMO in an C compiler consumes
more time than the lexer and parser altogether? But further discussion
of these academic issues doesn't help in spotting the real bottlenecks
of any concrete compiler ;-)

DoDi
[Perhaps you could do some experiments and tell us whether it confirms
your intuition. Like I said, when I profiled some compilers, I was
surprised to see how much time they spent grinding through the
characters in the input. -John]

BGB

unread,
Aug 20, 2012, 3:14:31 PM8/20/12
to

On 8/19/2012 7:01 PM, Hans-Peter Diettrich wrote:
>> [Compilers spend a lot of time in the lexer, because that's the only
>> phase that has to look at the input one character at a time. -John]
>
> When the source code resides in a memory buffer, the time for reading
> e.g. the characters of an identifier (in the lexer) is neglectable vs.
> the time spent in lookup and entering the identifier into a symbol table
> (in the parser).
>
> Even if a lexer reads single characters from a file, most OSs maintain
> their own file buffer, so that little overhead is added over the
> program-buffered solution.
>
> I really would like to see some current benchmarks about the behaviour
> of current compilers and systems.

My experiences are similar to those John is describing.

Basically, given the lexer/tokenizer processes things a character at a
time, it itself can end up being the bulk of the parsing time.

Usually this is only really noticable though in cases where the
tokenizer has to deal with a fairly large amount of text, such as what
happens when writing a C parser (and suddenly one may find itself with
10+ MB of preprocessor output to churn through).

My assembler had a similar issue, although it is generally fast enough
(and ASM code is typically small enough) that this isn't really a major
issue.


granted, for a C style parser, how this compares with having to check
whether or not an identifier is a type-name, is a secondary issue.

things like lookups can be potentially fairly expensive, but this can be
largely resolved via the use of hash tables, say, using a hash table to
identify keywords (mapping the name to a keyword ID-number or similar)
and lookup type-names (1).

my scripting language partly avoids this issue in the parser by making
most statements start with a keyword (pretty much anything that doesn't
start with a keyword is an expression), and identifying type-names by
the use of the language syntax (making them functionally no different
than normal identifiers). (OTOH... the script-language parser is a bit
less efficient, and works primarily via identifying statement types via
if/else logic and using "!strcmp()" to match keywords).


1: actually, my C parser uses a pretty big hash for type-name lookup,
namely 16k entry IIRC, whereas for most other things 1024 or 4096 entry
is plenty sufficient (for a chain-hash). the main reason for this is the
large number of typedefs in headers like "windows.h", which can easily
saturate a 1024 or 4096 entry hash.

I have used bigger hash tables though, namely for things like interning
strings (64k) and dynamic+interface method-dispatch (256k, but this one
is an open-address hash).


> DoDi
> [The benchmarks I did were a while ago, but they showed a large
> fraction of time in the lexer. I wouldn't disagree that building the
> symbol table is slow, but figure out some estimate of the ratio of
> the number of characters in a source file to the number of tokens,
> and that is a rough estimate of how much slower the lexer will be
> than the parser. I agree that some current analyses would be useful.
> -John]
>

yep.

can't compare exactly, as my parsers tend to be recursive-descent and
build ASTs directly.

Hans-Peter Diettrich

unread,
Aug 21, 2012, 2:40:25 AM8/21/12
to
BGB schrieb:

> 1: actually, my C parser uses a pretty big hash for type-name lookup,
> namely 16k entry IIRC, whereas for most other things 1024 or 4096 entry
> is plenty sufficient (for a chain-hash).

The global (top level) symbol table can become very big, in detail in C
compilers. A separate table for type names is not required.

> the main reason for this is the
> large number of typedefs in headers like "windows.h", which can easily
> saturate a 1024 or 4096 entry hash.

Right. AFAIR I added capabilities to skip already processed header
files, at least when a guard is detected. Such (not fully conforming)
behaviour can influence the time spent with loading the same files
moreoften.

> I have used bigger hash tables though, namely for things like interning
> strings (64k) and dynamic+interface method-dispatch (256k, but this one
> is an open-address hash).
>
>
>> DoDi
>> [The benchmarks I did were a while ago, but they showed a large
>> fraction of time in the lexer. I wouldn't disagree that building the
>> symbol table is slow, but figure out some estimate of the ratio of
>> the number of characters in a source file to the number of tokens,
>> and that is a rough estimate of how much slower the lexer will be
>> than the parser. I agree that some current analyses would be useful.
>> -John]
>>
>
> yep.
>
> can't compare exactly, as my parsers tend to be recursive-descent and
> build ASTs directly.

The same for my compilers. My C to Pascal converter uses an LL(1)
parser, with only one exception where LL(2) lookahead is required.
Expressions are parsed differently, using an table-driven parser for
operator precedence and associativity. The preprocessor is implemented
in a couple of filter stages between the lexer and parser.


P.S.: When the time for loading source files can be separated from other
lexer tasks, I'd expect that this operation takes most of the time.

DoDi

BartC

unread,
Aug 21, 2012, 12:39:38 PM8/21/12
to
"Hans-Peter Diettrich" <DrDiet...@aol.com> wrote in message
>> [Compilers spend a lot of time in the lexer, because that's the only
>> phase that has to look at the input one character at a time. -John]
>
> When the source code resides in a memory buffer, the time for reading
> e.g. the characters of an identifier (in the lexer) is neglectable vs.
> the time spent in lookup and entering the identifier into a symbol table
> (in the parser).
>
> Even if a lexer reads single characters from a file, most OSs maintain
> their own file buffer, so that little overhead is added over the
> program-buffered solution.
>
> I really would like to see some current benchmarks about the behaviour
> of current compilers and systems.

I was recently testing a new language with a bare lexer for a near-identical
syntax.

This test, which excluded file input and any identifer lookups (just
tokenising), ran at 17M chars/second, 3M tokens/second, and some 800K
lines/second, on a low-end desktop PC. At that rate, tokenising the entire
compiler might take 25msec.

If that represented the bulk of the compilation time, then I'd be quite
happy! But I suspect it will be just a fraction.

While lexers do have to deal with a character at a time, the processing
might be quite simple (eg. merely six instructions for each character of an
identifier in my version).

Oh, I should mention that this lexer is written in an interpreted,
dynamically typed language. That means a native code version might process
some 20M tokens per second (3 or 4 msec to scan the entire compiler). That
doesn't really suggest it's going to be a bottleneck!

> [The benchmarks I did were a while ago, but they showed a large
> fraction of time in the lexer.
...
-John]

Testing an older compiler (which doesn't lend itself to isolating just the
bare tokeniser), showed it spent about 50% of compilation time in
tokenising, lookups and parsing. This generates simple byte-code, so with a
more sophisticated one with more passes, optimisation and so on, it would
likely be even less.

(But it's also possible my compilers are highly inefficient apart from the
tokenising parts..)

--
Bartc

BGB

unread,
Aug 21, 2012, 3:45:33 PM8/21/12
to
(sorry in advance for sort of wandering off on a tangent...).


a major thing that I think skews things so much in the case of C is just
how much stuff can get pulled in by the preprocessor, the bulk of which
ends up quickly being discarded by subsequent stages (and much of the
rest is only minimally processed by later stages).

actually, a fairly large portion of the time may actually be spent in
the preprocessor, which may basically sit around, for each source line:
splitting it into tokens;
checking if any of the identifiers in the source-line are names for
macros (and expanding as necessary);
repeating the above until no more macro expansions are seen;
recomposing the line and spitting it into the output.

then, for example, there are typically large numbers of typedefs and
structure definitions, very few of which may end up actually being used
by the code in question (most just get put into a table and then forgotten).

then, after all this, the compiler is typically left looking at a fairly
small chunk of (actual) code.

if we spend 1000ms preprocessing and parsing the code, and 10ms or 100ms
compiling it, where has most of the time gone?...


OTOH, in my scripting language (which doesn't use headers), I have
generally been looking at millisecond-range compile times.

although admittedly most of my compilers have tended to be pretty dumb,
typically working like (for the upper end):
split into tokens and parse the code directly into an AST;
perform basic simplifications on the AST (evaluating constant expressions);
perform basic type-analysis / inference (generally forward propagation
driven by declarations and assignment operations);
flatten this out into a bytecode-style format (binary serialization is
optional).

this is then currently followed by a backend which translates the
bytecode into threaded-code and runs this (generally also pretty fast,
and generally functions/methods are translated on call).

my code-generator backend was a little more involved, basically doing
its own type-analysis, mapping the stack to registers and temporaries,
and using peephole optimization. this backend fell into disuse mostly
because I was having a hard time really debugging it, and it was eating
up a lot of my time trying to do so (threaded code generally being a
simpler option).


unlike many other compilers though, my stuff is generally based on a
stack-machine model, so pretty much everything (from intermediate
values, to handling local variables and environment scopes) is handled
using a stack-machine model.

for example, in my bytecode format, local variables (and function
arguments, and parent scopes) are identified by their position on the
variable stack (with declarations pushing new variables onto the stack,
and other operations popping them back off).

note that my bytecode is lexically scoped, but tends to see the entire
lexical environment as a big stack (which may actually be split across
multiple call frames, or have parts be allocated on the heap as well).

there are also the dynamic, object ("this" scope), and toplevel
environments, but lookups into these are given by name or by
name+signature, rather than by an explicit stack index as in the case of
the lexical environment. note that packages/import build on delegation,
which is built on the object scope (packages are actually implemented as
objects, and imports are actually lexical variables used to delegate to
these objects).


but, in general, I just seem to find the stack model generally a lot
easier to understand and work with.



my code generator also tied register allocation and lifetime to these
stacks as well, although later on it had added temporaries which were
identified by handle and had an independent lifetime, rather than being
tied to the stack-machine model (and were, ironically, much closer to
how the CPU stack was actually being utilized by the code generator, as
the evaluation stack and local/environment stack were not mapped 1:1
with the CPU stack, whereas the temp-vars were keyed 1:1 with their
physical storage).

however, it was introducing temporaries and registers which existed
independently of the stack-machine model which is where the pain began.
if I ever get back to working on or writing a new native code generator,
I will probably just put the temporaries on their own stack, thus any
temporary entering its lifetime is pushed onto the temp-var stack, and
popped off when its lifetime ends, mostly in the name of being less of a
PITA to work with.

note that, in any case, the actual stack-frame (at the machine-level) is
basically a fixed-size region which is divided up and allocated (as
appropriate) for things like holding variables and evaluation-stack
entries (and treated as an array of machine-words allocated using a bitmap).

note that some minor ABI hacks were made to allow for things like
lexical scoping, but otherwise it was using the same ABI as C (C style
block functions being callable directly as C function pointers).


currently, the threaded-code uses a more narrowly defined subset of the
C ABI, basically where all calls have exactly 2 arguments: the VM
context and the current threaded-opcode context. this, combined with
most of the logic being interpreter-style and written in C, does a bit
of harm to the performance though (but, it is still "fast enough" for
general application scripting tasks).

but, hopefully with hope and luck, I will be able to justify the effort
to be able to return to using a full-featured JIT.

granted, I could always just be like, "hell with it, I am just going to
do it..." (whether or not it is worthwhile), which sort of ends up being
invoked occasionally (for things where I can't really justify doing, but
want to do them anyways...).

BartC

unread,
Aug 22, 2012, 9:04:20 AM8/22/12
to
"BGB" <cr8...@hotmail.com> wrote in message
> On 8/20/2012 8:35 AM, Anton Ertl wrote:

>> Code generation and optimization do not change the relation between
>> the time spent in scanning and in parsing. Moreover, if the compiler
>> spends most of the time in code generation, optimization and/or
>> "more", there is even less reason to worry about parsing speed.
>
>
> (sorry in advance for sort of wandering off on a tangent...).
>
>
> a major thing that I think skews things so much in the case of C is just
> how much stuff can get pulled in by the preprocessor, the bulk of which
> ends up quickly being discarded by subsequent stages (and much of the
> rest is only minimally processed by later stages).

I think much of that is the fault of the language and over-reliance on the
pre-processor. Instead of relying on neat language features to minimise the
sizes of headers, everything has to be spelled out in a long-winded way.
Lack of proper namespaces (in lists of enumerations for example) means
identifiers are long and unwieldy, which doesn't help.

> if we spend 1000ms preprocessing and parsing the code, and 10ms or 100ms
> compiling it, where has most of the time gone?...

But even given all that, there are ways of dealing with huge header files so
that it is not necessary to repeatedly tokenise and parse the same headers
over and over again (for recompiling the same module, or compiling many
modules all sharing the same headers).

I've no idea whether many C compilers actually bother though; perhaps it's
easier to just recommend a faster computer..

> OTOH, in my scripting language (which doesn't use headers), I have
> generally been looking at millisecond-range compile times.

Well, that's more like it. Compilation time simply shouldn't be an issue,
not for compiling a single module anyway.

And has never been a problem for me, ever, no matter what hardware I was
using, partly thanks to using my own tools.

I only get a delay when there are interdependencies and I just recompile
everything for simplicity. Then I might have to twiddle my thumbs for 5-10
seconds. But then, I now have to rely on some external tools..

> although admittedly most of my compilers have tended to be pretty dumb,
> typically working like (for the upper end):
> split into tokens and parse the code directly into an AST;
> perform basic simplifications on the AST (evaluating constant
> expressions);
> perform basic type-analysis / inference (generally forward propagation
> driven by declarations and assignment operations);
> flatten this out into a bytecode-style format (binary serialization is
> optional).
>
> this is then currently followed by a backend which translates the
> bytecode into threaded-code and runs this (generally also pretty fast,
> and generally functions/methods are translated on call).

I have two current compilers, one native code, and this one for bytecode for
a dynamically typed, non-type-inferred language:

source -> lexer/parser -> types -> code generator -> optim -> binary
bytecode file

The type pass does very little here, mainly checking l-values and reducing
constant expressions. The optim pass does very little too, just reducing
some common bytecode combinations into one. Nevertheless, the resulting
bytecode, even with a straightforward, non JIT-ing interpreter, can make a
basic lexer run at some 3M tokens/second as I mentioned in another post.

The native code compiler is more involved (I hate these ones! But I need one
to implement the interpreter):

source -> lexer/parser -> names -> types -> intermediate code generator ->
target code generator -> optim -> asm source file

The optim stage does some peephole stuff, but haven't gone as far as having
some variables allocated to registers. Last time I checked, it was
perhaps 40-50% slower than gcc-O1, averaged over ~20
integer/floating-point-intensive benchmarks. That might do me.

--
Bartc
[I've seen C compilers that keep preparsed versions of headers. Dunno
what they do with #if. Also see Microsoft's C# and other .NET languages,
that put all of the type info in the objects, so you can use the object
as a compiled include file. -John]

BGB

unread,
Aug 26, 2012, 8:37:05 PM8/26/12
to
(this post had basically been sitting around several days unfinished and
unsent, mostly as I was more busy with other stuff...).


On 8/22/2012 8:04 AM, BartC wrote:
> "BGB" <cr8...@hotmail.com> wrote in message
>> On 8/20/2012 8:35 AM, Anton Ertl wrote:
>
>>> Code generation and optimization do not change the relation between
>>> the time spent in scanning and in parsing. Moreover, if the compiler
>>> spends most of the time in code generation, optimization and/or
>>> "more", there is even less reason to worry about parsing speed.
>>
>>
>> (sorry in advance for sort of wandering off on a tangent...).
>>
>>
>> a major thing that I think skews things so much in the case of C is just
>> how much stuff can get pulled in by the preprocessor, the bulk of which
>> ends up quickly being discarded by subsequent stages (and much of the
>> rest is only minimally processed by later stages).
>
> I think much of that is the fault of the language and over-reliance on the
> pre-processor. Instead of relying on neat language features to minimise the
> sizes of headers, everything has to be spelled out in a long-winded way.
> Lack of proper namespaces (in lists of enumerations for example) means
> identifiers are long and unwieldy, which doesn't help.
>

well, C is C.
other languages also have things like import, but C doesn't, so alas.


>> if we spend 1000ms preprocessing and parsing the code, and 10ms or 100ms
>> compiling it, where has most of the time gone?...
>
> But even given all that, there are ways of dealing with huge header
> files so
> that it is not necessary to repeatedly tokenise and parse the same headers
> over and over again (for recompiling the same module, or compiling many
> modules all sharing the same headers).
>
> I've no idea whether many C compilers actually bother though; perhaps it's
> easier to just recommend a faster computer..

the problem here is that, although it isn't too hard to figure out
possible optimizations, it is much harder to make them work in ways
which don't violate the C standard.

another issue is that things like precompiled headers are non-standard,
and there is no real agreed-on convention for "hey, compiler, feel free
to use precompiled headers here".

like, say, "#pragma precompiled_header" or similar (say, put before the
include guard to indicate to the compiler that this header may be used
as a precompiled header).

even if supported, it wouldn't generally be found in OS headers, meaning
it would be necessary to wrap the headers to use it.


a partial solution is to allow an option of just lumping together a
bunch of code and compiling it all at once, which can often be faster,
but isn't really applicable in many cases.


>> OTOH, in my scripting language (which doesn't use headers), I have
>> generally been looking at millisecond-range compile times.
>
> Well, that's more like it. Compilation time simply shouldn't be an issue,
> not for compiling a single module anyway.
>
> And has never been a problem for me, ever, no matter what hardware I was
> using, partly thanks to using my own tools.
>
> I only get a delay when there are interdependencies and I just recompile
> everything for simplicity. Then I might have to twiddle my thumbs for 5-10
> seconds. But then, I now have to rely on some external tools..
>

this is part of the issue.

my language instead just uses importing.
so, it is possible to "import" a module, and see whatever is declared in it.


currently, the language uses a mix of a Java and C# style model, where
importing gives the qualified name for the module, which is interpreted
as giving its path (relative to the VM's search path).

within each module, a "package" may be declared, which basically puts
the contents of the module into a given package. functionally, this is
more similar to the use of "namespace" in a language like C#. otherwise,
whatever is declared in the module is placed in the toplevel (it is also
possible to put contents in other packages as well).

there are edge cases here, so modifiers end up being needed both for
package and import directives, mostly to "fine-tune" the behavior.

the current default behavior though is "load the module if-needed, using
its qname, and then try to import the package named by the qname".

currently, there is no direct analogue of Java's "import everything in a
directory" semantics, but I have considered adding something like this,
only I have not yet gotten around to it.


but, in any case, since loading modules doesn't require processing a
small mountain of text in the process, it tends to go a lot faster (and
puts a lot less strain on the compiler/VM as well).
for my C compiler, I had a bytecode intermediate stage, but no bytecode
interpreter (apart from a minor oddity that the optimizer logic could be
used itself as a limited interpreter).

this bytecode was then translated into native code.

however, this was not well maintained, and the bytecode format in
question is no longer in use (though the bytecode used by my scripting
language is a "close sibling", it has gone in a different direction in
several ways, and currently only ever goes into being threaded-code).


then I am basically stuck with things as-is, as I am sadly too busy and
raw performance isn't a big enough priority to justify either compiling
my current live bytecode fully into native code, or for that matter to
revive my C compiler.


so, for the C compiler it was basically like:

( preprocessor -> tokenizer+parser -> AST reducer -> bytecode-compiler
-> (RPNIL) )
->
( (RPNIL) -> codegen passes -> (ASM) -> assembler -> (COFF) ->
runtime-linker )

the codegen basically was a multi-pass stack machine.
any values created were pushed to the stack, and operations would pop
values and push results (internally mapped to register or memory
operations). in the compiler, the compiler would "contain" any registers
or variables, rather than representing a physical memory region. entries
on the main value-stack were regarded as immutable. type-analysis was
also performed on the stack.

if values turned out to be constant here, then they would be evaluated
directly (and the literal value being placed on the stack instead).


for the scripting VM, the model is a little different, and type analysis
has been moved largely before emitting bytecode, so in this case it is a
little more like with the JVM or similar.

so, for my scripting VM:
( tokenizer+parser -> AST reducer -> type-inference -> bytecode-compiler
-> (BSIL) )
->
( (BSIL) -> threaded-converter -> (linked-list indirect threaded-code) )


what differences do exist make interfacing my C frontend to my scripting
backend a bit problematic though. a big issue at the moment is that
naive translation of C ASTs to the bytecode would produce giant modules
(with all the header contents as well) and there is no "good" way to
eliminate this after-the-fact.

as-is, in the script VM bytecode, any structures/typedefs/prototypes/...
end up directly in the bytecode, and are "constructed" by executing the
toplevel, which is a bad scenario for C.

otherwise, my VM currently handles C metadata by storing it in a
database-format, which would likely need to be used along with the
bytecode in this case (and omitting all of these declarations from the
toplevel). normally, these databases are used by the VM to annotate
natively-compiled code, and are currently inserted into the DLL
post-compile via a tool (technically, my C compiler front-end is used to
perform analysis and build these databases, from headers, but a native
compiler produces the actual machine-code).

this would likely mean producing object modules containing both bytecode
and a database, and likely having a "link" stage which merges the
database contents.


not that any of this can't be done though, but it is harder to justify
doing so...


> --
> Bartc
> [I've seen C compilers that keep preparsed versions of headers. Dunno
> what they do with #if. Also see Microsoft's C# and other .NET languages,
> that put all of the type info in the objects, so you can use the object
> as a compiled include file. -John]
>

AFAIK, the preparsed/precompiled headers for C generally handle #if and
#ifdef and similar during the preprocessor as usual. this seems to be a
large part of why there are many restrictions on the use of precompiled
headers in those compilers which support them.


AFAICT, languages like C# delay commands like #if or #ifdef until later
(and impose restrictions on how they may be used). IIRC, they are
generally handled at linking or at JIT.


although it "could" be possible to move #if and friends into a later
compilation stage, it is not really possible to do this without
violating the C standard in the process.

one possible compromise could be to introduce another compiler
extension, such as:
#define_target /name/

which would basically say to the compiler "this particular define will
be handled on a per-target basis" and thus delay certain defines to a
later compilation stage.

say:
#define_target FOO
...
#ifdef FOO
...
#endif

could have the preprocessor spit out something like:
__target_ifdef(FOO) {
...
};

but would retain the normal behavior for use of conventional #define's
(and thus generally avoid breaking C standard conformance, or breaking
code which uses macros in "clever" ways to alter the syntax, 1).

1: consider examples like this:
#ifdef SOMEFEATURE
#define SOMEFEATURE_BEGIN
#define SOMEFEATURE_END
#else
#define SOMEFEATURE_BEGIN /##*
#define SOMEFEATURE_END *##/
#endif
or something as mundane as:
#ifndef __cplusplus
#define EXTERN_C extern "C"
#define BEGIN_EXTERN_C EXTERN_C {
#define END_EXTERN_C }
#else
#define EXTERN_C
#define BEGIN_EXTERN_C
#define END_EXTERN_C
#endif


my script-VM actually does something vaguely along these lines, and
folds the contents of any ifdef or ifndef statements into their own
bytecode-blocks ("ifdef" and "ifndef" also have their own special
bytecode instructions). these are handled when code is converted into
threaded code (the instruction either becomes a special-call, or a no-op).

BartC

unread,
Aug 29, 2012, 5:03:31 PM8/29/12
to
"BGB" <cr8...@hotmail.com> wrote in message
news:12-0...@comp.compilers...

> On 8/22/2012 8:04 AM, BartC wrote:

>> But even given all that, there are ways of dealing with huge header files so
>> that it is not necessary to repeatedly tokenise and parse the same headers
>> over and over again (for recompiling the same module, or compiling many
>> modules all sharing the same headers).
>>
>> I've no idea whether many C compilers actually bother though; perhaps
>> it's easier to just recommend a faster computer..
>
> the problem here is that, although it isn't too hard to figure out
> possible optimizations, it is much harder to make them work in ways
> which don't violate the C standard.
>
> another issue is that things like precompiled headers are non-standard,
> and there is no real agreed-on convention for "hey, compiler, feel free
> to use precompiled headers here".

Why should how a compiler optimises its work violate the standard?
Provided the end results are the same, the compiler can do what it
likes.

However the C language and the way the C compilers are typically
invoked (for example just one-time to compile one module, so it's
forgotten it's compiled the same sets of headers a moment before)
doesn't make things easy. And it's possible that things such as
__TIME__ have been used in such a way that you are obliged to
recompile a header each time.

So it's easy to see why compilers may not bother!

Nevertheless, I think there is plenty that can be done, although I'm
not sure that creating intermediate files such as precompiled headers
is the way to go. It's better when a compiler is properly integrated
into an IDE, then the symbol tables built by a set of headers can be
made persistent much more easily.

Alternatively, it might be possible to just have a very faster parser!
And perhaps integrate the preprocessor so that it is not a separate
pass (I haven't attempted a C compiler so not sure if that's feasible;
my own source-level directives are handled by the lexer itself, or
sometimes by the parser).

>> [I've seen C compilers that keep preparsed versions of headers. Dunno
>> what they do with #if. Also see Microsoft's C# and other .NET languages,
>> that put all of the type info in the objects, so you can use the object
>> as a compiled include file. -John]
>
> AFAIK, the preparsed/precompiled headers for C generally handle #if and
> #ifdef and similar during the preprocessor as usual. this seems to be a
> large part of why there are many restrictions on the use of precompiled
> headers in those compilers which support them.
>
> AFAICT, languages like C# delay commands like #if or #ifdef until later
> (and impose restrictions on how they may be used). IIRC, they are
> generally handled at linking or at JIT.

With a new language then it's much easier to arrange matters so that
it's faster and simpler to compile. It might not even have conditional
directives, or any preprocessor at all; (C needs them because it is a
cruder, older language; I used to have conditional code, but no longer
and in any case it seemed an unsatisfactory approach).

--
Bartc

BGB

unread,
Sep 4, 2012, 2:45:51 PM9/4/12
to
On 8/29/2012 4:03 PM, BartC wrote:
> "BGB" <cr8...@hotmail.com> wrote in message

>> On 8/22/2012 8:04 AM, BartC wrote:
>
>>> But even given all that, there are ways of dealing with huge header
>>> files so
>>> that it is not necessary to repeatedly tokenise and parse the same
>>> headers
>>> over and over again (for recompiling the same module, or compiling many
>>> modules all sharing the same headers).
>>>
>>> I've no idea whether many C compilers actually bother though; perhaps
>>> it's easier to just recommend a faster computer..
>>
>> the problem here is that, although it isn't too hard to figure out
>> possible optimizations, it is much harder to make them work in ways
>> which don't violate the C standard.
>>
>> another issue is that things like precompiled headers are non-standard,
>> and there is no real agreed-on convention for "hey, compiler, feel free
>> to use precompiled headers here".
>
> Why should how a compiler optimises its work violate the standard?
> Provided the end results are the same, the compiler can do what it
> likes.

basically, because the C compilation process is difficult to optimize
while not changing behavior in various ("minor") ways.

a major one is the preprocessor, where:
defines in one header may subsequently cause different expansions in
another;
pretty much anything that can be done via textual substitution is
allowed by the preprocessor;
there are macros and built-in defines which may have time-dependent or
invocation-dependent behaviors;
...

and, the standard requires that all this sort of stuff works as-expected.


> However the C language and the way the C compilers are typically
> invoked (for example just one-time to compile one module, so it's
> forgotten it's compiled the same sets of headers a moment before)
> doesn't make things easy. And it's possible that things such as
> __TIME__ have been used in such a way that you are obliged to
> recompile a header each time.

yeah.

or things as simple as:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>

where basically the presence of this define does itself alter the
behavior and contents of the header.


a partial optimization though is to optimize for the case where all the
modules in a library are compiled in batches, and where all of them use
the same header-lines. then it is possible to preprocess/parse the
headers once, and then reuse them for subsequent modules.

I think I remember at least one compiler doing this (apparently Borland?).

MSVC basically uses the whole "stdafx.h" thing, where they precompile
this and have (sometimes "force") everything to include it.

GCC apparently only allows it for the case of including a single header.


my considered option was basically having a way (in the headers) to tell
the compiler that it is allowed to violate the C standard in certain
ways for sake of optimizing the compilation process.


> So it's easy to see why compilers may not bother!

yeah.

the lack of a good way to approach the problem, and the relative
simplicity of the old/slow route, leaves the old/slow strategy as the
main way most code is compiled.


> Nevertheless, I think there is plenty that can be done, although I'm
> not sure that creating intermediate files such as precompiled headers
> is the way to go. It's better when a compiler is properly integrated
> into an IDE, then the symbol tables built by a set of headers can be
> made persistent much more easily.
>

actually, caching all of the definitions/defines/... resulting from
processing a set of headers can be done (I have done so in the past,
partly as an extension intended to allow faster "eval" of C fragments,
and this is also partly how the FFI for my BGBScript language works).

however, compiling C code using these would itself likely violate the
standard in subtle ways, so likely couldn't be used as a general-purpose
optimization.

the most obvious way to use this in a "moderately standard" way would,
ironically, likely assume a form resembling that of the "stdafx.h" system.


> Alternatively, it might be possible to just have a very faster parser!
> And perhaps integrate the preprocessor so that it is not a separate
> pass (I haven't attempted a C compiler so not sure if that's feasible;
> my own source-level directives are handled by the lexer itself, or
> sometimes by the parser).

I am not entirely sure that this would buy much, but would likely be
more complicated (and using a single pass for preprocess+parse+lexing
could actually make the overall process slower).

a partial issue is that the preprocessor works line-by-line, but a C
parser does not.

combining them would likely lead to a case where there were mismatches
between what the preprocessor has currently expanded, and what the
parser needs to parse a complete statement.

so, it is generally simpler and more effective to expand everything out
prior to passing it off to the parser, but the preprocessor need not
necessarily serialize back to text:
it is possible that the preprocessor spits out basically a large array
of tokens, and the parser works off this (as a separate pass).

note that my parsers have not usually used a separate lexing and parsing
pass, but have usually broken the source-text into tokens "mid-flight",
often using a cache (covering a range of 256 bytes) to avoid reading the
same tokens multiple times.


>>> [I've seen C compilers that keep preparsed versions of headers. Dunno
>>> what they do with #if. Also see Microsoft's C# and other .NET
>>> languages,
>>> that put all of the type info in the objects, so you can use the object
>>> as a compiled include file. -John]
>>
>> AFAIK, the preparsed/precompiled headers for C generally handle #if and
>> #ifdef and similar during the preprocessor as usual. this seems to be a
>> large part of why there are many restrictions on the use of precompiled
>> headers in those compilers which support them.
>>
>> AFAICT, languages like C# delay commands like #if or #ifdef until later
>> (and impose restrictions on how they may be used). IIRC, they are
>> generally handled at linking or at JIT.
>
> With a new language then it's much easier to arrange matters so that
> it's faster and simpler to compile. It might not even have conditional
> directives, or any preprocessor at all; (C needs them because it is a
> cruder, older language; I used to have conditional code, but no longer
> and in any case it seemed an unsatisfactory approach).


well, there is probably a reason why most newer languages have not been
designed this way, but this does not eliminate the problem of most
existing code being written in C or C++ (or the general ineffectiveness
of newer languages to really do well the things that C and C++ do well).


for example, my BGBScript language does not currently use a preprocessor
(I would not entirely oppose the idea, as there are possible uses, but
currently I don't use one, and although the language has an
"ifdef"/"ifndef" construction, it is handled in the threaded-code
translation stage).

however, the language is not itself likely a good C substitute (it
wasn't really designed to fill the same roles as C, and has much more in
common with ECMAScript, although it does borrow some more aspects of C
syntax and semantics).


but, it does have a merit:
even for "relatively large" modules (10-20 kB), it seems to have compile
times in the low-millisecond range, and with a little optimization could
(probably) be pushed into the microsecond range.

(the compiler has generally been fast enough that thus far I have mostly
always been loading modules directly from source).


I have considered ideas for C-like languages could compile much faster
(and potentially also allow running in a VM), but even if they looked
like C and were 99% code-compatible, if they don't follow the C
standards, they aren't really C.

a major idea here being for a language which would look-like and be
"mostly" code-compatible with C99, but would largely change those parts
of the language which impede fast compilation and generating
portable/verifiable bytecode (actually, a lot more modest/subtle than it
may seem, it would be more like a "fake" version of C).

basic ideas: traditional includes are replaced ("#include" would
actually function more like an "import" directive), the preprocessor is
merged into the syntax and no longer works via textual substitution,
declaration parsing would work more like in C#, ...

semantics would involve "reinterpreting" the meaning of a lot of things
(for example, pointer syntax and operations would be retained, but what
they "mean" would actually be replaced with something else, ...). in
some cases, it would likely incorporate many semantics and compilation
strategies used in ECMAScript-like languages as well.

(something like this would likely be an ugly beast from a design POV
though, and one can debate the value of a language which would look like
C but be functionally and semantically somewhat different).


sadly, I don't have a justification for the time/effort such a project
would require, as my main effort is trying to get a marketable 3D game
made, such that I can hopefully have *some* form of personal income...

Marco van de Voort

unread,
Sep 5, 2012, 4:40:43 AM9/5/12
to
On 2012-08-29, BartC <b...@freeuk.com> wrote:
>> another issue is that things like precompiled headers are non-standard,
>> and there is no real agreed-on convention for "hey, compiler, feel free
>> to use precompiled headers here".
>
> Why should how a compiler optimises its work violate the standard?
> Provided the end results are the same, the compiler can do what it
> likes.

True, but that is a complex problem, since processing a header starts
with a preprocessor state (at the point of including), then processing
it changes compiler and preprocessor state.

I'm not a C expert, but it probably depends on the state of the
compiler on inclusion too (e.g. when you can check if a type is
already defined or not)

Both the output states are dependent on (both) the input state which
in turn depend on the point of inclusion.

So precompiled headers should probably reduce the input state to the
bare essentials, and reduce the changes to the output change
accordingly.

Then one should be able to check if the reduced input state matches the one
in the precompiled header before it is loaded.

>> AFAIK, the preparsed/precompiled headers for C generally handle #if and
>> #ifdef and similar during the preprocessor as usual. this seems to be a
>> large part of why there are many restrictions on the use of precompiled
>> headers in those compilers which support them.
>>
>> AFAICT, languages like C# delay commands like #if or #ifdef until later
>> (and impose restrictions on how they may be used). IIRC, they are
>> generally handled at linking or at JIT.

Or they use a different way of importing that zaps
preprocessor/compiler state to some known global state, and doesn't
allow code before #include, removing the dependency on compiler state.
[The preprocessor conceptually happens before the parse, so
preprocessor state doesn't depend on compiler state. But #if means
that the sequence of includes affects both the preprocessor state and
the compiler state. -John]

0 new messages