well, recently I came up with an idea, but it has turned into a bit of a mess on me...
I have an existing C compiler I wrote before, and it is organized about like this: Preprocessor; Parser; Upper Compiler, converts syntax trees to RPNIL; Lower Compiler, converts RPNIL to assembler; Assembler; Linker.
so, the upper-compiler ends up being fairly sparse, mostly being a fairly dumb converter.
the output, I call RPNIL, is a vaguely Forth or PostScript like language representing the intermediate version of the language.
some of the basic types-related processing, constant folding, some branch elimination, ... has been done before any of this RPNIL code is generated (likewise, the various high level constructions have been decomposed into labels and conditional jumps).
meanwhile, the lower-compiler does much of the "heavy lifting", implementing many things I had not really originally intended (nearly the entire types tower, operator overloading, inlining, ...), and goes all the way down to machine code (when this compiler project was started, it was originally assumed that this lower end would be simple and stupid, but it ended up being one of the largest and complicated pieces of the project).
so, an idea was considered: I can split this compiler into 2 major stages: an upper-lower compiler, which would do most of the high-level processing (the it would then compile from RPNIL to some other representation); a lower-lower compiler, which would take this representation and convert it into machine code.
starting out, I had made the (incorrect) assumption that this would be an easy process or reorganizing the code, but looking at it, it is unlikely to be anywhere near this simple (the code is both inconsistently structured and heavily interconnected).
if I were to follow things how I began to split them, the lower-lower compiler (I had started calling it VRM), I would end up with an abstracted x86-ASM like language.
basically, it would differ from x86 ASM largely by glossing over many of the differences between x86 and x86-64, implementing more types, and essentially merging/orthogonalizing the SSE registers and GPRs, for example, in allowing 64 and 128 bit integers as built-in types, ... but, it would be a lot lower-level than RPNIL in that it would not automatically handle the calling conventions, abstract types (such as typed pointers, structures, arrays, ...).
as this had been imagined, it would follow the x86-style "operator dest, source" instruction format. the exact register handling would be handled by a register allocator, and register names would consist of an id and some letters giving the type.
for example (this stage of the idea): push x0f4 ;push float vector to stack div x5o, x3o ;divide 128 bit integers mul x1f4, x7f4 ;SIMD multiply ('mulps') sub x2g, x4g ;SSE subtract ('subsd') pop x3g ;pop double into x3
push r1q ;push qword in r0 mov r2q, x6q ;mov qword in x6 to r2 ...
a subsequent idea was then that I could further generalize things, and switch over to a TAC+SSA based representation: mov.i r16 3 mov.i r17 4 mov.i r18 2 mul r20 r16 r17 add r23 r18 r20
which would allow some more "interesting" features, and can be fairly easily converted into the two-register form internally by using dependency analysis (we can take note that, if a value is never used again, we will not see it again, thus a "true" three-address form is rare, and will involve duplicating the register).
in the existing RPN-compiler, much of this dependency tracking is an implicit part of the stack machinery, since when we pop two items from the stack, we can safely assume we will not see them again (thus, dup or similar is needed if we want to keep the value around). internally, a two-oprand register-based approach took hold, primarily because this was much easier to work with (the stack serving, in large part, to push and pop abstract literals and registers).
but, it seems a little silly to have an RPN-based language that compiles into SSA based language.
may as well directly compile from syntax trees into SSA, as this would seem to make more sense. however, doing this would end up with me rewriting damn near the entire compiler (as opposed to just a decent chunk of it...).
technically, it would be a lot easier to stick with a two-operand form (similar to x86), and sticking with RPNIL as the main IL (even simpler, is to implement this lower portion more as a mass of library code that is directly driven by the RPNIL compiler, which is a lot closer to the original plan, and what I had originally started doing before all this came up).
however, TAC and SSA may make sense if they may later allow a lot more features and optimizations (for example, constructions being, for whatever reason, either not possible or extremely awkward with RPN).
or, I can do the basic library (serving as a codegen), and possibly later write an SSA style frontend (and rework the upper compiler to target it directly).
> well, recently I came up with an idea, but it has turned into a bit of a > mess on me...
> I have an existing C compiler I wrote before, and it is organized about like > this: > Preprocessor; > Parser; > Upper Compiler, converts syntax trees to RPNIL; > Lower Compiler, converts RPNIL to assembler; > Assembler; > Linker.
[wants/needs restructuring...., but very far along in project]
> anyone have any thoughts?...
Obviously, only you really know your code and the complexities of changing it... But, at this point, I'd think you'd want to preserve RPNIL and perhaps even concentrate it. Then, add SSA to allow you to select between the two or gradually migrate away from RPNIL.
First, I think you should rework the compiler so that the upper compiler _only_ "converts syntax trees to RPNIL." Second, migrate the upper compiler's "easy-lifting" i.e., "basic types-related processing, constant folding, some branch elimination", into the "heavy-lifting" area of the lower compiler as RPNIL. Third, now that all processing is done in RPNIL, insert an SSA processing layer between the "heavy-lifting" area in RPNIL and the RPNIL to assembly.
I.e.,
(purified) Upper Compiler, *only* converts syntax trees to RPNIL (modified, split) Lower Compiler part A, "heavy-lifting and merged upper compiler easy-lifting" in RPNIL (new) convert RPNIL to SSA (new) perform SSA only transforms (new) convert SSA to RPNIL (kept, split) Lower Compiler part B, converts RPNIL to assembler
Now that you have both layers, if you want to gradually eliminate RPNIL, you can. Or, you have your choice of which is better.
> "cr88192" <cr88...@hotmail.com> wrote in message > news:c05b0$47bfeebb$ca83b482$9782@saipan.com... >> well, recently I came up with an idea, but it has turned into a bit of a >> mess on me...
>> I have an existing C compiler I wrote before, and it is organized about > like >> this: >> Preprocessor; >> Parser; >> Upper Compiler, converts syntax trees to RPNIL; >> Lower Compiler, converts RPNIL to assembler; >> Assembler; >> Linker.
> [wants/needs restructuring...., but very far along in project] >> anyone have any thoughts?...
> Obviously, only you really know your code and the complexities of changing > it... But, at this point, I'd think you'd want to preserve RPNIL and > perhaps even concentrate it. Then, add SSA to allow you to select between > the two or gradually migrate away from RPNIL.
ok.
> First, I think you should rework the compiler so that the upper compiler > _only_ "converts syntax trees to RPNIL."
this is what it mostly does (none of the logic from the lower compiler exists in the upper compiler, and the only real communication between them is spewing everything into a buffer and passing it along).
originally, I had intended it to do more, making the compiler more retargetable by easily swapping out the lower end of the compiler, but the lower end ended up being large and complicated enough that it is no longer reasonable...
> Second, migrate the upper compiler's "easy-lifting" i.e., "basic > types-related processing, constant folding, some branch elimination", into > the "heavy-lifting" area of the lower compiler as RPNIL.
not terribly reasonable, as a certain amount has to be done in the upper compiler just to make the language hold together (C and similar are annoying in this way).
some of the constructions (such as destructuring 'if' statements and loops, can't really be sent much lower, since these will not map nicely to RPN anyways...).
if/when I implement C++ or similar, likely a bit more will need to happen in said upper compiler...
> Third, now that all processing is done in RPNIL, insert an SSA processing > layer between the "heavy-lifting" area in RPNIL and the RPNIL to assembly.
> I.e.,
> (purified) Upper Compiler, *only* converts syntax trees to RPNIL > (modified, split) Lower Compiler part A, "heavy-lifting and merged upper > compiler easy-lifting" in RPNIL > (new) convert RPNIL to SSA > (new) perform SSA only transforms > (new) convert SSA to RPNIL > (kept, split) Lower Compiler part B, converts RPNIL to assembler
I am not sure what point there would be in going twice through RPNIL, since it is high level enough that going back to it would probably eliminate most of what could be done by going through SSA (RPNIL is at a similar level of abstraction to the likes of MSIL/CIL, only that the compiler recieves textual input rather than bytecode).
RPNIL was originally intended as a "general purpose" output for compilers anyways, me reasoning that RPN was a good solid model, and easy to target (I figured SSA would be much harder to target since one needs to keep track of names and similar).
so, yeah, RPN has a clear advantage in this respect...
> Now that you have both layers, if you want to gradually eliminate RPNIL, > you > can. Or, you have your choice of which is better.
probably, it will go like this: Syntax Trees (currently implemented as XML DOM trees) to RPNIL; RPNIL to SSA; further SSA based processing; SSA to ASM.
as such, I could probably loosen up the SSA handling, making SSA be the intermediate representation between these stages. then the question would be one of using a textual or binary representation between stages (I am more inclined towards textual...)
probable syntax: likely line-oriented, and based on the "<opr> <args*>" layout (very easy to parse); types will be sticky, so registers will remember what they are (vs endlessly having to restate them); any conversions will need to be explicit, and operator types will need to agree; ...
most forms with a target will look like: <opr> <dest> <args*>
"cr88192" <cr88...@hotmail.com> wrote: > well, recently I came up with an idea, but it has turned into a bit of a > mess on me...
> I have an existing C compiler I wrote before, and it is organized about like > this: > Preprocessor; > Parser; > Upper Compiler, converts syntax trees to RPNIL; > Lower Compiler, converts RPNIL to assembler; > Assembler; > Linker.
> so, the upper-compiler ends up being fairly sparse, mostly being a fairly > dumb converter.
> the output, I call RPNIL, is a vaguely Forth or PostScript like language > representing the intermediate version of the language.
> [lots of info removed]
> or, I can do the basic library (serving as a codegen), and possibly later > write an SSA style frontend (and rework the upper compiler to target it > directly).
> anyone have any thoughts?...
Many people have thoughts, but I am not sure what you are looking for. What I miss in your description is what exactly you want to achieve. Possible answers could be:
- getting to learn how to write a 'real' compiler chain - getting a better C compiler - getting a faster C compiler - fixing known code-generation problems in your existing code
- ideas about organizing your code - ideas about using LLVM as your intermediate SSA language - ideas about ditching your code, and start contributing to LLVM
> In article <c05b0$47bfeebb$ca83b482$9...@saipan.com>, > "cr88192" <cr88...@hotmail.com> wrote:
>> well, recently I came up with an idea, but it has turned into a bit of a >> mess on me...
>> I have an existing C compiler I wrote before, and it is organized about >> like >> this: >> Preprocessor; >> Parser; >> Upper Compiler, converts syntax trees to RPNIL; >> Lower Compiler, converts RPNIL to assembler; >> Assembler; >> Linker.
>> so, the upper-compiler ends up being fairly sparse, mostly being a fairly >> dumb converter.
>> the output, I call RPNIL, is a vaguely Forth or PostScript like language >> representing the intermediate version of the language.
>> [lots of info removed]
>> or, I can do the basic library (serving as a codegen), and possibly later >> write an SSA style frontend (and rework the upper compiler to target it >> directly).
>> anyone have any thoughts?...
> Many people have thoughts, but I am not sure what you are looking for. > What I miss in your description is what exactly you want to achieve. > Possible answers could be:
> - getting to learn how to write a 'real' compiler chain > - getting a better C compiler > - getting a faster C compiler > - fixing known code-generation problems in your existing code
actually, the goal is for dynamic compilation and blurring the line between compile time and runtime (basically, using C like we would many script languages such as Python or Scheme, but still retaining the power and performance of C).
for example, if I had lexical closures and eval in C (both eventual planned features), then that would be worthwhile.
another recent goal/feature was the addition of machinery to aid with Persistent Storage, DSM (Distributed Shared Memory), and code possible mobility (AKA: watch as lexically-scoped code objects migrate over the network).
> - ideas about organizing your code > - ideas about using LLVM as your intermediate SSA language > - ideas about ditching your code, and start contributing to LLVM
I already know about LLVM, and though an interesting project, I don't entirely agree with it in some ways, nor do I really believe everyone needs to settle on the same implementation.
I also might have acted differently had I known about clang to begin with, but once I learned about it, I figured, "well, I already have my own implementation"...
also, it is written in C++, and follows an uber-centralized class-heirarchy based design (aka: 'OOP'), which I personally find disagreeable as well (I prefer a modular aproach, and not an inheritence-based approach).
after all, it is good if I can rip out and replace parts of a project as needed, rather than having to view it as some integrated whole...
having a single integrated lower compiler hinders this, in making it to where I can't really "swap out" things for other things...
to what are my ends, I also feel I better achieve my goals than LLVM does for me (after all, I already have a decently working framework in < 1 year).
actually, I had also vaguely considered using the LLVM IR language as the SSA syntax, but for the time being figured it would be too much of a hassle to parse (allowing the low-level code generator to be swapped out could be a useful feature though, esp since LLVM supports more targets than just x86 and x86-64, ...).
of course, right now I have some doubts as such that adding SSA will make the code any simpler though (one of the original considerations for considering splitting the upper and lower compilers). rather, it has merits in its own right, but these merits are not necessarily simplicity...
my lower compiler is as such a 30 kloc mass of highly-interconnected code, and a 10/20 or 15/15 split would be nice, but more likely, such a split (and addition of SSA) would turn out as 20/25 or similar...
it may also make sense to have multiple possible backends around as well (for example, I had once considered the possibility of a CPS-based RPNIL backend, ...).
"cr88192" <cr88...@hotmail.com> writes: > actually, the goal is for dynamic compilation and blurring the line > between compile time and runtime (basically, using C like we would > many script languages such as Python or Scheme, but still retaining > the power and performance of C).
Didn't Francois Bellard write tcc for that purpose?
Phil -- Dear aunt, let's set so double the killer delete select all. -- Microsoft voice recognition live demonstration
> "cr88192" <cr88...@hotmail.com> writes: >> actually, the goal is for dynamic compilation and blurring the line >> between compile time and runtime (basically, using C like we would >> many script languages such as Python or Scheme, but still retaining >> the power and performance of C).
> Didn't Francois Bellard write tcc for that purpose?
I looked at tcc as well...
well, ok, I use a good deal of NIH practices (apparently almost my official practice...).
one can learn things, have a good time, doing it oneself...
after all, one may not have much else of value in their lives...
> well, recently I came up with an idea, but it has turned into a bit of a > mess on me...
> I have an existing C compiler I wrote before, and it is organized about like > this: > Preprocessor; > Parser; > Upper Compiler, converts syntax trees to RPNIL; > Lower Compiler, converts RPNIL to assembler; > Assembler; > Linker.
> so, the upper-compiler ends up being fairly sparse, mostly being a fairly > dumb converter.
> the output, I call RPNIL, is a vaguely Forth or PostScript like language > representing the intermediate version of the language.
> some of the basic types-related processing, constant folding, some branch > elimination, ... has been done before any of this RPNIL code is generated > (likewise, the various high level constructions have been decomposed into > labels and conditional jumps).
> meanwhile, the lower-compiler does much of the "heavy lifting", implementing > many things I had not really originally intended (nearly the entire types > tower, operator overloading, inlining, ...), and goes all the way down to > machine code (when this compiler project was started, it was originally > assumed that this lower end would be simple and stupid, but it ended up > being one of the largest and complicated pieces of the project).
The front-end/back-end split is usually such that there are no lexical, syntactic or semantic issues left for the back end to deal with. Almost universally it's considered a good idea to begin code generation with no ambiguities left in the source (obviously excluding anything that's defined to be ambiguous until runtime). If you've got a significant amount of that sort of stuff left in your back end, you may want to consider getting that moved up first.
> The front-end/back-end split is usually such that there are no > lexical, syntactic or semantic issues left for the back end to deal > with. Almost universally it's considered a good idea to begin code > generation with no ambiguities left in the source (obviously excluding > anything that's defined to be ambiguous until runtime). If you've got > a significant amount of that sort of stuff left in your back end, you > may want to consider getting that moved up first.
there are no lexical or syntactic issues. but, however, it does handle most of the semantics.
the reason is that I had started out with an abstract stack machine (basically, vaguely PostScript style, where we have a stack and operations that work on pretty much anything they are given), and during development, ever more things were done as the codegen became more developed.
now, the thing is very complicated...
the idea is then to split this level up, so that the lower level deals almost purely with architectural issues (register allocation and operations, ...), and the RPNIL compiler becomes more the "middle end" rather than the "lower end".
so, the upper compiler: does preprocessing and parsing; does a some amount of AST level operations (folding constant expressions, eliminating dead branches, ...); converts ASTs into a more or less stack-based representation.
the 'RPNIL' compiler: deals with said abstract stack-based computational model; decomposes complex types and operations into basic architectural types (raw pointers, memory, and register operations); also handles things like type conversions, function calls, the type tower, operator overloading, and inlining; ...
the 'VRM' layer: does things like register allocation, figures out good ways to perform register/memory and memory/memory operations; emits assembler.
assembler and so on: ...
partly all this is because, originally, my compiler mutated out of an interpreter for a dynamically typed scripting language of mine.
as such, everything RPNIL or lower, was originally handled in what was originally the interpreter or JIT compiler (the major change was that of going over to using a textual representation rather than raw bytecode, originally due to issues that would have been involved trying to hack the two layers together at the bytecode level due to representational differences, it was much easier just to dump the output into a textual buffer and re-parse in terms of the RPNIL compiler's bytecode...).
the upper compiler was originally a severely hacked-over version of the script language's upper compiler (and, thus, operates at a similar level of abstraction, which may not be perfectly well suited to a language like C, but oh well...).
later on, this upper compiler was rewritten to use DOM trees instead of S-Expressions (and be, in general, a little less hackish), but otherwise does not do much differently than before.