When I was trying to build it before, had tried some customized
configuration strings, but these didn't build.
When I looked around (partly via grep), I had found somewhere that there
were lists of basically every possible target, with RISC-V basically
dealing with the configurations via a large number of targets with every
combination of features as a different target (well, except the ones I
wanted in this case).
Gave in at the time, and configured it for RV64I, but now could probably
go for RV64IM.
Kind of annoying that they build copies of GCC one-off for each target,
vs say as a collection of optional modules for different types of
targets (say, if GCC were configured more like how one configures the
Linux kernel).
Then, say, one edits a runtime config file or uses command-line options
to configure the backend for the specific target (with premade config
files for more well-known target machines).
An alternate approach would be, say, the compiler is configured with
FOURCC codes for major frontends and backends (source languages and
target machines; with secondary FOURCCs for language variants, target
baseline profiles, and an array of "f options", ...), with the compiler
being able to load in DLL's or SO's for major targets and components
(say, using a system similar to how A/V codecs work on Windows).
One then building the combination of front-end and backend modules which
are useful to them, or with 3rd party modules usable as plug-ins, ...
Some other shared components could potentially also be folded off into
sub-libraries (say, ELF and PE/COFF, which likely only needs minor
customization on a per target basis).
In this case, the core part of the compiler would mostly deal with file
management, typesystem management, and IR stages.
Well, also better would be if one were not launching and terminating the
compiler for every translation unit, ...
Though, I guess one could argue that responsibilities are not as cleanly
divided with a compiler as they are with A/V codecs, say:
Backend choice may effect PP defines in the frontend;
Backend type will effect the behavior of the typesystem;
...
Some of this can be handled some, say:
Backend has an interface to register defines with the frontend;
Various context parameters for configuring the typesystem behavior
(sizeof(long), sizeof(void*), ...).
Backend interface basically consists of a way for querying for support
for a given target, and then creating an instance of that target
(configuring the primary and secondary contexts).
Frontend is similarly about creating a translation instance that can
parse a given source-language and turn it to the AST format, and another
semi-shared stage which can translate ASTs into the IR format.
In BGBCC, the language frontend turning the source code into an XML
parse-tree; which is then converted into the stack-machine IR stage. The
Middle-Section mostly communicates with the backend in terms of a 3AC
IR, so the stack bytecode is mostly limited to the middle-stage.
It is possible one could make a case for splitting the backend into
separate code-generation and assembler phases, but the current backends
for BGBCC don't do this (my early x86 backend did have a separate
assembler, but this backend was dropped long ago).
Say, stages, if things were more cleanly modularized:
Language Frontend (Source -> AST)
Preprocessor (Semi Shared Module)
Parser
Upper-Middle (Shared, AST->IL)
Lower-Middle (Shared, IL->3AC)
Backend (Codegen)
3AC -> ASM
Assembler (Configured by Backend)
ASM -> ObjectModule
Linker (Configured by Backend)
ObjectModule -> Binary
In the current BGBCC, there is only a single middle stage, and the
backend also handles the Assembler and Linker parts.
Though, it could make sense if these could potentially be separate
modules configured by the backend (with an API interface), which may or
may not be part of the same DLL as the backend.
Idea here is that ObjectModule is not a serialized object (COFF or ELF
format), but instead an in-memory structure representing the contents of
such an object (section buffers, symbol and reloc tables, etc), with the
option of being flattened to or parsed from an object file (Eg, COFF).
Note, for example, passing ASM to the compiler might go something like:
Frontend invokes preprocessor;
Gets turned into something like:
<module ...>
<body>
<asm>
<![CDATA[ big blob of ASM code ]]>
</asm>
</body>
</module>
Whether or not using XML for ASTs makes sense, it is basically "what I
have". Some of my other stuff had used CONS-lists and JSON-style
dictionary-objects instead, but BGBCC was forked off of an early VM of
mine which had used XML here (even if arguably a JSON style approach
better fits the AST use-case than using XML here; CONS lists have a
different problem in that by the time they become "usefully flexible",
one is basically using them to awkwardly mimic the behavior of
dictionary objects).
Then, in the stack IL, gets essentially passed along as a big text blob.
In the 3AC IR, turns back to an 'ASM' object, with the ASM still
represented as a big string literal.
(Then again, can also note that in some stages of my compiler, large
numeric types are also passed as string literals).
At present, the backend sees the ASM object and then passes the string
literal to an ASM parser. A backend with a separate ASM stage might
instead emit the ASM blob to the output (then leave it for the assembler
module to deal with).
Configuring modules would generally use an interface similar to the
"DriverProc" style interface used in codec drivers, but with configured
modules typically accessed via VTable pointers.
Eg:
(*CGenCtx)->DoSomething(CGenCtx, ...);
...
It might make sense to keep the assembler and linker stages managed by
the backend rather than managed by the main compiler stage.
As-is, the main stage just sort of expects the backend to produce output
in the requested format (itself also given as FOURCC values).
However, in these "modern times", might make sense to consider expanding
the use of FOURCC's to EIGHTCC's (allowing for a FOURCC to be cast to an
EIGHTCC and still interpreted in a sensible way).
...
Though, one other property I think that is relevant to a compiler:
Not needing absurd amounts of RAM or hour+ build times;
This is basically a big "fail" for LLVM/Clang IMO.
This, and also my disfavor of "Java style" C++, is part of why I went
with reusing my old/crufty C compiler (BGBCC) as a base, rather than
jump over to Clang.
Better IMO if the whole compiler is written in "Ye Olde C" (nevermind
occasional cruft possibly following COM/OLE style conventions or
similar; but this stuff is at least "sort of usable").
...