I'm looking for a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.
Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.
What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.
If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.
Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.
A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].
One objection is that such trees are concrete, bloated and confusing; if done right, that's not true.My SO answer to this question: What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.
The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:
By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.
DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.
First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...
Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.
Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.
The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...
Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.
Internationalization support in software development tooling is vital for enabling efficient globalization. If there is any possibility of future collaboration with others across locales, consider internationalization from project inception. Internationalization can prevent rework or having to develop a new model design. The relevant requirement concerns locale settings.
On a computer, a locale setting defines the language (characterset encoding) for the user interface and the display formats for informationsuch as time, date, and currency. The encoding dictates the numberof characters that a locale can render. For example, the US-ASCIIcoded character set (codeset) defines 128 characters. A Unicode codeset,such as UTF-8, defines more than 1,100,000 characters.
For code generation, the locale setting determines the characterset encoding of generated file content. To avoid garbled text or incorrectlydisplayed characters, the locale setting for your MATLAB sessionmust be compatible with the setting for your compiler and operatingsystem. For information on finding and changing the operating systemsetting, see Internationalization or see the operating system documentation.
The code generator replaces characters that are not representedin the character set encoding of a model with XML escape sequences.Escape sequence replacements occur for block, signal, and Stateflow objectnames that appear in:
2. Navigate to the Code Generation > Template pane. The model is configured to use the code generation template file MixedLanguagesAndLocales.cgt. That file adds comments to the top of generated code files. For the code generator to apply escape sequence replacements for the .cgt file, enable replacements by specifying:
You can specify customizations to generated code files by using TLC code. TLC files support user default encoding only. To produce international custom generated code that is portable, use the 7-bit ASCII character set.
2. Navigate to the Code Generation > Template pane. The model is configured to use the code customization file example_file_process.tlc. That file customizes the generated code just before the code generator writes the code files. For example, the file adds a C source file, corresponding include file, and #define and #include statements.
Use the code generation report to review the generated code. For characters that are not in the current MATLAB character set encoding, the code generator uses escape sequence replacements to render characters correctly in the code generation report.
3. Open the Traceability Report. The report maintains traceability information, even when the name contains characters that are not represented in the current encoding. Names of model elements appear in the report as replacement names in the local language.
Internationalization support in software development tooling is vital to enabling efficient globalization. If there is a remote possibility that you could collaborate in the future with others across locales, consider internationalization from project inception. Internationalization can prevent rework or having to develop a new model design. The relevant requirement concerns locale settings.
The code generator replaces characters that are not represented in the character set encoding for a model with XML escape sequences. Escape sequence replacements occur for block, signal, and Stateflow object names that appear in Comments in code generation template (CGT) files.
By default, code generation template files do not contain character set encoding information. The operating system reads the files, using its current encoding, regardless of the encoding that you use to write the file. You can enable escape sequence replacements by adding the following token at the top of the template file:
3. If you are using Embedded Coder, open the Traceability Report. The report maintains traceability information, even when the name contains characters that are not represented in the current encoding. Names of model elements appear in the report as replacement names in the local language.
My thinking was that injecting generated code into existing source code files would enhance maintainability, because it makes it obvious what is going on instead of performing some operations behind the scenes.
I have done a middle sized Java project (ca. 300 classes) with multiple Maven modules with my domain classes auto generated (ca. 100 classes) from a DSL done with the superb Eclipse Xtext project and I would do it again :)
Since my project was build with maven I used the maven way to handle auto generated code, that means even my xtext dsl and generator project is handled by maven and at every full build I create all generated code new (as it should be). See: building xtext projects with maven tycho. So in my project I only have to do a "maven package".
b1e95dc632