This Engineering Notebook post discusses improving Leo's importers for difficult-to-parse languages such as c++ and javascript. Issue #3327 has become urgent now that I have begun to study codon!
tl;dr: Aha: use helper lines to guide analysis.
Background
Leo's importers have a long history. We are on something like the fifth iteration of their design. Each iteration has been a step forward, but Leo's c++ and javascript need more work.
Definitions of c++ functions or methods may be arbitrarily complex. For example, processSource in codon/codon/app/main.cpp starts this way:
std::unique_ptr<codon::Compiler> processSource(
const std::vector<const char *> &args, bool standalone,
std::function<bool()> pyExtension = [] { return false; }) {
Note how {return false} appears inside the parameter list!
Aside: I wonder whether codon generated this file! It's certainly difficult to read: everything is over-qualified.
The problem
The importer must split lines into nodes. Every line must appear in exactly one generated node. The bodies of the resulting nodes must tile the original file.
Handling the file line-by-line ensures that the generated nodes tile the file. However, a line-oriented approach complicates analysis. I'll omit most of the details.
Leo's importers tokenize the file so that strings and comments do not confuse the analysis. Alas, handling tokens creates other complications. What are we to do?
Aha! Let's use helper lines to simplify the analysis. We'll create the helper lines as follows:
- Start with the lines from the original file.
- Remove comments and strings.
- Remove curly brackets associated with 'if', 'for', and 'while' statements.
- Check the result to ensure that parens and brackets are properly nested.
The resulting lines will be much easier to analyze. The importer can assume that any remaining top-level curly brackets start the body of a class, function, or struct. The tiling problem remains challenging but tractable.
Summary
I plan to rewrite the c++ importer as suggested above. Helper lines will likely eliminate the need for the usual tokenizer and state stack.
Edward