Seeking Ideas to Improve Performance of ANTLR Grammar for our Open Source Language, ONE

140 views
Skip to first unread message

Norbert Leitl

unread,
Jun 13, 2024, 8:20:21 AMJun 13
to antlr-discussion

Hello everyone,

We are excited to announce that we are developing our own programming language named ONE, based on ANTLR and similar syntax to C#. In the near future, we plan to make the language open-source, providing the community with the opportunity to contribute and collaborate.

ONE introduces some deviations from C#, such as:

  • Reversed type declaration or type suffix notation, similar to Go.
  • Replacing brackets with indentation, similar to Python.
  • Adding sections for modifiers.
  • Additional new language elements.
To illustrate, here is a simple ONE code example:
namespace DemoNamespace:

    import System
    import System.Collections.Generic

    class SimpleClass:

        private readonly:

            _identifier string = "ONE"
            _numbers List<int> = []

        public:

            SimpleClass(identifier string, numbers List<int>):

                _identifier = identifier
                _numbers = numbers

            Print():

                Console.WriteLine($"Identifier: {_identifier}")

                index int = 0
                foreach number var in _numbers:

                    index++
                    Console.WriteLine($"{index}. number: {number}")


And here the transpiled C# code:
namespace DemoNamespace
{
    using System;
    using System.Collections.Generic;

    class SimpleClass
    {
        private readonly string _identifier = "ONE";
        private readonly List<int> _numbers = [];
        public SimpleClass(string identifier, List<int> numbers)
        {
            _identifier = identifier;
            _numbers = numbers;
        }

        public void Print()
        {
            Console.WriteLine($"Identifier: {_identifier}");
            int index = 0;
            foreach (var number in _numbers)
            {
                index++;
                Console.WriteLine($"{index}. number: {number}");
            }
        }
    }
}

The purpose of ONE in a nutshell:

ONE is designed to be the programming language for a browser-based flow editor, aimed at creating data flows with hardware devices and modules represented by graphical elements, similar to Node-RED. With ONE, you can write data flows also using a text-based programming language, complementing the graphical programming concept. ONE includes the complete C# language specification and extends it with syntax constructs for state machines, truth tables, and other elements. The engine of the data flow editor is written in C#, so the ONE code is transpiled to C# and then executed by the engine.

Current state of the project:

  • A working ANTLR grammar that represents the C# 12 language specification.
  • A C# transpiler written in C# that converts the ANTLR tree to a Roslyn CompilationUnitSyntax, representing the C# syntax tree.
  • A language server written in C# for ONE and a working Visual Studio Code extension.

We picked an ANTLR C# grammar from the Roslyn page: https://github.com/dotnet/roslyn/blob/main/src/Compilers/CSharp/Portable/Generated/CSharp.Generated.g4. We are aware that this ANTLR grammar is not designed for immediate use, but the major advantage is that it fully represents the complete Roslyn syntax structure. So we adapted it to generate a working ANTLR Lexer and Parser. As the basis for our lexer, we chose: https://github.com/antlr/grammars-v4/blob/master/csharp/CSharpLexer.g4. The logic for Python-like indentation was mainly taken from: https://github.com/antlr/grammars-v4/tree/master/python/python3.

Seeking Solutions for ANTLR Performance Issues:

The ANTLR parser works well and the grammar covers all C# syntax elements. However, with large files, the performance of the ANTLR parser is not particularly good. We believe improving performance will require restructuring the grammar.

We have read Gabriele Tomassetti's great and very helpful article  https://tomassetti.me/improving-the-performance-of-an-antlr-parser/ and tried implementing some optimizations. However, after profiling the ANTLR parser, it seems the most time-consuming rules are still the left-recursive ones like type or expression. Rewriting the expression rule into a cascade of ever more precise expressions had the opposite effect and the performance actually got even worse.

Does anyone have ideas or experience in making ANTLR grammars more performant, or analyse them in a practical way? Any insights or suggestions would be greatly appreciated.

Here are the necessary files to generate the ANTLR Lexer and Parser, as well as the base classes. Also included are the simple code example and a more complex one (TestPattern.on) where the poor performance  is noticeable.

Thank you

ONELexerBase.java
ONELexer.g4
SimpleClass.on
ONEParserBase.java
PatternTest.on
ONEParser.g4

Ken Domino

unread,
Jun 14, 2024, 8:12:21 AMJun 14
to antlr-discussion
I ported the code to C# to help with performance analysis: https://github.com/kaby76/one-parser/tree/main.

* There is a large max-k of 666 tokens for "member_declaration" for PatternTest.on, for the declaration of TestArray. https://github.com/kaby76/one-parser/blob/65b0bf18060e288c471e82d45cf44818563bd6a9/examples/PatternTest.on#L321-L368 . That may be ok as long as it doesn't need to do this repeatably.

* Most of the time is spent in "expression", followed by "type" and "basic_statement".

* Most rule invocations are with "expression".

* The profiling AdaptivePredict() code (https://github.com/antlr/antlr4/blob/7d4cea92bc3f7d709f09c3f1ac77c5bbc71a6749/runtime/CSharp/src/Atn/ProfilingATNSimulator.cs#L54-L96) does not count the actual times a token looked at. I wrote a ProfilingCommonTokenStream to keep count of the number of times LT() hits a token. It says the "yBoxed" decl is hit 53 times. https://github.com/kaby76/one-parser/blob/65b0bf18060e288c471e82d45cf44818563bd6a9/examples/PatternTest.on#L29C17-L29C23 . Why would the parse need to visit this declaration over 53 times?

* Your lexer and parser base classes should have Reset() defined, so the input can be rewound to the beginning. I didn't adjust the code to do that yet.

 --Ken

Norbert Leitl

unread,
Jun 17, 2024, 4:46:35 AMJun 17
to antlr-discussion
Hi Ken, thank you very much for your efforts!

Yes, it's quite strange that the yBoxed declaration stands out so much when called. There is nothing unusual compared to the other declarations, it's just a variable_declaration rule. Do you think this is mainly a parser issue  or could there be something wrong with the grammar?

I played around with Reset() implementation for the base classes.  For the lexer base class, reset seems never be called, whereas for the parser base class, it is.
Reply all
Reply to author
Forward
0 new messages