Question regarding the code base

19 views
Skip to first unread message

KFRF

unread,
Mar 25, 2017, 9:58:33 PM3/25/17
to esprima
Hi

First off. I got tired of reporting issues to the issue tracker, so I spent the last 2 days reading through the Esprima source code to understand what's going on under the hood, and I discovered a lot of things I need to understand... E.g. some source code are really outdated ano 2017.

As I understand things, most of the source code has been inherited from the very first public version 1.0. And everthing after that have been just "modifying the source". I'm right? Example the tokens.

And there also seem to exist support for legacy browsers / NodejS versions (0.8 - 0.12) that is no longer needed. And not to mention that the codebase isn't V8 optimized.

After 2 days I got a good understanding of the source code, so I felt I was ready to fix all bugs.

I started out with the scanner class. The class itself seems to have a performance loss at about 33% percent and is not optimized for V8. And I also discovered bugs with octal literals, line terminators, long binary digits on some computers, escaped chars etc...

Even the way the punctuators was handled looked like a mess to me, and it was then I noticed the curlyStack code and started to wonder what is the point? That is related to my first question raised on the Github repo. I found a improved solution for that one, but that took me into more complex questions, and I also discovered that the scanTemplate() function would need an rewrite. This again took me to the tokenizer class - where I solved similar template issue. But I here discovered that this class was a mess as well, As emphasized with some issue tickets.

Too make this short. I have problems too understand what is the problems with fixing reported bugs. The main problem as I see it, is that the codebase now is too fragmented - old fashion - so it's hard to fix it in a performant way? I ran into this issue as well.

Therefor I have a few questions to ask

1. What is the thoughts about this? Too have it's own Scope tracker, or inline arrays and push and pull to them ala Acorn, or simply a solution ala Shift? See the Ecma specs regarding this. Located here: It says

3. If body is a List of errors, then return body.

Under all circumstances it has to return an list of errors if we are following the specs. And Esprima does not do that, I'm right? There are more issues into this as well. Mainly performance. Scanning the whole AST tree could be expensive, and also use unnecessary CPU cycles.
As an experiment I have made a "Scope tracker" that catches all early errors for my own experimental code, and I gained performance on it having a inlined solution.

2. Then we have the grammar issues. Is it updated to ES2017? Doesn't seem like it, and there are still some binging issues I'm aware of. (not reported on the issue tracker)  The way the grammar are handlet now it's expensive, uses too much CPU cycles and memory. There are other options. See how TypeScript does it as one example. 

3. Precedence climbing. Comparing to Shift first. They have actually a similiar solution, but an improved one as I noticed. But what is the point with precedence climbing in the first place? In general with recursive descent parsers is to do the LHS stuff first before entering the parse stuff. and it would be much easier to do the right derivation that way. Wich lead me to the fact that precedence climbing could be avoided if done that way.  I have read issue tickets regarding this matter,and tried to figure it out. I tend too agree that the recursive solution used in Esprima 1.0 was no good, but there are room for improvements. 2017 now! See TypeScript solution This is one improved way to handle it.

So my question is. What is the plans here?

For now I'm focusing on refactoring the scanner class to improve it, and reduce memory and CPU usage and solve known bugs.

KFRF

unread,
Mar 25, 2017, 10:48:43 PM3/25/17
to esprima

A few more questions. Now when I go through the scanner class I discovered that 'Unexpected token ILLEGAL' is the error message used mostly all over. What is the idea behind this? E.g. if a number contains too few digits, or a binary digit has a unexpected char behind it. Wouldn't it be more natural to use a error message covering this issues?

parseInt() and parseFloat(). Why use them? Too expensive. And they are actually not needed to archieve the same goal. What is the purpose behind this?

Let me illistrate with "not needed"

Original Esprima code

private scanBinaryLiteral(startnumber): RawToken {
    let num = '';
    let ch;

    while (!this.eof()) {
        ch = this.source[this.index];
        if (ch !== '0' && ch !== '1') {
            break;
        }
        num += this.source[this.index++];
    }

    if (num.length === 0) {
        // only 0b or 0B
        this.throwUnexpectedToken();
    }

    if (!this.eof()) {
        ch = this.source.charCodeAt(this.index);
        /* istanbul ignore else */
        if (Character.isIdentifierStart(ch) || Character.isDecimalDigit(ch)) {
            this.throwUnexpectedToken();
        }
    }

    return {
        type: Token.NumericLiteral,
        value: parseInt(num2),
        lineNumber: this.lineNumber,
        lineStart: this.lineStart,
        start: start,
        end: this.index
    };
}

    Exact same output, but improved code:

private scanBinaryDigits(startnumber): RawToken {
        let valuenumber = 0;
        let chnumber;
        let numberOfDigitsnumber = 0;
        const pos = this.index;

        while (!this.eof()) {
            const ch = this.source.charCodeAt(this.index);

            if (isBinaryDigit(ch)) {
                value = value * 2 + toBinaryValue(ch);
                this.index++;
                numberOfDigits++;
            } else {
                break;
            }
        }

        // we must have at least one binary digit after 'b'/'B'
        if (numberOfDigits === 0) {
           this.throwUnexpectedToken(); // ?
        }
        if (!this.eof()) {
            ch = this.source.charCodeAt(this.index);
            // istanbul ignore else
            if (Character.isIdentifierStart(ch) || Character.isDecimalDigit(ch)) {
                this.throwUnexpectedToken(); // ?
            }
        }
        
        return {
        type: Token.NumericLiteral,
        value: value,
        lineNumber: this.lineNumber,
        lineStart: this.lineStart,
        start: start,
        end: this.index
    };
}

With the improved  code - wich is faster - I can extend it to also accept octals. Then I would have one function doing two things and save code lines and space.

KFRF

unread,
Mar 25, 2017, 11:57:02 PM3/25/17
to esprima
Ignore the tread. The Esprima build system was not best friends with my computer system. So I stopped dealing with Esprima.



On Sunday, March 26, 2017 at 9:58:33 AM UTC+8, KFRF wrote:

Ariya Hidayat

unread,
Mar 26, 2017, 1:03:12 PM3/26/17
to esp...@googlegroups.com

I'm not sure what you mean by grammar not updated to ES2017? In the current master branch, the parser already supports trailing commas in parameters (#1550) and async/await (#1079). Those are the new syntax features as part of ES2017. The implementation of these two features might not fully capture all scenarios (early errors, etc). But that's the case with every single JavaScript parser/interpreter out there anyway: we implement features and we continue to squash the bugs as we go.

Where is the support for legacy browsers or old Node.js versions?

For performance-related issues, feel free to point out how to make things better with the concrete evidence (e.g. the speed of the parser can be tested by running npm run benchmark-parser). While there are rooms to improve, many parts of the code have been quite optimized already. I spent an enormous amount of time looking at the IR generated by V8 Crankshaft when comparing different approach. Note that the situation can change in the very near future due to the new TurboFan compiler (and of course, combined with Ignition).

My recommendation is simple. Rather than discussing various subjective claims about the source code (it's a mess, too fragmented, isn't V8 optimized, old fashion, uses too much CPU cycles, etc), please construct a patch or a pull request that demonstrates the proposed change, without regressing any tests and code coverage, and with a conclusive improvement in e.g. the speed.

Thank you!

--
Ariya Hidayat, https://ariya.io

Ariya Hidayat

unread,
Mar 26, 2017, 9:31:33 PM3/26/17
to esp...@googlegroups.com

Esprima error message is designed to follow V8. Unexpected token is one of the most common ones. There is a suggestion to introduce a more detailed set of error messages (see issue #1214), but doing that requires a lot of time and needs some serious thought.

As for your claim, With the improved  code - wich is faster, it would be better if there is a benchmark proving that. Otherwise, we would engage in an endless discussion on micro-optimizations without making other important progress that matters. Remember that optimization is always a trade-off, speeding up things which do not show up as the major bottlenecks are not the best use of everyone's time.

Regards,

Ariya Hidayat

unread,
Mar 26, 2017, 9:31:55 PM3/26/17
to esp...@googlegroups.com
 
The Esprima build system was not best friends with my computer system. So I stopped dealing with Esprima.

Again, another subjective claim without much evidence.

I have been successfully using Windows, Linux, and macOS to develop Esprima. Unless you have a very exotic system, I doubt that you will encounter a significant development barrier.

On Windows, currently tsfmt does not support glob and thus causes a failure (tracked in issue #1730). But this is not a showstopper and until tsfmt fixes it, you can just ignore it (e.g. our AppVeyor CI run, for Windows build, just dances around it).

KFRF

unread,
Mar 27, 2017, 2:50:26 AM3/27/17
to esprima
Negative all over is not a good start! My background is a lawyer and I am used to be in the court, so I can handle that!  

On the issue ticket on Esprima repo, I posted some evidence regarding the issues I faced with the build system. Allmost all NPM commands failed due to path issues. Ony "npm run test" I got working after while and it ran around 1500 tests successfully.

I know Esprima is fast. Around 12% faster then Acorn. But there are room for improvements. Before I went back to Esprima source code and tried too look into various issues, I created my own parser. Just a playground but it's around 90% faster. That is not the point here, but I use my own parser to validate every bug I report on Esprima because my own parser is 100%.ECMA compat. It's a job related project, but I could I posted it here. But t again reverse engineering exist, so I can't,

 That said. Grammar. I was thinking about both lexical grammar, as well as destructing and binding. I noticed there are issues regarding it, Also noticed the issue ticket from september 2016 on Esprima repo. Specially how parenthesis are handled causes a lot of the early errors to fail. Too evidence it. Look at the Shift parser - early errors.

Then the performance issue claims. Simple knowledge is that V8 loves switch statements with numbers. That is more performant then a if / else statement. Just ask on the V8 repo if doubt. The only place I noticed where it was optimized for V8 is for keywords. Calculating the length of the word, and use a switch statement to return boolean values. That is an fast approach.

Complex functions at least 3 level deep is and can be a performance DEOPT as well. The trick to avoid it is too split into different functions. The way you did the export and import is correctly opmitized when it comes to performance, but what about "parseForStatement". This is not optimized.

Now you got 3 evidences for that :)

forEach, and instanceOf and other native Javascript code are normally horrible slow. In top of the scanner class you are using "indexOf" to check hex values etc. Not optimal, I also noticed this code is from the early versions of Esprima, and from the old days I remember I did similar things when I handled legacy browsers such as IE 6 and 7. So this is one of the places where I was thinking of legacy browsers.  And  this can be optimized.  

So now when I am done with backing up my own writings. I have to ask. What is the plan for early errors? Performant wise I would say I scope tracker is the best solution. That is what I did in my own parser. The Scope tracker will spawn itself, and call either itself or the parent ( parser itself). Depends on how many level deep you are.  This is also somehow similiar too how V8 and Spidermonkey does it with scoping. There are no overhead doing  this, and it can be quickly optimized. 
I tried to adopt my own Scope tracker to Esprima source right before I replied to this, and I works. However. You world need to "declare" names as you go on the places where it's needed. Not so familiar with the Esprima code ATM:

A solution like this is also similar to what Acorn did, but a scope tracker will also handle the hidden classes properly and there will be no DEOPT. A inlined solution aka Acorn is also a slow down.

For this I suggest study V8 solution.

Current grammar solution. E.g "IslolateCoverGrammar" I got the idea, but why call "this"? You loose performance on this, and how is the cover grammar reported?  How are you dealing with syntatic and semantic errors correctly?

Let's look at this error [([a])] = 12; Where does the error occur? Esprima parses this, but it's an error. In this case, are Esprima counting backward or reporting it as an error with line and column pointing at the closing right parenthesis? 

My point is - if I understand it correctly. As Esprima is, you have to count backward to prev token to get correct position?

This things are not covered by current binding and destructing solution.. This was my point.

K.F


KFRF

unread,
Mar 27, 2017, 7:29:45 AM3/27/17
to esprima
Quick follow up. I used one day now to figure out the Esprima codebase, and why things are so hard to fix. And I also try to optimize the code base. I will come back to it when and IF I succeed, but so far yes Esprima codebase isn't fully optimized.

Here is a horrible bench result that give you an idea. This is after I was done refactoring the scanner class. Working on the parser class now and it got around 17% improved performance.

Current Esprima:

0.1            - 156,150 ops/sec ±0.99% (88 runs sampled)
1               - 187,765 ops/sec ±2.50% (82 runs sampled)
x >>>= 42 - 48,597 ops/sec ±1.25% (89 runs sampled)

Optimized Esprima:

0.1           - 371,950 ops/sec ±1.36% (91 runs sampled)
1               - 601,854 ops/sec ±0.25% (91 runs sampled)
x >>>= 42 - 347,005 ops/sec ±0.21% (88 runs sampled)




On Sunday, March 26, 2017 at 9:58:33 AM UTC+8, KFRF wrote:

Ariya Hidayat

unread,
Mar 27, 2017, 9:54:55 PM3/27/17
to esp...@googlegroups.com

Allmost all NPM commands failed...? This should not be the case. Only npm run code-style should fail (due to the issue I already explained). Other commands might fail because the dependency to this one, but characterizing that almost all... is rather unfair, I would say.

As with your explanation about other claims, let's just say that we agree to disagree here. I don't know whether it's the language barrier or now (I assume you're not a native speaker). For us, the evidence of performance improvement involves running npm run benchmark-parser showing that there is a conclusive speed improvement. As a free software/open-source project, we rely on volunteers and it's not our objective to optimize every single function to death. It's all about trade-off, we're not interested in micro-optimizations and micro-benchmarks as we want our volunteers' time to be spent on things which can make a significant difference (whether it's related to performance, correctness, reliability, etc), not just a millisecond win on parsing a synthetic test case.

I can see that you are passionate about performance. If you happen to create a patch or a pull request, without regressing the test suite, and with a demonstrable speed-up (as described above, and also in my previous emails), then some of us will be very happy to review it. Until then, I don't feel that this style of conversation is terribly productive. At some point, I might pick up this thread again, but in the mean time, I'll step aside to work on my own (long) TODO.


Thanks!

KFRF

unread,
Mar 27, 2017, 11:24:38 PM3/27/17
to esprima
I understand you, and I see you got a long TODO. Performance is not important to me ATM. I'm focused on fixing all open issues in Esprima, if that is accepted. Due to my daily work I can't  my computer to push PR. But I give you solutions. If accepted.

After gone through the source code and solved all issues, I can look into performance again. Probably in may when my work are done and I can return to my home country.
Reply all
Reply to author
Forward
0 new messages