Parsing specific phrases (instead of always a `Program`)

35 views
Skip to first unread message

bucephalus org

unread,
Sep 24, 2015, 4:22:24 PM9/24/15
to esprima
Hello!


First of all: THANK YOU! for Esprima! A wonderful tool and something I was desperately missing for a long time.
Suddenly, a whole new universe of possibilities opens up for the projects I am working on.

I also didn't know before, that there is a standardization, namely [The ESTree Spec](https://github.com/estree/estree) for the Abstract Syntax Trees. 
This standardization is also a great thing. In my opinion, this is on the long run an even bigger milestone in the development of ECMAScript/JavaScript than an entire new ECMAScript version. By now, we only have code (strings) on one side, and values (ES data structures) on the other, and there is `x.toString()` to convert in one direction and `eval(code)` to convert to the other. But with ESTree's, we have an intermediate "ontology" between code and values, which is so very useful. And the Esprima parser is the transition between these worlds. Similar to the introduction of X-rays and CT scans in medicine. One day, we will/should have something like `x.toESTree()` in the ECMAScript standard, along with `x.toString()`.
THANK YOU ALL!


Anyway, I also have a question or maybe a request.
As far as I understand, `esprima` (when loaded in say Node.js with `var esprima = require('esprima');`) is a plain object with four properties:

* `version` with a version number
* `Syntax`, a plain object with the tag names for the non-terminals
* `tokenize`, a method that can be called as `tokenize(code)` or `tokenize(code,options)`
* `parse`, a method that can be called as `parse(code)` or `parse(code,options)`, where `options` is a boolean-valued plain object.

It seems, that `parse` always parses a `Program`, and that makes sense, of course.
But I would like to use the parser for other types as well, for example, for `Expressions` or `Function` (both syntax categories are mentioned in The ESTree Spec, but they are not part of `esprima.Syntax`). For example, I would like `esprima.parse()` with an optional third argument `cat` (for syntax category), so that e.g. `esprima.parse(code,options,'Expression')` would return the ESTree, if `code` parses to any `Expression` tree. Otherwise, an error should be thrown.
In this context `esprima.parse(code,options)` is short for `esprima.parse(code,options,'Program')`, of course.

For example, 

  > esprima.parse ('foo(bar)', {}, 'Expression')

should return

  { type: 'CallExpression',
    callee: { type: 'Identifier', name: 'foo' },
    arguments: [ { type: 'Identifier', name: 'bar' } ] }

wheareas now I have to peel this `'CallExpression'` out of the `'Program'` tree, as a result of the

  > esprima.parse ('foo(bar)')

call, which returns

  { type: 'Program',
    body: [ { type: 'ExpressionStatement', 
              expression: { type: 'CallExpression',
                            callee: { type: 'Identifier', name: 'foo' },
                            arguments: [ { type: 'Identifier', name: 'bar' } ] } } ],
    sourceType: 'script' }


What do you think? Or am I overlooking something?


Cheers, 

Richard Gibson

unread,
Sep 24, 2015, 5:06:08 PM9/24/15
to esp...@googlegroups.com, bucepha...@gmail.com
The options argument to esprima.parse already supports sourceType: "module", in correspondence with ESTree Program#sourceType. I could imagine using that property or a similar one to specify other goal productions, although making that happen might take awhile.

--
You received this message because you are subscribed to the Google Groups "esprima" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esprima+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Ficarra

unread,
Sep 24, 2015, 10:58:14 PM9/24/15
to esp...@googlegroups.com, bucepha...@gmail.com
Because not all non-terminals neatly align with a parsing function in esprima (and because many parsing functions assume some invariant about the program leading up to their invocation), this would be very hard in general. But if you have a specific goal symbol in mind, you wouldn't have a very hard time adding support for that. The interface Richard Gibson suggests sounds pretty good to me, as well.

Michael Ficarra

bucephalus org

unread,
Sep 25, 2015, 9:34:06 AM9/25/15
to esprima, bucepha...@gmail.com, repl...@michael.ficarra.me, fr...@michael.ficarra.me
Thank you Richard, and thank you Michael for your replies.

I am afraid, I don't understand your answer, Richard. I probably didn't put my question very well. Please, allow me to rephrase it and to point out, why the solution may be of general interest.


Let me repeat some classic terminology (e.g. from Automata Theory, Languages, and Computation, by Hopcroft & Ullman).

I suppose, ECMAScript is a Context-free Language, i.e. we can define its entire syntax by means of production rules or syntax rules of the form

   NonTerminalSymbol -> Some_term_made_of_terminal_and_nonterminal_symbols

For example,

   IfStatement -> 'if' + '(' + Expression + ')' + Statement
                | 'if' + '(' + Expression + ')' + Statement + 'else' + 'Statement'

When Esprima has parsed such an `IfStatement`, it abstracts away the notations and returns the actual content like so:

   ["IfStatement", Expression, Statement]                     // without `else` clause
   ["IfStatement", Expression, Statement_1, Statement_2]      // with `else` clause

only that the result is not an array, but a plain object:

   {
     type: "IfStatement",
     test: Expression,
     consequent: Statement_1,
     alternate:  Statement_2 | null 
   }


More general, a **context-free grammar** is a quadruple G = (V,T,P,S), where

* `V` is the set of terminal symbols; these are lexically analysed as tokens by Esprima
* `T` is the set of nonterminal symbols; these are given as interfaces in The ESTree Spec and keys in `require('esprima').Syntax` ('AssignmentExpression', ..., 'YieldExpression').
* `P` is the set of production rules; this is essentially the ECMAScript standard
* `S` is the start symbol, one of the nonterminals; in case of Esprima, this is 'Program'.

The **language** of `G` is the set of all strings, that can be generated from `S` by applying the production rules until all nonterminal symbols are gone.

A **parser** for that language takes the other way, i.e. it takes a string and produces the tree that describes how that string is generated from `S`. Or it issues an error message, in case the string is not in the given language.

Now, with Esprima, we do have such a parser for ECMAScript. And that is great.


But in this general setup, the parser always parses a `S` phrase. Esprima always tries to parse a `Program`. 
What I would like it to do, is parsing any kind of phrase, defined by a nonterminal symbol, say `IfStatement`.
I would like to be able to call the parser like so

    esprima.parse (code, {}, 'IfStatement')

so that it does not try to parse a `Program` phrase, but an `IfStatement`, and that it issues an error, when `code` is not a well-formed if-statement.


Of course, this is an option not given by Esprima, as the API is presented in the [documentation](http://esprima.org/doc/index.html).
And Michael, I understand from your reply that you suppose this is very hard to solve. 
But I am not sure, really. I think to remember from my studies, that in fact, most algorithms and strategies are implemented that way, i.e. they have implicit parsers for every nonterminal symbol. But it is just not made explicit in the theory, because in general, there is only the question for the language/programs, not its sub-phrases. I didn't study the source code of Esprima. But if I am right, this additional functionality might not be so hard to provide, at all.


But why should this be interesting, at all? Why could it be valuable to add this option to Esprima?

Traditionally in JavaScript, we deal with text input from users and we often need to verify that input. For example, we need to make sure, that a certain string is a valid email address. 
But when JavaScript is really trying to evolve and catch up with the possibilities of other higher-order languages, JavaScript/ECMAScript code itself becomes data. ECMAScript is already a functional language in the sense that functions are "first-class" values. But ECMAScript is a scripting language and there is a huge potential to embrace that fact and allow code to be data, as well!
By now, we only have `eval(code)` to convert code into values. But "eval is evil", it is too dangerous to allow any input code. Up to now, there is no built-in way to analyze the code, so that it would accept only certain kinds of expressions. Well, there is `JSON.parse`, which is a big step in that direction. 
But I would like to filter and safely convert code of any kind: `String`, `Literal`, `Function`, `Expression`, etc. If Esprima would have an option to filter certain kind of code, it would immediately provide a verification for these data and input. And that would be really cool. ;-)

Cheers,
Thomas

Richard Gibson

unread,
Sep 25, 2015, 12:17:09 PM9/25/15
to esp...@googlegroups.com, bucepha...@gmail.com, repl...@michael.ficarra.me, fr...@michael.ficarra.me
I understand your request, and its differences from the current interface. I was pointing out an extension of the current interface that would satisfy your request, and alluded to something upon which Michael elaborated: that the current structure of esprima is not conducive to this change. It isn't impossible, but it isn't exactly a high priority either. I encourage you to look at the source code, mock up a proof of concept, and/or file an issue at https://github.com/jquery/esprima.
Reply all
Reply to author
Forward
0 new messages