[Neo4j] Cypher language grammar

203 views
Skip to first unread message

dmi...@vrublevsky.me

unread,
Jan 6, 2016, 12:23:29 PM1/6/16
to neo4j-e...@googlegroups.com, openc...@googlegroups.com
Hi!

I am author of Cypher IntelliJ plugin. Recently i spend some time porting Cypher language from it’s Parboiled-based implementation (from cypher-compiler) to bnf, so I can it use then in plugin to generate parser.

The tricky part there is - I need Lexer. Parboiled-based parsers are lexer-less. This is great when you need AST and nothing more.

In my case I need Lexer that will be capable of parsing Cypher query, even invalid one.
Fortunately IntelliJ toolkit gives you possibility to generate Lexer using jFlex from bnf.

So. I wanna discuss several issues/problems that I encountered while doing all that stuff.

Keywords, function names & identifiers

In Cypher it is allowed to use any keyword or function name as identifier name. 
For example, it’s possible to write such query:
```
MATCH nodes=(return)
RETURN nodes(nodes)
```

Currently Lexer generated by IntelliJ toolkit can’t deal with such cases.
From lexer perspective there is 6 keywords in query.

However generated Lexer is pretty simple. Probably I can fix this issue by writing my own Lexer, which will be capable of handling state and determine what exactly this should be - identifier or keyword.

Same thing applies to CypherPrettifier. If you execute above query in http://console.neo4j.org/, then you can see that query is incorrectly formatted.

Question: Are there any plans on making language more strict, so it won’t allow to use keywords as identifier (probably during openCypher initiative)?

Functions names & case insensitivity
Cypher has some built-in functions. 
For example: `toInt()`
In reality, from Cypher compiler perspective those function are case-insensitive.
So, all such examples are valid: `toint`, `ToInT`, `toINT` and others.

While this isn’t bad, it can sometimes arise interesting effects.

Again problem is with Lexer. 
Code: ` (E:Flavour{name:'E', description:'Light, Medium-Sweet, Low Peat, with Floral, Malty Notes and Fruity, Spicy, Honey Hints.'}),`

In this code sample `E` is parsed by Lexer as function name. Because there is `e()` function and function name is case-insensitive.

Question: Same as above. Are there plans to forbid to use function names as identifiers?

Questionable rules
I encountered several questionable rules.

1) RelationshipPatternSyntax - this one specifies syntax for creating constraints. 
```
RelationshipPatternSyntax ::= ("()-[" Identifier RelType "]-()")
      | ("()-[" Identifier RelType "]->()")
      | ("()<-[" Identifier RelType "]-()")
```
Pattern start & endings are hardcoded. However everywhere else when pattern is described `Dash`, `LeftArrowHead` and `RightArrowHead` are used. And this rules support additional style of dashes and arrows.
Basically it means that I can’t create constraint using additional supported dash & arrow head variants.

2) Expression1 - Functions that are not functions.
There are severals branches that looks like a function, but not really a function from grammar perspective.
I am curious why this was designed in that way.
They are called “Predicates” id documentation (http://neo4j.com/docs/stable/query-predicates.html).
>>> Predicates are boolean functions

3) _PRAGMA - actually I can’t find any information what this is.
There are some clues in google, but no one in Neo4j documentation.
Reply all
Reply to author
Forward
0 new messages