Parsing in dremel

19 views

Skip to first unread message

Camuel Gilyadov

unread,

Feb 5, 2011, 7:00:51 PM2/5/11

to opend...@googlegroups.com

I would like to shed some light on the parsing of BQL in the project:

1. We use excellent open-source parsing library named ANTLR for parsing BQL.

2. ANTLR has great graphic IDE named AntlrWorks for authoring grammar files and for debugging them as well and last but definitely not the least to chart nice parse-trees and abstract-syntax-trees. I went so far to include AntlrWorks into the mercurial repository under directory 'tools'. So use it and see example in the end of this post.

3. One of the existing issues with the parsing is that when the query is grammatically invalid and apparently unparsable, antlr (or me misusing it) actually silently do best-effort parsing.San working currently on resolving this issue and add proper error-reporting.

4. Another pseudo-issue is the level at which parser approaches BQL language. When I implemented the parser I had many "understanding" levels at which parser will try to comprehend and parse the query. For example I could have put all built-in functional such as sin/cos directly into the grammar as have them as part of reserved keywords in the language. I could also differentiate between various types of identifiers and even can check some cases of division by zero in the grammar. The question is if it is a good idea to have grammar and therefore a parser be so intelligent? I think no, it is not a good idea, because if query is unparsable user won't get any intelligent error message, only something like "expected x or y and got z at position n". Another, problem is that debugging grammar rules is not for the faint of heart even with great AntlrWorks IDE, yet even another problem is that parser does not have access to symbol table, so it cannot "deeply understand the query in the context". So I decided to make grammar as simple as possible and understand query on very basic components only. However, I do capture many details, particularly operations precedence in expressions and parsing complex names and ID and etc. But currently I don't differentiate between column ID and for example function names, and between built-in functions and user-defined functions and etc. This is left to the semantic analyzer.

5. Therefore, when supplying the query the error and error-message can come from many places in the dremel stack. First, from the parser (when it will be fixed), second from semantic analyzer (in compiler module according to new structure), third from the planner when for example table exists in schema but cannot be found on disk and finally from executor when something mid-query went wrong.

Reply all

Reply to author

Forward

0 new messages