Can someone help me understand why my tokenizer (made with Brag) doesn't work?

135 views
Skip to first unread message

Cistian Alejandro Alvarado Vázquez

unread,
Sep 24, 2019, 12:10:18 AM9/24/19
to Racket Users
I'm trying to parse a simple subset of SQL and failing massively. I made a grammar which works, but have no idea how to get my tokenizer to work and I find the documentation very difficult to parse (ho ho ho). My code along with explanations can be found on this StackOverflow question:


I don't think it's really a perfect question for StackOverflow (judging by the lack of responses at least) so I'm hoping someone here might be able to guide me a bit. 

Can anyone help me understand how my tokenizer should work? I'm truly truly lost and reading the documentation has only added to my confusion! Thank you!

Matthew Butterick

unread,
Sep 24, 2019, 1:57:17 AM9/24/19
to Cistian Alejandro Alvarado Vázquez, Racket Users

On 23 Sep 19, at 9:10 PM, Cistian Alejandro Alvarado Vázquez <alex.a...@resuelve.mx> wrote:

Can anyone help me understand how my tokenizer should work? I'm truly truly lost and reading the documentation has only added to my confusion! 

What your parser parses is a sequence of tokens. If you don't pass the parser all the tokens that the grammar expects, then the parse can never succeed. 

For instance, the problem with this lexer:

(lexer      
["select" lexeme]
[whitespace (token lexeme #:skip? #t)]       
[any-char (next-token)]))

Is that it only produces one token, "select". Since your parser uses more than just the token "select": 

#lang brag
select    : /"select" fields /"from" source joins* filters*
fields    : @field (/"," @field)*
field     : WORD
source    : WORD
joins     : join* 
join      : "join" source "on" "(" condition ")"
filters   : "where" condition ("and" | "or" condition)*
condition : field | INTEGER "=" field | INTEGER

The parse can never succeed.

Likewise, your revised lexer:

(lexer
       [whitespace (token lexeme #:skip? #t)]
       ["select" lexeme]
       [(:seq alphabetic) (token 'WORD lexeme)]))

Will only emit two kinds of tokens: "select" and a WORD token containing a single letter as its lexeme. (Do you see why?) Also not what you want.

I can't write your whole tokenizer. But as a start, you probably want to match each of your reserved keywords as a whole token, e.g. —

[(:or "select" "from" "join" "on" "where" "and") lexeme]

If you want other sequences of characters to be captured as WORD tokens, your pattern needs to be a quantified pattern:

[(:+ alphabetic) (token 'WORD lexeme)]



Cistian Alejandro Alvarado Vázquez

unread,
Sep 24, 2019, 1:30:20 PM9/24/19
to Racket Users
I do understand why any-char wouldn't work. Thanks! I don't really understand why :seq didn't work, maybe because I want to require one or more chars (thus :+)? Anyway, it works now! Thanks a lot! You're a legend.
Reply all
Reply to author
Forward
0 new messages