instaparse questions

eliass...@yahoo.com

unread,

Nov 18, 2013, 4:20:55 AM11/18/13

to clo...@googlegroups.com

Hi,

I'm trying to use instaparse to differentiate between identifiers and keywords.

The following code is from the tutorial.

(def unambiguous-tokenizer
(insta/parser
"sentence = token (<whitespace> token)*
<token> = keyword | !keyword identifier
whitespace = #'\\s+'
identifier = #'[a-zA-Z]+'
keyword = 'cond' | 'defn'"))

The above parser works fine for:

(insta/parse unambiguous-tokenizer "cond id defn")

It recognizes cond and defn as keywords and id as identifier.

But if an identifier starts with a keyword such as condid:

(insta/parse unambiguous-tokenizer "condid id defn")

It doesn't work anymore. (I want it to recognize condid as an identifier not a misspelled keyword)

Does anybody know how to make that work?

Thanks

--anders

Mark Engelberg

unread,

Nov 18, 2013, 5:55:46 AM11/18/13

to clojure

Simplest way is to make the keywords regular expressions that look for a "word boundary" after the keyword:

(def unambiguous-tokenizer-improved

(insta/parser
    "sentence = token (<whitespace> token)*
     <token> = keyword | !keyword identifier
     whitespace = #'\\s+'
     identifier = #'[a-zA-Z]+'

keyword = #'cond\\b' | #'defn\\b'"))

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mark Engelberg

unread,

Nov 18, 2013, 6:51:34 PM11/18/13

to clojure

Also, the version in the tutorial called "preferential-tokenizer" behaves the way you would like. This is actually a really good illustration of the difference between the two approaches of negative lookahead versus ordered choice.

The unambiguous-tokenizer, by saying "<token> = keyword | !keyword identifier", rigidly specifies that it's not a valid identifier if it starts with a keyword. The preferential-tokenizer simply says: "<token> = keyword / identifier", i.e., keyword interpretation is preferred over identifier. The preference approach is more flexible, allowing the parser to begin by interpreting the "cond" in "condid" as a keyword, but when this doesn't lead to a valid parse (because there's no whitespace after "cond"), it backtracks and tries interpreting it as an identifier.

As I pointed out in the last post, you can "fix" the unambiguous-tokenizer by clearly specifying with regexes that the tokens must end at word boundaries, but the preferential-tokenizer example is another way to get the behavior you're looking for.

Reply all

Reply to author

Forward