Striping spaces in a SQL parser

33 views
Skip to first unread message

Scott Taylor

unread,
Sep 20, 2008, 1:27:54 AM9/20/08
to treet...@googlegroups.com

In a SQL grammar (as in most), spaces can generally be disregarded, so
that the following two strings parse the same:

"SELECT * FROM users", and
"SELECT * FROM users"

I've found that when parsing, I end up with nodes like the following:

grammar SQL
rule select
select_clause " "+
from_clause ...
end
end

instead of a more clear definition:

grammar SQL
rule select
select_clause " " from_clause ...
end
end

The only time such spaces can't be stripped is when a string occurs.
So the following two *shouldn't* be parsed with the same node:

"SELECT * FROM foo WHERE name = \"bar baz\""
"SELECT * FROM foo WHERE name = \"bar baz\""

Is there an obvious way to strip these spaces without running the
grammar over the file twice (once to strip spaces, the other to do the
actual parsing)?

Scott

P.S. If anyone is interested, I've already undertaken SQL parsing in
a ruby project named guillotine:

http://github.com/smtlaissezfaire/guillotine/tree/master

Clifford Heath

unread,
Sep 20, 2008, 3:25:12 AM9/20/08
to treet...@googlegroups.com
On 20/09/2008, at 3:27 PM, Scott Taylor wrote:
> In a SQL grammar (as in most), spaces can generally be disregarded...

> rule select
> select_clause " "+
> from_clause ...
> end

Don't do that. Tab characters, newlines/CR, and comments are
all equivalently handled as white-space. Instead, define a rule
that skips white space. I use "s" for optional white-space, "S"
for mandatory white-space. See
<http://activefacts.rubyforge.org/svn/lib/activefacts/cql/LexicalRules.treetop
>
for a sample set of lexical rules like this.

I also recommend that you define a rule for each keyword, with
each word terminated by !alphanumeric. This avoids the need to
use mandatory whitespace in most cases, which helps a lot when
you aren't sure whether the rule you called has skipped whitespace
already.

Adopt a policy of always skipping whitespace after rules, or before
rules, because that avoids doing it twice.

> "SELECT * FROM foo WHERE name = \"bar baz\""
> "SELECT * FROM foo WHERE name = \"bar baz\""
>
> Is there an obvious way to strip these spaces without running the
> grammar over the file twice (once to strip spaces, the other to do the
> actual parsing)?

Not a good idea.

> P.S. If anyone is interested, I've already undertaken SQL parsing in
> a ruby project named guillotine:
>
> http://github.com/smtlaissezfaire/guillotine/tree/master

Cool!

Clifford Heath.

Scott Taylor

unread,
Sep 20, 2008, 3:43:26 AM9/20/08
to treet...@googlegroups.com

On Sep 20, 2008, at 3:25 AM, Clifford Heath wrote:

>
> On 20/09/2008, at 3:27 PM, Scott Taylor wrote:
>> In a SQL grammar (as in most), spaces can generally be disregarded...
>> rule select
>> select_clause " "+
>> from_clause ...
>> end
>
> Don't do that. Tab characters, newlines/CR, and comments are
> all equivalently handled as white-space. Instead, define a rule
> that skips white space. I use "s" for optional white-space, "S"
> for mandatory white-space. See
> <http://activefacts.rubyforge.org/svn/lib/activefacts/cql/LexicalRules.treetop
>>

Sure. I've got a "space" rule, which could be easily generalized to
any sort of whitespace char, as you've outlined.

> for a sample set of lexical rules like this.
>
> I also recommend that you define a rule for each keyword, with
> each word terminated by !alphanumeric. This avoids the need to
> use mandatory whitespace in most cases, which helps a lot when
> you aren't sure whether the rule you called has skipped whitespace
> already.
>
> Adopt a policy of always skipping whitespace after rules, or before
> rules, because that avoids doing it twice.
>
>> "SELECT * FROM foo WHERE name = \"bar baz\""
>> "SELECT * FROM foo WHERE name = \"bar baz\""
>>
>> Is there an obvious way to strip these spaces without running the
>> grammar over the file twice (once to strip spaces, the other to do
>> the
>> actual parsing)?
>
> Not a good idea.

Well, maybe I should elaborate on the idea: As I have it now, I have a
preparser class which strips \n's and, \r's. It seems like it would
be much easier to add support which would replace groups of spaces
(two or more) with one space in that preparser. Of course, the issue
is replacing groups of spaces inside a string.

The idea I was suggesting is not to use the whole SQL grammar to parse
out strings the first time around (that would be insane), but a much
simpler "string" parser - something which could detect the parts of
the string which *aren't* inside a string, from those that *are*.
This seems to follow your mantra of "Adopt a policy of always skipping
whitespace after rules..." It also seems like it would be a more
efficient way of doing things as well.
>

>> P.S. If anyone is interested, I've already undertaken SQL parsing in
>> a ruby project named guillotine:
>>
>> http://github.com/smtlaissezfaire/guillotine/tree/master
>
> Cool!

I'd appreciate any feedback or help on this project, if you're
interested.

Best,

Scott Taylor

Reply all
Reply to author
Forward
0 new messages