Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Script Parsing Part 1: The Way to NOT do it.

38 views
Skip to first unread message

Alex K. Angelopoulos (MVP)

unread,
Dec 13, 2002, 6:50:00 PM12/13/02
to
The VBScript parsing project I've been working with off-and-on for a few weeks
has finally led me to try what I _think_ is a state machine analysis method. In
interests of documenting what else I tried, here's a commented expansion of my
notes on trying to work with code from a macroscopic viewpoint.

It's very ugly, and serves as a good explanation of why to NOT use this kind of
general technique. There are some odd discoveries and functions I built along
the way, but trying anything except for linear character parsing is probably a
dead end. There will be a follow up on the topic of "state machine" analysis of
code.


1. INITIAL PURPOSE

Parsing VBScript code into logical elements automatically, so you can use code
to separate functions, comments, logical structures such as Case statements and
Do loops. This can also be used to:
+ extract strings and identifiers (variables and literal numeric values used
within the code).
+ do code substitution
+ look for coding problems automatically, such as undimensioned variables or
large blocks under error control without error check statements.

Translating this into something tangible for coders out there, this means the
following, besides just being able to pull out sequences of code blocks:
+ being able to correctly identify all keywords in the code - you have to be
able to tell what is and is not a string, a comment, an identifier, or an
operator.

Note that for uses such as "mining code", you WILL want to hang onto commented
text; this is an extra requirement which most lexers don't handle.

2. THE PROBLEM

There are hundreds if not thousands of tools for automatic code parsing.
Unfortunately, most of the tools are for C-family languages, and focus on
compiler processing for C code to _produce_ C code. Even the ones written in
BASIC variants are primarily for older non-Microsoft BASIC flavors, and always
are written in C anyway (all the ones I have found, anyway). This isn't a
direct problem if you goal is to write a compiler or you at least have a good
background with C; neither applies to me currently unfortunately.

There _are_ generic parsers completely wrapped up, some exposed to COM
automation, which can take a formal grammar developed for a language and then
use that to produce parsing code. This has a drawback in that there are no
formal public grammars available for VBScript or VB5/6. There is a semi-formal
one which is being improved for VB.NET available from Microsoft, but other
issues arise when attempting to use this grammar as a starting point:

- different grammar processors use different forms of BNF (Backus-Naur Form)
notation, which means a lot of processing is required to make a formal grammar
work with an arbitrary grammar compiler.
- VB.NET is significantly different from VBScript, to the point where I think I
would find it easier to write a grammar from scratch than to use even the VB.NET
version as a starting point.
- The syncretistic history of the BASIC languages has produced a modern syntax
which is hard to reduce to a formal grammar.

The end result is that it is probably easiest to start from scratch, which is
what I did.

3. MACROSCOPIC APPROACHES - WORK WITH LINES AND LINE CHUNKS

I'm going to cover several ad-hoc approaches I tried. As you will see, I had to
do a lot of backtracking. My initial reasons for choosing this general approach
were that I wanted to try to handle the code at as high a level as possible - in
other words, somewhere above the level of individual characters - and I wanted
to end up with simple, easy-to-understand rules for parsing.

3.1. General Regex Searches

One technique is to attempt to match constructs using regular expression
searches; I've had some success with abstracting portions of a script that way.
Nice things about this approach include:

+ you can jump right in and try it without doing pre-processing or tokenizing of
the code.
+ It is fairly fast, since processing is handed off to the regex search in
native code.

Bad things:
- there always seem to be exceptions to the rules in VBScript (a fact of life
due to historic compatibility). This means that you may have it working fine
for dozens of scripts, then you suddenly find that it chokes on an obscure usage
someone has.
- the code produced is very spaghetti-like, depends a lot on entry/exit values,
and is dependent on obscure and complex regex expressions (I had at least 3
where I found myself looking at the $9 replacement parameter). Also, in the
limiting case of large code and more complex regexes, search time will usually
grow exponentially.

3.2. Preprocessing Via Regex

This is not a solution itself, but a trick to make life simpler; it simply
involves using regular expression search-and-replace on the raw code to
initially remove material which is irrelevant to any parsing or execution and
get the remainder into a "normal" form. For example, VBScript:

+ ignores empty lines, and will accept either vbCr (also known as 0x0D, Chr(13))
or vbLf (also known as 0x0A, Chr(10)) as a line terminator *(see note 1 below).
A regex which searches for vbLf or vbCr or both with any intermixed whitespace
(vbTab, space, vbFormFeed, vbVerticalTab) and then replaces it with a single
instance of vbCr can make code much more regular and also significantly reduce
the size*(see note 2 below).
+ ignores whitespace at the beginning or end of lines.
+ treats all internal whitespace sequences as single occurrences of whitespace,
and does not require whitespace between operators.

A routine like the one outlined in the notes at bottom can indeed be used to
reduce the code significantly and create what I am calling a "line normal" form
of the code output. It looks like a "block'o'code", but is much cleaner in a
machine-parsing sense.

- one problem is that as soon as you attempt to apply regular expression
substitutions without anchoring them to a line start or end, you run the risk
of grabbing whitespace within quoted text. You CAN use some other tricks with
this, such as processing code line-by-line and only modifying lines which have
no literal quotes within them, but if you are going to go to that level of
granularity you will need to check all lines anyway, so this won't do anything
but give you bulkier code; you will have to do a "complete" solution somewhere
else which takes the lines with quotes into account.

So basically, this gives you a way of trimming the code into a set of only lines
which have actual code or comments, and ensure that each line begins and ends
with a non-whitespace character. The granularity of the cleanup is limited to
the level of physical lines (as distinct from logical statements, which may be
concatenated onto a single line with ":" or split across lines with "_"). There
are some other tricks which can be applied at this point, though, allowing
cleanup within lines.

3.3. Intra-Line Preprocessing: String Functions and Regexes

At this point, we probably have to deal with code lines to try to normalize
code. In order to start extracting tokens/lexemes (logical "words" if you will,
such as "function", "Loop", "MyVar") we have identify whether or not something
is inside quotes or after a comment marker.

Our first step is to split the code into physical lines; this is relatively
simple; to get an array of lines from our code block CodeString we use:

lines = Split(CodeString, vbCr)

Now we need to worry about comments, line joins, and strings within each line.
How do we do that?

There are tricks for cleanup this way. As you will see, though, they wind up
requiring the same steps repeated continually to produce order from the chaos of
real code.

One high-level trick for simplifying this is to do another split within a loop
that handles each line: split on the quote marks, like this:

q = Chr(34)
for i = 0 to UBound(lines)
segments = Split(lines(i), q)
' do stuff
next

There are two interesting points about this.

The first is that if you look at the way strings are escaped in VBScript, you
will quickly see that all even-index segments (segments(0), segments(2), and so
on) are _outside_ strings, and all odd segments (segments(1), segments(3), etc)
are _inside_ strings. This means that you can safely assume your even segments
are raw text.

The second point, if you wish to exploit it, is that since you MUST have an even
number of literal quotes in well-formed code lines, you will have an odd number
of total segments (and an even number for UBound(segments) - 0, 2, 4...). That
only applies if you don't have any comment characters in the line, of course;
the following is a perfectly legal line, since the only " occurs after the
comment starts:

Const q = Chr(34) ' q is a single " character

In any case, you can now iterate over the even-indexed segments without fear of
affecting literal strings. Unfortunately, there are still some problems to
this approach: if you don't want to mess up comment formatting, you need to
isolate comments; if you want to start physically joining logical statements
split across physical lines with '_' or physically split logical statements
joined onto single physical lines with ':', you need to do some extensive
preprocessing otherwise. That gets messy, though. The simplest technique is
probably the one I used, which is to try a split on each segment using ' as the
delimiter. If you get more than one subsegment back, you have encountered a
comment. This doesn't account for "REM" either, but since REM has to be the
first statement on a logical line, we have a simple fix. To make the model
simple to follow, though, we need an entire layer of processing to refactor the
code into physical lines so that they are equivalent to logical lines. Since we
obviously don't want to split lines at occurrences of ":", "'", or "REM" inside
literal strings, the pre-processing was a necessity.

What we do is either perform an InStr or a character-by-character search for :
and '. Character processing is easiest at this point. When we encounter :, we
replace it with vbCr to give ourselves logical lines. When we encounter ', we
replace it with vbCr & "'", and stop processing.

This neglects one thing: REM statements. To handle these correctly, the
simplest thing to do is this. At the beginning of the first segment, check not
only for : and ', but also check for "REM" in any case, followed by a whitespace
character. We also need to recheck for REM anytime we do a vbCr substitution for
:. The easiest way is to have a case-insensitive regex running, and check for
"^\s*(REM)\s+" (translates into string start followed by zero or more whitespace
characters, followed by REM, followed by at least one whitespace character).

Once finished with the segments, we rejoin them with ", and any vbCr occurrences
we inserted are rendered inline to the code automatically. If we then rejoin
all the lines with vbCr, we no longer have any cases of multiple logical
statements embedded within a single line of code, and have all remarks split out
onto their own lines. We do have two dirty details left, though...

First, we may have excess whitespace at the end or beginning of lines; second,
we have done nothing about logical statements split across physical lines.

The fix for this is to take another pass through the data with a regex; for ease
of use, just running it through the LineNormalForm function in Note 2 will
suffice.

Now we can split the data into lines with vbCr (again!) and this time iterate
through them from the last to the second line. When we are on line i, if the
last character of line i is '_", and neither the first character of line i-1 is
' nor the first 4 characters of line i-1 match the "REM\s" regex, then we set
chop the last character of line i-1, append line i, and empty line i. Like
this:

set rx = new regexp
rx.pattern = "^rem\s"
lines = Split(CodeString, vbCr)
For i = Ubound(lines) to 1 step -1
If Right(lines(i-1), 1) = "_" Then
If Left(lines(i-1), 1)<>"'" Then
If Not CBool(rx.Test(lines(i-1))
lines(i-1) = Left(lines(i-1), Len(lines(i-1) - 1) & lines(i)
lines(i) = vbNullString
Else
' I HATE rem... change it out
lines(i-1) = "' " & Mid(lines(i-1), 4)
End If
End If
End If
Next

And, of course, we have one remaining quirk. There may be empty lines, so again
we call the LineNormalForm function. Note that I kill off the "rem" keyword
here. Why? Because it's a mess to track.

We've also cheated here; some code which is not legal VBScript will be accepted
by this process; for example, if line continuations were used inline followed by
a colon or if we had a line continuation followed by two blank lines, then the
remainder of the line, the VBScript compiler would choke on them; this will take
them in and clean them. Since we are technically preprocessing code, that could
be seen as an advantage, although it tells us that our method of doing things is
not like the process the VBScript compiler uses.

Now we can try cleaning up and extracting tokens. We resplit the lines again.
If the leftmost character of a line is not ', then we know we don't have any
comments, so we can split each line into segments at the quotes (again).

It is now child's play to normalize inline text. We can process the even
segments once more, only this time we can do a regex search for "\s+" and
replace it with " ". This gets rid of any bizarre whitespace and allows us to
do splits with " "to find tokens - although they may have operators stuck with
them. AN easy fix for that is to do a regex beforehand for any possible operator
character, replacing it with itself surrounded by spaces. In other words, look
for:
"(=|,|\^|/|&|\(|\)|\\|\*|\+|>|<)"
and replace it with
" $1 "

At this point, we can implement functions to do things such as look for keyword
occurrences and counts, extract strings, find embedded literals, and so on...
but we're not going to. We're not going to because this is a godawfully ugly,
unstable hack.

Ugly comes in several flavors, and this solution provides a smorgasbord version
of ugly.

To be fair, it does indeed give us some control over code and allows us to do
basic cleanup; it is also an incrementally constructed approach, and some of the
sub-processes are useful for other things as well or are interesting in their
own right. Unfortunately, that's where the good part stops. Problems include:

- lots of highly specialized steps repeated with just enough variations to make
them hard to abstract into general processes.
-Ad-hoc attempts to use flat regex and string function search and replace to
handle fairly complex state-dependent code introduce reliability questions.
- The final code we have, although capable of being reformatted for display, has
not been processed into a form making it easier to deal with as an abstract
entity. All we have done is reformat the raw text, and can depend on it being
easy to break up under certain circumstances. Those circumstances include:
- ignoring comment lines - code to do so in every attempt to abstract anything.
- if we have multiple adjacent quotes, we have a mess when it comes to
abstracting strings; for example,
path = """C:Program Files\MyApp\app.exe"""
is rendered as the following if we do a split at the quotes and see the
segments:

segment(0) = "path = "
segment(1) = ""
segment(2) = ""
segment(3) = "C:Program Files\MyApp\app.exe"
segment(4) = ""
segment(5) = ""
segment(6) = ""

Sure, we can implement some logic checking for that, but it all adds up to
advanced low-tech adhockery.

What ultimately nails this down as a bad way to go is to look at the motivations
I had for trying this approach. I didn't want to touch characters, and I wanted
simple relationships.

Obviously, this DID need to drop down to character analysis, as shown in
techniques I had to implement in the core loops; and it is very erratic
analysis, as well, with all sorts of iffy assumptions. Furthermore, the parsing
techniques I had to use in attempting to avoid character-handling look ludicrous
in retrospect; they are incredibly confusing, involving multi-layered arrays and
split-join-regex loops repeated ad nauseam.

The alternate approach is to assume we have to deal with characters from the
beginning, and work from there, which is what I will be covering in the next
note on this.

Although it requires starting from scratch, using that technique has some
benefits. First, a character array can be easily iterated and manipulated;
second, from the viewpoint of the scripting engines, a character is the
equivalent of an atom in chemistry: it is a fundamental element which has
specific roles in specific situations. Those roles may appear complex, but the
unit is never "broken" and the complexities can indeed be reduced to specific
relationships which can then be integrated to produce complex behavior from a
simple basic structure.

==============================
NOTES
(1) People usually don't worry about precise definitions of "whitespace" in
VBScript. The reason is that it works very simply if you are on a pure Windows
system; you may have problems with code which has been saved in off-the-wall
editing environments, though. Here's a quick summary.
+ VBScript interprets either a carriage return or a line feed as a _physical_
line termination. The vbNewLine constant is identical to the two-character
vbCrLf, which is Chr(13) and Chr(10) stuck together: carriage return followed by
a line feed, just like the end of line on an old typewriter. Windows editing
tools use this as the default symbol set for a line feed.
+ All other whitespace characters are interpreted as a space. This includes tab
and some symbols which are not usable on Windows systems: formfeed and
horizontal tab specifically.

(2) Although the WSH documentation claims otherwise, vbLf is NOT recognized as a
line termination by VBScript.Regexp; this appears to have been a minor confusion
by the tech writers. To effectively exploit the use of anchors in multiline
mode use of a regex, you want to have vbCr characters as line terminations.
Below is a "line normalization" routine I wrote; although "named" as though it
were for parsing VBScript code in general, the technique is applicable to any
text file. It strips out empty lines, changes line-endings to single-character
carriage returns, and then removes all whitespace at the beginning and the end
of lines.

Note that (as documented) th regex special character escape for a linefeed is
\n; for carriage return, it is \r.

Function LineNormalForm(VBScriptCode)
' takes arbitrary VBScript code as an argument string
' returns with lead/tail whitespace remove from all lines
' empty lines removed
' all line endings changed to carriage returns
Dim rx, sData
set rx = new regexp
rx.multiline=true
rx.global=true
rx.pattern = "(\s)*(\r|\n)+(\s)*"
sData = rx.Replace(VBScriptCode, vbCr)
rx.global = false
rx.multiline = false
rx.pattern = "^\s+"
sData = rx.Replace(sData, "")
rx.pattern = "\s+$"
LineNormalForm = rx.Replace(sData, "")
end Function


Paul Randall

unread,
Dec 13, 2002, 8:56:02 PM12/13/02
to

"Alex K. Angelopoulos (MVP)" <a...@mvps.org> wrote in message news:uKadWJwoCHA.2276@TK2MSFTNGP09...

> The VBScript parsing project I've been working with off-and-on for a few weeks

I once looked at the source code for a FORTRAN Tidy program which had been modified to work with the many handy extensions this
FORTRAN implimentation had, and tried to figure out what it was doing. Way too complex for me. But I found its output very useful,
especially when trying to use other people's code.

I admire what you have done. It would take me two years to do that much.

-Paul Randall


Alex K. Angelopoulos (MVP)

unread,
Dec 13, 2002, 9:14:45 PM12/13/02
to
You would be surprised how easy it can be. With a little bit of facility with
the language, if you sit down and just TRY to make a parser, you learn a lot.
Most of it by doing things that don't work.

--
Please respond in the newsgroup so everyone may benefit.
http://dev.remotenetworktechnology.com
----------
Subscribe to Microsoft's Security Bulletins:
http://www.microsoft.com/technet/security/bulletin/notify.asp


"Paul Randall" <paul...@cableone.net> wrote in message
news:e$797OxoCHA.2412@TK2MSFTNGP12...

Curtis Anderson

unread,
Dec 13, 2002, 11:48:22 PM12/13/02
to
Alex K. Angelopoulos (MVP) wrote:
<SNIP very interesting article>

Wow.
Can anyone here tell that Alex part-times as a tech-content writer? <G>
Seriously, nice work. And some rather interesting observations. Same as
Paul noted, it would take me aeons to do something like this!

--
Curtis Anderson
The poster formerly known as Ned Flanders
Who has (finally) changed his GECOS fields.


Steve Fulton

unread,
Dec 14, 2002, 9:21:53 AM12/14/02
to
Alex K. Angelopoulos (MVP) wrote:

[ Interesting article snipped. ]

> (2) Although the WSH documentation claims otherwise, vbLf is NOT recognized
> as a line termination by VBScript.Regexp; this appears to have been a minor
> confusion by the tech writers. To effectively exploit the use of anchors in
> multiline mode use of a regex, you want to have vbCr characters as line
> terminations.

vbLF and vbCR are /both/ recognized as a line termination by VBScript.Regexp.
The difference between them is how they match the "." wildcard; vbLF doesn't
match it and vbCR does.

The first condition means that you have to be wary of vbCRLF if your RegExp is
in multiline mode--every actual line ends in an "empty" line since the pattern
"^$" will use the CR as the terminator of one line and the LF as the terminator
of the next.

The second condition gives you a way to emulate Perl regex's /s modifier.
Replace vbCRLF with vbCR and ".*" will span line breaks; replace it with vbLF
and ".*" won't span line breaks.

sCRLF = "This is an example." & vbCRLF & "This is another example."
sLF = Replace(sCRLF, vbCRLF, vbLF)
sCR = Replace(sCRLF, vbCRLF, vbCR)

Test sCRLF, "vbCRLF" : Test sLF, "vbLF" : Test sCR, "vbCR"

Sub Test(str, sep)
Dim out(4)

With New RegExp
.Pattern = "^$"
.Multiline = True
.Global = True
out(0) = .Execute(str).Count & " empty line(s) found in first test."

.Pattern = "example\.$"
.Multiline = True
.Global = True
out(2) = .Execute(str).Count & " match(es) found in second test."

.Pattern = "an.*example"
.Multiline = False
.Global = False
out(4) = """" & .Execute(str)(0) & """ found in third test."
End With

MsgBox Join(out, vbNewLine), vbInformation, sep
End Sub

--
Insanity: doing the same thing over and over again and expecting different
results. -Albert Einstein

=-=-=
Steve
-=-=-


Alex K. Angelopoulos (MVP)

unread,
Dec 14, 2002, 1:43:51 PM12/14/02
to
"Steve Fulton" <cerbe...@hotmail.com> wrote in message
news:#tZ23x3oCHA.1624@TK2MSFTNGP12...

> Alex K. Angelopoulos (MVP) wrote:
>
> [ Interesting article snipped. ]
>
> > (2) Although the WSH documentation claims otherwise, vbLf is NOT recognized
> > as a line termination by VBScript.Regexp; this appears to have been a minor
> > confusion by the tech writers. To effectively exploit the use of anchors in
> > multiline mode use of a regex, you want to have vbCr characters as line
> > terminations.
>
> vbLF and vbCR are /both/ recognized as a line termination by VBScript.Regexp.
> The difference between them is how they match the "." wildcard; vbLF doesn't
> match it and vbCR does.

That explains one set of inconsistent results I got early on.... I think I
missed this by having done all my explicit substitutions to vbCr, and I
incorrectly recalled the post you made a few weeks ago mentioning the
documentation bug about ".".

> The first condition means that you have to be wary of vbCRLF if your RegExp is
> in multiline mode--every actual line ends in an "empty" line since the pattern
> "^$" will use the CR as the terminator of one line and the LF as the
terminator
> of the next.
>
> The second condition gives you a way to emulate Perl regex's /s modifier.
> Replace vbCRLF with vbCR and ".*" will span line breaks; replace it with vbLF
> and ".*" won't span line breaks.

So it's purely a documentation bug - this is intentional, it appears!


Alex K. Angelopoulos (MVP)

unread,
Dec 14, 2002, 2:18:01 PM12/14/02
to
Trying a State-Based Approach

Note - this is heavily influenced by my lack of any formal background with
computational state machines, as well as some vague memories of engineering
control systems.

1.1 What's a state machine?

Conceptually, it is something that you hand input to for processing. It decides
what to do with the input based not only on what the input is, but also on the
current state of the machine. The state machine then has the task of deciding
how to handle the input; presumably, you define some things that it provides as
output, based not on what you give it now, but on what it sees as a completed
task. The correct term I believe is to say that the state machine "emits"
output - it doesn't return it, it puts it somewhere and you can pick it all up
at some point.

1.2 Motivation for a State Machine

State machines are actually pretty easy to write once you get the concept in
action, so let me motivate it by showing an example which also will show you how
a state machine will work. As you will see, this is best done within a class or
component.

Let's start by asking this: what does the character ' signify in a line of
VBScript code?

The correct answer is, "it depends." If you encounter it in a string of
characters like this:

x = 1 ' x is a constant

then the ' signifies that you are entering a comment (or going into a commented
STATE). If it is instead encountered like this

x = "\|/|'|&"

it is nothing but another character in a string - when you are in a string
STATE, the ' does nothing.

Suppose we could identify a set of "states" that occur within VBScript code, and
knew exactly what would happen for any character encountered in any of those
states - whether it would change the state or not, and whether it would be added
to the "product" for that state. If we implemented script logic for all of
those states correctly, then all we need to do is feed characters to the code,
and tell it when we're done. The code could automatically break everything down
for us! Based on what it was getting, it could drop things that didn't matter,
hang on to things it needed later, add important characters to the current token
it is building, and stack the tokens into an output pile.

Moreover, when we need to deal with complicated problems such as required
look-ahead values, we can let the machine handle it without worrying about
excessively tangled logic.

2 WHAT ARE VBSCRIPT STATES?

We can figure this out almost instinctively. Let's say you're a very careful,
character-by-character reader, and you are reading a script.

When the script is opened, you have no preconception about what anything means;
you are in your initial, innocent state. Ultimately you are going to wind up
with lots of logical statements each composed of logical tokens, but for now,
nothing.

I call this the Ground State or State 0 - or s0 for short. Using the Ground
State as our starting point, let's look at all of the possible things you can
encounter. Some will have to be processed, some will be ignored, and some will
prompt you to change states.

Once we list all of the states which s0 can be transitioned into, we can start
looking at how each of them handles input, and consequently what states they can
transition into. When we are done listing all of the states and state
transitions, we have completely outlined the logic required to implement a
VBScript state machine.

Depending on your objectives, there may be different states you define for
VBScript, but when you get done you will have a set of states that appears
fairly similar to names that occur in formal grammars, normally described with
all sorts of BNF grammar notation. BNF grammars may look complex (for good
reason - they are) but the underlying concepts are not. Here are the basic
states I have (simplified slightly):

ground/state 0 - start this way, and is the state of any whitespace except CR or
LF
string - inside a string literal
date - inside a date-time literal
identifier - a catchall for words and numbers encountered
operators - essentially any non-alphanumeric symbols encountered other than
these .:'"_#
remark - obviously, anything inside a comment is in the remark state

3 HOW DOES THIS HELP?

The most obvious answer is that looking at the code stream you are processing in
terms of its _state_ and its _input_ makes for very simple logic flow in code.

If your state is being inside a remark, for example, you eat every character you
run into unless it is a newline; at that point, you transition to ground state.
Every character input event has associated output that is as simple as:
+ what happens to my state?
+ what do I do with my input character?

In simplest form, you can implement a state machine as a set of Select Case
statements. A very crude one would involve nested cases; the top-level case
statement would control flow to a state selection, and inside each
state-specific Case clause you would have another set of Select Case statements
defining what happens with each possible type of character.

That may sound messy, but it isn't too bad. Consider an InsideString case. If
you run into a ", that means the string ends. If you run into a newline, you
have bad syntax, so you should probably exit on error. Anything else at all is
a valid string character.

There are some details which keep this from being a no-brainer, of course. For
example, encountering a " while inside a string isn't sufficient to tell you the
string ends - if the next subsequent character is a " as well, you have actually
encountered an escaped ". That will be part of the next discussion I post
(hopefully with decent code accompanying it).


Alex K. Angelopoulos (MVP)

unread,
Dec 17, 2002, 2:39:22 PM12/17/02
to
This is probably the last comment I will post about this for a while; this part
documents some simple ways to run a state machine and includes a script which
can be used as a beginning parser for VBScript. The script is still somewhat
buggy, but if given a file containing VBScript as an argument, it will print out
the tokens it finds along with their "type", where type is an identifier,
operator, or string literal.

1. HOW IT WORKS


1.1 Simple Example - The "This is Not A" State Machine

A really simple process would be one you could use for processing data from a
line of text that you want to break at the spaces. You read from the start of
the string, holding on to all the characters; when you hit a space, you turn all
the prior characters into a single token and go on. IN the example following
here, this would take the string "this is a test string" and parse it into the
following tokens:
this,is,a,test,string

This is essentially a degenerate state machine: it has only one state. This
makes it easier to see what things are in a state machine definition, though.
Here's the script:

'DegenerateStateMachine.vbs
s = "This is a test string"
Dim words(): Redim words(-1)
tmpString = ""
For i = 1 to Len(s)
currentChar = Mid(s, i, 1)
If currentChar <> " " Then
tmpString = tmpString & currentChar
Else
Redim preserve words(UBound(words)+1)
words(UBound(words)) = tmpString
End If
Next
For i = 0 to UBound(words)
WScript.echo words(i)
next


Our inputs are the stream of characters in the string named s. We then take
actions based on the input.

If the input is not a space, the action we take is to add the input character to
a temporary string.

If the input IS a space, we take a whole series of actions.
A space signifies that what we have in tmpString is a word; so we want to save
the tmpString value. We make the "words" array larger, put the temp string into
it's new element, and then reset the temp string.

This is pretty boring, but it gets the idea across; what I have described is the
the actions taken by a state based on input. It just never changes state, so
the way it handles input is always the same.

1.2 Slightly Complex: Two States

Now let's look at the same example, but as a two-state machine. Those states
are InGround and InComment.

When InGround is our state, if we hit a regular letter, we add it to our temp
string. If we hit a space, we say all the characters so far are a word, and go
on. If we hit a ' however, we call that the start of a comment. We add
everything we had to a word, and from then on ignore any data.

Why is this a state machine? Because the same input is handled differently
depending on our state. If we hit a space in the commented area, we don't
assume it's a new word.

Here's the example from 1.1 adapted as a simple two-state machine:

'TwoStateMachine.vbs

s = "This is a test string'with what looks like a comment"
Public words()
Redim words(-1)
tmpString = ""
InGround = 0
InComment = 1
state = InGround

For i = 1 to Len(s)
currentChar = Mid(s, i, 1)
Select Case True
Case state = InGround
Select Case True
Case currentChar = "'"
EmitWord
state = InComment
Case currentChar = " "
EmitWord
Case Else
tmpString = tmpString & currentChar
End Select
Case state = InComment
' we do nothing at all - throw this away
End Select
Next

for i = 0 to UBound(words)
WScript.Echo words(i)
next

Sub EmitWord
Redim preserve words(UBound(words)+1)
words(UBound(words)) = tmpString
tmpString = ""
End Sub

'==============================

2. A DETERMINISTIC FINITE STATE MACHINE VBSCRIPT PARSER

With minimal commenting, here is the last working version of my simplistic
VBScript parser. It is crude, but it works. Run it with a script path
supplied as an argument (and use CScript as the host - this will echo MANY
tokens to the screen). It will parse the file into tokens, and when done will
echo each one along with what the script identified it as.

The token types I list are oversimplified, and certain elements (such as REM and
raw date strings) are not handled yet; nevertheless, it works pretty well.


' FsmVBScriptParser.vbs

' ----- SECTION 1 - INITIALIZATION -----
' Just initialization info; not relevant to direct understanding of
' of state machines. Conceptually, what is important is:
' + Variables for the token list, input, WorkingToken are made public
' so all the procedures can see them
' + The script's command-line argument is a file containing VBScript code;
' the file is read and turned into an array of characters

Const ground = 0
Const InComment = 1
Const InToken = 2
Const InString = 3
Const t_string = "string"
Const t_operator = "operator"
Const t_identifier = "identifier"
Public q : q = Chr(34)
Public state, atoms, pointer, output(), tokens(), WorkingToken, tokentypes()
Public WhiteSpace, Operators, EOL, WS
Operators = "+-!^*()-=\/<>," ' not precisely true, but it works
redim tokens(-1)
redim tokentypes(-1)
state = ground
pointer = 0
WhiteSpace = vbTab & vbFormFeed & vbVerticalTab & " "
EOL = vbCrLf
WS = WhiteSpace & EOL
script = WScript.Arguments(0)
atoms = StringToArray(LineNormalForm(ReadFile(script)))

' ----- END SECTION 1 -----


' ----- SECTION 2 - OUR STATE MACHINE LOOP -----
Do While pointer <= UBound(atoms)
' this is our really crude processing loop for the
' input or "commands" (the characters coming in)
' it directs execution to the appropriate state module
' while there is input to handle
' WScript.Echo pointer, escape(atoms(pointer))
Select Case True 'Select current state
Case state = ground
DoGroundState
Case state = InComment
DoInCommentState
Case state = InToken
DoInIdentifierState
Case state = InString
DoInStringState
Case Else
End Select
' note that this main loop advances to the next token
' automatically so the state procedures don't need to handle it.
pointer = pointer + 1
Loop
' ----- END SECTION 2 -----

' Output
For i = 0 to Ubound(tokens)
WScript.Echo tokens(i), tokentypes(i)
Next

' ==============================================
' State machine routines
' ==============================================

Sub DoGroundState
Select Case True ' select the input value
Case CBool(InStr(WS, atoms(pointer)))
' whitespace is irrelevant; we do nothing
Case CBool(InStr(Operators, atoms(pointer)))
' this is a pseudo-state. We process the operator as a token,
' but remain in GroundState.
WorkingToken = atoms(pointer)
EmitToken t_operator
Case atoms(pointer) = "'"
' this is the start of a comment
' so we drop the input and change state to InComment
state = InComment
Case atoms(pointer) = q
WorkingToken = atoms(pointer)
state = inString
Case Else
WorkingToken = atoms(pointer)
state = inToken
End Select
End Sub

Sub DoInCommentState
' simple "drop the data" approach to comments
Select Case True
Case CBool(InStr(WS, atoms(pointer)))
' do nothing since this version doesn't preserve comments
Case Else
state = ground
End Select
End Sub

Sub DoInIdentifierState
Select Case True
Case CBool(InStr(WS, atoms(pointer)))
EmitToken t_identifier
state = ground
Case CBool(InStr(Operators, atoms(pointer)))
EmitToken t_identifier
state = ground
Case atoms(pointer) = "'"
EmitToken t_identifier
state = InComment
Case Else
' add current token to token set
WorkingToken = WorkingToken & atoms(pointer)
End Select
End Sub

Sub DoInStringState
WorkingToken = WorkingToken & atoms(pointer)
Select Case True
' Since quotes can be embedded, we only end the
' string state on a quote if the next atom is not
' a character
Case atoms(pointer) = q
Select Case True ' Case for on a quote
Case atoms(pointer+1) = q
' Even though this atom is a quote,
' so is the next - so add to the token
Case Else
' for anything else we emit and drop to ground
EmitToken t_string
state = ground
End Select
Case Else
End Select
End Sub

Sub EmitToken(tokentype)
' resize token and tokentype arrays
' keep indexed against the tokens
redim preserve tokens(ubound(tokens) + 1)
redim preserve tokentypes(ubound(tokens) + 1)
' add current token and type to the lists
tokens(ubound(tokens)) = WorkingToken
tokentypes(ubound(tokens)) = tokentype
WorkingToken = ""
End Sub

' ==============================================
' Generic support routines
' ==============================================

Function StringToArray(sData)
' takes a single string of arbitrary length
' returns a 0-based array, 1 character per element
Dim aTmp()
Redim aTmp(Len(sData) - 1)
for i = 0 to UBound(aTmp)
aTmp(i) = Mid(sData, 1 + i, 1)
Next
stringToArray = aTmp
End Function

Function LineNormalForm(ByVal sData)
' takes arbitrary multiline string as an argument string


' returns with lead/tail whitespace remove from all lines
' empty lines removed
' all line endings changed to carriage returns
Dim rx

set rx = new regexp
rx.multiline=true
rx.global=true
rx.pattern = "(\s)*(\r|\n)+(\s)*"

sData = rx.Replace(sData, vbCr)


rx.global = false
rx.multiline = false
rx.pattern = "^\s+"
sData = rx.Replace(sData, "")
rx.pattern = "\s+$"
LineNormalForm = rx.Replace(sData, "")
end Function

Function ReadFile(FilePath)
'Given the path to a file, will return entire contents
' works with either ANSI or Unicode
Dim FSO, CurrentFile
Const ForReading = 1, TristateUseDefault = -2, _
DoNotCreateFile = False
Set FSO = createobject("Scripting.FileSystemObject")
If FSO.FileExists(FilePath) Then
If FSO.GetFile(FilePath).Size>0 Then
Set CurrentFile = FSO.OpenTextFile(FilePath, ForReading, _
False, TristateUseDefault)
If CurrentFile.AtEndOfStream <> True Then
ReadFile = CurrentFile.ReadAll: CurrentFile.Close


End If
End If
End If

End Function

' -------------------------------------------------------------
RELATED TOOLS AND CONCEPTS

The following may be of interest to anyone looking at parsing technologies.

Gold Parser - http://www.devincook.com/goldparser
free grammar compiler which can be used with a constructed grammar to generate a
language parser. The Gold Parser Engine itself has an ActiveX version.

Libero - http://www.imatix.com/html/libero/
+ Helps you write clean Finite State Machine code from a state table.

Foldoc - http://foldoc.doc.ic.ac.uk/foldoc/index.html
Mirrors include http://www.instantweb.com/foldoc,
http://ftp.sunet.se/foldoc/index.html
Contains a good set of advanced computing topics; you can look up some of the
jargon used in parsing and FSM such as state machine, lexer, parser, yacc...

Miscellaneous tools which play a role in the concept of lexing and parsing
include lex, yacc, flex, bison, awk, and sed.

0 new messages