Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

"Stripping" VBScript prior to tokenizing

13 views
Skip to first unread message

Alex K. Angelopoulos (MVP)

unread,
Nov 14, 2002, 9:04:19 PM11/14/02
to
I'm working on being able to tokenize and parse VBScript using VBScript. At
this point I have a pre-cleaning routine that simply wipes out blank lines and
terminal/starting white space, and joins split command lines. It is set to
self-clean, and the stripped code that comes out of it still works correctly, so
I think it's a good start. Comments welcome; I don't know if anyone will have a
use for this as-is, though.

'T2.vbs
aData = TidyArray(_
PreCleanVBScript(FileToLines( , _
WScript.ScriptFullName)))


For i = 0 To UBound(aData)
Wscript.Echo aData(i)
Next

Function TidyArray(aData)
Dim aTmp(), i
Redim aTmp(-1)
For i = 0 to UBound(aData)
If Len(aData(i))>0 Then
Redim Preserve aTmp(UBound(aTmp) + 1)
aTmp(UBound(aTmp)) = aData(i)
End If
Next
TidyArray = aTmp
End Function

Function PreCleanVBScript(aData)
' takes an array of VBScript code lines as an argument
' chops out non-printing characters at start/end
' removes blank lines
Dim aTmp()
Redim aTmp(-1)
For i = 0 to UBound(aData)
sTmp = LChomp(Chomp(aData(i)))
If Len(sTmp)>0 Then
Redim Preserve aTmp(UBound(aTmp) + 1)
aTmp(UBound(aTmp)) = sTmp
End If
Next
' Now let's get REALLY Ugly...combine split lines.
' We need to step through this backwards.
For i = UBound(aTmp) To 1 Step -1
If Right(aTmp(i-1), 1) = "_" Then
aTmp(i-1) = Left(aTmp(i-1), Len(aTmp(i-1)) - 1) & aTmp(i)
aTmp(i) = ""
End If
Next
PrecleanVBScript = aTmp
End Function

Function FileToLines(delim,FilePath)
' Reads a file, and splits into lines
' if delimiter is not specified, uses vbCrLf
Dim lDelim ' use a local variable
If VarType(delim) = vbError Then
lDelim = vbCrLf
Else
lDelim = delim
End If
FileToLines = Split(ReadFile(FilePath), _
lDelim)
End Function

Function ReadFile(FilePath)
'Given the path to a file, will return entire contents
' works with either ANSI or Unicode
Dim FSO, CurrentFile
Const ForReading = 1, _
TristateUseDefault = -2, _
DoNotCreateFile = False
Set FSO = createobject("Scripting.FileSystemObject")
If FSO.FileExists(FilePath) Then
If FSO.GetFile(FilePath).Size>0 Then
Set CurrentFile = FSO.OpenTextFile(FilePath, _
ForReading, _
False, _
TristateUseDefault)
ReadFile = CurrentFile.ReadAll: CurrentFile.Close
End If
End If
End Function

Sub fWrite(FilePath, sData)
'writes sData to FilePath
With CreateObject("Scripting.FileSystemObject")._
OpenTextFile(FilePath, _
2, _
True)
.Write sData: .Close
End With
End Sub

Function LChomp(sData)
' Trims initial nonprinting characters from a string
Dim sNoPrint, sTmp
sTmp = sData
sNoPrint = vbCr & vbLf & vbFormFeed _
& vbVerticalTab & vbNullChar & vbTab & " "
If Len(sTmp)>0 Then
Do While (Instr(sNoPrint, _
Left(sTmp, _
1)))
sTmp = Right(sTmp, _
Len(sTmp) - 1)
Loop
End If
LChomp = sTmp
End Function

Function Chomp(sData)
' Trims terminal nonprinting characters from a string
Dim sNoPrint, sTmp
sTmp = sData
sNoPrint = vbCr & vbLf & vbFormFeed _
& vbVerticalTab & vbNullChar & vbTab & " "
If Len(sTmp)>0 Then
Do While (Instr(sNoPrint, _
Right(sTmp, _
1)))
sTmp = Left(sTmp, _
Len(sTmp) - 1)
Loop
End If
Chomp = sTmp
End Function


Joe Earnest

unread,
Nov 15, 2002, 8:54:51 AM11/15/02
to
Hi Alex,

Nice code.

One thing that I played around with briefly as an
exercise, which could use a "cleaner" like this, was
full compiled-code-style error trapping, by running
the primary script from a secondary script and
using Execute to execute the lines from the primary
script. You then have the ability to display lines,
jump to different lines, and even jump in and out
of informal routines (not procedures) using "returns"
(true calls like assembly), etc. It would be useful
for a script-based scripter, or a word processor
macro-based scripter.

That use, however, also requires breaking and
rearticulating lines concatenated with colons into
sublines. The big problem there, of course, was
how to handle subprocedures without having to
essentially duplicate the entire interpreter. I had
tentatively broken them out into separate files to
handle them. In the end, though, didn't think that
there was enough use for that either.

Regards,
Joe Earnest

"Alex K. Angelopoulos (MVP)" <a...@mvps.org> wrote in message
news:eLIxasEjCHA.1300@tkmsftngp08...


| I'm working on being able to tokenize and parse VBScript using VBScript.
At
| this point I have a pre-cleaning routine that simply wipes out blank lines
and
| terminal/starting white space, and joins split command lines. It is set
to
| self-clean, and the stripped code that comes out of it still works
correctly, so
| I think it's a good start. Comments welcome; I don't know if anyone will
have a
| use for this as-is, though.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.410 / Virus Database: 231 - Release Date: 10-31-02


Alex K. Angelopoulos (MVP)

unread,
Nov 15, 2002, 3:13:56 PM11/15/02
to
Looks like you've been trying some of same things I have... it's a lot easier to
do if you use an IE TextArea for input. <g>

--
Please respond in the newsgroup so everyone may benefit.
http://dev.remotenetworktechnology.com
----------
Subscribe to Microsoft's Security Bulletins:
http://www.microsoft.com/technet/security/bulletin/notify.asp


"Joe Earnest" <joeea...@qwest.net> wrote in message
news:ONh6y0KjCHA.1616@tkmsftngp10...

Alex K. Angelopoulos (MVP)

unread,
Nov 17, 2002, 12:57:05 PM11/17/02
to
Joe,

Do you still have your code around?<g>

It turns out that we've probably done about a quarter of the steps in the code
for a rudimentary tokenizing system for VBScript. This isn't the hard part -
the mentally tough section is appropriately tokenizing everything - but initial
cleanup is one large subtask.

I've just been taking a look at some Python tokenizing/parsing code, and it
looks like it might be easy to steal - er, extend - someone else's material in
another scripting language and adapt it to VBScript.

--
Please respond in the newsgroup so everyone may benefit.
http://dev.remotenetworktechnology.com
----------
Subscribe to Microsoft's Security Bulletins:
http://www.microsoft.com/technet/security/bulletin/notify.asp


"Joe Earnest" <joeea...@qwest.net> wrote in message
news:ONh6y0KjCHA.1616@tkmsftngp10...

Alex K. Angelopoulos (MVP)

unread,
Nov 17, 2002, 1:57:49 PM11/17/02
to
"Alex K. Angelopoulos (MVP)" <a...@mvps.org> wrote in message
news:eHkrBLmjCHA.1552@tkmsftngp08...

> Joe,
>
> Do you still have your code around?<g>
>
> It turns out that we've probably done about a quarter of the steps in the code
> for a rudimentary tokenizing system for VBScript. This isn't the hard part -
> the mentally tough section is appropriately tokenizing everything - but
initial
> cleanup is one large subtask.
>

It turns out thet tokenizing IS the tough part of developing the VBScript
grammar.

One of the greater annoyances appears to be what I am calling "embedded
whitespace tokens" - apparently what tends to choke up parsers quite readily is
getting incorrectly passed references to things like "end if" as though they
were dual tokens.

It appears that the simplest way to handle this is to do substitutions in a
scanning phase, substituting single-token references before a parser even sees
it. There are two things which ease this somewhat. First, some of the more
finicky constructs available in earlier forms of VB which seemed to prevent
effective scanners/parsers from being used are not in VBScript; second,
Microsoft has actually published a grammar for VB.NET:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbls7/html/vblrfvbspec12.asp

Although this differs significantly from that of VB6/VBScript, it still provides
a useful starting point.

Here are a couple of posts I dug up from comp.compilers postings:

=================================================

From: Scott Nicol
Newsgroups: comp.compilers
Date: 22 May 2000 23:02:39 -0400
Organization: APK Net
References: 00-05-076
(http://compilers.iecc.com/comparch/article/00-05-076)
Keywords: Basic, parse, practice

Anton Belov -- Customer Engineering wrote:
> Does anyone have or know where to get the grammar definition (in any
> form) for VBScript ? I know it has been asked before - but i didn't
> see any answers ...

I've never seen a grammar published. You can't trust the reference
manual either, because it lies. The only way to write a grammar is to
play with an implementation of the language.

I wrote a parser for VB (sorry, not available, I don't own the code),
and since VBScript is based on VB, I'll assume you'll run into similar
problems. Here are a few:

- End requires an extra lookahead, because End is different from
End If, End While, End Select, etc. Handle it in the scanner
so that End <something> is returned to the parser as one token,
not two.
- Line labels are a pain. I think I put some magic in the
scanner for this.
- There are a zillion reserved words, but some depend on context.
This can make tokenizing very difficult. For example, "In" is
a reserved word but can also be a variable name. I think I
fixed this by having YACC pass state info back to Lex, but
you have to be careful about lookahead. Another possible fix
is to allow a syntax error and try to recover in yyerror().
Not easy...
- Scoping rules are quite complex, and are subtly different for
variables and types.
- Parenthesis are significant at runtime, so don't throw them
away during the parse if you ever want to build a runtime.
In case you're wondering, parenthesis override references,
so even if a function/subroutine is written to accept a
parameter by reference, you can pass it by value by
enclosing the argument in parenthesis.

Writing a runtime is even more exciting...

=================================================


Re: Visual Basic Grammar Definition wanted
From comp.compilers

From: Scott Stanchfield <>
Newsgroups: comp.compilers
Date: 19 Jul 1996 00:00:36 -0400
Organization: McCabe & Associates
References: 96-07-111
http://compilers.iecc.com/comparch/article/96-07-111
Keywords: Basic, parse

I hate to say this, but I've got a simple one and can't give it to
you...

As you read the following, you may realize that yacc is not a good way to
go for Visual Basic. It needs (at least) 3 tokens of lookahead. (I've
found cases where three tokens are needed, and I hope that's the limit.)
I would suggest something like PCCTS (ANTLR/DLG), as it makes it easy to
watch for multi-lookahead sequences.

What I can do is offer a bit of advice based on my experience writing
it:

-- VB is _not_ an LALR(1) langauge! You need 3 tokens lookahead.
For example,

IF expr THEN <end-of-line>
100 END
100 END IF

as you can see, you can't determine if the "100 END" is labelled
END statement, or the start of a labelled "END IF." You need to
look after the END to see if its followed by "IF" or not.
We implemented this by setting up a token buffer between the
scanner and parser. Basically, we renamed the yylex() routine
generated by lex to "real_yylex" and created a new yylex()
routine that grabs a token using real_yylex, if it's a number
(and the first token on a line) get another, if it's END, get
another, if it's IF return LABELED_END_IF. I'm sure there
are other ways to implement this, but this seemed to be the
most maintainable, and we needed the yylex wrapper for the
"next" problem below... While I'm here, this also makes it
easy to change END IF to a single END_IF token, END SUB to a
single END_SUB token etc. This makes the grammar much easier,
and you don't have to worry about the conflict between the
END statement and the "END x" that ends an "x" structure.
Just implement a few routines to create a token buffer, and
have your new yylex "peek" into that buffer... Keep it
simple, though.

-- NEXT I, J
This is a fairly evil construct. Nice for the users, but
can make for an evil grammar. We used the yylex trick above to
watch for NEXT IDENT COMMA, and if we found it, we replace the
COMMA token with COLON NEXT, so the parser would really see
NEXT I : NEXT J
which is easy to parse...

-- The VB docs don't cover the language very well. Watch for stuff from
"normal" basic like
IF expr THEN 30
which is shorthand for
IF expr THEN GOTO 30

There are a few other undocumented things like this that slip
my mind right now.

-- The examples distributed with VB3 and VB4 seem to hit some wacky
situations, and are pretty good for early testing and to help uncover
some of the things that aren't documented.

-- when analyzing VB source, you need to look at all files in a project
as a set -- they have global refs between files.

Hope this helps. If you have any specific questions about parsing VB
(other than "may I have the grammar" feel free to email me.)

Good luck!
Scott

Damon Groenveld wrote:
> If anyone out there has a lex/yacc (or simmilar) grammer definition for VB I
> would like to know about it. (VB 3.0 preferably).

--
Scott Stanchfield McCabe & Associates -- Columbia, Maryland

=================================================

From another Scott Stanchfield post, the next year:

=================================================


... I think I can say that it was a fairly nasty beast in some areas:
Think about

IF x = 1 THEN
END
100: END IF


which would complicate a grammar significantly (yes, VB allows a label
on things like "END IF" -- yuck!


A trick I've used for several projects (at home and in various jobs) is
to create a wrapper for yylex that acts like a token buffer. (You need
to rename the lex-generated yylex, and name the buffer yylex, calling
the renamed yylex.) This buffer watches for a few combinations of
tokens and modifies the set of tokens that the parser will see.
Sometimes it will merge several real tokens into one "metatoken;" other
times it will insert dummy tokens in the token stream. Both tricks can
make the grammar much cleaner and simpler.


In Visual Basic, things like "END IF" become a single ENDIF token, "NEXT
X,Y" becomes "NEXT X : NEXT Y" and so on. (In the above example, you
could change "LABEL END IF" to LABELLED_ENDIF so it can't conflict with
the "LABEL END" in the body of the if statement.

Joe Earnest

unread,
Nov 18, 2002, 5:00:40 PM11/18/02
to
"Alex K. Angelopoulos (MVP)" <a...@mvps.org> wrote in message
news:uvVp9smjCHA.1688@tkmsftngp08...

> "Alex K. Angelopoulos (MVP)" <a...@mvps.org> wrote in message
> news:eHkrBLmjCHA.1552@tkmsftngp08...
> > Joe,

Hi Alex,

Sorry for the delay, been out of pocket for a while and
will be for a bit longer. And I fear I'm really no help.

> > Do you still have your code around?<g>

The experimentation that I mentioned was in reading
and executing entire lines, so I didn't really do much
stripping, and really no tokenizing. Pretty much boilerplate
Execute experimentation right after Execute was added.

Just to show my age, I did use a modest tokenizing basic file
program back with QuickBASIC. There used to be some
tokenizing basic files posted on the QuickBASIC forums,
but I doubt they're still around. If they were, the underlying
language is almost identical - it's the object orientation that's
different. I assume that MS hasn't rewritten interpreter code
where it hasn't needed to. Tokenizing was a bigger issue with
QuickBASIC in DOS, because QuickBASIC produced such
large (by DOS standards) compiled executables (even with
MASM object file links), and DOS was so size critical, that
one could kick out much smaller direct code executables by
tokenizing and associating less "safety-oriented" code.

> It turns out thet tokenizing IS the tough part of developing the VBScript
> grammar.
>
> One of the greater annoyances appears to be what I am calling "embedded
> whitespace tokens" - apparently what tends to choke up parsers quite
readily is
> getting incorrectly passed references to things like "end if" as though
they
> were dual tokens.

Yeah. There are books out there out there on building
compilers. Unfortunately, I don't have one. As best I recall
the QuickBASIC techniques, after whitespace reductions,
quote identification and non-conforming (potential variable
and literal) referencing, a line was parsed from the largest fixed
phrases to the smallest, so that "end if", "do while", etc. were
identified differently from their components, and then down to
the individual keywords, operators, modifiers, etc. *Really*
tedious stuff.

Regards,
Joe Earnest

0 new messages