VBScript String Replace - Remove / Replace Characters in String

dsoutter

unread,

Mar 2, 2010, 12:11:01 AM3/2/10

to

VBScript String Replace

http://www.code-tips.com/2009/04/vbscript-string-clean-function-remove.html

Remove or replace specific characters from a string. The article below
provides a function in VBScript to remove or replace characters in a
string.

VBScript String Replace

http://www.code-tips.com/2009/04/vbscript-string-clean-function-remove.html

remove Illegal Characters from a string: VBScript String Replace

http://www.code-tips.com/2009/04/vbscript-string-clean-function-remove.html

VBScript replace characters in string.

Message has been deleted

Al Dunbar

unread,

Mar 2, 2010, 11:16:25 PM3/2/10

to

"dsoutter" <webmaste...@gmail.com> wrote in message
news:a49b3803-8e79-46eb...@a16g2000pre.googlegroups.com...
> http://groups.google.com/group/web-programming-seo/browse_thread/thread/9fcc0e6307ccbce0
>
> On Mar 2, 4:11 pm, dsoutter <webmasterhub....@gmail.com> wrote:
>> VBScript String Replace
>>
>> http://www.code-tips.com/2009/04/vbscript-string-clean-function-remov...

>>
>> Remove or replace specific characters from a string. The article below
>> provides a function in VBScript to remove or replace characters in a
>> string.
>>
>> VBScript String Replace
>>

>> http://www.code-tips.com/2009/04/vbscript-string-clean-function-remov...

>>
>> remove Illegal Characters from a string: VBScript String Replace
>>

>> http://www.code-tips.com/2009/04/vbscript-string-clean-function-remov...

>>
>> VBScript replace characters in string.
>

Here is how I would code this function if I ever needed such a thing:

msgbox clean("C:\<test>&<done>")

function clean (strtoclean)
strtemp = strtoclean
badchars =
Array("?","/","\",":","*","""","<",">","","&","#","~","%","{","}","+","_",".")
for each badchar in badchars
select case badchar
case "&": goodchar = " and "
case ":": goodchar = "-"
case else: goodchar = " "
end select
strtemp = replace( strtemp, badchar, goodchar )
next
clean = strtemp
end function

IMHO, this has the same result but the logic is somewhat simpler. What
benefit would I get from switching from my version to yours?

/Al

Message has been deleted

WebmasterHub.net

unread,

Mar 4, 2010, 1:05:48 AM3/4/10

to

On Mar 3, 3:16 pm, "Al Dunbar" <aland...@hotmail.com> wrote:
> "dsoutter" <webmasterhub....@gmail.com> wrote in message
>
> news:a49b3803-8e79-46eb...@a16g2000pre.googlegroups.com...
>
>
>
>
>
> >http://groups.google.com/group/web-programming-seo/browse_thread/thre...

> /Al- Hide quoted text -
>
> - Show quoted text -

Hi Al, the logic is simpler as you are using the replace() function
to
perform the string replace, where the function provided takes the
left
and right parts of a string, either side of an illegal character. In
many cases, your method would be more suitable mainly due to the
simpler logic, especially when all instances of each character are to
be processed in the same way.

As the method provided parses the string character by character, you
should have greater control over the output when more complex
operations need to be performed, such as removing or replacing a
character only if it within a specific context:

Eg. replace "&" with " and " if padded with spaces or other specific
character, or with a "+" if not
"something & something else" would become "something and something
else"
"somethin&something else" would become "somethin+something else".

Eg. replace ":" only if NOT part of a url:

"the website is http://code-tips.com " would remain "the website is
http://code-tips.com "
"See Here: http://code-tips.com " would become "See Here http://code-tips.com
"

This would be achieved by either checking the previous 3-5 characters
when a ":" is found to see if it is in the context of a url or not
(http, https, ftp), or by checking the characters following the
current ":" is "//" which would indicate that the semicolon is part
of
a url.

This functionality has not been included in the function provided,
but
would be easy to implement, as the string is incrementally parsed and
manipulated using a numeric string position value relative to the
current position/character in the string.

There may also be differences in performance between the two methods,
as the function provided includes the code required to remove or
replace each of the specified characters without calling the
replace()
function. I suspect that the replace function uses a similar
approach
to replace the specified characters so any difference in performance
would be minimal, unless parsing a large string value. I haven't yet
tested this for performance differences.

mayayana

unread,

Mar 4, 2010, 9:59:39 AM3/4/10

to

This looks like some kind of advertisement
for a blog, but it's an interesting question.
In compiled VB both of the foregoing methods
would be extremely slow on large strings.
The webpage sample is allocating a vast
number of strings to do its job. As the strings
get bigger it would slow to a crawl. The Replace
function looks much better to me, but it's also
fairly slow. (Replace itself is slow.)

Probably none of that matters if the function
is only being used for filename strings of 20+-
characters. And it's not easy to optimize for
speed in VBS anyway. But personally I'd still much
prefer your Replace loop. I don't see the sense of
writing a highly inefficient Replace method in
VBS when the scripting runtime can do it internally.

But in general, why not tokenize? In compiled
code that should be by far the fastest, with much
greater speed achieved if the characters can be
treated as numbers in an array so that the operation
is not allocating new strings or deciphering the Chr
value of each stored numeric value of the string.
In VBS, I don't know whether treating characters as
numbers will help, since it's still a variant that has
to be "parsed". I haven't tested the possibilities.
But I'm using numeric conversion below. I figured that
it should be a little faster than having the function
need to do a string comparison. (In a Select Case
where the character is not an "illegal" there would be
20-30 string comparisons happening if one uses the
string version.)

Another adsvantage of tokenizing is flexibility.
There can be dozens of Case declares with very
little cost.

' Note: I just wrote this as an "air code" sample.
' I didn't bother to get all of the ascii values since
' it's just a demo.

Function Clean(sIn)
Dim i2, iChar, A1()

ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
iChar = Asc(Mid(sIn, i2, 1))
Select Case iChar
Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126
A1(i2 - 1) = "-"
Case Else
A1(i2 - 1) = Chr(iChar)
End Select
Next
Clean = Join(A1, "")
End Function

James

unread,

Mar 4, 2010, 6:41:52 PM3/4/10

to

Hi Mayayana,

As the "air code" sample of your method parses the string character by
character, I suspect theat a combination of your method and the
function provided should allow characters to be replaced, taking into
account the context of each illegal character.

I am using the method to clean a plain text string that may or may not
contain URLs. If there are URLs present in the string, they are later
replaced with an internal url with paramaters pointing to a logging
script that loggs and forwards the request to the original url. The
cleaned string is also used to generate a set of keywords and
keyphrases from the text supplied.

I have based the code below from the "air code" demo, which has also
not been tested. I have incorporated the contextual tests to only
remove/replace some characters if they are not in a scpecific context
(using a URL as an example).

The method below must certainly be a better approach to the function
linked from this thread, or suggested by Al. What do you think? Also,
is there a better way to incorporate the contextual tests for each
illegal character the string?

Thanks

James

-------------------------

Function Clean(sIn)
Dim i2, iChar, A1()

ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
iChar = Asc(Mid(sIn, i2, 1))
Select Case iChar

Case 58
rChars = Mid(sIn, i2+1, 2)
If rChars = "//" Then

A1(i2 - 1) = Chr(iChar)

End If

Case 47
rChar = Asc(Mid(sIn, i2+1, 1))
lChar = Asc(Mid(sIn, i2-1, 1))

If rChar = 47 OR lChar = 47 Then

A1(i2 - 1) = Chr(iChar)
Else

A1(i2 - 1) = "-"
End If

Case 63, 92, 42, 60, 62

A1(i2 - 1) = "-"

Case 44, 46, 43, 126
A1(i2 - 1) = ""

Case Else

mayayana

unread,

Mar 4, 2010, 9:55:29 PM3/4/10

to

>
The method below must certainly be a better approach to the function
linked from this thread, or suggested by Al. What do you think? Also,
is there a better way to incorporate the contextual tests for each
illegal character the string?
>

I think that's pretty much what I meant in saying
it's flexible. There's no limit, really. One could even
call separate functions from within the Select Case.

Parsing URLs
sounds tricky, but it can be done. For instance, you
could check each ":" to see if it's part of "http://",
then get the whole URL and write your edited
URL to the array. You'd just have to find the end
of the URL, calculate the offset of the start and end
characters, and keep track of how many characters
you've actually written to the array. With edits involved
you might need to use a bigger array and then Redim
Preserve it at the end before the Join call.

James

unread,

Mar 4, 2010, 11:23:45 PM3/4/10

to

Thanks Mayayana,

The illegal characters are being removed or replaced as expected. I
am using a regular expression with the replace function to remove all
html tags exept for "a" tags (hyperlinks). I am then removing all "a"
tags so that only the href value is left, which is placed after the
anchor text in brackets.

The next step I am using the string clean function from the linked
article (now modified to include suggestions in this thread) to remove
all special characters from the string except when a component of a
URL.

The final step, which I am currently working on is to parse the
cleaned string to replace urls with the internal redirect. It is
working as expected, but there are some cases where URLs are not
followed by a space depending on the context in the original string.
The problem being that there isn't currently a consistent method to
find the end of each URL. I am working toward adjusting the function
so that all URLs are contained in square brackets [] once processed
using the string clean function so that they can be found easily when
parsing to update the URLs.

I am replacing all special characters with a space, then re-parsing
the string to remove double (or more) spaces between words / URLs.
This works most of the time, but as i am not removing "." chars (ASCII
# 46), a url may end up with an additional "." at the end (http://
address.com.). To prevent this, i am replacing all "." with " ."
before parsing URLs so allow URLS to be recognised consistently.
After parsing and converting URLs, I then replace any occurrances of
" ." with the original "."

This seems to work, but I am not sure that it is the best way to do
this as the same string is parsed a number of times before the desired
results are achieved.

The string clean function works well using the tokenizing method.
Thanks again for your suggestion.

James

mayayana

unread,

Mar 5, 2010, 12:27:59 AM3/5/10

to

>
This seems to work, but I am not sure that it is the best way to do
this as the same string is parsed a number of times before the desired
results are achieved.
>

I think if it were me I'd put it *all* in the tokenizer.
For instance, for "<" you could do something like:

Case 60
If ucase(Mid(sIn, i2 + 1, 1)) = "A" then
'This is an anchor tag, so parse it.
Else 'drop out all other tags.
Do
i2 = i2 + 1
if Mid(sIn, i2, 1) = ">" then exit do
Loop
End If

One note with that: You'd want to use Do/Loop
for the main loop so that you can change the
value of i2. The code above would go back to the
start of the main loop and begin processing the next
character after the end of the tag. My original code
used: For i2 = ..... Next

I guess it all gets down to a matter of personal
preference at some point, though. You're the one
who's going to have to maintain your script. :)

Al Dunbar

unread,

Mar 5, 2010, 12:31:03 AM3/5/10

to

"dsoutter" <webmaste...@gmail.com> wrote in message

news:2456c1b9-460d-46dd...@l12g2000prg.googlegroups.com...

> On Mar 3, 3:16 pm, "Al Dunbar" <aland...@hotmail.com> wrote:
>> "dsoutter" <webmasterhub....@gmail.com> wrote in message

<snip>

>> Here is how I would code this function if I ever needed such a thing:

<snip>

>> IMHO, this has the same result but the logic is somewhat simpler. What
>> benefit would I get from switching from my version to yours?
>>

>> /Al- Hide quoted text -
>>
>> - Show quoted text -
>
> Hi Al, the logic is simpler as you are using the replace() function to
> perform the string replace, where the function provided takes the left
> and right parts of a string, either side of an illegal character.

A nice analysis, and exactly my point. Thanks for making it for me.

> In
> many cases, your method would be more suitable mainly due to the
> simpler logic, especially when all instances of each character are to
> be processed in the same way.

True. But, as written, your function will also only process all instances of
each character in the same way. My method might therefore appear to be
better in all cases in which the functions, as written, could be used. If
you want to compare our methods when applied to a different problem space,
such as you describe here:

> As the method provided parses the string character by character, you
> should have greater control over the output when more complex
> operations need to be performed, such as removing or replacing a
> character only if it within a specific context:

You cannot compare my function as written with your function as modified to
solve some new problem. A better comparison would be to compare your
modified function with a different function I might write to solve that
problem.

> Eg. replace "&" with " and " if padded with spaces or other specific
> character, or with a "+" if not
> "something & something else" would become "something and something
> else"
> "somethin&something else" would become "somethin+something else".
>
> Eg. replace ":" only if NOT part of a url:
>
> "the website is http://code-tips.com " would remain "the website is
> http://code-tips.com "
> "See Here: http://code-tips.com " would become "See Here
> http://code-tips.com
> "
>
> This would be achieved by either checking the previous 3-5 characters
> when a ":" is found to see if it is in the context of a url or not
> (http, https, ftp), or by checking the characters following the
> current ":" is "//" which would indicate that the semicolon is part of
> a url.

There might even be other ways to perform this kind of parsing...

> This functionality has not been included in the function provided, but
> would be easy to implement, as the string is incrementally parsed and
> manipulated using a numeric string position value relative to the
> current position/character in the string.

You seem to be proposing that simple functions be written in such a way that
they are more directly adaptable into more complex ones capable of more
complex operations. I disagree with this approach, UNLESS a function is
coded in such a way that it can be made to perform the more complex work
without first having to be modified to do so by calling it in a different
manner.

I'm not saying that you are wrong to do it your way, just that it may not be
the best approach for others to emulate.

> There may also be differences in performance between the two methods,
> as the function provided includes the code required to remove or
> replace each of the specified characters without calling the replace()
> function.

Yes, you avoid calling replace. But you do that by calling instr for each
possible bad character, plus left, mid, len, and and two string
concatenations for each bad character actually present. If you are concerned
with the overhead of calling a built-in function, my method does that fewer
times.

> I suspect

suspect, but do not know...

> that the replace function uses a similar approach
> to replace the specified characters so any difference in performance
> would be minimal, unless parsing a large string value. I haven't yet
> tested this for performance differences.

I haven't tested either, however, the actual logic used by a built-in
function, while possibly logically identical to that of a function written
in vbscript, is more likely to be faster and more efficient. This is mainly
because the built-in functions are coded in a lower level language.

Regardless, no argument over ultimate relative efficiency can really be
resolved without rigorous testing. Since neither of us feel it important
enough to do that, we probably both are willing to accept some
inefficiencies, given that our functions each perform their intended tasks
perfectly! ;-)

Or do they? I haven't tested your code, but my reading of it suggests to me
that it make unstated assumptions about the nature of the string it is
processing (does it, for example, presume that the string represents a valid
NTFS, UNC or URL path of some sort?).

If you wouldn't mind, try running your function against a string such as
"C::\". I suspect the result might be "C :\", a string containing an illegal
character. If so, you would have to either include an internal recursive
call, or call your function in a loop until the result no longer changed. Or
you would have to qualify your documentation to explain that it is intended
only to process valid paths strings (or whatever the case actually is).

Regardless, another knock against your function as posted, if you are
interested in objective criticism, is that it does not fully document
itself. The nature of an "illegal character" is somewhat inferred, but not
fully explained. If the goal is to convert a valid path to a string that
could be used as a filename, here are a few quirks you appear not to have
addressed:

non-uniqueness: Run your function (or mine, for that matter) on these two
different paths: "C:\documents and settings" and
"C:\documents\and\settings", and you get the same result: "C documents and
settings".

other filename invalidities: run it on one of those huge URL strings and you
might wind up with a filename that was actually too long for the file system
to handle.

the concept of adapting the function to do more comprehensive processing. If
that actually was the reason for your less simple approach, your audience is
not getting the benefit if you do not explain that.

the vagueness of the name of the function itself: clean? there's nothing
dirty here. Calling it Path2Filename might be a more accurate representation
of its purpose (or it might not - I could not tell the purpose from the code
itself without your additional explanation.

/Al

Al Dunbar

unread,

Mar 5, 2010, 12:33:05 AM3/5/10

to

"WebmasterHub.net" <webmaste...@gmail.com> wrote in message
news:923bab94-9163-4786...@l24g2000prh.googlegroups.com...

> On Mar 3, 3:16 pm, "Al Dunbar" <aland...@hotmail.com> wrote:
>> "dsoutter" <webmasterhub....@gmail.com> wrote in message

<snip>

I already replied to your identical post from your alter ego ;-)

/Al

Al Dunbar

unread,

Mar 5, 2010, 12:45:00 AM3/5/10

to

"mayayana" <maya...@nospam.invalid> wrote in message
news:eut$Rs6uK...@TK2MSFTNGP06.phx.gbl...

> This looks like some kind of advertisement
> for a blog,

or of a web site purporting to demonstrate some level of expertise and
authority that some of us have yet to recognize as such...

> but it's an interesting question.
> In compiled VB both of the foregoing methods
> would be extremely slow on large strings.

Granted. But if limited to URL's, for example, they might not be extremely
huge.

> The webpage sample is allocating a vast
> number of strings to do its job. As the strings
> get bigger it would slow to a crawl. The Replace
> function looks much better to me, but it's also
> fairly slow. (Replace itself is slow.)

I do not dispute that, although I do not know the actual metrics. But for a
site dedicated to providing example vbscripts and a newsgroup dedicated to
the same language, a completely different approach (i.e. re-write in C, for
example) would generally be of no interest to those looking for vbscript
solutions.

> Probably none of that matters if the function
> is only being used for filename strings of 20+-
> characters. And it's not easy to optimize for
> speed in VBS anyway.

Exactly.

> But personally I'd still much
> prefer your Replace loop. I don't see the sense of
> writing a highly inefficient Replace method in
> VBS when the scripting runtime can do it internally.

Agreed. But the other issue with less simple code that cannot be discounted
is the greater effort required to develop it, debug it, and test it to
ensure it works in all cases.

> But in general, why not tokenize? In compiled
> code that should be by far the fastest, with much
> greater speed achieved if the characters can be
> treated as numbers in an array so that the operation
> is not allocating new strings or deciphering the Chr
> value of each stored numeric value of the string.
> In VBS, I don't know whether treating characters as
> numbers will help, since it's still a variant that has
> to be "parsed". I haven't tested the possibilities.

I strongly suspect that the variant thing will make most vbscript code less
efficient than a compiled language, and that it might cause the tokenized
approach to be less efficient than it might be expected to be.

<snip>

/Al

Al Dunbar

unread,

Mar 5, 2010, 12:54:08 AM3/5/10

to

"James" <webmaste...@gmail.com> wrote in message
news:f4d5de01-3c8f-430a...@t9g2000prh.googlegroups.com...

> On Mar 5, 1:59 am, "mayayana" <mayay...@nospam.invalid> wrote:
>> This looks like some kind of advertisement
>> for a blog, but it's an interesting question.

<snip>

> Hi Mayayana,
>
> As the "air code" sample of your method parses the string character by
> character, I suspect theat a combination of your method and the
> function provided should allow characters to be replaced, taking into
> account the context of each illegal character.
>
> I am using the method to clean a plain text string that may or may not
> contain URLs. If there are URLs present in the string, they are later
> replaced with an internal url with paramaters pointing to a logging
> script that loggs and forwards the request to the original url. The
> cleaned string is also used to generate a set of keywords and
> keyphrases from the text supplied.

You see, that whole description is not inherent in the listing you have
posted of your clean function.

> I have based the code below from the "air code" demo, which has also
> not been tested. I have incorporated the contextual tests to only
> remove/replace some characters if they are not in a scpecific context
> (using a URL as an example).
>
> The method below must certainly be a better approach to the function
> linked from this thread, or suggested by Al.

It might indeed be better, but I don't see where this must certainly be so.
Your original function and my "simpler" version never even tried to do the
contextual bit, so saying code that was designed to do so is better is a bit
like saying a hammer is a better tool than a nailfile for nailing things
together.

> What do you think? Also,
> is there a better way to incorporate the contextual tests for each
> illegal character the string?

My guess: yes, probably there is. I just find your code below even harder to
follow than the original clean function. But as implied previously, it seems
odd to have two functions doing two different things but having the same
name.

/Al

Al Dunbar

unread,

Mar 5, 2010, 12:56:49 AM3/5/10

to

"mayayana" <maya...@nospam.invalid> wrote in message

news:OCgsS8Av...@TK2MSFTNGP05.phx.gbl...

>>
> The method below must certainly be a better approach to the function
> linked from this thread, or suggested by Al. What do you think? Also,
> is there a better way to incorporate the contextual tests for each
> illegal character the string?
>>
>
> I think that's pretty much what I meant in saying
> it's flexible. There's no limit, really. One could even
> call separate functions from within the Select Case.
>
> Parsing URLs
> sounds tricky, but it can be done. For instance, you
> could check each ":" to see if it's part of "http://",
> then get the whole URL and write your edited
> URL to the array. You'd just have to find the end
> of the URL, calculate the offset of the start and end
> characters, and keep track of how many characters
> you've actually written to the array. With edits involved
> you might need to use a bigger array and then Redim
> Preserve it at the end before the Join call.

in my opinion, the use of regular expressions seems more likely to be more
efficient than coding all the ifs ands and buts in vbscript. But sorry, I'm
not a regular expression kind of guy.

/Al

Al Dunbar

unread,

Mar 5, 2010, 1:01:39 AM3/5/10

to

"James" <webmaste...@gmail.com> wrote in message
news:cc81aa22-8549-43a8...@l12g2000prg.googlegroups.com...

> On Mar 5, 1:55 pm, "mayayana" <mayay...@nospam.invalid> wrote:
>> The method below must certainly be a better approach to the function
>> linked from this thread, or suggested by Al. What do you think? Also,
>> is there a better way to incorporate the contextual tests for each
>> illegal character the string?

<snip>

>
> Thanks Mayayana,
>
> The illegal characters are being removed or replaced as expected. I
> am using a regular expression with the replace function to remove all
> html tags exept for "a" tags (hyperlinks). I am then removing all "a"
> tags so that only the href value is left, which is placed after the
> anchor text in brackets.
>
> The next step I am using the string clean function from the linked
> article (now modified to include suggestions in this thread) to remove
> all special characters from the string except when a component of a
> URL.
>
> The final step, which I am currently working on is to parse the
> cleaned string to replace urls with the internal redirect. It is
> working as expected, but there are some cases where URLs are not
> followed by a space depending on the context in the original string.
> The problem being that there isn't currently a consistent method to
> find the end of each URL. I am working toward adjusting the function
> so that all URLs are contained in square brackets [] once processed
> using the string clean function so that they can be found easily when
> parsing to update the URLs.

So I am curious. What was the purpose of your initial post? To get some
feedback on a script you are trying to develop? Or to advertise a site
containing expertly developed code? Or to get feedback on a site purportedly
containing expertly developed code?

/Al

James

unread,

Mar 5, 2010, 1:29:06 AM3/5/10

to

Hi Al, Thanks for your wise words. The reason for using the function
in this case is not for filenames, although it was written for this
purpose. You method using the replace function will not work at all
for what I am trying to achieve. If you read the response to your
question, you will actually see that i agreed with you that the
replace method would be more suitable if every all illegal characters
are being processed in the same way (remove all / replace all
occurrences with the same char). As i am removing characters from the
text that are not a component of a url, the replace method in your
function would not be suitable, as it doesn't allow me to test
characters surrounding an illegal character.

> You cannot compare my function as written with your function as modified to
> solve some new problem.

There was no comparison with "some new problem" and your function. I
acknowledged that in the context of the linked article and in response
to your intelligent rhetorical question that you method would be
better. BUT, in the context of the solution I am working towards yours
would not be suitable, which is why I needed to explain the scenario
in more detail.

> Regardless, another knock against your function as posted, if you are
> interested in objective criticism, is that it does not fully document
> itself. The nature of an "illegal character" is somewhat inferred, but not
> fully explained. If the goal is to convert a valid path to a string that
> could be used as a filename, here are a few quirks you appear not to have
> addressed:

The term "illegal characters" is used because that is what the article
and function was originally written for removing characters that are
illegal in filenames. This doesn't mean that the function can only
ever be used to remove characters in filenames. I am not using it for
filenames at all in this case, which makes most of what you have said
irrelevant. Thanks for pointing out this highly important fact.

Sorry that you seem to have gotten your knickers in a knot. If you
just looking for an argument, then you should find another community
to abuse.

James

mayayana

unread,

Mar 5, 2010, 10:12:54 AM3/5/10

to

>> I haven't tested the possibilities.
>
> I strongly suspect that the variant thing will
> make most vbscript code less
> efficient than a compiled language, and that
> it might cause the tokenized
> approach to be less efficient than it might be expected to be.
>

There's not much sense in talking about it
if we're all just going to speculate, so I tried
it out. I think you're clearly right. Replace bogs
down in compiled code, but the reverse is the
case with VBS. And a different-length replacement
string doesn't seem to affect the results to
speak of.

While the
tokenizing provides a very nice way to do a very
complex operation on a string, it doesn't come
close compared to Replace.

I tried your function, my numeric tokenizer, and
a tokenizer that left each character as a string.
Testing a few large HTML files I found that the
numeric tokeinzer was slightly faster than the
string tokenizer, but the Replace method was
about 10 times as fast.

Dim Arg, FSO, TS, s1, i1, i2, s2
Arg = WScript.arguments(0)

Set FSO = CreateObject("Scripting.FileSystemObject")
Set TS = FSO.OpenTextFile(Arg, 1)
s1 = TS.ReadAll
TS.Close
Set TS = Nothing

i1 = timer
s2 = CleanTok(s1)
i2 = timer
MsgBox "Time for tokenize: " & (i2 - i1) * 1000 & " ms"

i1 = timer
s2 = CleanTokS(s1)
i2 = timer
MsgBox "Time for tokenizeS: " & (i2 - i1) * 1000 & " ms"

i1 = timer
s2 = CleanRep(s1)
i2 = timer
MsgBox "Time for replace: " & (i2 - i1) * 1000 & " ms"

Set FSO = nothing

Function CleanRep (strtoclean)
strtemp = strtoclean
badchars = Array("?", "/", "\", ":", "*", """", "<", ">", ",", "&",

"#", "~", "%", "{", "}", "+", "_", ".")

For Each badchar in badchars
Select Case badchar
Case "&": goodchar = " and "
Case ":": goodchar = "-"
Case Else: goodchar = " "
End Select

strtemp = replace( strtemp, badchar, goodchar )

Next
cleanRep = strtemp
End Function

Function CleanTokS(sIn)
Dim i2, Char, A1()

ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)

Char = Mid(sIn, i2, 1)
Select Case Char
Case "?", "/", "\", ":", "*", """", "<", ">", ",", "&", "#", "~",

"%", "{", "}", "+", "_", "."

A1(i2 - 1) = "-"
Case Else

A1(i2 - 1) = Char
End Select
Next
CleanTokS = Join(A1, "")
End Function

Function CleanTok(sIn)

Dim i2, iChar, A1()
ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
iChar = Asc(Mid(sIn, i2, 1))
Select Case iChar

Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126, 37, 123, 125, 43,
95, 46

A1(i2 - 1) = "-"
Case Else
A1(i2 - 1) = Chr(iChar)
End Select
Next

CleanTok = Join(A1, "")
End Function

Al Dunbar

unread,

Mar 5, 2010, 10:26:08 PM3/5/10

to

"James" <webmaste...@gmail.com> wrote in message
news:1a1afd8b-2ac6-459a...@g8g2000pri.googlegroups.com...

> Hi Al, Thanks for your wise words. The reason for using the function
> in this case is not for filenames, although it was written for this
> purpose. You method using the replace function will not work at all
> for what I am trying to achieve. If you read the response to your
> question, you will actually see that i agreed with you that the
> replace method would be more suitable if every all illegal characters
> are being processed in the same way (remove all / replace all
> occurrences with the same char). As i am removing characters from the
> text that are not a component of a url, the replace method in your
> function would not be suitable, as it doesn't allow me to test
> characters surrounding an illegal character.

I think we are talking at cross-purposes here. I have been comparing my
replace-based version of your "clean" function with your version. I have not
been saying that one should use replace or that it can be used in every
situation. All I have been saying is that if you have two functions that
produce identical results, the better choice is usually the simpler of the
two.

I misread you as representing your "clean" function as one that you were
making available for others to use, as-is, as an example of a well-written
function. I did not anticipate that this thread would evolve into a
discussion of an application for which neither version of the function would
suffice, but one that would need to be adapted.

>> You cannot compare my function as written with your function as modified
>> to
>> solve some new problem.
>
> There was no comparison with "some new problem" and your function.

Thanks for putting me straight on that. This goes to my upthread comment
about talking at cross-purposes.

> I
> acknowledged that in the context of the linked article and in response
> to your intelligent rhetorical question that you method would be
> better. BUT, in the context of the solution I am working towards yours
> would not be suitable, which is why I needed to explain the scenario
> in more detail.

I never suggested that my version of your function would do anything
different than it does. But at least I think I am starting to understand
where you are coming from...

>> Regardless, another knock against your function as posted, if you are
>> interested in objective criticism, is that it does not fully document
>> itself. The nature of an "illegal character" is somewhat inferred, but
>> not
>> fully explained. If the goal is to convert a valid path to a string that
>> could be used as a filename, here are a few quirks you appear not to have
>> addressed:
>
> The term "illegal characters" is used because that is what the article
> and function was originally written for removing characters that are
> illegal in filenames. This doesn't mean that the function can only
> ever be used to remove characters in filenames. I am not using it for
> filenames at all in this case, which makes most of what you have said
> irrelevant. Thanks for pointing out this highly important fact.

Not so important a fact, just a comment made with constructive intent on the
assumption that you were, indeed, looking for comment.

> Sorry that you seem to have gotten your knickers in a knot. If you
> just looking for an argument, then you should find another community
> to abuse.

If my knickers were in a knot over this teapot tempest (which they aren't)
that would be my fault, not yours. I apologize for seeming to be taking an
abuse approach here, as that was truly not my intent.

/Al

James

unread,

Mar 6, 2010, 11:04:45 PM3/6/10

to

Thanks mayayana, that has cleared things up a lot.

I have been trying to achieve the same thing using regular expressions
which seem to have similar speeds to the Replace function, but are not
always consistent. I think this could be quite a good method, as it
avoids using the loop for each of the characters being removed or
replaced, and I should be able to incorporate the conditions required
to only remove/replace characters if not a component of a url.

Function CleanRepReg (strtoclean)
strtemp = strtoclean

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|
\||:|/)"
strOutput = objRegExp.Replace(strtemp, "-")

objRegExp.Pattern = "-+"
strOutput = objRegExp.Replace(strOutput, "-")

CleanRepReg = strOutput

End Function

Note that this also uses a second reg replace to remove duplicate "-"
characters once the initial replace method has been called.

I am having some trouble trying to derive a regular expression that
will remove any ":" or "/" characters from the text that are not part
of a url. I am able to remove them if they are part of a URL ( "http:
(.)*" ) or similar, but nothing happens if I try to do the reverse
using the "not" operator ( ! ) in the regular expression.

Is this possible using the replace method of a regular expression
object in VBScript?

Thanks

James

mayayana

unread,

Mar 6, 2010, 11:29:27 PM3/6/10

to

>
Is this possible using the replace method of a regular expression
object in VBScript?
>

Hopefully one of the RegExp fans will show
up to answer that. I don't have the patience
for RegExp and never use them.

Evertjan.

unread,

Mar 7, 2010, 2:14:19 AM3/7/10

to

mayayana wrote on 07 mrt 2010 in microsoft.public.scripting.vbscript:

>>
> Is this possible using the replace method of a regular expression
> object in VBScript?

"The" regular expression object.

Yes, it is possible to use that.

> Hopefully one of the RegExp fans will show
> up to answer that.

Given above.

> I don't have the patience
> for RegExp and never use them.

So why do you ask and hope?

Seems you are waisting our time, impatient one!

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

mayayana

unread,

Mar 7, 2010, 9:06:50 AM3/7/10

to

>
> So why do you ask and hope?
>
> Seems you are waisting our time, impatient one!
>

That wasn't James. I said that. See his post
above for the question.

Evertjan.

unread,

Mar 7, 2010, 5:43:43 PM3/7/10

to

mayayana wrote on 07 mrt 2010 in microsoft.public.scripting.vbscript:

There is no post above in usenet.

mayayana

unread,

Mar 7, 2010, 10:36:18 PM3/7/10

to

> > That wasn't James. I said that. See his post
> > above for the question.
>
> There is no post above in usenet.
>

You don't see a post above? That doesn't mean
it's not there. :)
You seem to be using XNews but also
have a reference to Google Groups in your header.
If you're actually reading via Google you might want
to get a real newsreader. If that's not the problem
then it may be the MS server. It goes wacky periodically.
I sometimes see an Re post that appears to be the
original myself. I don't know why that happens. In
any case, there are about 8 posts going back in this
sub-thread and about 20 in the entire thread.

Evertjan.

unread,

Mar 8, 2010, 11:57:29 AM3/8/10

to

mayayana wrote on 08 mrt 2010 in microsoft.public.scripting.vbscript:

>
>> > That wasn't James. I said that. See his post
>> > above for the question.
>>
>> There is no post above in usenet.
>>
>
> You don't see a post above? That doesn't mean
> it's not there. :)

It is not there per definition.

See below [and I do not mean in a follow up posting!!]

> You seem to be using XNews but also
> have a reference to Google Groups in your header.

Impossible, not in MY header.
Possibly your reader adds it?

> If you're actually reading via Google you might want
> to get a real newsreader.

Oh come off it! I have been using Luu Tran's Xnews for years and wouuld
not be seen using G-groups, except sporadicly as a bad Dejavu substitute.

> If that's not the problem
> then it may be the MS server. It goes wacky periodically.

Earlier postings in usenet are not "above", quoting is "above".

> I sometimes see an Re post that appears to be the
> original myself.

Eh?

So you are the original post yourself? ;-)
And such repost is bing a mirror of yourself?

> I don't know why that happens. In
> any case, there are about 8 posts going back in this
> sub-thread and about 20 in the entire thread.

However calling them "above" indicatres somethiing else,
and at least stipulates all those postings are [stil] available on the
news server of all recipients, a stipulation that is false by usenet
standards.

James

unread,

Mar 8, 2010, 6:19:14 PM3/8/10

to

Hi Evertjan,

In my post reffered to by mayayana, I was asking if it is possible to
perform the same operation as the replace function to remove special
characters from a string. I have written the function below which
uses the RegExp object to replace characters from the input string.
The function works, but the reason for implementing using regular
expressions is to incorporate conditions into the expression so that
some special characters remain after calling the replace method of the
Regexp object.

I am trying to remove all special characters detailed in the pattern,
if they are not a component of a url. For example, I wish to remove
all semi-colons ( : ), but not when used in a url.

I have tried a few things, including replacing ":" that are part of a
url, but when I include the not operator ( ! ), the expression doesn't
remove any characters at all (no part of the expression equates to
true??)

The following pattern matches and removes any instances of the special
characters:

"(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|\||:|/)"

i have tried the following and similar without success:
"(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|\|)|((http)!:)"

"(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|\||(http)!:)"

Function CleanRepReg (strtoclean)
strtemp = strtoclean

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|
@|
\||:|/)"
strOutput = objRegExp.Replace(strtemp, "-")

objRegExp.Pattern = "-+"
strOutput = objRegExp.Replace(strOutput, "-")

CleanRepReg = strOutput

End Function

Thanks

James

Evertjan.

unread,

Mar 9, 2010, 6:25:49 AM3/9/10

to

James wrote on 09 mrt 2010 in microsoft.public.scripting.vbscript:

> In my post reffered to by mayayana, I was asking if it is possible to
> perform the same operation as the replace function to remove special
> characters from a string. I have written the function below which
> uses the RegExp object to replace characters from the input string.
> The function works, but the reason for implementing using regular
> expressions is to incorporate conditions into the expression so that
> some special characters remain after calling the replace method of the
> Regexp object.
>
> I am trying to remove all special characters detailed in the pattern,
> if they are not a component of a url. For example, I wish to remove
> all semi-colons ( : ), but not when used in a url.
>
> I have tried a few things, including replacing ":" that are part of a
> url, but when I include the not operator ( ! ), the expression doesn't
> remove any characters at all (no part of the expression equates to
> true??)
>
> The following pattern matches and removes any instances of the special
> characters:
> "(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|\||:|/)"

> objRegExp.Pattern =

> "(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|\||:|/)"

Why the outer ()?
No need, unles you refer to the match in the repalce string.

> objRegExp.IgnoreCase = True

Why this, there are no litteral a-z characters in you search regex

Let's parse your sting, I see 4 errors:

\?|
\*|
\""| --> do you mean \"\"

,|
\\|
<|
>|
&|
#|
~|
%|

{| --> \{|
}| --> \}|
\+|
_|
\.|
@|
\||
:|
/ --> \/

This probably would be just as effective

objRegExp.Pattern =
"[(?*",\\<>&#~%{}+_.@|:\/]+"

> objRegExp.Pattern = "-+"
> strOutput = objRegExp.Replace(strOutput, "-")

Do you mean any number of - shoud be repalced by only one?

Then probably try:
objRegExp.Pattern = "\-+"

> i have tried the following and similar without success:
> "(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|\|)|((http)!:)"

What does the !: do in "((http)!:)" ?

Do you want to remove "http:" ?
Or do you mean some noncapturing match gone wrong?

try (now you probably need IgnoreCase = True) :

"[(?*",\\<>&#~%{}+_.@|:\/]|http"

================================

I prefer testing regex in javascript on Google Chrome,
sorry VBS-fans,
and did:

James

unread,

Mar 9, 2010, 6:08:30 PM3/9/10

to

> (Please change the x'es to dots in my emailaddress)- Hide quoted text -

>
> - Show quoted text -

Thanks Evertjan,

What i am working towards, is to be able to parse text which may
contain html tags and urls explicitly in the text to remove special
characters and reformat hyperlinks. Any web addresses found in the
text should remain, even though the url contains some of the
characters being removed. This is what I was trying to achieve using
the regular expression (remove all : and / chars except from urls).

A tags are stripped, leaving the href value in brackets with padded
spaces after the anchor text of the original link. At this point, I
am trying to have all hyperlinks padded with spaces so they can be
easily identified later in the script.

I should also note that i have adjusted the expression to remove the
characters instead of replacing with a "-". It is currently removing
all characters specified in the expression even when a component of a
url. If possible, the expression needs to match any of the characters
specified, but not when part of a url.

The following demonstrates what I am trying to achive:

"text. text, tex*t text: http://www.google.com text text" should
become "text text text text http://www.google.com text text"

Thanks

James

Evertjan.

unread,

Mar 10, 2010, 6:25:45 PM3/10/10

to

James wrote on 10 mrt 2010 in microsoft.public.scripting.vbscript:

> What i am working towards, is to be able to parse text which may
> contain html tags and urls explicitly in the text to remove special
> characters and reformat hyperlinks.

I was just showing hw to use regex, not how make your specification work,
as it is much more fun and educative to try it yourself.

mayayana

unread,

Mar 10, 2010, 10:49:31 PM3/10/10

to

Oh well. Maybe someone else will help. :)

If you end up tokenizing you should be able
to just trap ":". Then check for "http://" around
that. Then get up to the next space. Insert
that string into your filtered version and carry
on from there.

If you use RegExp you could still run a tokenizer
for the URLs. The only other method that comes
to mind is a simple Instr loop.

The people who like to use RegExp tend to be
a small but fervent crowd. Since nobody has shown
up with an answer I'm guessing that the particular
job of dealing with the URLs doesn't lend itself to
RegExp.

--------------

"text. text, tex*t text: http://www.google.com text text" should
become "text text text text http://www.google.com text text"

--------------

James

unread,

Mar 11, 2010, 5:31:53 PM3/11/10

to

On Mar 11, 2:49 pm, "mayayana" <mayay...@nospam.invalid> wrote:
> Oh well. Maybe someone else will help. :)
>
> If you end up tokenizing you should be able
> to just trap ":". Then check for "http://" around
> that. Then get up to the next space. Insert
> that string into your filtered version and carry
> on from there.
>
> If you use RegExp you could still run a tokenizer
> for the URLs. The only other method that comes
> to mind is a simple Instr loop.
>
> The people who like to use RegExp tend to be
> a small but fervent crowd. Since nobody has shown
> up with an answer I'm guessing that the particular
> job of dealing with the URLs doesn't lend itself to
> RegExp.
>
> --------------

> "text. text, tex*t text:http://www.google.comtext text" should

> become "text text text texthttp://www.google.comtext text"
> --------------

Hi mayayana,

I will have another look at this over the next few days and will post
my solution once found. I am breaking the process down into smaller
steps which should make it a little easier to get it working, but
won't be the most efficient solution. Once I have a woking method, i
will then attempt to consolidate the steps.

Thanks

Al Dunbar

unread,

Mar 11, 2010, 9:37:28 PM3/11/10

to

"mayayana" <maya...@nospam.invalid> wrote in message

news:#tH9c2Mw...@TK2MSFTNGP04.phx.gbl...

> Oh well. Maybe someone else will help. :)
>
> If you end up tokenizing you should be able
> to just trap ":". Then check for "http://" around
> that. Then get up to the next space. Insert
> that string into your filtered version and carry
> on from there.
>
> If you use RegExp you could still run a tokenizer
> for the URLs. The only other method that comes
> to mind is a simple Instr loop.
>
> The people who like to use RegExp tend to be
> a small but fervent crowd. Since nobody has shown
> up with an answer I'm guessing that the particular
> job of dealing with the URLs doesn't lend itself to
> RegExp.

Possibly. Or, perhaps more likely, they are just not attracted to a thread
whose subject line appears to be about the REPLACE function.

/Al

Evertjan.

unread,

Mar 12, 2010, 3:43:14 AM3/12/10

to

Al Dunbar wrote on 12 mrt 2010 in microsoft.public.scripting.vbscript:

>> The people who like to use RegExp tend to be
>> a small but fervent crowd. Since nobody has shown
>> up with an answer I'm guessing that the particular
>> job of dealing with the URLs doesn't lend itself to
>> RegExp.
>
> Possibly. Or, perhaps more likely, they are just not attracted to a
> thread whose subject line appears to be about the REPLACE function.

Since a NG is not a helpdesk, many of us expect the OP to do most of the
work, like learning some Regex himself.

Helping out is not the same as taking over the whole job.
The latter case should perhaps be left to paid professionals.

Al Dunbar

unread,

Mar 12, 2010, 7:33:13 PM3/12/10

to

"Evertjan." <exjxw.ha...@interxnl.net> wrote in message
news:Xns9D3962E1...@194.109.133.242...

> Al Dunbar wrote on 12 mrt 2010 in microsoft.public.scripting.vbscript:
>
>>> The people who like to use RegExp tend to be
>>> a small but fervent crowd. Since nobody has shown
>>> up with an answer I'm guessing that the particular
>>> job of dealing with the URLs doesn't lend itself to
>>> RegExp.
>>
>> Possibly. Or, perhaps more likely, they are just not attracted to a
>> thread whose subject line appears to be about the REPLACE function.
>
> Since a NG is not a helpdesk, many of us expect the OP to do most of the
> work, like learning some Regex himself.

Yeah, it occurred to me that that might be another reason.

Too bad he came to the conclusion that dealing with URL's is out of RegExp's
league, though...

/Al

Evertjan.

unread,

Mar 14, 2010, 4:04:34 PM3/14/10

to

Perhaps we can help him by helping him conclude
that an url is just an alphanumeric string?

And that regex is perfect for testing,
part matching or manipulating such strings?

James

unread,

Mar 14, 2010, 6:42:41 PM3/14/10

to

On Mar 15, 7:04 am, "Evertjan." <exjxw.hannivo...@interxnl.net> wrote:
> Al Dunbar wrote on 13 mrt 2010 in microsoft.public.scripting.vbscript:
>
>
>
>
>

> > "Evertjan." <exjxw.hannivo...@interxnl.net> wrote in message

> (Please change the x'es to dots in my emailaddress)- Hide quoted text -
>
> - Show quoted text -

Thanks Evertjan, and Al, I wouldn't have continued to do this using
RegEx if if it wasn't possible. It is my lack of experience using
Regular Expressions that is the problem here, not RegEx. Thankfully,
this can easily be altered with a little research and practice. Once
I become more familure with using Regular Expressions, this method to
perform the String replace that i require will be quite suitable.

James

unread,

Mar 15, 2010, 1:29:11 AM3/15/10

to

Ok, it's working now. Thanks for the advice.

James