VBScript, OpenTextFile(), Strings and Unicode

Axel Dahmen

unread,

Oct 16, 2007, 6:45:33 PM10/16/07

to

Hi,

I've written a VBScript to manually update the version number in an
AssemblyInfo.cs file. But all I get from the ReadAll() function is rubish.
The string I'm assigning the function result to only contains loads of empty
squares when watched from within Visual Studio Debugger.

I can't even write the string back without modifications into the
AssemblyInfo.cs file. Visual Studio shows only a bunch of Chinese characters
when opening the saved file. (Perhaps the result of a new espionage attempt
by the Axis of Evil? :) )

Moreover, I cannot modify the string. To VBScript the string doesn't seem
contain anything near to a valid character.

Can anybody help me on how to access and save the Unicode file's content
from within VBScript?

TIA,
Axel Dahmen

--------
Here's my test source code:

' VBScript source code
Const ForReading = 1, ForWriting = 2, TristateTrue = -1
Dim fso,stream,strg,re

Set fso = CreateObject("Scripting.FileSystemObject")
Set stream =
fso.OpenTextFile("C:\Temp\AssemblyInfo.cs",ForReading,False,TristateTrue)
strg = stream.ReadAll()
stream.Close
Set stream = Nothing

Set re = New RegExp
re.Global = True
re.IgnoreCase = False
re.Multiline = True
re.Pattern = "\d+\.\d+\.\d+\.\d+"
strg = re.Replace(strg,"9.9.9.9")

Set stream =
fso.OpenTextFile("C:\Temp\AssemblyInfo2.cs",ForWriting,True,TristateTrue)
stream.Write(strg)
stream.Close
Set stream = Nothing

Paul Randall

unread,

Oct 16, 2007, 7:22:39 PM10/16/07

to

"Axel Dahmen" <KeenT...@newsgroups.nospam> wrote in message
news:OexKEYEE...@TK2MSFTNGP03.phx.gbl...

One person's rubbish is another person's treasure :-)

I suspect that the file in question is encoded in some form of Unicode or
some other encoding. Various encodings of HTML files can be displayed
successfully because the name of the encoding is included in the file. I
suppose your AssemblyInfo2.cs file might include encoding information; I
don't know what program reads that file's format.

If you think that you would recognise the name of the encoding, you can try
setting VBScript's locale to various settings and see how the string looks
in a message box. Search the scripting help file for the two-word phrase
locale id, for a list of locale IDs and descriptions.

Microsoft programs that write Unicode files, such as Notepad in WXP, write a
Byte Order Mark (BOM), the two bytes FFFE, at the beginning of the file,
followed by two bytes per character for the text in Notepad. You can get
some clues to what you have by looking at the file with a hexidecimal file
editor.

-Paul Randall

Axel Dahmen

unread,

Oct 16, 2007, 7:51:04 PM10/16/07

to

Hi Paul,

thanks for your response.

AssemblyInfo.cs files are automatically generated by MS Visual Studio for
.NET projects. They are simple text files containing some project
information.

AssemblyInfo.cs files are plain Unicode files without Unicode signature.
Every Unicode program can read these by guessing the byte order from the
leading zeros at every second byte.

Visual Studio can read loads of encodings, including Unicode files having a
Unicode signature. So saving the file in that format wouldn't cause any
trouble.

The problem is - or seems to be - VBScript. It doesn't seem to handle
Unicode files correctly. In fact, it must be VBScript since I can't even
read and edit the Unicode file's contents.

Any further clues? Your help is quite appreciated.

Regards,
Axel Dahmen

----------------------
"Paul Randall" <paul...@cableone.net> schrieb im Newsbeitrag
news:u9MxetEE...@TK2MSFTNGP03.phx.gbl...

Paul Randall

unread,

Oct 17, 2007, 1:12:50 AM10/17/07

to

Hi, Axel
You can open a text file in Ascii, Unicode, or using the system default.
The format argument can have any of the following settings:
Constant Value Description
TristateUseDefault -2 Opens the file using the system default.
TristateTrue -1 Opens the file as Unicode.
TristateFalse 0 Opens the file as ASCII.

If you lie about a Unicode file with a BOM so that it is read as ASCII, the
first bytes read fine, but at some point what you get is garbage.

I have not tried telling the system to read a Unicode file without a BOM as
ASCII, nor have I tried reading one as Unicode. Reading Unicode files with
a BOM, specifying open as Unicode, has always worked well for me.

Have you tried specifying open as Unicode? Let me know if this works for
you. Also, WXP's Notepad can read and write Unicode. Does Notepad display
your file properly?

-Paul Randall

"Axel Dahmen" <KeenT...@newsgroups.nospam> wrote in message

news:uD0lt8EE...@TK2MSFTNGP06.phx.gbl...

Anthony Jones

unread,

Oct 17, 2007, 8:39:43 AM10/17/07

to

"Axel Dahmen" <KeenT...@newsgroups.nospam> wrote in message

news:uD0lt8EE...@TK2MSFTNGP06.phx.gbl...

> Hi Paul,
>
> thanks for your response.
>
> AssemblyInfo.cs files are automatically generated by MS Visual Studio for
> .NET projects. They are simple text files containing some project
> information.
>
> AssemblyInfo.cs files are plain Unicode files without Unicode signature.
> Every Unicode program can read these by guessing the byte order from the
> leading zeros at every second byte.
>

Have you used a Hex editor to see for yourself? My AssemblyInfo.cs files
are in UTF-8 encoding.

--
Anthony Jones - MVP ASP/ASP.NET

Paul Randall

unread,

Oct 17, 2007, 11:29:52 AM10/17/07

to

"Anthony Jones" <A...@yadayadayada.com> wrote in message
news:eNHHMrLE...@TK2MSFTNGP05.phx.gbl...

Now I see the problem. Unicode has more than one meaning. "Unicode
characters" is different from "the encoding named Unicode".

Common Unicode text characters have 65536 possible values. Unicode encoding
maps each of the 65536 text characters as a 16-bit value, which is easy to
understand, but inefficient in many cases from a storage and transmission
point of view.

For many languages, the vast majority of Unicode characters typically used
are mapped to the first 256 values, so it is wasteful to use 16 bits to
store or transmit these 8-bit values. UTF8 AND UTF7 are two encodings that
represent the first 128 characters as a single byte and the rest of the
Unicode values as multi-byte characters (or something like that - can't
remember the exact details). So UTF8 encoding can be used instead of
Unicode encoding to represent Unicode text more efficiently.

I think the file system object can only read files as ANSI 8-bit characters
commonly known as ASCII, or Unicode 16-bit characters. I think the XML DOM
could be used to read the UTF8-encoded file and convert it to Unicode
encoding, then script could make the desired changes using string functions,
then the XML DOM could convert it back to UTF8 encoding and save it back to
the file. Perhaps the ADODB stream object could be used instead of the XML
DOM.

-Paul Randall

Axel Dahmen

unread,

Oct 19, 2007, 6:04:09 AM10/19/07

to

You're right. I've now done some research on UTF-8, UTF-16 and Unicode/ISO
10646 in general now:

After doing some testing in VBScript I now learned that VBScript only
supports UTF-16 but not UTF-8 which on the other hand is vastly used by
Visual Studio 2005.

So here's where the problem lies buried:

When opening a file using OpenTextFile() and having the IsUnicode flag set,
VBScript expects the file to be 16-Bit encoded. This yields these Chinese
characters by misinterpreting the originally UTF-8 encoded 8-Bit characters.

When saving a file as Unicode, VBScript simply prepends the file content
with a BOM, without doing any further re-encoding to the content. Since the
file now has a BOM, other programs interpret the file content to be UTF-16
as well, though in fact it isn't. Thus they are displaying these Chinese
characters erroneously.

My solution: I open the files as ASCII, change what I need to change, and
save them back as ASCII. Due to the fact that I'm not searching for any
non-ASCII characters, this works for me. If I'd be searching for, e. .g,
Umlaut characters instead, I'd have to do the UTF-8 math myself to find the
corresponding characters in the VBS string.

Sad, but true: FileSystemObject currently is rather limited on Unicode/ISO
10646 files.

MS should drop the ASCII option in favour of UTF-8. There won't be any loss,
only additional usability. Codepage files (e. g. LATIN-1) have become
obsolete for quite a few years now.

Regards,
www.axeldahmen.com
Axel Dahmen

Anthony Jones

unread,

Oct 19, 2007, 3:51:53 PM10/19/07

to

"Axel Dahmen" <KeenT...@newsgroups.nospam> wrote in message

news:OJW9icjE...@TK2MSFTNGP04.phx.gbl...

The term ASCII as used by the FileSystemObject is a misnomer. It really
means the current codepage. Hence 'ASCII' would still be needed.

I doubt MS will add UTF-8 support to FileSystemObject its pretty much coming
to the end of its lifecycle.

As Paul touched on you can use the ADODB.Stream object to read and write
UTF-8 files.

Alternative you could write a .NET console app instead of a VB Script.

VBScript, OpenTextFile(), Strings and Unicode - Possible?

Axel Dahmen

Paul Randall

Axel Dahmen

Paul Randall

Anthony Jones

Paul Randall

Axel Dahmen

Anthony Jones