Non English string ?

Phil Hunt

unread,

Feb 12, 2010, 1:20:15 PM2/12/10

to

What is the best way to determine if a string contains "non Eglish"
character ?
TIA

Jeff Johnson

unread,

Feb 12, 2010, 1:30:05 PM2/12/10

to

"Phil Hunt" <a...@aaa.com> wrote in message
news:unthJABr...@TK2MSFTNGP02.phx.gbl...

> What is the best way to determine if a string contains "non Eglish"
> character ?

That's not an easy question to answer. Consider the word "resum�." It's an
English word (taken from French) but it contains an accented character that
is not "native" to English. If your code encountered that word, would you
want it to judge that it contains a "non-English character"?

Phil Hunt

unread,

Feb 12, 2010, 1:45:11 PM2/12/10

to

Ok. Forget French for a moment. How can i tell if the string contain
"Eastern Asia" character ?

"Jeff Johnson" <i....@enough.spam> wrote in message
news:eO%236mFBr...@TK2MSFTNGP02.phx.gbl...

> "Phil Hunt" <a...@aaa.com> wrote in message
> news:unthJABr...@TK2MSFTNGP02.phx.gbl...
>
>> What is the best way to determine if a string contains "non Eglish"
>> character ?
>

> That's not an easy question to answer. Consider the word "resum?" It's an

Helmut Meukel

unread,

Feb 12, 2010, 2:33:11 PM2/12/10

to

"Phil Hunt" <a...@aaa.com> schrieb im Newsbeitrag
news:OthUFOBr...@TK2MSFTNGP05.phx.gbl...

Let's start with the code table.
Characters in strings are just byte or integer values.
In old Dos ASCII (IIRC: American Standard Code for Information
Interchange) was used, 7 data bits + 1 parity bit.
IBM created Extended ASCII (8 data bits, no parity bit) and used
the doubled capacity to code some european characters and grafic
characters (card symbols, lines...).
This exteded ASCII became finally Code Page 437. Other code
pages like 850 (multilingual), 865 (scandinavian) used the same
code values for different characters. My first Vectra PC used the
Roman8 character set, also used by HP's 250, 1000 and 3000
Systems.
With Windows Microsoft switched to ANSI, still 8 bit and
finally to Unicode (16 bit).

So first you have to know how your text is coded, to determine
which codes are used for eastern asian characters.

HTH.

Helmut.

Phil Hunt

unread,

Feb 12, 2010, 3:24:10 PM2/12/10

to

Thanks. I basically have to examine the bit patterns to determine.
I understand the ASCII, it is the Unicode I have some trouble with. I know
it is 16 bits insteads of 8. But in VB/debug window, I have never been able
to see a 16 bits character, maybe it does not display on the screen. Do you
know what i am talking ?
For the character 'A', how can I see the full 16 bits pattern in VB ?

"Helmut Meukel" <NoS...@NoProvider.de> wrote in message
news:uW6t4oBr...@TK2MSFTNGP06.phx.gbl...

Jeff Johnson

unread,

Feb 12, 2010, 3:55:27 PM2/12/10

to

"Phil Hunt" <a...@aaa.com> wrote in message

news:Oc$RZFCrK...@TK2MSFTNGP02.phx.gbl...

> Thanks. I basically have to examine the bit patterns to determine.
> I understand the ASCII, it is the Unicode I have some trouble with. I know
> it is 16 bits insteads of 8. But in VB/debug window, I have never been
> able to see a 16 bits character, maybe it does not display on the screen.
> Do you know what i am talking ?
> For the character 'A', how can I see the full 16 bits pattern in VB ?

I believe you can use the AscW() function to find this. If you get a value
back > 255, I'd say you can safely assume it's a non-English character.

Phil Hunt

unread,

Feb 12, 2010, 4:09:45 PM2/12/10

to

Thanks. It is good enough for me. Always wonder what AscW is used for.

"Jeff Johnson" <i....@enough.spam> wrote in message

news:%23mT%231WCrK...@TK2MSFTNGP06.phx.gbl...

Jeff Johnson

unread,

Feb 12, 2010, 5:15:47 PM2/12/10

to

"Phil Hunt" <a...@aaa.com> wrote in message

news:%23pHG3eC...@TK2MSFTNGP02.phx.gbl...

> Thanks. It is good enough for me. Always wonder what AscW is used for.

The W is for "wide," which is an old term for characters stored in 2 bytes
instead of one. (They're twice as wide!)

Jim Mack

unread,

Feb 12, 2010, 5:28:57 PM2/12/10

to

Not even close. To prove that, in the Immediate Window:

For Idx = 128 to 160: ? Idx, AscW(Chr(Idx)) :Next

--
Jim Mack
Twisted tees at http://www.cafepress.com/2050inc
"We sew confusion"

Phil Hunt

unread,

Feb 12, 2010, 5:52:41 PM2/12/10

to

Ok, I 'll make it > 128

"Jim Mack" <jm...@mdxi.nospam.com> wrote in message
news:uYM5FLD...@TK2MSFTNGP04.phx.gbl...

Rick Rothstein

unread,

Feb 12, 2010, 5:59:47 PM2/12/10

to

Assuming any characters above ASCII 255 in a text string makes the text
non-English, then does something like this work (note that is a space
character after the exclamation point)?

If StringValue Like "*[! -" & chr$(255) & "]*" Then
' Non-English characters present
Else
' All text are English characters
End If

--
Rick (MVP - Excel)

"Jim Mack" <jm...@mdxi.nospam.com> wrote in message
news:uYM5FLD...@TK2MSFTNGP04.phx.gbl...

Jim Mack

unread,

Feb 12, 2010, 6:04:41 PM2/12/10

to

Phil Hunt wrote:
> Ok, I 'll make it > 128

Maybe you missed the point. There are quite a few "English Characters"
for which AscW() will return results > 128, or 255, etc.

The same is true of any Ansi character set under Windows, both SBCS
and MBCS.

Phil Hunt

unread,

Feb 12, 2010, 6:24:13 PM2/12/10

to

Rick,
I think your code would work. I just have to look up 'Like', never used it,
but seems handy

Jim,
I think I missed your point. I used to memorize EBCDII codes. Since I move
over to PC, bit pattern is thing of the past for me, until now.

Thanks, I take a closer look Monday.

"Rick Rothstein" <rick.new...@NO.SPAMverizon.net> wrote in message
news:uDC2NcD...@TK2MSFTNGP06.phx.gbl...

Helmut Meukel

unread,

Feb 12, 2010, 7:04:42 PM2/12/10

to

Phil,

just run charmap.exe
It shows the hex code of the selected character.
I looked at the Win2000 and the Vista version, in the Vista version
you can select more DOS code pages (extended ASCII).
Both show you Unicode and a variety of Windows code pages (ANSI).

AFAIK, when looking at a non-Unicode text file, you have to _know_
what code page was used to create it. IBM and Micro$oft introduced
code pages with DOS 4.0 but forgot to define anything to distinguish
between text coded with different code pages. Same is true for ANSI.

In Unicode texts with western characters the high byte is usually
Hex00. Hex00FE is the lowercase icelandic character Thom.
The netherlandic ij is usually written as 2 characters, but in Unicode
you can use a single character Hex0133.
The trademark sign TM is Hex2122, the %o sign is Hex2030, the
Peseta sign Pts is Hex20A7, c/o is Hex2105, the danish/norvegian
A/S (Aktieselskab) is Hex214D.

HTH

Helmut.

"Phil Hunt" <a...@aaa.com> schrieb im Newsbeitrag

news:Oc$RZFCrK...@TK2MSFTNGP02.phx.gbl...

Jim Mack

unread,

Feb 12, 2010, 8:42:37 PM2/12/10

to

Rick Rothstein wrote:
> Assuming any characters above ASCII 255 in a text string makes the
> text non-English, then does something like this work (note that is
> a space character after the exclamation point)?

You have to distinguish AscW() results from Asc() results. If you use
AscW(), you will see 'English' characters with codes > 255. Using
Asc() you won't, but you may then qualify some non-English characters
as English (which may be OK depending on the circumstance).

I don't know if Like examines the Unicode characters... if so, then it
will act the way AscW() does and fail some valid characters.

--
Jim

>
> If StringValue Like "*[! -" & chr$(255) & "]*" Then
> ' Non-English characters present
> Else
> ' All text are English characters
> End If
>
>

Rick Rothstein

unread,

Feb 13, 2010, 12:11:50 AM2/13/10

to

>> Assuming any characters above ASCII 255 in a text string makes the
>> text non-English, then does something like this work (note that is
>> a space character after the exclamation point)?
>
> You have to distinguish AscW() results from Asc() results. If you use
> AscW(), you will see 'English' characters with codes > 255. Using
> Asc() you won't, but you may then qualify some non-English characters
> as English (which may be OK depending on the circumstance).
>
> I don't know if Like examines the Unicode characters... if so, then it
> will act the way AscW() does and fail some valid characters.

I have no idea about this stuff at all. Having only ever worked with US
regional settings, I know next to nothing about the international world of
VB... I only threw the idea out there in case it might work. I was hoping
someone knowledgeable about such things would test it out and see if it
could be used or not.

CY

unread,

Feb 13, 2010, 2:01:35 PM2/13/10

to

Good ideas, but I got an concern, if a file using ascii 32-128 is
English bur in some country remapped for example [ ] and | as we do/
did... then it gets a bit cumbersome again ;)

That is if the file is interpreted by the PC:s Codepage.. 850 or 437
(Or was this just in the good old days?)

//CY

Nobody

unread,

Feb 13, 2010, 6:29:04 PM2/13/10

to

"Phil Hunt" <a...@aaa.com> wrote in message

news:unthJABr...@TK2MSFTNGP02.phx.gbl...

> What is the best way to determine if a string contains "non Eglish"
> character ?

I have not developed international applications, but I know more than those
who use one language only.

First, you need to treat a sequence of bytes as encoded stream of characters
that must be decoded first. You can't assume that every byte is a character
or every two bytes are one character because of various encoding schemes,
such as Multi Byte Character Set(MBCS), and surrogates in Unicode(In which
case 4 bytes represent one character). You can't also assume that byte
values in the range 0 to 127 are English only, although in most cases they
are. You have to know how the characters were encoded. For example, in some
MBCS they used the range 33 to 126 to encode some characters.

In Unicode-32 however, each character is 4 bytes always and with fixed
meaning.

In ANSI and Unicode: 0-127 have fixed meaning and they are one and the same.
In Unicode: 128-255 have fixed meaning, they follow ISO/IEC 8859-1.
In ANSI: 128-255 have meaning based on what Code Page(CP) in use. In the
US/Western Europe, Windows uses "Windows-1252" code page (CP1252).
Characters in the range 160 to 255 are identical to Unicode, but most of the
range between 128 to 159 are not. So it's not safe to assume that in English
that characters in the range 128 to 255 are identical to Unicode.

In VB, strings are stored internally as Unicode-16. However, the controls
are ANSI and when you call API functions an ANSI version of the string is
created(when using ByVal/ByRef As String) and copied back if you used ByRef
As String. To pass Unicode strings to API functions, you must use "ByVal
StrPtr(s)" and in most cases you have to use the W version of the function.

The main API functions used for converting between Unicode and non-Unicode
are WideCharToMultiByte/MultiByteToWideChar, typically with CP_ACP flag,
which means use the current code page.

Also, Chr() function in VB treats the number you provide as character code
based on the current system code page, and returns a Unicode character.
While ChrW() doesn't do any transformation and therefore faster. The same
applies to Asc/AscW. Asc() uses the current system code page, and returns 63
"?" if the character cannot be represented.

Some links:

http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/Windows-1252
http://en.wikipedia.org/wiki/Multi-byte_character_set

The links above are basically derived from the first link. To answer your
question, visit "Latin characters in Unicode" link above, and check the
ranges that start with Latin and compare it with AscW() value.

Sample code to show how VB6+SP5 deals with characters in the range 128 to
159 in an English-US based OS(XP+SP2):

Option Explicit

Private Sub Form_Load()
Dim i As Long
Dim s As String

s = ChrW(&H8765&)
Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s))
s = Chr(&H80)
Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s))

For i = 0 To 255
s = Chr(i)
' Compare Chr() with ChrW(), and print where they differ
If s <> ChrW(i) Then
Debug.Print i, Hex(i), Asc(s), AscB(s), Hex(AscB(s)),
Hex(AscW(s))
End If
Next

End Sub

Output:

63 101 -30875 8765
128 172 8364 20AC
128 80 128 172 AC 20AC
130 82 130 26 1A 201A
131 83 131 146 92 192
132 84 132 30 1E 201E
133 85 133 38 26 2026
134 86 134 32 20 2020
135 87 135 33 21 2021
136 88 136 198 C6 2C6
137 89 137 48 30 2030
138 8A 138 96 60 160
139 8B 139 57 39 2039
140 8C 140 82 52 152
142 8E 142 125 7D 17D
145 91 145 24 18 2018
146 92 146 25 19 2019
147 93 147 28 1C 201C
148 94 148 29 1D 201D
149 95 149 34 22 2022
150 96 150 19 13 2013
151 97 151 20 14 2014
152 98 152 220 DC 2DC
153 99 153 34 22 2122
154 9A 154 97 61 161
155 9B 155 58 3A 203A
156 9C 156 83 53 153
158 9E 158 126 7E 17E
159 9F 159 120 78 178

As you notice, when you provide Chr() with characters in the range 128-159
in an English based system, the Unicode characters as shown by AscW do not
necessarily have the same value.

Helmut Meukel

unread,

Feb 13, 2010, 6:45:24 PM2/13/10

to

"CY" <chri...@gmail.com> schrieb im Newsbeitrag
news:8769612d-7558-468a...@o3g2000yqb.googlegroups.com...

That was before extended ASCII in the days of 7-bit ASCII or
when communicating with other computers and you needed the
parity bit to check for transmission errors.
In 7 bit ASCII some codes were used for national characters:
the US-Characters | [ { ] } and some others cold be replaced
with specific characters of the national language.
Mind, the same code values were used for german Umlauts
(�, �, �,...), scandinavian character (�, �, �, ...), french
accents and so on. So you had to know the language
of the text to get it right.

IBM's extended ASCII contained some of those national
characters above 127 but not enough, they used most code
values for graphical characters. This was the character set
later known as Codepage 437. Codepage 850 contains less
graphical characters and more national characters.

Startup CharMap.exe, and you can see the differences.
DOS: USA is Codepage 437 and DOS: Western Europe
is Codepage 850.

Helmut.

Jeff Johnson

unread,

Feb 15, 2010, 9:16:39 AM2/15/10

to

"Jim Mack" <jm...@mdxi.nospam.com> wrote in message
news:uYM5FLD...@TK2MSFTNGP04.phx.gbl...

>>> Thanks. I basically have to examine the bit patterns to determine.

>>> I understand the ASCII, it is the Unicode I have some trouble
>>> with. I know it is 16 bits insteads of 8. But in VB/debug window,
>>> I have never been able to see a 16 bits character, maybe it does
>>> not display on the screen. Do you know what i am talking ?
>>> For the character 'A', how can I see the full 16 bits pattern in
>>> VB ?
>>
>> I believe you can use the AscW() function to find this. If you get
>> a value back > 255, I'd say you can safely assume it's a
>> non-English character.
>
> Not even close.

I never said if it's 255 or less it's guaranteed to be an English character,
I said if it's ABOVE 255 it's pretty much guaranteed to NOT be an English
character. There is a difference.

Jim Mack

unread,

Feb 15, 2010, 11:09:59 AM2/15/10

to

And yet the first assertion is mostly correct and the second isn't.

Which the code snippet clearly shows: AscW() produces more than a
dozen results >255 for characters that would be considered English
since they're in the '1033' character set.

Jeff Johnson

unread,

Feb 15, 2010, 3:21:51 PM2/15/10

to

"Jim Mack" <jm...@mdxi.nospam.com> wrote in message

news:%23fj4Tll...@TK2MSFTNGP06.phx.gbl...

>>>> I believe you can use the AscW() function to find this. If you get
>>>> a value back > 255, I'd say you can safely assume it's a
>>>> non-English character.
>>>
>>> Not even close.
>>
>> I never said if it's 255 or less it's guaranteed to be an English
>> character, I said if it's ABOVE 255 it's pretty much guaranteed to
>> NOT be an English character. There is a difference.
>
> And yet the first assertion is mostly correct and the second isn't.
>
> Which the code snippet clearly shows: AscW() produces more than a
> dozen results >255 for characters that would be considered English
> since they're in the '1033' character set.

Ahhhh, I see where you're going with this. And it makes me realize I was
unclear (or rather, I was making an assumption based on what I thought the
poster wanted). I interpreted "non-English CHARACTER" to mean "non-English
LETTER." Almost everything in the range of your example code was not a
letter but rather some form of punctuation, and I don't consider punctuation
to be language-specific.

Jeff Johnson

unread,

Feb 15, 2010, 3:23:44 PM2/15/10

to

"Jim Mack" <jm...@mdxi.nospam.com> wrote in message

news:%23fj4Tll...@TK2MSFTNGP06.phx.gbl...

> Which the code snippet clearly shows: AscW() produces more than a
> dozen results >255 for characters that would be considered English
> since they're in the '1033' character set.

[Sent the previous reply too soon.]

Your comment about how they would be considered English since they're in the
1033 code page (or whatever that is) is actually the crux of my first reply.
DOES the poster actually consider all of those to be English?

Phil Hunt

unread,

Feb 15, 2010, 4:52:13 PM2/15/10

to

Thanks to all who replied. After reading all the posts, I think I should
stick with
"0 - 127 is english assumption". It is safer in the context of this issue I
have.

"Nobody" <nob...@nobody.com> wrote in message
news:%2360qfRQ...@TK2MSFTNGP05.phx.gbl...

Jim Mack

unread,

Feb 15, 2010, 4:54:20 PM2/15/10

to

Jeff Johnson wrote:
> "Jim Mack" wrote...

>
>> Which the code snippet clearly shows: AscW() produces more than a
>> dozen results >255 for characters that would be considered English
>> since they're in the '1033' character set.
>
> [Sent the previous reply too soon.]
>
> Your comment about how they would be considered English since
> they're in the 1033 code page (or whatever that is) is actually the
> crux of my first reply. DOES the poster actually consider all of
> those to be English?

Maybe he'll see and respond. It's a question about what he wants to
classify and why, but in fact if you examine the normal output of many
modern text editing / word-processing programs, you will very likely
find characters that fall in the range I called out.

If you passed such text through the AscW() test, it would fail. That's
what I was pointing out.

Mike Williams

unread,

Feb 15, 2010, 5:39:18 PM2/15/10

to

"Phil Hunt" <a...@aaa.com> wrote in message

news:Og$QkkorK...@TK2MSFTNGP06.phx.gbl...

> Thanks to all who replied. After reading all the posts, I think
> I should stick with "0 - 127 is english assumption". It is safer
> in the context of this issue I have.

Actually that's not a valid assumption. It is possible to have non-English
text that contains only characters in that range, and it is actually quite
common to have English text that contains characters outside it, such as the
pound sign (�) for example.

Mike

Phil Hunt

unread,

Feb 15, 2010, 5:48:39 PM2/15/10

to

Actually my problem is quite the opposite, I have a printer that print
english well in native font. Printing mixed language is possible with true
front but very slow. My objective is not to mix them up.

text to start with and
"Mike Williams" <Mi...@WhiskyAndCoke.com> wrote in message
news:%23NJz2%23orKH...@TK2MSFTNGP02.phx.gbl...

Helmut Meukel

unread,

Feb 16, 2010, 4:16:26 AM2/16/10

to

"Phil Hunt" <a...@aaa.com> schrieb im Newsbeitrag

news:uV%23kGEpr...@TK2MSFTNGP02.phx.gbl...

> Actually my problem is quite the opposite, I have a printer that print english
> well in native font. Printing mixed language is possible with true front but
> very slow. My objective is not to mix them up.
>

Hmm, so you want to use the built-in printer font for speed.
Then you have the problem of codes representing different
characters in the fonts used to create the texts and your built-in
printer font, isn't it?

How about printing a table with the codes and the corresponding
Character of this font (probably there is one already in the printer
docs).

Then use CharMap.exe to check the Truetype fonts and codepages
probably used in creating the texts against your printed table. Depending
on your situation this may be only a few.
Cross out all which are different. You'll finally have a list of codes with the
same character representation in all fonts and codepages you checked.

Store this list in an array, check the text to print against this array of
codes and select the printer font accordingly.

HTH.

Helmut.

Nobody

unread,

Feb 16, 2010, 7:36:39 AM2/16/10

to

"Phil Hunt" <a...@aaa.com> wrote in message

news:uV%23kGEpr...@TK2MSFTNGP02.phx.gbl...

> Actually my problem is quite the opposite, I have a printer that print
> english well in native font. Printing mixed language is possible with true
> front but very slow. My objective is not to mix them up.

To tell the range of Unicode characters that a specific font supports, you
can call GetFontUnicodeRanges(). However, this works with the fonts
installed on the system, not the printer. Some fonts have Unicode in the
name, but they don't implement all Unicode characters. Search the news
groups for "vb GetFontUnicodeRanges" for samples.

Karl E. Peterson

unread,

Feb 16, 2010, 8:58:25 PM2/16/10

to

Phil Hunt wrote:
> What is the best way to determine if a string contains "non Eglish" character
> ?

Define "English", test for your definition.

--
.NET: It's About Trust!
http://vfred.mvps.org