Problems with CompareStringW

Warren Menzer

unread,

Oct 22, 2003, 2:31:52 PM10/22/03

to

I've found that with certain Unicode strings, CompareStringW seems to
be acting very strangey - you get behavior like this:

A < B
B < C
C < A

or even:

A < B
B < A

These strings are randomly generated Unicode strings, so it may be
that the problematic strings contain characters that are either unused
or in certain parts of the Unicode space that are reserved (something
similar to the private use space, maybe). So it may be that
CompareStringW works fine for all real-world strings that we'd ever
encounter. Still, it's a bit unsettling to see CompareStringW return
values that are so obviously wrong.

A specific example - all three calls to CompareStringW return
CSTR_LESS_THAN:

A = 1B37 1D96 4516
B = 30FE 4113 67BE
C = 0747 4443 40E6

CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE + NORM_IGNOREWIDTH,
A,
-1,
B,
-1);

CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE + NORM_IGNOREWIDTH,
B,
-1,
C,
-1);

CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE + NORM_IGNOREWIDTH,
C,
-1,
A,
-1);

Are there any errors here? From what I can tell, the three strings
are all legal (null-terminated) UTF-16 strings - they're not
ill-formed.

Another example:

A = 0D42 65F9
B = 1111 1B4F

CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE,
A,
-1,
B,
-1);

CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE,
B,
-1,
C,
-1);

CompareStringW returns that A<B and B<A if I pass in -1 as the lengths
(the documentation states that "if this parameter is any negative
value, the string is assumed to be null terminated and the length is
calculated automatically"). But if I calculate the lengths of the
strings myself and pass those in, then it works proplerly (A>B and
B<A). Passing in the string lengths does not help the case above,
however.

Any ideas would be much appreciated - thanks!

Michael (michka) Kaplan [MS]

unread,

Oct 26, 2003, 4:09:19 PM10/26/03

to

"Warren Menzer" <wme...@yahoo.com> wrote...

Sorry I did not reply to this message sooner, I have been on vacation. ;-)

> I've found that with certain Unicode strings, CompareStringW seems to
> be acting very strangey -

Strangely is a relative term, especially in a case where you are randomly
generating strings....

> you get behavior like this:
>
> A B < C
> C < A
>
> or even:
>
> A B < A

Yes, both are not so great. But you have to understand how the collation
data is created and what it represents.

The goal is to give a way to sort every part of the Unicode BMP (basis
multilingual plane), according to some particular selected locale. Any time
a code point is not usefully defined in the table (e.g. it is not defined in
Unicode, it is not a language/script that Windows has useful data for, or it
is intentionally not given weight), it will not give useful linguistic
information.

In other words, comparing random crap can give random crap results. :-)

> These strings are randomly generated Unicode strings, so it may be
> that the problematic strings contain characters that are either unused
> or in certain parts of the Unicode space that are reserved (something
> similar to the private use space, maybe). So it may be that
> CompareStringW works fine for all real-world strings that we'd ever
> encounter. Still, it's a bit unsettling to see CompareStringW return
> values that are so obviously wrong.

See above. But I will plow through your examples too, below.

> A specific example - all three calls to CompareStringW return
> CSTR_LESS_THAN:
>
> A = 1B37 1D96 4516

The first two are not defined in the collation tables. U+4516 is a part of
Extension A, added to the table in WinXP/Server 2003. It has a default
weighting that is basically code point order at the end of the table. For
prior versions it is undefined.

SUMMARY: This is a nonsense string and nothing useful can come from testing
it.

> B = 30FE 4113 67BE

U+30fe is a Katakana iteration mark that has some special properties in
regard to collation that are going to give dumb results when mixed with
non-Kana strings. U+4113 is another Extension A character, and U+67be is a
part of the main CJK Ideographs table.

SUMMARY: Again, this is a nonsense string and nothing useful can come from
testing it.

> C = 0747 4443 40E6

U+0747 is a Syriac character, and U+4443 and U+40e6 are again Extension A
characters.

SUMMARY: Once again, this is a nonsense string and nothing useful can come
from testing it.

> Are there any errors here? From what I can tell, the three strings
> are all legal (null-terminated) UTF-16 strings - they're not
> ill-formed.

Well, they are ill-formed in a linguistic sense. The code as shipped does
not even handle invalid Jamo sequences all that intuitively, since in the
real world people use real strings. It is important to design reasonable
tests....

> Another example:
>
> A = 0D42 65F9

A Malayalam character and a CJK ideograph -- two characters one would never
really expect to be together.

> B = 1111 1B4F

A Hangul character and an undefined codepoint -- again not a valid test.

> CompareStringW returns that A (the documentation states that "if this parameter is any negative
> value, the string is assumed to be null terminated and the length is
> calculated automatically"). But if I calculate the lengths of the
> strings myself and pass those in, then it works proplerly (A>B and
> B<A). Passing in the string lengths does not help the case above,
> however.

Well, this is a type of situation that really is a bug, something that I
have been working to correct for future versions -- there simply are many
cases where if you pass invalid data we handle it oddly, specifically
between the -1 and cch cases (which are basically two different code paths).

The -1 case is designed to not require a string wallk on the part of the
caller (it literally plows the string one sort element at a time and stops w
hen it knows the answer, and any time the two calls give different results,
it is technically a bug (one that I am charged with trying to fix! <grin>).
The mitigation for the time being is that invalid input is required to give
invalid results....

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

FL

unread,

Oct 27, 2003, 1:04:26 AM10/27/03

to

Michael (michka) Kaplan [MS] wrote:
> The -1 case is designed to not require a string wallk on the part of the
> caller (it literally plows the string one sort element at a time and stops w
> hen it knows the answer, and any time the two calls give different results,
> it is technically a bug (one that I am charged with trying to fix! <grin>).
> The mitigation for the time being is that invalid input is required to give
> invalid results....

It's better not to think this as a "mitigation for the time", IMHO,
because usually this type of string walking routines that are not
properly implemented will lead to security bugs. Invalid inputs may lead
to "desired" results of the hacker.

Francisco

Michael (michka) Kaplan [MS]

unread,

Oct 27, 2003, 8:14:21 AM10/27/03

to

I would tend to agree with this point, in theory.

In practice, I have to recognize that the current tables simply do not
handle aberrant data well enough, and that the "garbage in, garbage out"
rules apply.

From a security standpoint, the use of undefined code points, random invalid
Hangul/Jamo/Kana is not a concern the way that Latin vs. Cyrillic "o" being
used for a string would be (for spoofing), and that case is already well
handled.

But as I said, fixing many of these problems is one of my main
preoccupations, these days. :-)

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

"FL" <frleong@NO_SPAM_hotmail.com> wrote in message
news:%23NJfZAF...@TK2MSFTNGP12.phx.gbl...

Warren Menzer

unread,

Nov 24, 2003, 5:25:33 PM11/24/03

to

Thanks for the comments - here's a followup:

I generated a text file of valid Unicode characters so that our
automated script could randomly generate valid Unicode filenames.
Even limiting our testing to these characters, we found some
inconsistent behavior with CompareStringW. I wrote a test application
that creates two 2-character Unicode strings and compares them with
CompareStringW (I can provide the source if you're interested). These
characters are taken from the text file of valid characters mentioned
above. I was able to find some instances of strings A and B such that
AB and B>A). I ran a test of 100 million comparisons
with two different options - supplying -1 as the strings lengths, and
passing in lengths generated from calling wsclen. Here are the
results:

Results (using -1): 186421 comparisons FAILED out of 100000000 total
Results (using wcslen): 558 comparisons FAILED out of 100000000 total

Using wcslen provided better behavior, but using either method, all of
the problems seem to involve comparisons between strings that contain
Hangul characters, e.g.:

A = 0ABD 1112 [GUJARATI SIGN AVAGRAHA, HANGUL CHOSEONG HIEUH]
MONGOLIAN LETTER MANCHU ALI GALI TTA
HANGUL CHOSEONG PIEUP-NIEUN
B = 1146 22BE [HANGUL CHOSEONG IEUNG-PANSIOS, RIGHT ANGLE WITH ARC]
COMBINING DOUBLE GRAVE ACCENT

Is there something about Hangul characters that causes them to be
different than other types of characters? We'll probably just remove
those characters from our list of valid characters for our testing,
but I wanted to make sure that there wasn't something else about them
that we need to be aware of.

Thanks again for your time!

Michael (michka) Kaplan [MS]

unread,

Nov 24, 2003, 8:46:38 PM11/24/03

to

Hangul does indeed have special handling. I would have to wonder what the
definition of "valid" strings is here, since the script mix alone makes them
pretty unlikely strings in any case?

In terms of testing the APIs, it might be important to take a step back and
do a test breakout to decide WHAT you are testing. Are you truly looking to
test the API, something that would belong to members of the Windows test
folks at Microsoft? Or are you trying to verify that when you pass the types
of strings that your app would be expected to have that they are handled
appropriately? It seems like the test is not really designed with a real
world scenario in mind....

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03112...@posting.google.com...

Mihai N.

unread,

Nov 25, 2003, 3:24:33 AM11/25/03

to

1. What was the locales you have used? It is important. En example not so far
away is the ä (a umlaut). In German is after a, in Swedish is after z.

2. What do you mean "valid Unicode characters"?
When you compare Hangul (Korean) with Gujarati (Indian), what would you
expect?

--
Mihai
-------------------------
Replace _year_ with _ to get the real email

Michael (michka) Kaplan [MS]

unread,

Nov 25, 2003, 10:21:19 AM11/25/03

to

For #2, it is worse than that.

Warren is comparing a string composed of a Gujarati character, a Hangul
syllable and a Mongolin letter with a string composed of a Hangul syllable,
a non-letter, and one of those double combining characteras.

It is not a test of "valid strings" by any stretch of the imagination. I
have real trouble understanding what he is even trying to test here.

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns943E436...@216.148.227.77...

Warren Menzer

unread,

Nov 25, 2003, 12:19:55 PM11/25/03

to

I guess there's a fine line between "unlikely" and "impossible". I
agree that it's not very likely that our application will see
file/folder names that contain (say) both Hangul and Gurarati
characters, but it's certainly a possibility. Our application needs
to work with all possible Windows files and folders, and these weird
names are definately ones that are possible.

More specifically, our application keeps a sorted list of files in a
vector for easy retrieval. It's not very important what the sort
order is, as long as it's consistent [to respond to Mihai N.'s
question, it doesn't really matter how a Hangul string A compares to a
Gujarati string B, as long as the fact that AA -
it's only consistency that's important to us]. Since keeping a sorted
list requires the use of a comparison function, we needed to verify
that CompareStringW produces results that are consistent. This came
about because we ran into instances where we could not find items in
our sorted list because CompareStringW wasn't returning consistent
results (that is, if we're doing a binary search, knowing which
"direction" within the vector to go towards is dependent on what our
comparison function returns).

As I said, this point is probably moot for 99.9% of our customers, but
it only takes a couple of "strange" files by one customer for this
problem to manifest itself.

Thanks again for your time and input.

Stephen Bye

unread,

Nov 26, 2003, 6:51:19 AM11/26/03

to

Then don't use CompareStringW, use something like wcscmp, which just
compares the numerical character codes.

"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03112...@posting.google.com...

Alex

unread,

Nov 26, 2003, 10:56:48 AM11/26/03

to

wme...@yahoo.com (Warren Menzer) wrote in
news:d07984db.03112...@posting.google.com:

> I guess there's a fine line between "unlikely" and "impossible".
> I agree that it's not very likely that our application will see
> file/folder names that contain (say) both Hangul and Gurarati
> characters, but it's certainly a possibility. Our application
> needs to work with all possible Windows files and folders, and
> these weird names are definately ones that are possible.

If I understand your problem correctly, you need to arrange items
in alphabetical order (of the selected locale). But when you have
unusual combinations of characters alphabetical order may not really
make sense, so it would be enough to just have some stable comparison.
Ugly solution to this problem is to compare 2 strings as wchar vectors
first and then always call CompareStringW with smaller one as a first
parameter and greater one as a second. Of course, this relies
CompareStringW(A,B) always returning the same value as long as A and B
are passed in the same order.

> As I said, this point is probably moot for 99.9% of our customers,
> but it only takes a couple of "strange" files by one customer for
> this problem to manifest itself.

... and that 0.1% always happens on some important presentation :-)

Alex.

Michael (michka) Kaplan [MS]

unread,

Nov 26, 2003, 6:26:11 PM11/26/03

to

I do not think the line is as fine as you are claiming it to be, fwiw.

The data was crafted by a linguist with an eye to accuracy of actual
linguistic data, a woman who is so gifted at what she does that she has a
sign on her door that says "Right Honourable Data Lady" and people will
often call her the RHDL without even a trace of sarcasm. The code has been
written by several developers across the last decade and a half that use
that linguistic data to the fullest extent possible.

Now I am one of those developers, and currently the owner of the code in
question. There are indeed bugs there, especially in the area of
linguistically incorrect data, and I am working to try to fix these bugs.
But the fact that the current data s not quite as smart with stupid inputs
is hardly a blocking issue.

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03112...@posting.google.com...

Warren Menzer

unread,

Dec 1, 2003, 9:12:38 AM12/1/03

to

The reason we were interested in using CompareStringW rather than
wcscmp is that it can also detect precomposed vs. decomposed unicode,
and can declare that a precomposed and a decomposed string are
"equal", even if they aren't the same byte-wise.

"Stephen Bye" <.> wrote in message news:<OyLxFOBt...@TK2MSFTNGP11.phx.gbl>...

Michael (michka) Kaplan [MS]

unread,

Dec 1, 2003, 9:19:19 AM12/1/03

to

So if you have strings that are appropriately contructed, this will work
very well.

Do you have a scenario where users are creating invalid, strange strings
such as your testing is using? I think if you do a realistic refactoring of
your test cases and document that people should not do dumb things, then you
should be all set!

Look at it another way, do you truly posulate users who are sophisticated
enough to understand languages (some of which are not even supported via
built-in input methods or collation in Windows!) but not smart enough to use
them apropriately? You seem to be trying to build a product that gives your
users the right to very, very wrong.

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Warren Menzer" <wme...@yahoo.com> wrote in message

news:d07984db.03120...@posting.google.com...

Stephen Bye

unread,

Dec 2, 2003, 10:56:11 AM12/2/03

to

So use CompareStringW first.
If it returns "equal" then treat the strings as equal.
If it returns "unequal" then use wcscmp to order the strings.

"Warren Menzer" <wme...@yahoo.com> wrote in message

news:d07984db.03120...@posting.google.com...

Michael (michka) Kaplan [MS]

unread,

Dec 2, 2003, 5:55:46 PM12/2/03

to

Yikes, I would *not* suggest that at all, unless you wanted a binary
ordering. The results will not be linguistically valid.

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Stephen Bye" <.> wrote in message

news:%23J7Y5yO...@TK2MSFTNGP11.phx.gbl...

Stephen Bye

unread,

Dec 3, 2003, 6:06:08 AM12/3/03

to

The OP said "It's not very important what the sort order is, as long as it's
consistent". All he wants to do is a binary search on a list of items. Their
relative positions do not need to be "linguistically valid".

But I can see now that my suggestion is flawed. If you were using a binary
search with a string that was in the table, a binary comparison with an
unequal string may send you off in the wrong direction.

"Michael (michka) Kaplan [MS]" <mic...@online.microsoft.com> wrote in
message news:u31rxdSu...@TK2MSFTNGP12.phx.gbl...

Stephen Bye

unread,

Dec 9, 2003, 4:54:57 AM12/9/03

to

It looks like there are just 2 options left:
1) Pre-process your strings so that they are all precomposed (or
decomposed),
or
2) Abandon the idea of using a binary search to find a string in the table.

"Stephen Bye" <.> wrote in message

news:eZ2Mf1Yu...@TK2MSFTNGP11.phx.gbl...

Michael (michka) Kaplan [MS]

unread,

Dec 9, 2003, 9:39:34 AM12/9/03

to

Sigh...

Once again, if you have strings that are even remotely valid, then
CompareString will work just fine. The cases that have been raised here are
what we call "TESTER BUGS" because a tester can find them with an automated
tool. They are not found in the wild because in the real world users simply
do not misuse language as extensively as these strings require.

If you are trying to test a tool, then a realistic test breakout would not
cause one to abandon usage of an API for such an unrealistic scenario.

--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.

"Stephen Bye" <.> wrote in message

news:eQrYrpjv...@tk2msftngp13.phx.gbl...