A < B
B < C
C < A
or even:
A < B
B < A
These strings are randomly generated Unicode strings, so it may be
that the problematic strings contain characters that are either unused
or in certain parts of the Unicode space that are reserved (something
similar to the private use space, maybe). So it may be that
CompareStringW works fine for all real-world strings that we'd ever
encounter. Still, it's a bit unsettling to see CompareStringW return
values that are so obviously wrong.
A specific example - all three calls to CompareStringW return
CSTR_LESS_THAN:
A = 1B37 1D96 4516
B = 30FE 4113 67BE
C = 0747 4443 40E6
CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE + NORM_IGNOREWIDTH,
A,
-1,
B,
-1);
CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE + NORM_IGNOREWIDTH,
B,
-1,
C,
-1);
CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE + NORM_IGNOREWIDTH,
C,
-1,
A,
-1);
Are there any errors here? From what I can tell, the three strings
are all legal (null-terminated) UTF-16 strings - they're not
ill-formed.
Another example:
A = 0D42 65F9
B = 1111 1B4F
CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE,
A,
-1,
B,
-1);
CompareStringW(LOCALE_INVARIANT,
NORM_IGNORECASE,
B,
-1,
C,
-1);
CompareStringW returns that A<B and B<A if I pass in -1 as the lengths
(the documentation states that "if this parameter is any negative
value, the string is assumed to be null terminated and the length is
calculated automatically"). But if I calculate the lengths of the
strings myself and pass those in, then it works proplerly (A>B and
B<A). Passing in the string lengths does not help the case above,
however.
Any ideas would be much appreciated - thanks!
Sorry I did not reply to this message sooner, I have been on vacation. ;-)
> I've found that with certain Unicode strings, CompareStringW seems to
> be acting very strangey -
Strangely is a relative term, especially in a case where you are randomly
generating strings....
> you get behavior like this:
>
> A < B
> B < C
> C < A
>
> or even:
>
> A < B
> B < A
Yes, both are not so great. But you have to understand how the collation
data is created and what it represents.
The goal is to give a way to sort every part of the Unicode BMP (basis
multilingual plane), according to some particular selected locale. Any time
a code point is not usefully defined in the table (e.g. it is not defined in
Unicode, it is not a language/script that Windows has useful data for, or it
is intentionally not given weight), it will not give useful linguistic
information.
In other words, comparing random crap can give random crap results. :-)
> These strings are randomly generated Unicode strings, so it may be
> that the problematic strings contain characters that are either unused
> or in certain parts of the Unicode space that are reserved (something
> similar to the private use space, maybe). So it may be that
> CompareStringW works fine for all real-world strings that we'd ever
> encounter. Still, it's a bit unsettling to see CompareStringW return
> values that are so obviously wrong.
See above. But I will plow through your examples too, below.
> A specific example - all three calls to CompareStringW return
> CSTR_LESS_THAN:
>
> A = 1B37 1D96 4516
The first two are not defined in the collation tables. U+4516 is a part of
Extension A, added to the table in WinXP/Server 2003. It has a default
weighting that is basically code point order at the end of the table. For
prior versions it is undefined.
SUMMARY: This is a nonsense string and nothing useful can come from testing
it.
> B = 30FE 4113 67BE
U+30fe is a Katakana iteration mark that has some special properties in
regard to collation that are going to give dumb results when mixed with
non-Kana strings. U+4113 is another Extension A character, and U+67be is a
part of the main CJK Ideographs table.
SUMMARY: Again, this is a nonsense string and nothing useful can come from
testing it.
> C = 0747 4443 40E6
U+0747 is a Syriac character, and U+4443 and U+40e6 are again Extension A
characters.
SUMMARY: Once again, this is a nonsense string and nothing useful can come
from testing it.
> Are there any errors here? From what I can tell, the three strings
> are all legal (null-terminated) UTF-16 strings - they're not
> ill-formed.
Well, they are ill-formed in a linguistic sense. The code as shipped does
not even handle invalid Jamo sequences all that intuitively, since in the
real world people use real strings. It is important to design reasonable
tests....
> Another example:
>
> A = 0D42 65F9
A Malayalam character and a CJK ideograph -- two characters one would never
really expect to be together.
> B = 1111 1B4F
A Hangul character and an undefined codepoint -- again not a valid test.
> CompareStringW returns that A<B and B<A if I pass in -1 as the lengths
> (the documentation states that "if this parameter is any negative
> value, the string is assumed to be null terminated and the length is
> calculated automatically"). But if I calculate the lengths of the
> strings myself and pass those in, then it works proplerly (A>B and
> B<A). Passing in the string lengths does not help the case above,
> however.
Well, this is a type of situation that really is a bug, something that I
have been working to correct for future versions -- there simply are many
cases where if you pass invalid data we handle it oddly, specifically
between the -1 and cch cases (which are basically two different code paths).
The -1 case is designed to not require a string wallk on the part of the
caller (it literally plows the string one sort element at a time and stops w
hen it knows the answer, and any time the two calls give different results,
it is technically a bug (one that I am charged with trying to fix! <grin>).
The mitigation for the time being is that invalid input is required to give
invalid results....
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
Francisco
In practice, I have to recognize that the current tables simply do not
handle aberrant data well enough, and that the "garbage in, garbage out"
rules apply.
From a security standpoint, the use of undefined code points, random invalid
Hangul/Jamo/Kana is not a concern the way that Latin vs. Cyrillic "o" being
used for a string would be (for spoofing), and that case is already well
handled.
But as I said, fixing many of these problems is one of my main
preoccupations, these days. :-)
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
"FL" <frleong@NO_SPAM_hotmail.com> wrote in message
news:%23NJfZAF...@TK2MSFTNGP12.phx.gbl...
I generated a text file of valid Unicode characters so that our
automated script could randomly generate valid Unicode filenames.
Even limiting our testing to these characters, we found some
inconsistent behavior with CompareStringW. I wrote a test application
that creates two 2-character Unicode strings and compares them with
CompareStringW (I can provide the source if you're interested). These
characters are taken from the text file of valid characters mentioned
above. I was able to find some instances of strings A and B such that
A<B and B<A (or A>B and B>A). I ran a test of 100 million comparisons
with two different options - supplying -1 as the strings lengths, and
passing in lengths generated from calling wsclen. Here are the
results:
Results (using -1): 186421 comparisons FAILED out of 100000000 total
Results (using wcslen): 558 comparisons FAILED out of 100000000 total
Using wcslen provided better behavior, but using either method, all of
the problems seem to involve comparisons between strings that contain
Hangul characters, e.g.:
A = 0ABD 1112 [GUJARATI SIGN AVAGRAHA, HANGUL CHOSEONG HIEUH]
MONGOLIAN LETTER MANCHU ALI GALI TTA
HANGUL CHOSEONG PIEUP-NIEUN
B = 1146 22BE [HANGUL CHOSEONG IEUNG-PANSIOS, RIGHT ANGLE WITH ARC]
COMBINING DOUBLE GRAVE ACCENT
Is there something about Hangul characters that causes them to be
different than other types of characters? We'll probably just remove
those characters from our list of valid characters for our testing,
but I wanted to make sure that there wasn't something else about them
that we need to be aware of.
Thanks again for your time!
In terms of testing the APIs, it might be important to take a step back and
do a test breakout to decide WHAT you are testing. Are you truly looking to
test the API, something that would belong to members of the Windows test
folks at Microsoft? Or are you trying to verify that when you pass the types
of strings that your app would be expected to have that they are handled
appropriately? It seems like the test is not really designed with a real
world scenario in mind....
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03112...@posting.google.com...
2. What do you mean "valid Unicode characters"?
When you compare Hangul (Korean) with Gujarati (Indian), what would you
expect?
--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Warren is comparing a string composed of a Gujarati character, a Hangul
syllable and a Mongolin letter with a string composed of a Hangul syllable,
a non-letter, and one of those double combining characteras.
It is not a test of "valid strings" by any stretch of the imagination. I
have real trouble understanding what he is even trying to test here.
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Mihai N." <nmihai_y...@yahoo.com> wrote in message
news:Xns943E436...@216.148.227.77...
More specifically, our application keeps a sorted list of files in a
vector for easy retrieval. It's not very important what the sort
order is, as long as it's consistent [to respond to Mihai N.'s
question, it doesn't really matter how a Hangul string A compares to a
Gujarati string B, as long as the fact that A<B implies that B>A -
it's only consistency that's important to us]. Since keeping a sorted
list requires the use of a comparison function, we needed to verify
that CompareStringW produces results that are consistent. This came
about because we ran into instances where we could not find items in
our sorted list because CompareStringW wasn't returning consistent
results (that is, if we're doing a binary search, knowing which
"direction" within the vector to go towards is dependent on what our
comparison function returns).
As I said, this point is probably moot for 99.9% of our customers, but
it only takes a couple of "strange" files by one customer for this
problem to manifest itself.
Thanks again for your time and input.
"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03112...@posting.google.com...
> I guess there's a fine line between "unlikely" and "impossible".
> I agree that it's not very likely that our application will see
> file/folder names that contain (say) both Hangul and Gurarati
> characters, but it's certainly a possibility. Our application
> needs to work with all possible Windows files and folders, and
> these weird names are definately ones that are possible.
If I understand your problem correctly, you need to arrange items
in alphabetical order (of the selected locale). But when you have
unusual combinations of characters alphabetical order may not really
make sense, so it would be enough to just have some stable comparison.
Ugly solution to this problem is to compare 2 strings as wchar vectors
first and then always call CompareStringW with smaller one as a first
parameter and greater one as a second. Of course, this relies
CompareStringW(A,B) always returning the same value as long as A and B
are passed in the same order.
> As I said, this point is probably moot for 99.9% of our customers,
> but it only takes a couple of "strange" files by one customer for
> this problem to manifest itself.
... and that 0.1% always happens on some important presentation :-)
Alex.
The data was crafted by a linguist with an eye to accuracy of actual
linguistic data, a woman who is so gifted at what she does that she has a
sign on her door that says "Right Honourable Data Lady" and people will
often call her the RHDL without even a trace of sarcasm. The code has been
written by several developers across the last decade and a half that use
that linguistic data to the fullest extent possible.
Now I am one of those developers, and currently the owner of the code in
question. There are indeed bugs there, especially in the area of
linguistically incorrect data, and I am working to try to fix these bugs.
But the fact that the current data s not quite as smart with stupid inputs
is hardly a blocking issue.
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03112...@posting.google.com...
"Stephen Bye" <.> wrote in message news:<OyLxFOBt...@TK2MSFTNGP11.phx.gbl>...
Do you have a scenario where users are creating invalid, strange strings
such as your testing is using? I think if you do a realistic refactoring of
your test cases and document that people should not do dumb things, then you
should be all set!
Look at it another way, do you truly posulate users who are sophisticated
enough to understand languages (some of which are not even supported via
built-in input methods or collation in Windows!) but not smart enough to use
them apropriately? You seem to be trying to build a product that gives your
users the right to very, very wrong.
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03120...@posting.google.com...
"Warren Menzer" <wme...@yahoo.com> wrote in message
news:d07984db.03120...@posting.google.com...
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Stephen Bye" <.> wrote in message
news:%23J7Y5yO...@TK2MSFTNGP11.phx.gbl...
But I can see now that my suggestion is flawed. If you were using a binary
search with a string that was in the table, a binary comparison with an
unequal string may send you off in the wrong direction.
"Michael (michka) Kaplan [MS]" <mic...@online.microsoft.com> wrote in
message news:u31rxdSu...@TK2MSFTNGP12.phx.gbl...
"Stephen Bye" <.> wrote in message
news:eZ2Mf1Yu...@TK2MSFTNGP11.phx.gbl...
Once again, if you have strings that are even remotely valid, then
CompareString will work just fine. The cases that have been raised here are
what we call "TESTER BUGS" because a tester can find them with an automated
tool. They are not found in the wild because in the real world users simply
do not misuse language as extensively as these strings require.
If you are trying to test a tool, then a realistic test breakout would not
cause one to abandon usage of an API for such an unrealistic scenario.
--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
This posting is provided "AS IS" with
no warranties, and confers no rights.
"Stephen Bye" <.> wrote in message
news:eQrYrpjv...@tk2msftngp13.phx.gbl...