VAST 2022 Unicode Support Info and Tips

Seth Berman

未讀,

2022年5月2日上午8:38:322022/5/2

收件者：VAST Community Forum

The following is a link to the original post (https://groups.google.com/g/va-smalltalk/c/-w7FVdc1gFM/m/nMRmlXnSAgAJ) with a question about UnicodeString / String bridging...in this case equality and comparison.

I have inlined it below and then I will respond to it.

----------------------------------------------------------------------------------------------------------------------------------------

I was just playing a bit with (Unicode) string comparisons using the new 2022 release.

The documentation reads:

Important Note

• Compatibility relationship is uni-directional. A <String> or <Character> does not have direct knowledge of <UnicodeString>.

This is illustrated in the following statements.

'Smalltalk' asUnicodeString = 'Smalltalk' asUnicodeString --> true
'Smalltalk' asUnicodeString = 'Smalltalk' --> true
'Smalltalk' = 'Smalltalk' asUnicodeString --> false

'Smalltalk' asUnicodeString sameAs: 'Smalltalk' asUnicodeString --> true
'Smalltalk' asUnicodeString sameAs: 'Smalltalk' --> true
'Smalltalk' sameAs: 'Smalltalk' asUnicodeString --> true (!)

'Smalltalk' asUnicodeString first = 'Smalltalk' asUnicodeString first --> true
'Smalltalk' asUnicodeString first = 'Smalltalk' first --> true
'Smalltalk' first = 'Smalltalk' asUnicodeString first --> false

'Smalltalk' asUnicodeString first sameAs: 'Smalltalk' asUnicodeString first --> true
'Smalltalk' asUnicodeString first sameAs: 'Smalltalk' first --> true
'Smalltalk' first sameAs: 'Smalltalk' asUnicodeString first --> false

'Smalltalk' asUnicodeString includesSubstring: 'Smalltalk' asUnicodeString --> true
'Smalltalk' asUnicodeString includesSubstring: 'Smalltalk' --> true
'Smalltalk' includesSubstring: 'Smalltalk' asUnicodeString --> false

This is just playing, but I was wondering how one in 'real life' would go about this (in an efficient way) in cases where you don't know what kind of strings you're dealing with?

Cheers,

Adriaan.

Seth Berman

未讀,

2022年5月2日上午9:22:172022/5/2

收件者：VAST Community Forum

Hi Adriaan,

Thanks for the question. There are two answers to this question.

Full transparent bridging between a legacy String/Character and Unicode String/Grapheme/Unicode Scalar was simply too much to accomplish in a single release cycle. This is a multi-release endeavor, and the focus of VAST 2022 was the implementation of Unicode support and safely bringing it into the core of the platform. As we move to VAST 2023, we will focus more on the integration with specific parts of the product and changing code in String/Character would be considered part of this integration. We didn't want to rush how we changed other parts of the system, especially in highly optimized areas where even a little bit on increase code size (primarily vm code) can have a detrimental impact on performance.

"How one in 'real life' would go about this (in an efficient way) in cases where you don't know what kind of strings you're dealing with"

I'm not exactly sure how to answer this question. Some of it depends on your architecture. Some of it depends on what you expect the product to reasonably do for you on your behalf vs what you will be expected to do if you opt-in to Unicode support. I imagine as we integrate other libraries in VAST to be Unicode enabled, we will be better positioned to answer these questions. But it will require examples and use-cases.

One approach is the "Unicode Sandwich" model, and String is just a special case of decoding "on the edge" into a UnicodeString, except we're decoding code-pages instead of UTF data. And that will apply to many scenarios "in real life", but maybe not your situation. Perhaps you could elaborate with an example?

Maybe I already answered your question with "we're looking at it for VAST 2023"? As I've done in a few places, you can always just call #asUnicodeString on the string-like object in question. If this string has any characters outside 7-bit ascii, you're going to need to do conversion anyway.

Again, depending on your needs, another set of APIs to look at are "Views". We actually wrote bridging code so that all the "Views" could be used on the String class also. So, if you need to convert a code-page <String> class to graphemes, unicode scalars, utf8, utf16, utf32. Then you can do so, and this would allow your code to stay polymorphic in those areas. All of these are provided as extensions on String defined in the UnicodeSupport library.

Thanks for the question.

- Seth

Hans-Martin Mosner

未讀,

2023年11月22日凌晨3:45:122023/11/22

收件者：VAST Community Forum

May I bring up an observation we've already made in 2017 about string collation/comparison?

Since 8.6.3 (I think) EsString comparison (which uses CurrentLCCollate) yields different results compared to String comparison (which uses primitive VMprStringLessEqual).

With UnicodeString, there's another class using another comparison method. Sadly, UnicodeString is placed in the class hierarchy separate from the EsString+subclasses, which will likely bite some people.

Some results from VA 12.0.0 with the strings 'a' and 'ö' in different representations show how messed up this is:

'a' < 'A' false
'a' asDBString < 'A' asDBString true
'a' asUnicodeString < 'A' asUnicodeString false

'ö' < 'Ö' false
'ö' asDBString < 'Ö' asDBString false
'ö' asUnicodeString < 'Ö' asUnicodeString true

This just can't be right - three String implementations disagreeing with each other on simple cases. Our patched 9.1 does this (in a german locale):

'a' < 'A' true
'a' asDBString < 'A' asDBString true

'ö' < 'Ö' true
'ö' asDBString < 'Ö' asDBString true

This also demonstrates (via the DBString comparison) that LCCollate>>compareString:with: differs between 9.1 and 12.0.

In our software, we've overwritten the String (and Character) comparison to use the superclass method (collation) although this means system class changes whenever we migrate to a new VA version.

Since string comparison is something that most people just take for granted, it should work consistently and according to reasonable expectations out of the box. I don't know whether assuming that #<= collates according to the current locale is a reasonable expectation, but getting consistent results with the different String implementation classes should be one IMO.

Cheers,
Hans-Martin

Adriaan van Os

未讀,

2023年11月22日中午12:49:512023/11/22

收件者：VAST Community Forum

Hi Hans-Martin, I don't know about DBString, but for me UnicodeString agrees with String for

(Character value: 246) asUnicodeString < (Character value: 214) asUnicodeString.

Cheers,

Adriaan.

Esteban A. Maringolo

未讀,

2023年12月1日中午12:07:312023/12/1

收件者：va-sma...@googlegroups.com

Hello Hans-Martin,

First of all, thanks for such a detailed review of the collation of all string classes in VAST.

One thing I'd like to note is that you might be running your assertions in a UTF-8 enabled workspace, to properly perform your tests you should set ANSI as the encoding of the workspace and then paste/type the following

'a' < 'A' false
'a' asDBString < 'A' asDBString true
'a' asUnicodeString < 'A' asUnicodeString false

'ö' < 'Ö' false

'ö' asDBString < 'Ö' asDBString true
'ö' asUnicodeString < 'Ö' asUnicodeString false

You'll see that the difference between String/DBString is still there, but not so for String/UnicodeString. The other alternative is to instantiate a UnicodeString as Adriaan pointed out in another email:

(Character value: 246) asUnicodeString < (Character value: 214) asUnicodeString.

But what you pointed out about the differences between DBString and String in how the collation is performed is correct, and that's due to what they use at the lowest level and has historical reasons.

As for locale sensitive collation of Unicode strings [1], this is something we deliberately left out of our initial Unicode support implementation, as it is a large and complex issue in its own right, and somewhat orthogonal to the features we already have.

And, spurred on by your reports, we've been discussing a viable solution and have then defined String collation as a major feature to be included in VAST 2025 (14.0).

As a general outline of how this would be implemented, we'll probably have a "String Collator" hierarchy to which we'll delegate the actual comparison, and within the hierarchy you'll have different collation implementations (e.g. VMprStringLessEqual, CurrentLCCollate), platform dependent collation (e.g. CompareStringEx() on Windows), etc.

But more importantly, we will be evaluating the integration of ICU4X [2] into VAST as another collator alternative, which will give us a superior implementation of Unicode collation, enable locale sensitive collation of UnicodeStrings, and the performance level we already have for everything Unicode related.

I hope you find this an interesting way to get better collation support in VAST.

Best regards,

Esteban Maringolo

Senior Software Developer

emari...@instantiations.com

@emaringolo

/emaringolo

instantiations.com

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/bf57b2ed-7cf2-4476-94a1-524106d4b610n%40googlegroups.com.

回覆所有人

回覆作者

轉寄