VAST 2022 Unicode Support Info and Tips

瀏覽次數:158 次
跳到第一則未讀訊息

Seth Berman

未讀,
2022年5月2日 上午8:38:322022/5/2
收件者:VAST Community Forum
The following is a link to the original post (https://groups.google.com/g/va-smalltalk/c/-w7FVdc1gFM/m/nMRmlXnSAgAJ) with a question about UnicodeString / String bridging...in this case equality and comparison.

I have inlined it below and then I will respond to it.

----------------------------------------------------------------------------------------------------------------------------------------
I was just playing a bit with (Unicode) string comparisons using the new 2022 release.

The documentation reads:
Important Note
• Compatibility relationship is uni-directional. A <String> or <Character> does not have direct knowledge of <UnicodeString>.

This is illustrated in the following statements.

'Smalltalk' asUnicodeString = 'Smalltalk' asUnicodeString --> true
'Smalltalk' asUnicodeString = 'Smalltalk' --> true
'Smalltalk' = 'Smalltalk' asUnicodeString --> false

'Smalltalk' asUnicodeString sameAs: 'Smalltalk' asUnicodeString --> true
'Smalltalk' asUnicodeString sameAs: 'Smalltalk' --> true
'Smalltalk' sameAs: 'Smalltalk' asUnicodeString --> true (!)

'Smalltalk' asUnicodeString first = 'Smalltalk' asUnicodeString first --> true
'Smalltalk' asUnicodeString first = 'Smalltalk' first --> true
'Smalltalk' first = 'Smalltalk' asUnicodeString first --> false

'Smalltalk' asUnicodeString first sameAs: 'Smalltalk' asUnicodeString first --> true
'Smalltalk' asUnicodeString first sameAs: 'Smalltalk' first --> true
'Smalltalk' first sameAs: 'Smalltalk' asUnicodeString first --> false

'Smalltalk' asUnicodeString includesSubstring: 'Smalltalk' asUnicodeString --> true
'Smalltalk' asUnicodeString includesSubstring: 'Smalltalk' --> true
'Smalltalk' includesSubstring: 'Smalltalk' asUnicodeString --> false

This is just playing, but I was wondering how one in 'real life' would go about this (in an efficient way) in cases where you don't know what kind of strings you're dealing with?

Cheers,
Adriaan.

Seth Berman

未讀,
2022年5月2日 上午9:22:172022/5/2
收件者:VAST Community Forum
Hi Adriaan,

Thanks for the question.  There are two answers to this question.

Full transparent bridging between a legacy String/Character and Unicode String/Grapheme/Unicode Scalar was simply too much to accomplish in a single release cycle.  This is a multi-release endeavor, and the focus of VAST 2022 was the implementation of Unicode support and safely bringing it into the core of the platform.  As we move to VAST 2023, we will focus more on the integration with specific parts of the product and changing code in String/Character would be considered part of this integration.  We didn't want to rush how we changed other parts of the system, especially in highly optimized areas where even a little bit on increase code size (primarily vm code) can have a detrimental impact on performance.

"How one in 'real life' would go about this (in an efficient way) in cases where you don't know what kind of strings you're dealing with"
I'm not exactly sure how to answer this question.  Some of it depends on your architecture.  Some of it depends on what you expect the product to reasonably do for you on your behalf vs what you will be expected to do if you opt-in to Unicode support.  I imagine as we integrate other libraries in VAST to be Unicode enabled, we will be better positioned to answer these questions.  But it will require examples and use-cases.

One approach is the "Unicode Sandwich" model, and String is just a special case of decoding "on the edge" into a UnicodeString, except we're decoding code-pages instead of UTF data.  And that will apply to many scenarios "in real life", but maybe not your situation.  Perhaps you could elaborate with an example?

Maybe I already answered your question with "we're looking at it for VAST 2023"?  As I've done in a few places, you can always just call #asUnicodeString on the string-like object in question.  If this string has any characters outside 7-bit ascii, you're going to need to do conversion anyway.

Again, depending on your needs, another set of APIs to look at are "Views".  We actually wrote bridging code so that all the "Views" could be used on the String class also.  So, if you need to convert a code-page <String> class to graphemes, unicode scalars, utf8, utf16, utf32.  Then you can do so, and this would allow your code to stay polymorphic in those areas.  All of these are provided as extensions on String defined in the UnicodeSupport library.

Thanks for the question.

- Seth

Hans-Martin Mosner

未讀,
2023年11月22日 凌晨3:45:122023/11/22
收件者:VAST Community Forum
May I bring up an observation we've already made in 2017 about string collation/comparison?

Since 8.6.3 (I think) EsString comparison (which uses CurrentLCCollate) yields different results compared to String comparison (which uses primitive VMprStringLessEqual).
With UnicodeString, there's another class using another comparison method. Sadly, UnicodeString is placed in the class hierarchy separate from the EsString+subclasses, which will likely bite some people.
Some results from VA 12.0.0 with the strings 'a' and 'ö' in different representations show how messed up this is:

  'a' < 'A' false
  'a' asDBString <  'A' asDBString true
  'a' asUnicodeString <  'A' asUnicodeString false

  'ö'  < 'Ö'  false
  'ö' asDBString < 'Ö' asDBString false
  'ö' asUnicodeString < 'Ö' asUnicodeString true

This just can't be right - three String implementations disagreeing with each other on simple cases. Our patched 9.1 does this (in a german locale):

  'a' < 'A'  true
  'a' asDBString <  'A' asDBString  true

  'ö'  < 'Ö'   true
  'ö' asDBString < 'Ö' asDBString  true

This also demonstrates (via the DBString comparison) that LCCollate>>compareString:with: differs between 9.1 and 12.0.

In our software, we've overwritten the String (and Character) comparison to use the superclass method (collation) although this means system class changes whenever we migrate to a new VA version.

Since string comparison is something that most people just take for granted, it should work consistently and according to reasonable expectations out of the box. I don't know whether assuming that #<= collates according to the current locale is a reasonable expectation, but getting consistent results with the different String implementation classes should be one IMO.

Cheers,
Hans-Martin

Adriaan van Os

未讀,
2023年11月22日 中午12:49:512023/11/22
收件者:VAST Community Forum
Hi Hans-Martin, I don't know about DBString, but for me UnicodeString agrees with String for 
(Character value: 246) asUnicodeString < (Character value: 214) asUnicodeString.

Cheers,
Adriaan.

Esteban A. Maringolo

未讀,
2023年12月1日 中午12:07:312023/12/1
收件者:va-sma...@googlegroups.com
Hello Hans-Martin,

First of all, thanks for such a detailed review of the collation of all string classes in VAST.

One thing I'd like to note is that you might be running your assertions in a UTF-8 enabled workspace, to properly perform your tests you should set ANSI as the encoding of the workspace and then paste/type the following


  'a' < 'A'  false
  'a' asDBString <  'A' asDBString  true
  'a' asUnicodeString <  'A' asUnicodeString false

  'ö'  < 'Ö'   false
  'ö' asDBString < 'Ö' asDBString true
  'ö' asUnicodeString < 'Ö' asUnicodeString false

You'll see that the difference between String/DBString is still there, but not so for String/UnicodeString. The other alternative is to instantiate a UnicodeString as Adriaan pointed out in another email:


(Character value: 246) asUnicodeString < (Character value: 214) asUnicodeString.

But what you pointed out about the differences between DBString and String in how the collation is performed is correct, and that's due to what they use at the lowest level and has historical reasons.

As for locale sensitive collation of Unicode strings [1], this is something we deliberately left out of our initial Unicode support implementation, as it is a large and complex issue in its own right, and somewhat orthogonal to the features we already have.

And, spurred on by your reports, we've been discussing a viable solution and have then defined String collation as a major feature to be included in VAST 2025 (14.0).

As a general outline of how this would be implemented, we'll probably have a "String Collator" hierarchy to which we'll delegate the actual comparison, and within the hierarchy you'll have different collation implementations (e.g. VMprStringLessEqual, CurrentLCCollate), platform dependent collation (e.g. CompareStringEx() on Windows), etc.

But more importantly, we will be evaluating the integration of ICU4X [2] into VAST as another collator alternative, which will give us a superior implementation of Unicode collation, enable locale sensitive collation of UnicodeStrings, and the performance level we already have for everything Unicode related.

I hope you find this an interesting way to get better collation support in VAST.

Best regards,


Esteban Maringolo

Senior Software Developer

 emari...@instantiations.com
 @emaringolo
 /emaringolo
 instantiations.com
TwitterLinkedInVAST Community ForumGitHubYouTubepub.dev


--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to va-smalltalk...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/bf57b2ed-7cf2-4476-94a1-524106d4b610n%40googlegroups.com.
回覆所有人
回覆作者
轉寄
0 則新訊息