UnicodeString vs. String

77 views
Skip to first unread message

Bob Nemec

unread,
Jul 16, 2025, 2:33:45 PMJul 16
to VAST Community Forum
While testing GemBuilder in VAST 14.0.0 I came across a problem where GemBuilder views assume String instances and get UnicodeString instead. One fail is due to UnicodeString answering an empty string when created with a size: parameter...

(UnicodeString new: 100) size -->  0
(String new: 100) size  --> 100

...this breaks in OSWidget class>>withCrLf: because the 'result' temp variable created with 'aString class new: count' answer an empty String, which causes an exception in UnicodeString>>replaceFrom:to:with:startingAt: 

Since VAST 11.0.1 only has partial Unicode implementation the GemBuilder browser did not have an issue. I showed this to Richard Sargent from GemTalk and he suggested this could be of interest to others. 

I'm curious to know what should be expected from 'UnicodeString new: 100'.

Bob Nemec
A bad day in [ ] is better than a good day in { }

Marcus Wagner

unread,
Jul 17, 2025, 6:17:59 AMJul 17
to VAST Community Forum
UnicodeString requires a split of the notions size and capacity - this is in oppositition to String before.

String new:100 delivers a string with 100 nul characters in it, this was often used supporting streams in providing them a buffer with the capacity given in new:.
UnicodeString however has to cover a dimension of other problems causing this design split (single characters occupying indivual space, not necessarily common like 1 or 2 or ., bytes). A side effect: filling with nul characters is to be controlled internally now and nuls cannot be silently tolerated.
So UnicodeString new:100 delivers an empty string now (size = 0 and capacity = 100), whereas a String new: 100 (as capacity = size) was not empty and filled with nuls.

This refinement in design (size vs. capacity)  requires a review of any application when replacing String by UnicodeString also concerning the aspects of their (low level) design: size may deviate from capacity (being an implicit requirement of Unicode). 

All such traditional mixups have to be adopted, in particular all uses of String new: x with the intention to deliver some kind of work buffer. 
Besides any such traditional tolerated mixups already caused many other problems in the past (recently discussed in mistakenly use of Stream on: 'ABCD', growing buffers and related topics). Time to solve such things now when adopting to UnicodeStrings.
A hint: inspect any Unicode new: x. In other words: inspect any old String new: x allocation, whether it was actually devoted to allocating buffers. Such uses of UnicodeString would be critical - due to the dynamic size behaviour behind. 
Also remember the readonly property:  Stream on: 'ABCD' was not only a mistake but also kept GC busy for nothing, actually ment was a buffer created by String new: 4.
-
M

Richard Sargent

unread,
Jul 17, 2025, 12:53:43 PMJul 17
to VAST Community Forum
This is the recreation process for the error. I just evaluated this in 11.0.1. 
I am sure that the same thing happened in 14.0.0 to Bob. [GemStone source code is delimited with just LFs.]

OSWidget withCrLf: ('abc', String lf, ' ^#abc') asUnicodeString

A UnicodeString with size 0 cannot replace its contents, since the offsets don't exist.
'Primitive failed in: UnicodeString>>#replaceFrom:to:with:startingAt: due to Index out of range in argument 1' (args are 1, 3, the original UnicodeString, and 1)

Marcus Wagner

unread,
Jul 17, 2025, 4:18:30 PMJul 17
to VAST Community Forum
There are two bugs in it: 
1) withCrLf: computes wrong index like in 
(j := aString indexOf: Lf startingAt: i) == 0
ifTrue: [j := size + 1].
-> j > size, e.g. addressing s.th. behind the last character of String. UnicodeString does not tolerate this!
2) the problem in ascii (the goal of the method as it names indicates) a lf has to be followed by Cr (in Windows) so an lf is detected a cr is inserted,
thus growing the string if crs are needed to pair embedded lfs. Bad, but hm. A string like in this example, initially 10 chars becomes 11 chars long.
this does NOT hold under Unicode! CrLf is another Grapheme of length 1! So every single LF has to be replaced by a grapheme CrLF with size 1.
So under Unicode the string does not grow at all. The whole ASCII logic creating pairs CrLF (two chars) replacing single LFs changes to change every LF grapheme to a CRLF grapheme without growing the string size.

For short the method withCrLf: is buggy in several aspects and has to be fixed 
- under Windows to not access (tolerated but wrong) indexes beyond the string size and 
- under Unicode as exchange one grapheme by another one even does not even change its size!
At this time I cannot recommend to use this method at all, neither with normal Strings nor with Unicodes.

From time to time enhancing some stuff uncovers hidden ancient bugs....
-
M

Marcus Wagner

unread,
Jul 18, 2025, 6:16:50 AMJul 18
to VAST Community Forum
To complete and rectify the picture: 
OSWidget withCrLf: and withLf: are private methods, thus internal to CommonWidgets and are not of general use and initially they were NOT UnicodeString aware. 
They cover an ancient bug in certain modal Windows messages with icons with a text where the text was not shown properly, causing their display to be strange because of the termination of the message string, once considered to be a Windows bug.
The purpose was to keep message senders platform independent (look similar undex X-Motif, AIX and Windows).

Hence my recommendation is enforced: do not explicitely apply these two methods. 

However, in case that they are implicitly applied, eg. when using an official API with UnicodeStrings and then causing a crash, CommonWidgets have to be fixed (again) to deal with those modal Windows message boxes so that the initial bugs are kept to be covered and fixed.
 
I do not know if the two methods now are know considered to be UnicodeString aware or not. 

The problem is a consistent handling of line termination. It is older than 30 years.
Under Windows two characters, CR and LF are in use, this pair was considered to be written explicitely (0D0A).
Under Unix, by convention, only LF aka \n was used to terminate lines (the cr was implied by printers and terminals).
Under Unicode this got a twist, as there is a single Grapheme crlf (=size 1, capacity=2) serving this purpose.
In Smalltalk under Windows the Unix style and habit was supported, e.g. as in Transcript cr and the like. The mandatory second character was provided implicitely, that followed somehow the Unix world, though running under Windows. This led to the special Linefeed constructs e.g. in CommonFile.
Smalltalk is stressed here because of its initially implied platform independency, given some Windows inconsistency, and now under the apsect of relatevily new Unicode graphemes.
-
M

Henry Johansen

unread,
Jul 21, 2025, 10:05:34 AMJul 21
to VAST Community Forum
Hi!
Let me first start by noting support for UnicodeStrings in UI components is currently limited.
The primitive calls themselves (on Windows) were changed in 14.0 to use double dispatch to call the correct OS API for each string class in the latest release, so they at least accept UnicodeStrings parameters correctly.
But the Windows themselves (... on Windows) are still created in ANSI mode, so although it is now possible to set Widget texts to UnicodeStrings, if they receive updates through Window Events, will quickly revert to Locale strings.
Enabling Unicode-aware Windows in a non-utf8 Windows locale is a larger task, as it basically involves rewriting the Event loop+system. 

That said, the coverage of support functions that support both types of strings could be improved separately - withCrLf: / withLf: are among those not considered known to be UnicodeString aware.
The easiest way to make them work for both string classes, is to rewrite them in a streaming fashion, along the lines of:
withCrLf: aString
    "Private - Answer a String that is delimited by CR/LF."
    |crlfStream  prevLine |
    crlfStream := (aString class new: aString size) writeStream.
    aString linesDo: [ :aLine |
    prevLine isNil ifFalse: [ crlfStream nextPutAll: aString class crlf ].
        crlfStream nextPutAll: aLine.
    prevLine := aLine. ].
    ^crlfStream contents
The alternative is to delegate the operation to each string class, where we can provide an optimized version for both.
But, although the performance of the above will not be on the same level as the current implementation, I don't think that's critical in this case.

Cheers,
Henry

Marcus Wagner

unread,
Jul 22, 2025, 12:55:33 PMJul 22
to VAST Community Forum
Hi Henry,
thank you for the information.
That gave me a clear picture how you attacked the challenge, to make CommonWidgets Unicode aware.

You clearly cited where things might potentially be in danger and finally you gave also hints to ways out if s.th. turns out to fail. 

I hope that my answers to the initial observation of Bob did not imply any negative impacts, as this was not in my intention at all.

Kind regards
Marcus
Reply all
Reply to author
Forward
0 new messages