Naive start with UTF-8 / unicode

125 views
Skip to first unread message

Joachim Tuchel

unread,
Nov 28, 2025, 3:44:19 AM11/28/25
to VAST Community Forum
Hi there,


I am currently doing  lot of stuff that runs very long so I have time to experiment a little. I am really looking forward to enter the world of seamless handling of UTF8 in VAST. It has been and still is a pain point in many areas. So I ported my stuff to VAST 14.1.0 and made a very first and naive experiment - and failed miserably.

Here's what I did:
  1. open a Workspace (it is UTF8 by default)
  2. Enter a line of text wirh a few German Umlauts and an emoji
  3. Save the Workspace to a file named unicode.txt
  4. open the file in Notepad -> Umlauts and Emoji are displayed correctly
  5. open a new Workspace and enter this snippet:

    |file|
    file := CfsReadFileStream open: 'unicode.txt'.
    file contents inspect.
    file close.

  6.  Look at the inspector and feel stupid (I'm a pro in this step, btw.)
  7. Feel a bit smarter and change the snippet to

    |file|
    file := CfsReadFileStream open: 'unicode.txt'.
    file contents asUnicodeString inspect.
    file close.

  8. Forget abozut the feeling smarter part
So I am obviously missing something very basic here. I cannot find anything like enabling unicode for any Cfs*Stream. And I wonder how I would do any positioninng per Grapheme in a CfsRead*Stream.

The Documentation is a bit sparse in this use case. So what do people do to store simple Unicode String in a File and read it back? I am sure I am not asking too much here, itmust be possible. All I'm asking for is somebode to make me a tiny bit smarter ;-)

Thanks for hints. I'll continue playing in parallel, I guess the solution is somewhat like reading the Stream as a ByteArray and converting it to a UnicodeString. Which may not actually work well for large files, so there sure is more to it... Am I missing some feature I should have loaded?

Joachim






Adriaan van Os

unread,
Nov 28, 2025, 4:23:57 AM11/28/25
to VAST Community Forum
Hi Jochim,

If you want to read an utf-8 encoded file you can do

|file|
file := CfsReadFileStream open: 'unicode.txt'.
(UnicodeString utf8: file contents) inspect.
file close.

Does that help?

Cheers,
Adriaan

Adriaan van Os

unread,
Nov 28, 2025, 8:54:26 AM11/28/25
to VAST Community Forum
Hi Joachim,

Taken this opportunity to write your name correctly....

For writing to file, you can use 'fileStream nextPutAll: aUnicodeString asUtf8Bytes' until CFS knows how to handle UnicodeStrings.

Cheers,
Adriaan

Joachim Tuchel

unread,
Nov 28, 2025, 8:55:29 AM11/28/25
to VAST Community Forum
Hi Adriaan,

yep, that's it. Perfect, thanks!
although... I am a bit surprised that there is no such thing as unicodeEnabled: on Cfs*Stream. The way you showed us here is a bit clumsy, isn't it?

Am I understanding this correctly that I am still required to know that the file is encoded in utf-8. There is no such thing as autodetection on a Cfs*Stream? 
 
Joachim

Adriaan van Os

unread,
Nov 28, 2025, 9:13:34 AM11/28/25
to VAST Community Forum
You can try '[UnicodeString utf8: bytes]  on: ExError [ :e| e exitWith: bytes asString]' ...

Marcus Wagner

unread,
Nov 29, 2025, 5:27:41 AM11/29/25
to VAST Community Forum
Hi Joachim,

in Smalltalk, an (almost) ideal world of objects, we tend to forget the real world: the CFS stream hierarchy is based on Strings.
Historically this hierarchy already had to be duplicated (CfsLeadEncodedFileStream) to understand DBStrings (I do not go back further, display code was 6 bit, ASCII 7 bit and EBCDIC was the first 8 bit = byte code).

Now given UnicodeStrings to support UTF(8, 16, ...) such a (third) hierarchy does not exist yet.

To stay philosophical, this reminds me to halt and think about a redesign instead of duplicating s.th. again without a profound argument before proceeding to do so.

The point to be solved is marshalling, that is converting external world (files of s.th.) representations into the image (streams of s.th.).

Streams initially are already "of s.th.", alread object oriented, that is is conceptually the design of stream is clean.
The design of filestreams closely related to those streams however is not clean, as this information "of what" to be provided by the external world is missing. We have to open a file of what? 
Traditionally (see open dialogs of several programs) this missing information to be able to open is provided from somewhere else, sometimes  even hidden in the content of a file (BOM) or left to trial and error, or inspection on the fly, while reading (in most of the editors).
In Smalltalk, this missing information however is to be hard coded (see CfsFileStream>>#initialize, String new: buffersize).
The decision to use DBStrings (via CfsLeadEncodedFileStream) was historically delegated to the locale, so it was traditionally configured in the outside.

And there is even a rarely known third variant, concerning bytes and characters, when using streams around the 
CfsFileStream>>#isBytes: protocol (to reflect the historicial difference of binary vs. text files representations, see 6, 7, 8 bit bytes I mentioned earlier).

Now all of this became insufficent. To stay competitive, this cannot be simply made configurable again as it was in the DBString case based on a locale.

We already have now large characters (larger than bytes) under Windows and that has to be extended again to support UTF. 

Besides any extension has to be carefully inspected concerning ANSI, which implies rules on classes like streams.

My conceptual idea is s.th. like ReadStream on: UnicodeString new, which provides the missing information "of. s.th.)".

This is going to become complex. 
My experiment, simply replacing (in the CfsFileStream) the hardcoded 
CfsFileStream>>#initialize, String new: buffersize 
by
CfsFileStream>>#initialize, UnicodeString new: buffersize 
causes recursive walkbacks, as the whole underlying character support around locals in turn depends on streams, of the old style.

Kind regards
M

Marcus Wagner

unread,
Nov 29, 2025, 6:46:40 AM11/29/25
to VAST Community Forum
I forgot s.th. to make clear that a simple duplication won't save the day: look at e.g. OsProcessStream.
Here again everything is based on byte/nonbyte (=Character, not UTF) a distinction implied by ANSI. 
In this case the whole valuable enhancement of OS processes concerning the handiing does not support UTF8 and won't benefit if the traditional way, providing a third hiearchy, would be followed.
There must be s.th. more capable to cope with this. 
As I started, the whole thing concentrates on the embedded marshalling, the "of.s.th".
The already observable duplication of streams lays the finger on a weak point.

To finish my philosphical comment: newer programmings languages introduced the notion of "traits"to cover this aspect. 

Of course, on a meta-level, a trait is also an object. 

But Smalltalk does not support this. At ancient time, class and instance aspects were recognised and covered, using Behavoir and Class classes.
Given concepts like traits would even revolutionize such problems like exchange of the whole UI system which is going to happen currrently under Linux (X-motif -> Wayland and others).

Kind regards

Marcus Wagner

unread,
Nov 29, 2025, 11:52:51 AM11/29/25
to VAST Community Forum
A closer look based on experiments revealed: an object oriented approach as sketched out by me recently would fail. 
To implement file I/O, VAST makes currently use of the ANSI based implementation (covering SBCS and DBCS) but not other or even specialized APIs, see an overview here https://learn.microsoft.com/en-us/windows/win32/intl/unicode-in-the-windows-api).

This definetively confirms the earlier suggestions provided from Adriaan von Os above.

The CommonFile system of VAST covers exactly what it says: different OS (Unix, Windows, main frames) but not the gaps in the use of a given OS (like specialized APIs under Windows, which evolved here over the time recently).

At this moment files containing UTF must be dealt as binary files, this requires explicit handling of nasty boundary situations (like a grapheme crossing byte buffer boundaries). 

That means to read and write bytes (or byte arrays) and to explicitely convert bytes to and from graphemes or UnicodeStrings, when needed. 
In particular, it requires to implement a nextUtf protocol to read UTF and putUtf to write UTF, which are based on a byte stream.
In the long term this may lead to the observed plethora of different hierarchies (in addition to the existing ones: text, binary, double byte, ...)
The use of (file stream variants) isText or isCharacter would impose problems if graphemes are to be dealt.
A mistake is to assume that a character string can be read directly from a file stream.

Concerning the ANSI standard, which refers to an implementation based character set: as the current implementation is based on the traditional ANSI protocol of Kernel32, this does not mean a violation of the standard.

To end up: my suggested approach to specialize character streams would fail in the current implementation.
To read and write Unicode means extra effort using bytes and bytestreams.
Sorry.

Kind regards
M

Richard Sargent

unread,
Dec 1, 2025, 2:18:07 PM12/1/25
to VAST Community Forum
On Saturday, November 29, 2025 at 2:27:41 AM UTC-8 Marcus Wagner wrote:
Hi Joachim,

in Smalltalk, an (almost) ideal world of objects, we tend to forget the real world: the CFS stream hierarchy is based on Strings.
Historically this hierarchy already had to be duplicated (CfsLeadEncodedFileStream) to understand DBStrings (I do not go back further, display code was 6 bit, ASCII 7 bit and EBCDIC was the first 8 bit = byte code).

Now given UnicodeStrings to support UTF(8, 16, ...) such a (third) hierarchy does not exist yet.

To stay philosophical, this reminds me to halt and think about a redesign instead of duplicating s.th. again without a profound argument before proceeding to do so.

I have seen other Smalltalk implementations use the Decorator pattern to address these problems. In fact, I think even the Zinc HTTP implementation uses encoders.
The idea is that the lowest level file stream simply reads or writes bytes from or to the file system.
After that, you decorate the primitive stream with an encoder or encoders to transform between your Smalltalk model's representation and the physical representation.

Possible examples: binary vs text, code pages, UTF-x encodings, line orientation or not (which delimiter to use), ASCII field and group separators, possibly CSV, etc.

Generally, I think this is a good approach although studying a stream in an Inspector is considerably more complicated.

Marcus Wagner

unread,
Dec 2, 2025, 5:29:01 AM12/2/25
to VAST Community Forum
Hi Richard, 

I want to remember: Common File Systems is already a decorator (as the name implies) with the target to become independent from different OS.
And this layer became regulated under the rules of the ANSI standard (I think it is worth to stay compliant).

In general, I cannot recommend to decorate this existing decoration by another layer.

To make it more clear and name it precise: the existing Windows decorator of file streams has a problem to handle UTF files under Windows (I do not know the situation of the UTF support on other supported platforms, like Unix or zOS). 

And besides, decoration elements do already exist like the LeadEncodedByte classes and more rudimentarily in the handling of double bytes (another historic character set extension stemming from IBM mainframes). 
Implecitly there already exists another technical decoration layer in the platform functions accessing the Windows API: to be able to handle UTF (Windows duplicated the traditional Windows API using Windows characters to cover UTF -  the A and W Kernel32 function families).

As file access in general is both widespread and a critical resource (even nowadays though having already gained by modern storage technology) the topic has to be seen also from distance.

Of course it is left to anybody to create an individual isolated special quick solution, with performance losses or with other consequences.
But such a solution is unlikely to be integrated as a steady solution.

I saw it from a more general, long term perspective. And up to now I failed to find a minimalistic fix with generalistic impact.

Kind regards
M

Marcus Wagner

unread,
Dec 2, 2025, 7:01:44 AM12/2/25
to VAST Community Forum
Pro:

To reduce the complexity, I want to clarify further, to rectify my arguments concerning ANSI standard:

the standard explicetily cites 8-bit bytes and 8-bit characters (chapter 5.10.1 protocol <FileStream>).
So UTF is NOT covered. Any deviation here to support UTF8 is riskless, as it is not covered by the standard.

Con:
The complexity of decoration has to solve the position problem, concerning the protocols like next:, position: etc.
The amount is always to be interpreted as "passed" objects, that is UTF characters, not bytes. 
So e.g. skipToEnd and asking position has to answer the number of characters, not the size of the file in bytes.
As UTF characters have different sizes in bytes, any random adress positioning cannot be supported by an OS, as it depends on the content of the file. 
Positive side effect of this positioning concept: it does not support access to misaligned boundaries, misinterpreting long byte UTF sequences.
And it follows strictly the object oriented principles, matching the Smalltalk paradigm.
Kind regards
M

Johan Brichau

unread,
Dec 2, 2025, 3:47:55 PM12/2/25
to VAST Community Forum
Hi Joachim,

We agree that an abstraction for an 'EncodedCharacterStream', especially in the presence of Unicode strings, is a much desired addition to VAST.
While not immediately planned for the next VAST release, it is something that is 
on the radar of the development team.

Meanwhile... Adriaan's suggestions are the most succinct way to achieve what you need.
But I wanted to take the opportunity to also mention the Grease app, which offers some abstractions for UTF8 encoding and decoding as well as over streams.
While they are only implemented to support the cases of interest in Seaside, the following examples show they can be of interest beyond that:

message := UnicodeString utf8: #[86 105 101 108 101 32 71 114 195 188 195 159 101 32 97 117 115 32 66 101 108 103 105 101 110 32 240 159 153 130].
GRPlatform current
writeFileStreamOn: 'test-utf8.txt'
do: [:stream |
codec := GRCodec forEncoding: 'utf-8'.
encodedStream := codec encoderFor: stream.
encodedStream nextPutAll: message
]
binary: true


GRPlatform current
readFileStreamOn: 'test-utf8.txt'
do: [:stream |
codec := GRCodec forEncoding: 'utf-8'.
codec enableUnicode.
decodedStream := codec decoderFor: stream.
decodedStream contents inspect.
]
binary: true

Caveat; you will need to add the following method to Grease though before the above works.

GRVASTConverterCodecStream>>#contents
^ self next: (stream size)

The Grease streams and abstractions obviously do not cover all use cases and were never intended to do that (which is demonstrated by the missing method), but the snippets may be of help to implement what you need. The Grease streams implement the decorator approach where a file stream is wrapped with a encoding stream that is parameterized with an encoder. As mentioned by Richard, that approach is also implemented by Zinc.

best regards,
Johan

Marcus Wagner

unread,
Dec 12, 2025, 8:33:53 AM12/12/25
to VAST Community Forum
Thanks, Johan, 
I want to contribute two comments more: 

Comment 1) Back to the roots, to the begin of this conversation by Joachim Tuchel: 

I found a proof leading to the consequence that it is impossible to extend Cfs*Stream to cover Unicodes in general.
This strange sounding result follows from an requirement, which is silently implied but violated by Unicode (and DBCS, CDC NOS 6/12, Prime PRIMOS code [which was not byte based] and perhaps many, many others codes I do not know).

Proof: files of the existing operating systems, accessed via the existing streams require that all of the basic elements have the same size (traditionally words or bytes). The ANSI standard, based on POSIX, has the highest visibility in this sense: bytes and characters, having uniquely the same size (even when the standard does not say which size).

In VAST the Cfs*Stream abstraction in turn implements a facade over several OS to access the underlying files. So a couple of Smalltalk protocols necessarily depend on this requirement (all of them are position related).

Only Unicode UTF-32 fullfills this assumption. But this variant is expensive because of its space demand, so it is rare and seems to be not relevant in practical use.

The existing and the suggested solutions in VAST for UTF-8 or UTF-16 and particular the approach with EncodedCharacterStream, attempt to close the gap of lacking OS support as good as possible.

Example: 
your (Johans) message above contains 30 bytes which represent 22 Unicodes [in UTF-32 that would cost 88 bytes!]. 
Asking a full functional (hypothetical) OS file system implementation unicodeFileStream at: 22 should provide the last Unicode character, the expected smiley icon, any existing file stream however will yield Character $I. 
This (OS) deficit already existed in the ancient implementations covering double byte extensions, but the effects did not matter then.
No OS would answer that the file size is 22 [unicodes] but it will answer 30 [bytes] or multiples thereof (like 1K). 
So there is no support for any random access or addressing on base of Unicodes. 
But random access is a necessary part of Cfs*Streams, based on the object (element) concept of Smalltalk.

Thus you can have Smalltalk streams on objects in Smalltalk, but the file systems of the outer world do not support this. Or otherwise if the outer world would support it, it would require excessive space to be able to cover the goals of Unicode.

Comment 2) (minor) 
I could not find the readFileStreamOn:do:binary: protocol you used above. It is not contained in the Grease feature of VAST 14.0.1. 
But I understand this in the sense you already announced it to be an implementation to cover Seaside targets, likely I missed here something to load, sorry.

Kind regards
M
Reply all
Reply to author
Forward
0 new messages