Hello All,
Happy New Year to everyone! At Instantiations, we're certainly looking forward to 2021.
First and foremost, the release of VAST Platform 2021 is on our minds. We're putting the finishing touches on it now and preparing to release it soon! After this release, one of the next important VAST additions is support for Unicode.
I recently hinted at our "Unicode Support" coming to VAST Platform 2022 in this post: https://groups.google.com/u/1/g/va-smalltalk/c/sG3x1rBBU-E.
Much exciting work has been done in this area in just the past couple months, and I wanted to share some of it with you. We're also planning to do a webinar at a later point this year to more formally demonstrate what has been developed.
So how does one define "Unicode Support"?
This is an important question to ask because the answer can vary widely. It is not a binary choice of "having Unicode" or "not having Unicode". In fact, I liked the way Joachim framed it in the aforementioned post as "Proper Unicode Support''. Thinking about it in terms of it being "proper" support provides a great frame of reference. (Some programming languages refer to this as "Unicode-correctness".)
To me, "Proper Unicode Support" means functionality integrated into the product such that many of the various concepts in the Unicode standard are available as first-class objects in VAST. I also think "proper" support within VAST means automatically handling many of the complex issues that occur when using Unicode. (Most languages force the user to deal with these issues.)
To meet the above criteria, we've been moving forward with an ambitious implementation that will provide a set of Unicode-related features that only languages like Swift, Raku (Perl 6), and Elixir will have parity with to date.
What needed to change inside the VAST Platform?
Unicode Support is truly not a single feature. We've been working towards "Proper Unicode Support" in VAST for many years through the continuing development of the many prerequisites. It's this group of prerequisite features that come together to make Unicode work properly and holistically.
It's important to note that these new features are absolutely essential. After all, VAST was initially designed at a time when Unicode was just being standardized and almost everyone still operated using single-byte character set encodings.
Some of these foundational features included reorganizing, fixing, and improving our code page converter. We also had to develop the capability for UTF-8 encoded filenames in our zip streams. Even the internals of our new OsProcess framework, both in the VM and in the image, were also designed with UTF-8 encoding by default. However, many features beyond this are still required to create the necessary foundation.
What are some of the technical considerations?
Many of you can attest (perhaps better than myself) to all the complexities with the digital representation and transmission of the world's languages. Issues are not magically solved because some bytes were thrown into a Unicode string.
The concept of a "character" itself is a complex topic when considered across the spectrum of all languages. Even with Unicode, users still face issues with encodings, either off-the-wire or via the filesystem. Endianness in some of the encoded forms (like UTF-16LE or UTF-16BE) can become an issue also.
New complexity is introduced with normalization forms since there are many "user-perceived" characters that can have several different codepoint representations (like Å vs Å, see screenshot below). As mentioned, ambiguity regarding what a "character" is and how you can access it from a string can be involved. Even the Unicode standard's usage of the term "character" is not consistent.
For further reading, there is an interesting history regarding ANSI, regional character encodings, and the Unicode standard. If you are interested, I recommend books such as O'Reilly's "Unicode Explained" and "Fonts and Encodings". Looking at the history of various generations of programming languages with regards to digital language representation is also fascinating and enlightening.
What are some of the features being added to VAST?
VAST Platform (Current State)
The following is a very brief overview of the relevant abstractions in the existing VAST system.
VAST Platform with Unicode Core (NEW -- Coming to VAST 2022)
There are three main abstractions that we have developed to facilitate working with Unicode data which are the Unicode counterparts to the locale-based String/Character. They are: UnicodeScaler, Grapheme, UnicodeString.
Other New Features
Optimizations
Beyond this, we have some follow-on work that we'll be integrating into VAST Platform 2022, namely, support for Literals, upgraded Scintilla editor with UTF-8 encoding by default, File System APIs, and Windows Wide APIs.
I realize this was a huge amount of information to digest, so thank you for reading it. That said, there's more to come!
We look forward to showing our customers and the community these new features during a live webinar in the coming months!
-Seth
Greetings Philippe,
Thanks for these questions and good to hear from you.
UnicodeString
"Someone has been busy :-)"
Always:)
"There is risk and opportunity here"
Agreed. Given the current state of Unicode in the product, I concluded mostly
opportunity.
"I assume #size would answer the number of graphemes."
Indeed, it does. And that size is cached
since it must be computed. A UnicodeString also has a capacity or 'usable
size'. This is mostly managed internally to give performance to various write
APIs like #addAll:, #replaceFrom:to:with:, #at:put: and so on for
variable-length elements.
"As most external systems (XSD, RDMS, ...) will likely
use code points or worse "Unicode code units" users will have to
remember to use the correct selector."
Good points. Many tokenization algorithms might also be sensitive to this as well and make code point iteration the better choice.
Views
"How are they different from encoding and decoding
support? Is this orthogonal to encoding and decoding support?"
Decoding/Encoding is happening under the hood in most views. The internal representation of the actual
data is UTF-8, and from there it must transform this into a stream of
graphemes, Unicode scalars, utf-8 (easy), utf-16, or utf-32. We have other
methods to convert various encoded forms to a UnicodeString, but views are not
that method. Views can be treated as positionable read streams, so the
next/atEnd APIs apply. What is a little different is that they are also bi-directional,
so there is an ability to stream in reverse. Views can be treated as a
read-only interface of a collection, so #do:, #collect:, #select:,
#inject:into: and so on are available. Asking for #size, #contents, and various
ranged copies are all optimized in the VM. Due to copy-on-write, views are
consistent no matter what happens to the UnicodeString later. In that sense, views are immutable.
"Did you consider making strings immutable? If so, what
were some of the considerations? Was there just too much code that expects
strings to be mutable?"
One of the main objectives was to create a String/Character drop-in
replacement. In that respect, it could
not practically be enforced as immutable. As mentioned, views can capture immutability
and give just that part of the collection interface that would be appropriate
for this constraint.
Additional Information
Certainly, there is a whole other list of challenges regarding coexistence with String and Character, but they are not as interesting to hear about, and my initial post in this thread was getting rather long. These challenges would not be unique to us and a lot of preparation for this task was done by researching the lessons-learned from various languages that have been augmented with Unicode, such as Delphi and Python 2. What we will not be doing is a Python 3-like transition where our existing String and Character just, all the sudden, become Unicode. That would be a disaster in VAST on so many levels.
I do not know if the choice of making the basic unit an extended grapheme cluster was bold or not. What bothered me most about any other representation was how the Collection APIs would have the potential to just fall apart on you. It reminded me of what might happen if one viewed a collection of Integers as a bunch of indexable bytes. Sure, if all the integer values are < 256, this is going to seamlessly work out for everyone. Until it doesn't. So, a 'byte' probably is not really the appropriate way to canonically view a collection of Integers. Likewise, when working with the Collection API, it just did not seem appropriate to force the user to work with only part of what they probably consider a character to be. It certainly creates a lot more work for them to use the Collection API appropriately and correctly. There are always UnicodeString representational exceptions, which is why we created performant views.
Many thanks for your questions Phillipe, I look forward to hopefully seeing you at a future ESUG event.
- Seth
...
Views
"How are they different from encoding and decoding support? Is this orthogonal to encoding and decoding support?"
Decoding/Encoding is happening under the hood in most views. The internal representation of the actual data is UTF-8, and from there it must transform this into a stream of graphemes, Unicode scalars, utf-8 (easy), utf-16, or utf-32. We have other methods to convert various encoded forms to a UnicodeString, but views are not that method. Views can be treated as positionable read streams, so the next/atEnd APIs apply. What is a little different is that they are also bi-directional, so there is an ability to stream in reverse. Views can be treated as a read-only interface of a collection, so #do:, #collect:, #select:, #inject:into: and so on are available. Asking for #size, #contents, and various ranged copies are all optimized in the VM. Due to copy-on-write, views are consistent no matter what happens to the UnicodeString later. In that sense, views are immutable.
"Did you consider making strings immutable? If so, what were some of the considerations? Was there just too much code that expects strings to be mutable?"
One of the main objectives was to create a String/Character drop-in replacement. In that respect, it could not practically be enforced as immutable. As mentioned, views can capture immutability and give just that part of the collection interface that would be appropriate for this constraint.
....
I do not know if the choice of making the basic unit an extended grapheme cluster was bold or not.
....
Many thanks for your questions Phillipe, I look forward to hopefully seeing you at a future ESUG event.
" Since all the optimizations went into this I wonder if at one point users will start asking for CP-1252 views or similar for efficient access and then you'll have to maintain two encoding and decoding stacks."
- From my standpoint, if customers continue to invest
in the product and ask for such things, then we will be happy to do it. Maybe
more to your meaning from an implementation point of view, I haven't really
talked much about the fact that the Unicode algorithms are implemented almost exclusively
in Rust. Rust has genuinely nice native Unicode support modelled
appropriately for a system-level programming language and a super-trim runtime
that we link in. We make use of their various 'crates' to help us get the
functionality we need, and my sincere goal is to be able to hand back to that
community like we did with Dart. In regard to your comment, I would
be taking a strong look at crates like 'encoding_rs' to
get that kind of functionality for our customers.
"I believe going against established consensus and conventional wisdom among 20+ year old programming languages is bold."
- That sounds ominous:) But I understand your meaning and it
is undeniably something that had to be thought about. We find ourselves
in the unusual position that we are a 20+ year old programming language that
currently has near zero formal support for Unicode, and we're just now
implementing it. There is a whole era of languages that had to make those
decisions long ago and chose things like UCS-2 as internal representations and
indexing decisions based off that. Time went on and lessons were learned
about the inadequacies of UCS-2, compatibility was likely a top goal and UTF-16
enters the picture. Their indexing strategy was probably set-in stone at
that point. In the modern era, the more recent languages like Rust, Go, Julia,
Swift (only as of ver5 due to object-c baggage) choose UTF-8 internal representations.
Some that don't, like Dart, have their hands tied because of Javascript.
There is a growing departure away from placing such importance on constant time
indexing. We're seeing more complexity in the underlying structure of
Unicode 'characters' like emoji which undoubtedly puts more strain on codepoint
based implementations. So, in general, I think consensus and wisdom
in this area is still in motion and not established. But to be fair, you
said established consensus/conventional wisdom among 20+ year old
languages. To that I would say those eras of languages are not our
target for this new support. And I would wonder, if those languages had both the body of
knowledge and the Unicode standard as it exists today, would they still be
making the same choices?
I think you are right, a grapheme-cluster based UnicodeString will have the potential to not do what is expected for everybody. This could be from a functional or performance point of view. We will work on optimizing the fast cases for it, but in the end, it’s not a byte string. But mostly, I believe the trade-offs to be favorable. We can abstract away normalization and character boundaries which I believe will be an ultimate win in this environment. I certainly don't believe in one-size-fits-all approaches. For example, Systems-level programming languages probably shouldn't use grapheme-cluster indexing by default. It’s not that way in Rust and I wouldn't expect.
Certainly, given a project of this magnitude, there are
going to be quite a few decisions that have been discussed that are going to
need adjustment (or abandonment) along the way.
This is a good thought exercise and I very much appreciate
it Philippe. I certainly respect your knowledge in this area and many others. I’ve
tried to elaborate as much as possible, even though I know some things have
been obvious to you. I hope you don't mind, I've done it for the benefit of the group as I'm already being told people are getting a great deal out of this.
- Seth