Hello Joachim,
Thanks for your thoughts on this, much appreciated.
If I were to sum up your points, I would put them at "handling normalization for me" and "Guess the character encoding of an input string"
"Handling normalization for me"
I'll be honest with you, if having this abstracted away was the measure of "proper Unicode support", then most languages today would fail.
This is one of the reasons that I have chosen to have the basic unit of a UnicodeString at the grapheme level as opposed to answering Unicode scalars by default.
The other major reason is to properly deal with the plethora of API's which a UnicodeString is going to have to support as a result of being way down the
Collection hierarchy. Take something simple like 'reverse'. Try doing that with an NFD form of 'ü' and see what you get...you won't like it.
Most languages...even the so-called ones with "Unicode support" (which by the way is a near meaningless phrase) make you deal with normalization as just part of what you do.
For example, making sure that things are in NFC (or NFD) before you do these kinds of things is always on you to deal with.
I have studied most language's implementations of Unicode at this point and believe Swift and Perl 6 (now Raku) made the appropriate decisions with how they are representing this.
Its certainly far more complexity hoisted onto the implementer to do this, but anything lower level just leads us back to the same conversations regarding normalization.
With what has been developed, 'überweisung' at: 1 in your example answers the grapheme 'ü'. It may be in NFC or NFD...it doesn't really matter for operations like '=' and 'hash'
and '<'. However, we do maintain the original normalization under the hood because, for example, certain filesystems may require the name to be in some
normalized form or it won't find it.
The bottom line in this issue is that you will be dealing with graphemes (@ user-perceived characters) by default. You won't care about normalization until you have some sort of
scenario where ensuring a normalized form is important. And when you do, its easy. We have asNFC, asNFD, asNFKC, asNFKD. And you have the option to do it in-place or make a copy
or even just to test if your unicode string is in some normalized form.
If you want to work with something else, we have the analog of swift "views". For us, these are bi-directional streams that subclass Stream and work on UnicodeString, Grapheme, and UnicodeScalar
You can do the normal next, atEnd, do: inject:into: and I'll be adding string slicing equivalents.
Right now the views are graphemes, unicodeScalars, utf8, utf16, utf32.
These views keep internal bookmarks to the bytes of the underlying implementation (which have nothing to do with how it is being presented to the user) so you get efficient O(n) streaming.
1 byte = 1 char = 1 user-perceived character is absolutely false, and trying to maintain that illusion is what got everybody into that codepage mess in the pre-unicode era.
And, it continues to get people into messes even during the unicode era because 1 Unicode codepoint = 1 character is also absolutely false....as you have found out.
"Guess the character encoding of an input string"
I think if you do some research on this you'll see that, in general, you can't do this in a way that is 100% guaranteed to work.
Anybody that attempts to do this has to provide a confidence level of sorts and you even may have to provide a sufficient amount of input for it to even work.
We implement our Unicode in rust, so if I find a rust 'crate' that exposes this capability, I'll certainly look at wrapping it and would be happy to do so.
As said previously, I plan to do a post to show some examples of where we are at that should be interesting.
Here is a small example. I tend to use emojis because they are complex under the hood.
Part of this effort is to also ensure that workspaces and inspectors have scintilla editors in UTF-8 mode,
The first screenshot is the emoji as UnicodeString and it matches up with the website.
The second screenshot is the first of five unicodeScalars that make up the emoji.
- Seth