In the context of the WebAssembly specification/CG,
a discussion is ongoing whether the envisioned UTF-8 'string' type in the Interface Types proposal (basically interop between WebAssembly modules and/or JavaScript) would be minimum viable or not.
Since Java represents and exposes strings as potentially ill-formed UTF-16 (sometimes simply called
WTF-16), an Interface Types 'string' type that enforces well-formedness would make it impossible, and unrecoverable, to roundtrip a string containing an isolated surrogate (e.g. half of a musical symbol or emoji) through it, say
* to a separately compiled WebAssembly module written in the same language
* to JavaScript glue code or a JavaScript module (also WTF-16), or
* to other WTF-16 languages (e.g. C#, AssemblyScript).
The current direction of travel in the Wasm spec is to not account for this case and consider isolated surrogates a "bug" that must be "fixed", and I have grown very worried about potential detrimental effects on WTF-16 languages that have deliberately chosen to allow isolated surrogates for backwards-compatibility reasons of their string APIs, say when a requirement is to
* compile (legacy) code to WebAssembly without modification
* guarantee data integrity in between function calls, or
* if pre-existing string APIs make it overly easy to accidentally split a surrogate pair into half (say with a 'substring(0, 1)'), which currently magically works, but when using the Interface Types 'string' type could introduce anything from annoyances to hazards.
Once the decision is made, and UTF-8/USVs are chosen, WTF-16 languages will have to live with it and document that potential data corruption can occur when using the Interface Types 'string' type. Potential alternatives to avoid this outcome are to specify a suitable escape hatch or "relax" the 'string' type to match
WTF-8 semantics (basically un-disallows encoding isolated surrogates), which unlike UTF-8 can roundtrip any WTF-16 string. However, it currently appears that these alternatives will not find consensus when it comes to a vote.
There are a few more unpleasant side-effects like having to unnecessarily double re-encode from WTF-16 to UTF-8 (lossy) back to UTF-16 for each 'string' parameter/return in an Interface Types call, which could put affected languages at a performance and/or code-size disadvantage, but I am not sure how important these are in comparison.
If you have an opinion on the matter, I would be happy if you could share it within the respective WebAssembly CG
discussion thread before it is too late. The thread over there also has a presentation video for those who are interested in more detailed background on the concepts involved, and there will be a discussion slot on June 22nd in the WebAssembly CG video meeting.
Please share this thread with those you think should be aware. Happy coding! :)