Java/TeaVM and the (incompatible) WebAssembly Interface Types 'string' type

372 views

Skip to first unread message

Daniel Wirtz

unread,

Jun 17, 2021, 3:49:48 AM6/17/21

to TeaVM

In the context of the WebAssembly specification/CG, a discussion is ongoing whether the envisioned UTF-8 'string' type in the Interface Types proposal (basically interop between WebAssembly modules and/or JavaScript) would be minimum viable or not.

Since Java represents and exposes strings as potentially ill-formed UTF-16 (sometimes simply called WTF-16), an Interface Types 'string' type that enforces well-formedness would make it impossible, and unrecoverable, to roundtrip a string containing an isolated surrogate (e.g. half of a musical symbol or emoji) through it, say

* to a separately compiled WebAssembly module written in the same language

* to JavaScript glue code or a JavaScript module (also WTF-16), or

* to other WTF-16 languages (e.g. C#, AssemblyScript).

The current direction of travel in the Wasm spec is to not account for this case and consider isolated surrogates a "bug" that must be "fixed", and I have grown very worried about potential detrimental effects on WTF-16 languages that have deliberately chosen to allow isolated surrogates for backwards-compatibility reasons of their string APIs, say when a requirement is to

* compile (legacy) code to WebAssembly without modification

* guarantee data integrity in between function calls, or

* if pre-existing string APIs make it overly easy to accidentally split a surrogate pair into half (say with a 'substring(0, 1)'), which currently magically works, but when using the Interface Types 'string' type could introduce anything from annoyances to hazards.

Once the decision is made, and UTF-8/USVs are chosen, WTF-16 languages will have to live with it and document that potential data corruption can occur when using the Interface Types 'string' type. Potential alternatives to avoid this outcome are to specify a suitable escape hatch or "relax" the 'string' type to match WTF-8 semantics (basically un-disallows encoding isolated surrogates), which unlike UTF-8 can roundtrip any WTF-16 string. However, it currently appears that these alternatives will not find consensus when it comes to a vote.

There are a few more unpleasant side-effects like having to unnecessarily double re-encode from WTF-16 to UTF-8 (lossy) back to UTF-16 for each 'string' parameter/return in an Interface Types call, which could put affected languages at a performance and/or code-size disadvantage, but I am not sure how important these are in comparison.

If you have an opinion on the matter, I would be happy if you could share it within the respective WebAssembly CG discussion thread before it is too late. The thread over there also has a presentation video for those who are interested in more detailed background on the concepts involved, and there will be a discussion slot on June 22nd in the WebAssembly CG video meeting.

Please share this thread with those you think should be aware. Happy coding! :)

ScraM Team

unread,

Jul 9, 2021, 8:53:06 PM7/9/21

to TeaVM

Daniel,

Thanks for raising this issue here. While TeaVM WASM support is experimental right now, there is a lot of interest. It would be a shame for a technical decision in the WASM community to cause interoperability problems we could prevent through some advocacy.

I have read through your post and several documents on Unicode, UTF-8, WTF-8, and more. At this point I'm definitely concerned about future negative impacts on TeaVM's WASM backend.

I'm hoping you could clarify one point: If a TeaVM program compiled to WASM is passing strings to another TeaVM-compiled method in the same codebase, will each invocation potentially be impacted? Or will this only come into play when TeaVM is invoking an external WASM method? The former sounds disastrous, the latter not great but potentially bearable.