Explore a cross-PL implementation of ICU

52 views
Skip to first unread message

Mike Samuel

unread,
Dec 12, 2025, 3:23:56 PM12/12/25
to icu-support
I'm working on a programming language, Temper, for libraries that translate to all the other programming languages.

I was thinking of porting ICU to it.  The idea would be to extend ICU support to PLs that don't have it now like C# and Lua, and as new programming languages arise, they get good i18n support as soon as there's a Temper backend for that PL.

From our side, it'd help us prove out the language, help us refine our translation machinery (we'd want our Java and Rust translations to delegate to ICU4J and ICUX probably), and, down the road, probably help support template DSLs built on Temper.

I just wanted to say hi and make sure we're not stepping on any toes, see if this is something that the Unicode Consortium is interested in having under their umbrella.  And see if people have any advice on things to do / not to do when porting.

cheers,
mike

fyi, I chatted with some i18n/l10n folk about Temper at TPAC recently.  Slides with a bit of background on the Temper project here:
https://temperlang.github.io/tpac2025/Templates/Overview.html

Nebojša Ćirić Ꙉ

unread,
Dec 12, 2025, 3:43:22 PM12/12/25
to icu-support, Mike Samuel
(my previous email was rejected)

Hi Mike,
thanks for the interest. I don't know if Unicode has any toes to be stepped on here, but I can talk about the multilingual approach you mentioned.

In the early days of ICU4X we looked for a write once, use everywhere approach - technically a transpiler solution. We looked into Clojure, Rust and other options as source language and also hoped that Wasm would become a much stronger player.

We chose Rust + Diplomat as the best solution at that moment, and reworked ICU internals to allow better data handling. If you want to base your new Temper library on something, I would suggest ICU4X as it has the most modern API and data handling implementation compared to C/J. Rust is great for performance, FFI (into Wasm/JS, C/C++) but there are questions about interoperability with Java (running numbers to see where the slowdowns are).

Hope this helps.

Nebojša

Mike Samuel

unread,
Dec 12, 2025, 3:58:26 PM12/12/25
to icu-support, Nebojša Ćirić Ꙉ, Mike Samuel
Nebojša, lovely to hear from you. It's been a minute.

Thanks for the feedback.  Yeah, ICU4X (sorry, said ICUX earlier) sounds like the best base to port from, and probably easier to track to keep in sync.

The whole Parrot affair showed that it's hard to switch runtimes, but I'm surprised that more PLs haven't adopted Wasm as their primary runtime. Temper is a language, intentionally without a runtime; our approach to interop is just to work, by translation, into existing runtimes.

(On transpilation, our compiler produces source files, along with other outputs, but we do enough odd things that we fall into Rachit's space of compilers that produce non-traditional outputs)
But I wonder if there might be room for transpilation; it occurs to me that having a common test suite might be good.  Test code tends to be more formulaic / less performance sensitive, so maybe there's room there.

cheers,
mike

Nebojša Ćirić Ꙉ

unread,
Dec 12, 2025, 4:04:14 PM12/12/25
to Mike Samuel, icu-support
I am afraid to count the actual minutes ;).

As for the common testing suite, we did start one and is being developed and maintained by the ICU team. It has a number of runners in C, Java and Rust (core is in Python) and operates on CLDR data. Our goal is to make sure our various implementations agree among each other.
It's an interesting idea to write the whole thing in 1 language and transpile - it would save time targeting various implementations.

Elango Cheran

unread,
Dec 12, 2025, 7:02:22 PM12/12/25
to Nebojša Ćirić Ꙉ, Mike Samuel, icu-support
If you want to take a look at the lessons learned from the attempt at a transpiler in Clojure, watch this presentation: https://www.youtube.com/watch?v=terdLf0ribg There are interesting takeaways about PLs, higher level effects of simplicity / complexity, why Rust is an outlier as a target language, etc.

It's a hard problem to get a library that behaves identically across PLs with optimized native speed and idiomatic behavior in each of them. Whether you take a wrap-a-binary approach or a transpiler approach, the complexity will manifest differently, but it never goes away. It just manifests as different problems, whether they're immediately obvious or they surprise you a few years down the road.

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAEm%2Bh-Ep2py91Mhr3q%3DJeJHswAbr49AC0Zgnjz4u8QMkK61LQg%40mail.gmail.com.

Mike Samuel

unread,
Dec 15, 2025, 11:29:07 AM12/15/25
to icu-support, Elango Cheran, Mike Samuel, icu-support, Nebojša Ćirić Ꙉ
Thanks Elango.  Great talk!

Yeah.  We did several years worth of comparative PL linguistics (sadly not a thing that's been approached academically) before starting our language design and compiler implementation. We crafted a language specifically to translate well to avoid a lot of those problems.  Not trying to translate an existing language let us avoid semantic traps like having to port JVM string type's random access semantics onto other languages.  Not having a top type (I think Kalai uses `any`) let us craft a constrained casting semantics.

You talk about ownership briefly, and I see that Kalai.emit.langs includes C++.  Does your memory model rely on persistent data structures' lack of reference cyclicity?

Elango Cheran

unread,
Dec 15, 2025, 5:37:36 PM12/15/25
to Mike Samuel, icu-support, Nebojša Ćirić Ꙉ, timothy...@gmail.com
Thanks! Yeah, the "cpp" you see is a vestige of the first implementation approach in Kalai, which targeted C++ and Java outputs, and used code to carefully traverse the AST of the input forms in order to emit the corresponding output forms. That approach sounds very common and reasonable at first blush, but as the talk mentions, you don't realize how much complexity is involved until you experience the simpler alternative. The simpler rewrite approach only targeted Java and Rust.

As far as what constructs we designed to support in the input source, it was a difficult set of decisions. Like you mentioned, not all languages have random access indexing into the characters of a string (ex: Rust), even though most do. More and more languages have a sequence/stream/iterator abstraction, but that depends on which version of Java or C++, and Python feels clunky, and C probably won't ever have it. We faced the impedance mismatch early on of mutable collections vs. persistent immutable collections. Choosing to support one or the other better was a mutually exclusive decision, and that choice became a question of "which category of non-functional requirements & applications would we make our transpiler become better suited to support?" This is why we included a section in the talk on which PLs are designed for which types of problems (read: which sets of non-functional requirements). We ended up choosing to support persistent (immutable) data structures because, for high-level and/or general-purpose applications which have changing requirements, our opinionated position from our experience was that this was by far the most fruitful way to program. Having an apex Any type was important to allow the heterogenous kind of persistent data structures for maximum flexibility. As the talk shows, when that becomes a common interface for several libraries, the power automatically accrues when you can easily mix and match them. That's how the second Kalai implementation approach was able to easily combine a term rewrite library of macros that looks like a fancy Datalog with the analyzer AST library, where the Datalog style constraint satisfaction greatly simplified the wrangling of detailed verbose AST output. The difference in productivity was a night & day difference. What you lose in static typing, you gain as much if not more in power and flexibility. While the decision also took Kalai further from contention to being suitable for something low level and performance-sensitive like ICU, it was better suited for a SQL query builder library like HoneySQL to save everyone from the dreadful historical mistake that is ORMs.

Although the decision of mutable vs. persistent immutable data collections was an inflection point, we felt okay by knowing that most PLs have a 3P library implementation of it somewhere (even Rust!). Later on, we did find that the ownership construct in Rust to be a much more difficult thing to accommodate syntactically, almost orthogonal to any other PL construct we had encountered, and we had to do some unnatural acts to get things just to compile, starting with the Any type. It's worth noting that even in regular Rust programming, when you use serde_json to parse & format JSON, which is heterogeneous data, the serde_json library has to subvert the type system, in a way, by creating the Value enum that enumerates all the possible types. Perhaps it's like a domain-specific Any? You don't ever escape the requirements you want to support and the tradeoffs of your earlier decisions. You just have to pick your poison, according to the category of problems you want to design towards.

Mike Samuel

unread,
Dec 16, 2025, 4:19:53 PM12/16/25
to icu-support, Elango Cheran, icu-support, Nebojša Ćirić Ꙉ, timothy...@gmail.com, Mike Samuel
Thanks for explaining.  Is https://github.com/kalai-transpiler/kalai the latest incarnation of Kalai?  iiuc, you're using two translation strategies for different target languages.  The first two backends work by doing more-or-less a single pass, and the two later backends use a more traditional multi-pass compiler architecture.

Your observation on "approach sounds very common and reasonable at first blush" brought to mind this from Rachit's ["transpiler" rant](https://people.csail.mit.edu/rachit/post/transpiler/):

> Lie #4: Transpilers don't have backends ... Compilers already do things that “transpilers” are supposed to do. And they do it better because they are built on the foundation of language semantics instead of syntactic manipulation.

Yeah, on string random access, most languages have it.  Swift is an outlier, but closest to what we ended up going with.
https://temperlang.github.io/tld/6977ee7c59772b42/draftsite/blog/2025/03/25/a-tangle-of-strings/ captures a lot of our API design work for strings.
tldr: the need for consistent semantics despite native code unit differences led us to try both slice semantics and by-construction index types for low-level string processing.  The former allows perfect consistency but is confusing and inefficient on dynlangs.  The latter looks like indexing and is trivial to translate efficiently.

The collections design space is really gnarly, but since your user community seems to skew towards FP people, persistence definitely seems the way to go.
Defining semantics for iteration over mutable data structures is really hard to get sane much less consistent.  Java there tried ConcurrentModificationException as a principled way to get fail-stop semantics but I think there are holes in that you can drive a truck through.

On top/any types, I think what really drove home for me how hard it is to get both idiomaticity and semantic consistency was this Python dict with heterogenous keys.

Python 3.14.0 (main, Oct  7 2025, 09:34:52) [Clang 17.0.0 (clang-1700.0.13.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> m = { True: "True", 1: "1", 1.0: "1.0" }
>>> m[True]
'1.0'

Even ignoring dynlangs' tendency to just have one scalar type for (numbers, strings, booleans), in key expressions, the conflation becomes super apparent.  And List.find operators tends to recreate those problems in non-relational contexts.

Yeah.  On JSON {de,}serialization, that was a major design story for us too, but not one that drove us towards heterogeneity.  We allow

    @json
    class Price(
      public currencyCode: String,
      public amount: Int32,
    ) { ... }

@json is a decorator that applies at compile time to define a JSON adapter helper.  That allows doing explicit, type-directed decoding.

    Price.jsonAdapter().decodeFromJson(...)

Since we can't allow runtime type introspection, we do it at compile time.  Meta-programming lets us avoid the need for heterogeneity.  That's super important for things like:

     let someBooleans = [false, true];
     // Should encode `[false, true]` not `[0, 1]` even on runtimes that represent booleans using ints.
     List.jsonAdapter(Boolean.jsonAdapter()).encodeToJson(someBooleans, jsonOut);

Some older dynlangs (eg Perl5 & PHP) have added affordances to allow for JSON encoding ( https://www.php.net/manual/en/function.is-bool.php ), but they're super brittle in practice.

Yeah, Rust's linearity is definitely an outlier.  I was a huge fan of the Cyclone language back in the day, so I'm a fan of Rust too.
One thing we did was decide that there's no default reference identity operator, à la Clojure (identical? ...).

Elango Cheran

unread,
Dec 23, 2025, 5:34:44 PM (9 days ago) 12/23/25
to Mike Samuel, icu-support, Nebojša Ćirić Ꙉ, timothy...@gmail.com
Yep, that URL is the Kalai Github repo. We ditched the single pass strategy early on, so you won't see it unless you go far enough back that the whole repo is completely different (and internally had a different name).We learned a lot through experience as we went along, including basics like that.

You definitely thought a lot about the design of your APis for handling strings, but I suppose that makes sense since you have a loftier goal of making any language be able send & read data with another in a way that's seamless and not too inefficient over the boundary, and that requires untangling the mess of how strings are treated by programming languages natively.

It looks like your target language output code relies on helper libraries in the target language, and we did the same thing too. I guess that's unavoidable.

Yeah, we definitely had a high level general purpose FP design in mind, although not exactly clear and predefined to begin with. It was kind of exploratory, and it evolved after discussing it a lot over each decision point, and then we recorded it down, and jotted down a rationale for the more interesting / difficult decisions. Not quite ready to be properly public facing, though. Using heterogeneous collections works fine with the persistent collections because they're immutable, so long as they implement the hash function without collisions

Homogenous typed collections definitely make sense for lower level applications, and sending well-defined data through narrow touch points (as good design for services-based architecture requires). Designing for heterogeneous collections further took us away from that, yeah. We just liked the benefits too much that accrued from generic plain data in the world of general purpose programming (which are underappreciated IMO - we didn't have time to expound further on that in the talk).
Reply all
Reply to author
Forward
0 new messages