[Proposal] U+FFFD Substitution of Maximal Subparts

Cameron Duley

ulest,

5. okt. 2023, 21:24:2805.10.2023

til elixir-lang-core

As far as I can tell, neither Elixir nor Erlang have a built in function for replacing invalid sequences in Unicode. There's a suggested method on this page of the Unicode standard for handling this. Several other languages (Go, Python, C#, etc) now follow this spec.

Invalid Unicode's encountered frequently enough that I think it's worth incorporating a solution into Elixir itself.

Present alternatives to handling invalid unicode (and json by extension) are:

Crashing (not ideal in many cases)
Roll your own (lot of overhead for accidental complexity)
Depend on a package (+1 package towards dependency hell)

This is my college try, but I'm certain there's a performant and far cleaner solution to be had in pure Elixir. If not, perhaps this is a request for OTP.

Kip

ulest,

6. okt. 2023, 18:26:3706.10.2023

til elixir-lang-core

Cameron, I think this is a useful proposal. Elixir has means to check validity (String.valid?/1) and a mechanism to split valid and invalid code points (String.chunk/2 with the :valid trait). But there isn't, to my knowledge, a means to coerce validity. A couple of thoughts:

1. Since Elixir strings are, by definition, UTF8, I don't know that special handling of UTF16 and UTF32 code points makes much sense - although I accept this may be more Unicode compliant.

2. What would the function be called? Since we have String.valid?/1 maybe String.validate/2 with an option `replace_invalid: utf8_string`. The default `:replace_invalid` could be U+FFFD or it could be `nil`. If the default is `nil` then there could also be a `String.validate!/2` that raises if there is no `:replace_invalid` option.

3. I think the implementation could leverage the code of `String.chunk/2` which uses `String.next_codepoint/1`. That would simplify implementation and be more consistent in code style.

Kip

ulest,

6. okt. 2023, 19:03:2006.10.2023

til elixir-lang-core

Your implementation is definitely fast and memory efficient so I retract my implementation comments. Now that I've run the benchmarking script and tested out a few different approaches leveraging the std lib I understand better why you've taken the approach you have. Nice work.

José Valim

ulest,

7. okt. 2023, 04:40:5007.10.2023

til elixir-l...@googlegroups.com

Hi Cameron,

If the goal is to include this handling for UTF-16 and UTF-32, I suggest proposing this to Erlang/OTP as new functions in the "unicode" module. Otherwise, Elixir only has facilities to deal with UTF-8. You could propose such a feature in their issues tracker.

Also note that "rolling your own" or "depending on packages" is usually not enough reasons for adding features to Elixir. Otherwise, one could easily argue Decimal and Jason would be more important additions to the language. :) We do describe which features we would consider part of the language here: https://elixir-lang.org/development.html

Other than that, awesome job on the library and benchmarks. :)

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com.

Kip

ulest,

7. okt. 2023, 16:51:3407.10.2023

til elixir-lang-core

Cameron, I think this would be a useful addition to the Unicode library I maintain. If that works for you, please open an issue there and we can collaborate. I think it being part of the Erlang `:unicode` module makes good sense too as José says but that's a longer "sales" and implementation cycle.

José Valim

ulest,

31. okt. 2023, 12:35:4831.10.2023

til elixir-lang-core

Folks, I am following up on this, where did we land?

The new implementation is roughly ~70LOC for UTF-8, so at first I don't see an issue with adding it to Elixir. However, the Elixir version would be UTF-8 only (part of the String module).

Thoughts?

Cameron Duley

ulest,

31. okt. 2023, 13:52:2831.10.2023

til elixir-lang-core

This was the final version I'd landed on for UTF-8:

https://github.com/elixir-unicode/unicode/blob/main/lib/unicode/validation/utf8.ex

Along with the following modules for testing:

https://github.com/elixir-unicode/unicode/blob/main/test/support/unicode_validation_helpers.ex

https://github.com/elixir-unicode/unicode/blob/main/test/unicode_validation_test.exs

I think it's ideal functionality to have in the String module, and the implementation's "reasonable enough" until a native solution's available in OTP.

Testing is my only uncertainty - How much is prudent and in what manner?

José Valim

ulest,

31. okt. 2023, 14:09:2031.10.2023

til elixir-l...@googlegroups.com

Does the specification provide tests for us to include? Otherwise we can include enough tests for full line coverage and a “brute force”/property test commented out.

I would say the name “replace_invalid” is excellent.

To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/26e4c761-caf0-46ea-ab08-a33407febdb8n%40googlegroups.com.

Cameron Duley

ulest,

31. okt. 2023, 18:48:5931.10.2023

til elixir-l...@googlegroups.com

I'd originally looked for tests in the spec, browsers, and other languages to compare against.

W3C's current test suite doesn't seem comprehensive:

https://github.com/web-platform-tests/wpt/blob/master/encoding/replacement-encodings.any.js

Go has a few token tests that are relatively intelligible:

https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/bytes/bytes_test.go;l=1157

To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4LKEOb%3D7WMeUhots27dj%3DAwvhLF3o%2Bo7Q2%2BhZXMNQi8YA%40mail.gmail.com.

--

Thanks,

Cameron Duley

José Valim

ulest,

1. nov. 2023, 03:25:0001.11.2023

til elixir-l...@googlegroups.com

Yeah, so I would go with the full coverage route. If you want to provide a PR, it will be welcome. Thank you and sorry for the delay!

To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CAH7jZqe%3D%3DKUPG6O3v_%3DSr%3DAjzv28X5x1QvpF-bqJ96XWOVPu5w%40mail.gmail.com.

Svar alle

Svar til forfatter

Videresend