Proposal: String normalization in compatibility mode and transliteration

Nicolas Goy

unread,

Feb 12, 2018, 8:30:48 AM2/12/18

to elixir-lang-core

1.

String.normalize should support NFKC and NFKD unicode normalization format.

Reference: https://www.unicode.org/reports/tr15/

Those are particularly useful to generate "machine identifiers" from user input, like usernames.

2.

The second part (which is independent but related), is support for unicode transliteration.

Basically, this is a "non destructive" unicode->ascii conversion.

There is a library doing it in elixir

https://github.com/fcevado/unidecode

and a javascript example

https://github.com/pid/speakingurl

Also some discussion on the forum:

https://elixirforum.com/t/how-to-replace-accented-letters-with-ascii-letters/539/8

My thinking is that all those libraries are doing it a bit differently, because, well, unicode is hard.

And with unicode being so hard, I think it should be implemented at the language level (or in a core library) to be done right and supported.

It might not matters much for English readers, but for other languages, it is something you will implement eventually, often poorly.

Some references:

http://cldr.unicode.org/index/cldr-spec/transliteration-guidelines

Michał Muskała

unread,

Feb 12, 2018, 8:53:22 AM2/12/18

to elixir-l...@googlegroups.com

Is there some Unicode standard around transliteration?

As far as I know transliteration is very language specific - for example Russian is transliterated differently when embedded in an English text vs a Polish text.

Michał.

Wiadomość napisana przez Nicolas Goy <ku...@goyman.com> w dniu 12.02.2018, o godz. 14:30:

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/d2839fb2-984c-4bcf-b8fd-c891c8c24c83%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

José Valim

unread,

Feb 12, 2018, 12:23:18 PM2/12/18

to elixir-l...@googlegroups.com

For the first, you can use Erlang’s Unicode module: http://erlang.org/doc/man/unicode.html

From Elixir v1.8 it will be the preferred mechanism for normalization.

--

José Valim
www.plataformatec.com.br
Founder and Director of R&D

Nicolas Goy

unread,

Feb 12, 2018, 5:51:09 PM2/12/18

to elixir-lang-core

My bad, I found String.normalize and stopped there, didn't think of checking erlang.

About transliteration, that's what I am wondering. I "hand implemented" transliteration a lot for the languages I use, and I always thought it was a pain. But I have by no mean full knowledge of unicode and even less of the languages it represents.

Michał Muskała

unread,

Feb 14, 2018, 8:39:19 AM2/14/18

to elixir-l...@googlegroups.com

On 12 Feb 2018, 18:23 +0100, José Valim <jose....@gmail.com>, wrote:

For the first, you can use Erlang’s Unicode module: http://erlang.org/doc/man/unicode.html

From Elixir v1.8 it will be the preferred mechanism for normalization.

Interesting. Do we have any benchmarks between our implementation and Erlang's? If our is faster, maybe we can contribute back?

Michał.

José Valim

unread,

Feb 14, 2018, 9:00:13 AM2/14/18

to elixir-l...@googlegroups.com

Their normalization code is faster, we never did benchmark or improve ours.

The rest of the Unicode stuff in elixir is faster though because we work strictly on binaries.

Reply all

Reply to author

Forward