[erlang-questions] Downcase Accented characters

82 views
Skip to first unread message

Roberto Ostinelli

unread,
Oct 21, 2012, 3:14:37 PM10/21/12
to Erlang
Dear list,

I've a binary string which includes accented characters and unicode, that i need to downcase.

Is my real best option here to convert everything to list and downcase that?

Loïc Hoguin

unread,
Oct 21, 2012, 3:18:54 PM10/21/12
to Roberto Ostinelli, Erlang
Your current best option is ux_string:to_lower/1 from the ux library
which will properly lower all characters, not just A-Z.

Should be at https://github.com/erlang-unicode/ux

--
Loïc Hoguin
Erlang Cowboy
Nine Nines
http://ninenines.eu
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Roberto Ostinelli

unread,
Oct 21, 2012, 3:25:11 PM10/21/12
to Loïc Hoguin, Erlang
Thank you Loïc,

did you happen to benchmark it? Would that be better/faster than a simple list_to_binary(string:to_lower(binary_to_list(Bin)))?

On Sun, Oct 21, 2012 at 12:18 PM, Loïc Hoguin <es...@ninenines.eu> wrote:
On 10/21/2012 09:14 PM, Roberto Ostinelli wrote:
Dear list,

I've a binary string which includes accented characters and unicode,
that i need to downcase.

Is my real best option here to convert everything to list and downcase that?

Your current best option is ux_string:to_lower/1 from the ux library which will properly lower all characters, not just A-Z.

Should be at https://github.com/erlang-unicode/ux


--
Loďc Hoguin

Loïc Hoguin

unread,
Oct 21, 2012, 3:33:25 PM10/21/12
to Roberto Ostinelli, Erlang
For this comparison, ux would be slow and accurate, while your solution
would be fast and inaccurate. :)

On 10/21/2012 09:25 PM, Roberto Ostinelli wrote:
> Thank you Loïc,
>
> did you happen to benchmark it? Would that be better/faster than a
> simple list_to_binary(string:to_lower(binary_to_list(Bin)))?
>
> On Sun, Oct 21, 2012 at 12:18 PM, Loïc Hoguin <es...@ninenines.eu
> <mailto:es...@ninenines.eu>> wrote:
>
> On 10/21/2012 09:14 PM, Roberto Ostinelli wrote:
>
> Dear list,
>
> I've a binary string which includes accented characters and unicode,
> that i need to downcase.
>
> Is my real best option here to convert everything to list and
> downcase that?
>
>
> Your current best option is ux_string:to_lower/1 from the ux library
> which will properly lower all characters, not just A-Z.
>

> Should be at https://github.com/erlang-__unicode/ux


> <https://github.com/erlang-unicode/ux>
>
> --
> Loďc Hoguin
> Erlang Cowboy
> Nine Nines
> http://ninenines.eu
>
>


--
Loïc Hoguin


Erlang Cowboy
Nine Nines
http://ninenines.eu

Roberto Ostinelli

unread,
Oct 21, 2012, 3:39:22 PM10/21/12
to Loïc Hoguin, Erlang
For the records, this just works..

start() ->
Unicode = list_to_binary("∞-HOpe@☺.EXAMple.com/My❤"),
Result = list_to_binary(string:to_lower(binary_to_list(Unicode))),
"∞-hope@☺.example.com/my❤" = binary_to_list(Result).

any downsides I'm not seeing?

Roberto Ostinelli

unread,
Oct 21, 2012, 3:45:21 PM10/21/12
to Loïc Hoguin, Erlang
BTW,

ux dependencies are unsatisfied:

==> ux (get-deps)
Pulling abnfc from {git,"git://github.com/nygge/abnfc.git","master"}
Cloning into 'abnfc'...
Pulling metamodule from {git,"git://github.com/freeakk/metamodule.git",
                             "master"}
fatal: remote error: 
  Repository not found.
Cloning into 'metamodule'...
ERROR: git clone -n git://github.com/freeakk/metamodule.git metamodule failed with error: 128 and output:
fatal: remote error: 
  Repository not found.
Cloning into 'metamodule'...

ERROR: 'get-deps' failed while processing

Loïc Hoguin

unread,
Oct 21, 2012, 3:46:17 PM10/21/12
to Roberto Ostinelli, Erlang
This only works for letters found in latin1, not for all the uppercases
found in unicode. If that's good enough for you then you don't need ux. :)

On 10/21/2012 09:39 PM, Roberto Ostinelli wrote:
> For the records, this just works..
>
> start() ->
> Unicode = list_to_binary("∞-HOpe@☺.EXAMple.com/My❤"),
> Result = list_to_binary(string:to_lower(binary_to_list(Unicode))),

> "∞-hope@☺.example.com/my <http://example.com/my>❤" = binary_to_list(Result).


>
> any downsides I'm not seeing?
>
> On Sun, Oct 21, 2012 at 12:25 PM, Roberto Ostinelli <rob...@widetag.com
> <mailto:rob...@widetag.com>> wrote:
>
> Thank you Loïc,
>
> did you happen to benchmark it? Would that be better/faster than a
> simple list_to_binary(string:to_lower(binary_to_list(Bin)))?
>
>
> On Sun, Oct 21, 2012 at 12:18 PM, Loïc Hoguin <es...@ninenines.eu
> <mailto:es...@ninenines.eu>> wrote:
>
> On 10/21/2012 09:14 PM, Roberto Ostinelli wrote:
>
> Dear list,
>
> I've a binary string which includes accented characters and
> unicode,
> that i need to downcase.
>
> Is my real best option here to convert everything to list
> and downcase that?
>
>
> Your current best option is ux_string:to_lower/1 from the ux
> library which will properly lower all characters, not just A-Z.
>

> Should be at https://github.com/erlang-__unicode/ux


> <https://github.com/erlang-unicode/ux>
>
> --
> Loďc Hoguin
> Erlang Cowboy
> Nine Nines
> http://ninenines.eu
>
>
>


--
Loïc Hoguin


Erlang Cowboy
Nine Nines
http://ninenines.eu

Roberto Ostinelli

unread,
Oct 21, 2012, 3:51:46 PM10/21/12
to Loïc Hoguin, Erlang
Oh I see.

So if I want to downcase this string: "∞-HOpe@☺.ÉXAMple.com/My❤" I will need ux?

r.

Loïc Hoguin

unread,
Oct 21, 2012, 4:00:15 PM10/21/12
to Roberto Ostinelli, Erlang
Yes and no, this example would still work I think? I'm no expert on how
Erlang deals with unicode, I just know what string:to_lower/1 does. :)

On 10/21/2012 09:51 PM, Roberto Ostinelli wrote:
> Oh I see.
>
> So if I want to downcase this string: "∞-HOpe@☺.ÉXAMple.com/My❤" I will
> need ux?
>
> r.
>
> On Sun, Oct 21, 2012 at 12:46 PM, Loïc Hoguin <es...@ninenines.eu
> <mailto:es...@ninenines.eu>> wrote:
>
> This only works for letters found in latin1, not for all the
> uppercases found in unicode. If that's good enough for you then you
> don't need ux. :)
>
>
> On 10/21/2012 09:39 PM, Roberto Ostinelli wrote:
>
> For the records, this just works..
>
> start() ->

> Unicode = list_to_binary("∞-HOpe@☺.__EXAMple.com/My❤"),
> Result =
> list_to_binary(string:to___lower(binary_to_list(Unicode))__),
> "∞-hope@☺.example.com/my <http://example.com/my>


> <http://example.com/my>❤" = binary_to_list(Result).
>
>
> any downsides I'm not seeing?
>
> On Sun, Oct 21, 2012 at 12:25 PM, Roberto Ostinelli
> <rob...@widetag.com <mailto:rob...@widetag.com>

> <mailto:rob...@widetag.com <mailto:rob...@widetag.com>>> wrote:
>
> Thank you Loïc,
>
> did you happen to benchmark it? Would that be better/faster
> than a

> simple list_to_binary(string:to___lower(binary_to_list(Bin)))?


>
>
> On Sun, Oct 21, 2012 at 12:18 PM, Loïc Hoguin
> <es...@ninenines.eu <mailto:es...@ninenines.eu>

> <mailto:es...@ninenines.eu <mailto:es...@ninenines.eu>>> wrote:
>
> On 10/21/2012 09:14 PM, Roberto Ostinelli wrote:
>
> Dear list,
>
> I've a binary string which includes accented
> characters and
> unicode,
> that i need to downcase.
>
> Is my real best option here to convert everything
> to list
> and downcase that?
>
>
> Your current best option is ux_string:to_lower/1 from
> the ux
> library which will properly lower all characters, not
> just A-Z.
>

> Should be at https://github.com/erlang-____unicode/ux
> <https://github.com/erlang-__unicode/ux>


>
> <https://github.com/erlang-__unicode/ux
> <https://github.com/erlang-unicode/ux>>
>
> --
> Loďc Hoguin
> Erlang Cowboy
> Nine Nines
> http://ninenines.eu
>
>
>
>
>
> --
> Loïc Hoguin
>
> Erlang Cowboy
> Nine Nines
> http://ninenines.eu
>
>


--
Loïc Hoguin
Erlang Cowboy
Nine Nines
http://ninenines.eu

Roberto Ostinelli

unread,
Oct 21, 2012, 4:10:27 PM10/21/12
to Loïc Hoguin, Erlang
ok, thank you :)

r.

Thomas Allen

unread,
Oct 21, 2012, 4:12:09 PM10/21/12
to Roberto Ostinelli, Erlang
On Sun, October 21, 2012 3:39 pm, Roberto Ostinelli wrote:
> For the records, this just works..
>
> start() ->
> Unicode = list_to_binary("∞-HOpe@☺.EXAMple.com/My⠤"),
> Result = list_to_binary(string:to_lower(binary_to_list(Unicode))),
> "∞-hope@☺.example.com/my⠤" = binary_to_list(Result).
>
> any downsides I'm not seeing?

For what it's worth,

1> list_to_binary("&#8734;-HOpe@&#9786;.EXAMple.com/My&#10084;").
** exception error: bad argument
in function list_to_binary/1
called as
list_to_binary([8734,45,72,79,112,101,64,9786,46,69,88,65,77,112,108,101,
46,99,111,109,47,77,121,10084])

I get that on my system if any of the special characters (&#8734;,
&#9786;, &#10084;) are present (R15B02 on Debian 6.0.6 and OSX 10.7.2,
both built from source). So you might need to be careful with that
technique.

Thomas Allen

Michael Uvarov

unread,
Oct 21, 2012, 4:12:52 PM10/21/12
to Roberto Ostinelli, Erlang
> ux dependencies are unsatisfied.
Just update it.

Michael Uvarov

unread,
Oct 21, 2012, 4:16:37 PM10/21/12
to tho...@oinksoft.com, Erlang, Roberto Ostinelli
list_to_binary([8734,45,72,79,112,101,64,9786,46,69,88,65,77,112,108,101,
46,99,111,109,47,77,121,10084])
It works only for elements from 1 to 255.
Use unicode:characters_to_binary/1 instead.

If ux is slow, than try i18n (it is NIFs for ICU).

Yurii Rashkovskii

unread,
Oct 21, 2012, 6:02:36 PM10/21/12
to erlang-pr...@googlegroups.com, Erlang, rob...@widetag.com
Roberto,

You might be able to achieve what you need by using one isolated bit of Elixir's distribution — String.Unicode module.

it is compiled right off UnicodeData.txt so it has all the necessary data embedded, thus saving you from talking to gen_servers or ETS tables.

And the best part is that you don't really need Elixir itself to be able to use it:

You can edit https://github.com/elixir-lang/elixir/blob/master/lib/elixir/priv/unicode.ex and rename it from String.Unicode to, say, :string_unicode (to be visually native to erlang's code) and after the compilation you'll get a beam file you can use independently from Elixir because it doesn't use anything from the elixir application.

Hope you'll find this helpful.

Marc Worrell

unread,
Oct 22, 2012, 5:44:59 AM10/22/12
to Roberto Ostinelli, Erlang
When you need to do downcast a subset (most european languages) then you can also check the z_string.erl module in z_stdlib.

https://github.com/zotonic/z_stdlib/blob/master/src/z_string.erl

We are in the process of splitting useful libraries from Zotonic, and z_stdlib is one of them.
Any additions/fixes are welcome.

- Marc

Richard O'Keefe

unread,
Oct 25, 2012, 11:05:52 PM10/25/12
to Marc Worrell, Erlang, Roberto Ostinelli

On 22/10/2012, at 10:44 PM, Marc Worrell wrote:

> When you need to do downcast a subset (most european languages) then you can also check the z_string.erl module in z_stdlib.
>
> https://github.com/zotonic/z_stdlib/blob/master/src/z_string.erl
>
> We are in the process of splitting useful libraries from Zotonic, and z_stdlib is one of them.
> Any additions/fixes are welcome.

Is there any reason why trim{,_left,_right}/1 don't strip
leading/trailing NBSP characters? (Or other Unicode white
space characters above U+0020.)
Reply all
Reply to author
Forward
0 new messages