to_binary and to_string changes

4,409 views
Skip to first unread message

José Valim

unread,
Aug 12, 2013, 4:06:56 AM8/12/13
to elixir-l...@googlegroups.com
Hello everyone,

This is a place to discuss some changes that are coming to Elixir master.

## The basics

Elixir has binaries and lists. A binary holds bytes, a list can hold anything.
A string is a binary encoded in UTF-8. A char list is a list containing UTF-8 codepoints.

Strings and char lists are not tagged, i.e. there isn't anything in the list or in the binary marking it as UTF-8 encoded data. It is responsibility of the IO interface to know and properly enforce their encoding. This means that, from the perspective of the system, a list of integers could mean anything.

## Codepoints and bytes

Char lists are made of codepoints and strings are made of bytes. This means that the conversion of a char list to a string is not simply a matter of "replacing [...] by <<...>>". Let's see an example:

    # Codepoint for Ѐ
    iex> [200]
    'È'
    iex> String.from_char_list [200]
    "Ѐ"
    iex> size String.from_char_list [200]
    2

In the example above, we can see that the codepoint 200 is represented by two bytes in UTF-8.

## The trouble in Elixir today

Strings were added to Elixir on later versions, leaving some rough edges to be smoothed.

The first issue is regarding list_to_binary/1 and binary_to_list/1. Both functions do not make any assumption about the encoding. Consider the code below:

    iex> list_to_binary [200]
    <<200>>

The list_to_binary/1 performs a raw conversion from a list to binary with integers up to 255. The end result is not guaranteed, in any way, to be a valid string. From this perspective, list_to_binary/1 is a low level operation. The issue is that developers are using list_to_binary/1 as the primary mechanism for converting char lists to strings and that is just going to yield the wrong result when unicode characters are added. We even have this issue in Elixir source code itself.

Furthermore, string interpolation uses the to_binary/1 function (which is powered by the Binary.Chars protocol) to convert the interpolated content into a binary. Let's take a look at it:

    iex> to_binary [200]
    <<200>>

to_binary/1 uses the raw list_to_binary/1 which doesn't know about char lists nor Unicode codepoints. Alexei has pointed out this is a very poor behaviour for a function that is supposed to interpolate contents into a String, which is meant to be in UTF-8 after all.

## The solution

In the upcoming days, list_to_binary/1 and binary_to_list/1 will be replaced by String.to_char_list/1 and String.from_char_list/1 (they will be available from the :erlang module still). to_binary/1 and the Binary.Chars protocol will be replaced by to_string/1 and String.Chars.

It is important to notice that the remaining *_to_binary and binary_to_* functions won't be changed. I have many times contemplated getting rid of those functions and converting them to to_string variants but the truth is that some functions like is_binary/1 cannot be converted into is_string/1 as is_string/1 would require a guarantee that it is UTF-8 encoded and we cannot give this guarantee inside a guard.


José Valim
Skype: jv.ptec
Founder and Lead Developer

Oren Ben-Kiki

unread,
Aug 12, 2013, 4:43:42 AM8/12/13
to elixir-l...@googlegroups.com
Surely there's still value in being able to do a list_to_binary and binary_to_list to process raw binary files in arbitrary formats? That is, I see the logic in adding the string/char_list functions, but I think these should be an addition and not a replacement.


--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

José Valim

unread,
Aug 12, 2013, 4:48:28 AM8/12/13
to elixir-l...@googlegroups.com
Those will still be there via the :erlang module. If their usage is common enough, it is worthy bringing them back but I want to see if we really need it first. Note that the "replacements" already exist today. The point of removing them is exactly to avoid confusion.

José Valim
Skype: jv.ptec
Founder and Lead Developer


Oren Ben-Kiki

unread,
Aug 12, 2013, 4:57:12 AM8/12/13
to elixir-l...@googlegroups.com
Perhaps rename them to byte_list_to_binary (to distinguish from char_list_to_binary)? Outright removing them and sending people to :erlang for "pedagogical" reasons seems a bit extreme.

Devin Torres

unread,
Aug 13, 2013, 12:06:50 PM8/13/13
to elixir-l...@googlegroups.com
José,

It is already common enough to use both for non-string purposes for e.g. binary protocols. Also, iolist_to_binary -- is this suddenly removed too What will it be replaced with? I write something using iolist_to_binary every other day.

Yurii Rashkovskii

unread,
Aug 13, 2013, 12:14:52 PM8/13/13
to elixir-l...@googlegroups.com
José,

A lot of good points.

However I think removing  list_to_binary/1 and binary_to_list/1 is a drastic and an unnecessary measure. As those function names indicate, they have nothing to do with strings, but binaries. Binaries are a superset of strings and surely don't need to be UTF-8.

A really good point raised by Devin — hope iolist_to_binary is not going away? I personally find myself using list_to_binary instead of iolist_to_binary when my list is a narrow, integer-only list, as opposed to an iolist.

So, my vote is not to remove list_to_binary and friends, but update their documentation to reflect their true nature.

This:

* def list_to_binary(char_list)

Returns a binary which is made from the content of `char_list`.

Is definitely not the best way to document it and I can see how this creates the confusion. It takes not a char list, but a byte list and returns a binary. Nothing to do with strings.

Please keep it and if necessary, I will be happy to send a PR with updated docs.


--

Devin Torres

unread,
Aug 13, 2013, 12:18:41 PM8/13/13
to elixir-l...@googlegroups.com
Just take a look at any Erlang library for communicating a binary protocol to see how common it is for non-string purposes.

Yurii Rashkovskii

unread,
Aug 13, 2013, 12:25:01 PM8/13/13
to elixir-l...@googlegroups.com
My point exactly.

José Valim

unread,
Aug 13, 2013, 12:26:59 PM8/13/13
to elixir-l...@googlegroups.com

However I think removing  list_to_binary/1 and binary_to_list/1 is a drastic and an unnecessary measure. As those function names indicate, they have nothing to do with strings, but binaries. Binaries are a superset of strings and surely don't need to be UTF-8.

Exactly the fact it doesn't tell me anything for sure is the root source of confusion. If we are keeping them around, I would at least rename them to what Oren proposed, byte_list_to_binary and binary_to_byte_list. And possibly add char_list_to_string and string_to_char_list to Kernel.
 
A really good point raised by Devin — hope iolist_to_binary is not going away? I personally find myself using list_to_binary instead of iolist_to_binary when my list is a narrow, integer-only list, as opposed to an iolist.

iolist_to_binary is definitely staying.

I am not saying at any point that list_to_binary/binary_to_list is useless. All I am saying is that there is a lot of confusion and we need to work on solving it.

Yurii Rashkovskii

unread,
Aug 13, 2013, 12:28:52 PM8/13/13
to elixir-l...@googlegroups.com
Well, the first step in removing the confusion is fixing the documentation. Renaming is always a harsher step. Can we update the docs first?


José Valim

unread,
Aug 13, 2013, 12:42:43 PM8/13/13
to elixir-l...@googlegroups.com
Please do update the docs! It seems list_to_binary is not the only issue, we can also see the same mistake list_to_atom and friends.



José Valim
Skype: jv.ptec
Founder and Lead Developer


Reply all
Reply to author
Forward
0 new messages