Non UTF8 Binaries in ExUnit.CaptureIO

97 views
Skip to first unread message

Jonatan Männchen

unread,
Apr 6, 2022, 1:15:18 PM4/6/22
to elixir-lang-core
Hi everyone,

Background

While developing tests for a mix task, that returns non UTF8 binaries into STDOUT (building block to be piped into a file / pipe), I found that ExUnit.CaptureIO can only handle UTF8 and Latin1.

Example Test that does not work:

Proposal

For testing this use case, it would be good if any raw binary would also be passed through. (Maybe via option "encoding: :raw_binary")

Additionally, it would be good if there was a proper error for invalid characters instead of the currently raised ArgumentError.

Real World Example

Here is a real test, that would be made possible by this change: https://github.com/elixir-gettext/expo/blob/9048fe242830614f6d4235cbd345de844693f28a/test/mix/tasks/expo.msgfmt_test.exs#L18

PR

I'm happy to provide a PR for this as well.

Thanks & Kind Regards,
Jonatan

Wojtek Mach

unread,
Apr 6, 2022, 1:57:35 PM4/6/22
to elixir-lang-core
I believe capture_io(encoding: :latin1, fun) should do the trick, can you check?

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/e35e97cf-00d6-422e-b3c1-ec508ff1e36fn%40googlegroups.com.

Jonatan Männchen

unread,
Apr 6, 2022, 2:17:51 PM4/6/22
to elixir-lang-core
That unfortunately gives me the same result.

Wojtek Mach

unread,
Apr 6, 2022, 3:29:12 PM4/6/22
to elixir-lang-core
> Additionally, it would be good if there was a proper error for invalid characters instead of the currently raised ArgumentError.

Yeah the error is pretty bad:

IO.puts(<<222>>)
** (ArgumentError) errors were found at the given arguments: unknown error: put_chars

It is slightly more informative when used inside capture io though:

    assert capture_io(fn ->
             IO.puts(<<222>>)
           end) == <<222>>
** (ArgumentError) errors were found at the given arguments: unknown error: {put_chars,unicode,[<<"Þ">>,10]}

We get error because stdio is with unicode encoding:

iex> :io.getopts(:standard_io)[:encoding]
:unicode

and we're writing <<222>> which isn't unicode.

For writing in raw mode, use IO.binwrite. This and the previously mentioned :encoding option will make the following test succeed:

    assert capture_io([encoding: :latin1], fn ->
             IO.binwrite(<<222>>)
           end) == <<222>>

Jonatan Männchen

unread,
Apr 7, 2022, 1:24:58 AM4/7/22
to elixir-lang-core
Thanks, that solved my specific issue. I still think that some improvements are needed.

Looking at the code for CaptureIO, I think those changes should be made directly in the StringIO / IO modules and not specifically for CaptureIO.

The following things are not great today IMHO:
  1. The encoding that lets everything through is called latin1. I think we should introduce & properly document a new encoding called something like raw_binary. It would work exactly the same though.
  2. IO.write with invalid characters into an io device with any encoding should have a better error message. (Something like "<<222>> is not a valid unicode string. Provide `encoding: :raw_binary` when opening the io device")
  3. IO.binwrite with invalid characters into an encoding: :unicode io device should probably at least warn if invalid characters are passed.
This would something like this in the form of a test: https://gist.github.com/maennchen/f428360a71d23a323538d9b7d51e638b

José Valim

unread,
Apr 7, 2022, 1:46:14 AM4/7/22
to elixir-lang-core
All three of them would require changes to Erlang and I would love for someone to chase this path. 2 sounds doable with EEP 54 but 1 would definitely require more discussion.

I am not sure about 3. binwrite pretty much means "I know what I am doing" and we should allow the user to write whatever they want as is.

Jonatan Männchen

unread,
Apr 7, 2022, 2:03:39 AM4/7/22
to elixir-l...@googlegroups.com
I guess leaving nr. 3 out would be fine.

What would he the proper path to do this in erlang? Just writing some erlang example code and opening issues?

Sent from my iPhone

On 7 Apr 2022, at 06:46, José Valim <jose....@dashbit.co> wrote:


You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-core" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-core/RR7nbeHsluQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4KdzYkX%3DL%3D3%3DYHSJk4R8iq1%3Dbe9262Fg54eX1yLrx-c9g%40mail.gmail.com.

José Valim

unread,
Apr 7, 2022, 2:09:14 AM4/7/22
to elixir-lang-core
I would start with 2 by submitting a PR to Erlang/OTP and then access their appetite from there. :)

Jonatan Männchen

unread,
Apr 7, 2022, 2:12:21 AM4/7/22
to elixir-l...@googlegroups.com
I’ve never done a PR to OTP so far. I’ll check it out and ping back here for updates.

Sent from my iPhone

On 7 Apr 2022, at 07:09, José Valim <jose....@dashbit.co> wrote:



José Valim

unread,
Apr 7, 2022, 2:14:39 AM4/7/22
to elixir-lang-core

Jonatan Männchen

unread,
Apr 12, 2022, 3:38:05 AM4/12/22
to elixir-lang-core
Hi all,

I dug through the code and now know how the parts fits together.

Since it is a combination of Elixir & OTP, I would like to ask for your input before continuing:

When using the following code, the key code paths are:

{:ok, io} = StringIO.open("", encoding: :unicode) # Starts a StringIO GenServer
IO.write(io, <<222>>)

=> :io.request(pid, {:put_chars, :unicode,<<222>>}, _ref)

=> {{:error, req}, state}

=> badarg

=> raises argument error without any additional info

Based on this, I think multiple things are needed:

  1. `io:conv_reason/1` needs to accept a new tuple for the error
    1. I would return the error directly from `unicode:characters_to_binary/3`: {:error, {:encoding, result}}
    2. where result: {error, binary(), RestData} | {incomplete, binary(), binary()}
  2. `io:o_request/2` needs to handle that new error and raise a better error
    1. new error kind?
    2. existing badarg with more detail data?
  3. All those sources need to produce the new error (does not produce OTP incompatibility in Elixir sind everything unknown is converted to badarg):
    1. `StringIO.put_chars/4`
    2. `file_io_server:put_chars/3`
    3. `group:io_request/5`
    4. `standard_error:wrap_characters_to_binary/3`
    5. `user_drv:io_command/1`
    6. `user:wrap_characters_to_binary/2`
    7. `ssh_cli:io_request/4`
Does that sound like I'm going into the right direction?

If yes: Should that be one PR in OTP or several?

Thanks for any input & Best,
Jonatan

José Valim

unread,
Apr 12, 2022, 4:11:45 AM4/12/22
to elixir-lang-core
I would personally send a PR with steps one and two, and also changing one of the callsites for 3 so you have something to test. Then another PR to migrate the remaining call sites!

Jonatan Männchen

unread,
Apr 12, 2022, 5:14:47 AM4/12/22
to elixir-lang-core
Great :)

Last question for now: Would you add the details of the error to error_info of the badarg error or should this be an entirely different error?

José Valim

unread,
Apr 12, 2022, 6:39:36 AM4/12/22
to elixir-lang-core
I would do it based on the error_info. In fact if you could solve this all together with the error_info, then it would be even better.

Jonatan Männchen

unread,
Apr 13, 2022, 12:16:04 AM4/13/22
to elixir-lang-core
While working on it, I found that there's an already existing error called no_translation.

I believe that this is the intended error message since it says "device failed to transcode string from ~p to ~p". This is exactly our use case I think.

Based on this, I created two PRs:

  1. OTP: https://github.com/erlang/otp/pull/5885
  2. Elixir: https://github.com/elixir-lang/elixir/pull/11756
The Elixir PR does not depend on the OTP PR since the error was already in place and therefore also works in current OTP versions. It however has a small issue in the tests (ExUnit / iex somehow behave differently). I would therefore appreciate a hint on what is going wrong.

Since my proposal is represented completely in the two PRs, I will no longer update this thread. For anyone following along, please have a look at the PRs instead.

Reply all
Reply to author
Forward
0 new messages