[erlang-questions] "New" vs. "old" console behavior: bug or feature?

93 views
Skip to first unread message

Scott Lystig Fritchie

unread,
Apr 23, 2013, 3:36:54 PM4/23/13
to erlang-q...@erlang.org
Hi, all. I can't figure out if this message should be sent to the
erlang-bugs list or the erlang-questions list ... so I'll go for the
more general audience.

Summary: Starting Erlang with a tty/pseudo-tty can get you a different
console shell ("new" and "old", respectively) without you realizing
it.(*) If you don't know that you're using the old shell, and if a
process tries to send output to the 'user' registered process(**),
e.g. io:format(user, "Some message with ~p extra\n", [Extra]), then it
is possible that the io:format() call will not return for
seconds/minutes/hours/ever.

My question: Is the kind of indefinite blocking on I/O described below a
bug or a feature?

I have a test case that can reproduce this behavior. An automated
version (using Expect) can be found at:

https://gist.github.com/slfritchie/ad8e5cf1603cbe326be7

The basics of the reproducing the hang are:

SSH session #1 SSH session #2
-------------- --------------
Start an Erlang daemon
using "run_erl".

Attach to the daemon's console
using "to_erl".

Start another Erlang VM
and connect to the first
VM via "-remsh".

At the console, type the
following and press ENTER:
{term1,

Run this command:
io:format(user, "Hey!\n", []).

The io:format/3 call in session #2 will behave differently if session
#1's "run_erl" command runs with a tty/pseudo-tty or without.

A. With a tty/pty: The io:format() call returns immediately.
B. Without a tty/pty: The io:format() call will hang indefinitely.
It will remain blocked until the Erlang term parser in session #1
has returned. For example, finishing the term with "term2}." and
then pressing ENTER.

The same effect can be seen by forcing the use of the old shell, without
using SSH, by simply running "erl -oldshell" for session #1 (in an Xterm
or other terminal window, or at the machine's hardware console) instead
of using SSH + "run_erl" + "to_erl".

Riak was the application that triggered this bug hunt (in conjunction
with the Lager app)(***). Finding it has taken much longer than anyone
guessed. The reason is that the necessary precondition, starting Erlang
via 'run_erl' via SSH without an associated tty/pseudo-tty, is not
common. (Riak's packaging uses "sudo", which refuses to run if there
isn't a tty/pty available.)

All attempts to duplicate the behavior failed because we didn't
understand that the root cause of the bad behavior was the old console
being silently chosen at VM startup when not tty/pty is available.

-Scott

(*) See
https://github.com/erlang/otp/blob/maint/lib/kernel/src/user_drv.erl#L103
for how the choice is made.

(**) From the 'io' man page:

There is always a process registered under the name of user. This
can be used for sending output to the user.

... where "output to the user" really means "output to the Erlang
virtual machine console."

(***) For source code of Riak and Lager, respectively, see:
https://github.com/basho/riak
https://github.com/basho/lager
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Fred Hebert

unread,
Apr 24, 2013, 9:46:02 AM4/24/13
to Scott Lystig Fritchie, erlang-q...@erlang.org
Hi Scott,

The IO world of Erlang is a fun crazy thing :)

I've spent time trying to document how the shell works back at
http://ferd.ca/repl-a-bit-more-and-less-than-that.html. I'll do a quick
roundup of things just to be clear on everything.

Before going into the difference between 'new' and 'old' shells, there
is a 'user' process, which you mentioned, part of the IO system. The
'user' process acts as a default top-level group leader for all the
output coming from a process. All group leaders are inherited from the
process' parent. They can also be modified, so that you may have
different group leaders across a VM: they are local processes,
middle-men (like application_controller), or remote processes (this is
how RPC calls get printed to everyone any time).

By default, every OTP app will put its controller as a group leader for
all sub-processes. This group leader will redirect output, but overload
the feature to kill rogue processes on shutdown (it makes a list of all
processes, inspects their group leader, and if it's the current app's
pid, kills said process). Other tools like eunit and Common Test will
have the possibility of injecting themselves above test cases and pick
what to print or not. By sending IO directly to 'user', we bypass that
hierarchy and go straight to the node's main IO process. Other special
cases can be used, such as 'standard_error', which will redirect output
to the error channel.

That being said, there are two default implementations of a process that
registers itself as 'user' on a node: the new (current) shell, and the
'old' shell. The choice of which one to pick is determined at boot time
by the user_sup.erl module (part of kernel) through system flags:

- If the node is a slave node, the 'user' module will point to a remote
process.
- If the node is started with no special flag, the new shell is started
through 'user_drv'. This 'user' proc will act as a middle-man between
input and output with a tty program and the different Erlang groups
(see group.erl in kernel) to allow multiple jobs and concurrent shells
without messed up output. Evaluation is handled by shell.erl (stdlib)
- If the node is started with the -oldshell flag, the process in charge
is 'user.erl', which uses special IO devices ({fd,0,1} for IO) to deal
with the input and output channels for the node directly. It will send
the evaluation to shell.erl also.
- If the node is started with -noshell, the 'user.erl' module is still
booted, but will not evaluate any input nor forward it.
- If the node is started in -noinput mode, the 'user.erl' module is
still booted, but it will not forward any input, only output from the
node. It's a superset of -noshell and a bit safer because it opens the
IO port in a way that only has the 'out' channel open.
- There is an undocumented -nouser flag. Such a flag makes sure that
neither user.erl nor user_drv.erl are started. The node will crash
unless you specifically decide to start a process that registers
itself as 'user' and decides to handle IO for your node. This is what
you should use were you planning to provide your own Erlang shell and
boot it as 'erl -nouser -s custom_shell'.
- If it's not possible to boot the tty used by 'user_drv', it should
fall-back to 'user.erl' as an IO leader.

Alright. That covers most of it for the basics.

To figure out why it blocks, we need to figure out the evaluation. The
evaluation itself happens in a shell.erl process, which does an io
request to the 'user' process (technically, its own group_leader, so
that anyone may use the evaluator where they want. It just happens to be
the 'user' process in this case).

Input --> user.erl <---> shell.erl

The shell does an io-request to user, which asks to read characters.
The user.erl process forwards that data to the shell. The shell
attempts to evaluate it, and if there's not enough data, it asks for
more. user.erl then blocks until it can get more data to respond to the
io request.

When output is sent to 'user' it's sent as an additional io request, as
a message. This message will not be read until the shell can answer the
previous request. This is where you block.

Input --> user.erl <---> shell.erl
^----> other proc

The new shell does things differently by using a 'group.erl' process for
each IO group. Now each group.erl process has the same potential to
block, with the exception that user_drv.erl will start one very specific
'group.erl' process to be 'user', and will not return it as a potential
shell.erl input source (it would be 0 in '^G -> j', and it is not
possible to select it). user_drv will also consider it to be a special
group that can *always* output to tty, wheras other groups will only
have their output dumped by default if they're not the currently active
one (hence you do not get other shells' output by default when you
switch tasks). This means that while you could block things by finding
the specific 'group.erl' you're currently sending IO requests to by
default, it's unlikely to happen by accident, and 'user' is now a safe
process to send IO requests to.

I hope this explains things. I would find it difficult to call it a bug
given a solution exists to the problem already, but I do see why the
fallback to the old shell when no tty is available could be problematic.
I'm guessing it would be possible to make a 'raw shell', which does
tasks similar to user_drv, but using a user.erl-like adapter instead of
a tty program to communicate with and starting it with 'erl -nouser -s
rawshell' or something, or eventually making it the default user_drv
falls back to instead of 'user:start()'. I'm guessing this would be a
very low priority for the OTP team, though.

I hope this lengthy response answers your questions!

Regards,
Fred.

Robert Virding

unread,
Apr 24, 2013, 2:05:38 PM4/24/13
to Fred Hebert, erlang-q...@erlang.org
Strange because both user.erl and group.erl "should" be able to handle output requests in the middle of getting input. But it is a little difficult to see in group as there is all this tricky search code. :-)

Robert

Fred Hebert

unread,
Apr 24, 2013, 3:01:38 PM4/24/13
to Robert Virding, erlang-q...@erlang.org
On 04/24, Robert Virding wrote:
> Strange because both user.erl and group.erl "should" be able to handle output requests in the middle of getting input. But it is a little difficult to see in group as there is all this tricky search code. :-)
>
> Robert
>

The big difference I see is really about an io_request that requires
more input coming directly to user. In most cases, things will be fine.

When we have shell.erl requesting data to user.erl, we enter an
io_request to fetch for content through io:scan_erl_exprs/3. This one
has a get_until io request, which makes the user.erl process enter
get_chars/7. This one, until it gets the 'eof' value it set itself
internally, will keep looping in get_chars/8. In this spot, I believe it
should still be able to serve IO reqs, except when it starts getting
data from the port, at which point, it enters get_chars_bytes/8, which
may end up calling get_chars_apply/7.

Once you enter this one, there's a possibility of having incomplete
data, prompting for a new state, which calls get_chars_more/7, which
selectively receives and ignores io requests and can go recursively.

I'm guessing (at a glance, haven't checked if it actually enters these
paths with the actual code) that this is where the blocking happens. In
fact, a quick try with process_info/1 shows me that it's where the
'user' proc gets stuck:

(h...@ferdair.local)10> process_info(whereis(user)).
[{registered_name,user},
{current_function,{user,get_chars_more,7}},
{initial_call,{erlang,apply,2}},
{status,waiting},
{message_queue_len,1},
{messages,[{io_request,<0.63.0>,<0.29.0>,
{put_chars,unicode,io_lib,format,["hey",[]]}}]},
{links,[<0.6.0>,<0.28.0>,<0.33.0>,#Port<0.425>]},
{dictionary,[{encoding,latin1},
{read_mode,list},
{shell,<0.33.0>}]},
...
{suspending,[]}]

Looking at this, I'm guessing it would be possible to have a workaround
where get_chars_more/7 could listen to io_requests, but only if they are
for output -- otherwise you could get into nasty cases where you start
accepting other io requests that pull for data and things would get even
more confusing really fast and turn the handling of io events inside
out.

I'm thinking I recall seeing group.erl having a similar function that
could get locked up, but I haven't checked or tested to see if it was
indeed possible to lock it up in a similar way -- the fact that 'user'
is a distinct standalone group process makes the issue unlikely to
happen in the first place.

Regards,
Fred.

Scott Lystig Fritchie

unread,
Apr 24, 2013, 4:15:55 PM4/24/13
to Robert Virding, erlang-q...@erlang.org
Robert Virding <robert....@erlang-solutions.com> wrote:

rv> Strange because both user.erl and group.erl "should" be able to
rv> handle output requests in the middle of getting input. But it is a
rv> little difficult to see in group as there is all this tricky search
rv> code. :-)

"should" /= "reality", alas. See the backtrace below when the problem
hits.

-Scott

--- snip --- snip --- snip --- snip --- snip --- snip ---

=proc:<0.30.0>
State: Waiting
Name: user
Spawned as: erlang:apply/2
Current call: user:get_chars_more/7
Spawned by: <0.29.0>
Started: Tue Apr 9 16:16:46 2013
Message queue length: 1
Message queue: [{io_request,<0.186.0>,<0.179.0>,{put_chars,unicode,<<821
bytes>>}}]
Number of heap fragments: 0
Heap fragment data: 0
Link list: []
Dictionary: [{shell,<0.31.0>},{read_mode,list},{unicode,false}]
Reductions: 357542
Stack+heap: 610
OldHeap: 0
Heap unused: 152
OldHeap unused: 0
Stack dump:
Program counter: 0x00007f0c51d92930 (user:get_chars_more/7 + 232)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007f0c51931710 Return addr 0x00007f0c51d8c498 (user:do_io_request/5 + 88)
y(0) unicode
y(1) {[],[]}
y(2) #Port<0.630>
y(3) {erl_scan,tokens,[1]}
y(4) get_until
y(5) io_lib
y(6) {erl_scan_continuation,[],no_col,[],2,{erl_scan,#Fun<erl_scan.3.84904554>,false,false,false},0,#Fun<erl_scan.25.84904554>}

0x00007f0c51931750 Return addr 0x00007f0c51d8c308 (user:server_loop/2 + 1408)
y(0) #Port<0.630>
y(1) <0.30.0>
y(2) <0.25333.1728>

0x00007f0c51931770 Return addr 0x00007f0c51d8b968 (user:catch_loop/3 + 112)
y(0) #Port<0.630>

0x00007f0c51931780 Return addr 0x0000000000836c78 (<terminate process normally>)
y(0) <0.31.0>
y(1) #Port<0.630>
y(2) Catch 0x00007f0c51d8b968 (user:catch_loop/3 + 112)

Scott Lystig Fritchie

unread,
Apr 24, 2013, 4:39:36 PM4/24/13
to Fred Hebert, erlang-q...@erlang.org
Fred Hebert <mono...@ferd.ca> wrote:

fh> Input --> user.erl <---> shell.erl

fh> [...] The shell attempts to evaluate it, and if there's not enough
fh> data, it asks for more. user.erl then blocks until it can get more
fh> data to respond to the io request.

fh> When output is sent to 'user' it's sent as an additional io request,
fh> as a message. This message will not be read until the shell can
fh> answer the previous request. This is where you block.

Yup.

The combination of using "run_erl" + no tty/pty + lager prior to this
work(*) meant that any Erlang process that attempted to log a message
would be blocked arbitrarily until user.erl would return from
user:get_chars_more() et al.

If someone had used the Riak console recently, attaching to the VM's
console via "riak attach" (which in turn uses "to_erl"), and if the last
thing that they typed was not the end of an expression and then detached
from the console(**), every process that attempts to log a message
afterward will hang. Forever. Or until someone re-attaches to the
console and types "." and ENTER so that user.erl can finish its
simple-minded parsing.

The completely silent nature of the switch to the old shell was
baffling. Not to mention causing us & our customers serious grief. I
finally starting finding the root cause when I realized that all cases
of the all-lager-events-blocked-arbirarily mystery involved systems
where there was no 'user_drv' registered process.

-Scott

(*) https://github.com/basho/lager/pull/139

(**) Or if that person merely pressed ENTER before detaching, an
exceptionally easy thing to do.

Ignas Vyšniauskas

unread,
Apr 25, 2013, 4:02:41 AM4/25/13
to Scott Lystig Fritchie, erlang-q...@erlang.org
Hi Scott, Fred,

On 04/23/2013 09:36 PM, Scott Lystig Fritchie wrote:
>
> The io:format/3 call in session #2 will behave differently if session
> #1's "run_erl" command runs with a tty/pseudo-tty or without.
>
> A. With a tty/pty: The io:format() call returns immediately. B.
> Without a tty/pty: The io:format() call will hang indefinitely. It
> will remain blocked until the Erlang term parser in session #1 has
> returned. For example, finishing the term with "term2}." and then
> pressing ENTER.

Thank you for pointing this out! I've been seeing this fairly often
(specifically in the combination with lager that you mentioned) and
never figured out what caused it.

On 04/24/2013 03:46 PM, Fred Hebert wrote:
> - If it's not possible to boot the tty used by 'user_drv', it should
> fall-back to 'user.erl' as an IO leader.

So how does Erlang determine whether TTY is available? Currently I've
worked around this by forcing a `SHELL=screen` variable in the boot
script and it seems to do the trick, but I don't really like this
approach. Any suggestions?

Regards,
Ignas

Fred Hebert

unread,
Apr 25, 2013, 7:51:35 AM4/25/13
to Ignas Vyšniauskas, erlang-q...@erlang.org
On 04/25, Ignas Vyšniauskas wrote:
> Hi Scott, Fred,

>
> On 04/24/2013 03:46 PM, Fred Hebert wrote:
> > - If it's not possible to boot the tty used by 'user_drv', it should
> > fall-back to 'user.erl' as an IO leader.
>
> So how does Erlang determine whether TTY is available? Currently I've
> worked around this by forcing a `SHELL=screen` variable in the boot
> script and it seems to do the trick, but I don't really like this
> approach. Any suggestions?
>

There's a code snippet in user_drv.erl (lines 92-109) that does the
detection of whether it's possible to spawn things:

server(Pname, Shell) ->
process_flag(trap_exit, true),
case catch open_port({spawn,Pname}, [eof]) of
{'EXIT', _} ->
%% Let's try a dumb user instead
user:start();
Port ->
server1(Port, Port, Shell)
end.

server(Iname, Oname, Shell) ->
process_flag(trap_exit, true),
case catch open_port({spawn,Iname}, [eof]) of
{'EXIT', _} -> %% It might be a dumb terminal lets start dumb user
user:start();
Iport ->
Oport = open_port({spawn,Oname}, [eof]),
server1(Iport, Oport, Shell)
end.

The interesting clauses are the 'user:start()' fallbacks. Basically,
whenever the user_drv cannot spawn the desired port program (Erlang's
tty driver started as 'tty_sl -c -e'), it will fall back to the old
shell by default.

I'm thinking the two valid approaches are those I discussed with Robert
Virding and Andrew Thompson off the mailing list yesterday -- either
have lager not produce output to 'user' when it detects the old shell
(which is what Andrew and Scott ended up going with iirc), or fixing the
old shell the way I mentioned in another post. This would be the
solution where the function that currently blocks the shell is able to
forward output anyway -- something Robert told me he believes it did in
the past.

If the old behaviour was to forward the output and now it's gone (I
haven't checked, but I'm ready to trust Robert on that), then this would
be a regression bug that needs to be fixed by the OTP team (or any
contributor with the time to do it) within user.erl.

Regards,
Fred.

Scott Lystig Fritchie

unread,
Apr 25, 2013, 1:20:32 PM4/25/13
to =?UTF-8?B?SWduYXMgVnnFoW5pYXVza2Fz?=, erlang-q...@erlang.org
Ignas Vyšniauskas <bali...@gmail.com> wrote:

iv> Currently
iv> I've worked around this by forcing a `SHELL=screen` variable in the
iv> boot script and it seems to do the trick, but I don't really like
iv> this approach. Any suggestions?

Ignas, the Expect script that I put in my original/long message to this
list contains a magic workaround. In the case where run_erl (on box A)
is started via SSH from a remote box (call it box B).

If box B uses "ssh -t" to force the allocation of a pseudo-tty on box A
before executing "run_erl", then the problem goes away because the VM
can use the new shell.

If box B uses the ssh defaults, which on (all/most??) platforms does not
force allocation a pty, the problem exists.

Using 'screen' to force allocation of a pty is another hack. Using
"script" would be the same kind of hack. I agree, there's all a bit
disagreeable.

-Scott

Ignas Vyšniauskas

unread,
Apr 25, 2013, 4:48:09 PM4/25/13
to Scott Lystig Fritchie, erlang-q...@erlang.org
Hi Scott,

On 04/25/2013 07:20 PM, Scott Lystig Fritchie wrote:
>> Ignas Vyšniauskas <bali...@gmail.com> wrote:
>>
>> Currently I've worked around this by forcing a `SHELL=screen`
>> variable in the boot script and it seems to do the trick, but I
>> don't really like this approach. Any suggestions?

[I obviously meant `TERM=screen`.]

> Ignas, the Expect script that I put in my original/long message to
> this list contains a magic workaround. In the case where run_erl
> (on box A) is started via SSH from a remote box (call it box B).

Well, in fact you don't need the whole "ssh and several boxes" setup to
reproduce the problem, I think this works too:

1. Generate a release of some project
2. Start the release pretending to have no tty capabilities: `TERM=
./rel/node/bin/node start`
3. Attach, input something without termination
4. Trigger some logging via `io:format(user, <..>)` (e.g. by lager)
5. Observe processes hanging, rejoice.

> If box B uses "ssh -t" to force the allocation of a pseudo-tty on
> box A before executing "run_erl", then the problem goes away because
> the VM can use the new shell.

Yes, but the problem with that is that it leaves it up to the user/admin
to ensure that the release is started in an environment which claims to
have decent term capabilities, which is something non-obvious and annoying.

After reading a bit, I realise setting TERM *is* essentially the only
way to advertise term capabilities, so I might settle for something like

if [[ -z $TERM || $TERM == "dumb" ]]; then export TERM=screen; done

inside of the release start script, which is hackish, but I think it
prevents the problem from happening regardless of how the node is
started and also has the nice side-effect of proper tab completion and
etc everywhere.

--
Ignas

Max Lapshin

unread,
May 23, 2013, 4:25:55 PM5/23/13
to erlang-pr...@googlegroups.com, erlang-q...@erlang.org
I'm deploying some of erlang servers with capistrano, so I had to add:

default_run_options[:pty] = true

in deploy.rb to enable everything working as usual.

It really looks like a black magic for me: why does erlang behaviour differs if it is ssh or ssh -t =(

Reply all
Reply to author
Forward
0 new messages