[BUG] sman causes endless loop if libraries required by viewman does not exist

17 views
Skip to first unread message

Qian Yun

unread,
Jun 11, 2022, 3:53:36 AM6/11/22
to fricas-devel
The description of this bug:

1. First, I found this bug on macOS, when I run the binary from
GitHub CI, which is built with X11 support. But the system
does not include any X11 libraries, because they are not installed
from homebrew yet. Start "fricas" with no options will cause the
screen to be flooded with error messages that viewman has problem
loading shared libraries.

2. So this is easily reproduced on Linux as well. Simply rename
a X11 .so file (for example /usr/lib64/libXpm.so.4) to something else
and you can reproduce this bug. (Remember to rename it back!)

Further more, running a pre-built FriCAS distribution with X11
support on a server or a container that does not have X11 libraries
will meet this bug.

====

The simpliest workaround is to use "fricas -nogr".


The relevant code is in src/sman/sman.c:

static void
start_the_graphics(void)
{
spawn_of_hell(GraphicsProgram, DoItAgain);
}

The "DoItAgain" flag causes sman to start it endlessly.

Not sure why it is designed this way. If we start with "fricas -nox",
we can start hyperdoc by ")hd", but we can not use "draw" anymore.
This should be related.

- Qian

Waldek Hebisch

unread,
Jun 13, 2022, 9:16:01 PM6/13/22
to fricas...@googlegroups.com
Missing library is system/installation error. In such case you
can not expect FriCAS to work properly. Of course, infinite loop
of error messages is not nice, but IMO this is relatively low
priority issue, it is more important to ensure that required
libraries are preset.

Considering loop of respawning, AFAICS if things work OK it
should be not needed. It seems that orignals authors believed
that respawning masks some problems.

I hope that some day we will have more sane code on C side.
However, small tweaks are unlikely to converge to better
stucture unless there is enough advance plannig. As I wrote,
I would like to change communication protocol between
various parts of FriCAS. Precondition for protocol change
is getting right structure on Boot/Lisp side, in particular
isolating all I/O. We are close to this, but still not
there...

--
Waldek Hebisch

Qian Yun

unread,
Jun 14, 2022, 12:42:59 AM6/14/22
to fricas...@googlegroups.com
Just take a deeper look.

If you start with "fricas -nox", you can start hyperDoc later with
")hd", which is just a shorthand for ")system hypertex &".

And it seems that
")system /usr/lib64/fricas/target/x86_64-pc-linux-gnu/lib/viewman &"
works as well.

So if I dig deeper and find that there is no special reason
for sman to spawn "viewman" endlessly, we can treat it the same
as "hyertex": sman will just spawn it once when launch fricas,
user can spawn it manually when needed.

- Qian

Waldek Hebisch

unread,
Jun 14, 2022, 12:28:39 PM6/14/22
to fricas...@googlegroups.com
On Tue, Jun 14, 2022 at 12:42:20PM +0800, Qian Yun wrote:
> Just take a deeper look.
>
> If you start with "fricas -nox", you can start hyperDoc later with
> ")hd", which is just a shorthand for ")system hypertex &".
>
> And it seems that
> ")system /usr/lib64/fricas/target/x86_64-pc-linux-gnu/lib/viewman &"
> works as well.
>
> So if I dig deeper and find that there is no special reason
> for sman to spawn "viewman" endlessly, we can treat it the same
> as "hyertex": sman will just spawn it once when launch fricas,
> user can spawn it manually when needed.

If you take seriously possibility that viewman may die, then
natural thing is to automatically restart it. And this is
what 'sman' is doing now.

--
Waldek Hebisch

Qian Yun

unread,
Jun 14, 2022, 9:06:29 PM6/14/22
to fricas...@googlegroups.com

On 6/15/22 00:28, Waldek Hebisch wrote:
>
> If you take seriously possibility that viewman may die, then
> natural thing is to automatically restart it. And this is
> what 'sman' is doing now.
>

First, the source code of viewman is pretty short and simple,
so it is unlikely to die.

Second, if in the unlikely cases that viewman dies for some
reason, and sman restarts it, it is very likely that
viewman will die again for the same reason. And now it is
in infinite loop. Which is the problem I encountered in the
first place.

Even if the natural thing is to restart it, there should be
a limit on it.

- Qian

Ralf Hemmecke

unread,
Jun 15, 2022, 12:23:24 AM6/15/22
to fricas...@googlegroups.com
> Even if the natural thing is to restart it, there should be
> a limit on it.

I agree. I guess 100 is big enough, If after 100 restarts viewman does
not come up again, there is something seriously wrong. In order to save
the rest of the session, let vieman die and perhaps enable a switch to
start it manually. (I do not even think that then the latter is an
absolute must.)

Ralf

Waldek Hebisch

unread,
Jun 17, 2022, 9:23:53 PM6/17/22
to fricas...@googlegroups.com
On Wed, Jun 15, 2022 at 09:05:48AM +0800, Qian Yun wrote:
>
> On 6/15/22 00:28, Waldek Hebisch wrote:
> >
> >If you take seriously possibility that viewman may die, then
> >natural thing is to automatically restart it. And this is
> >what 'sman' is doing now.
> >
>
> First, the source code of viewman is pretty short and simple,
> so it is unlikely to die.
>
> Second, if in the unlikely cases that viewman dies for some
> reason, and sman restarts it, it is very likely that
> viewman will die again for the same reason. And now it is
> in infinite loop. Which is the problem I encountered in the
> first place.

_Assuming_ our programs are correct reasonable reason for dying
is some _intermittent_ system problem. Like getting wrong bits
from RAM/HDD or OOM killer making wrong choice. Even in case
of bugs intermittent bugs are quite likely, viewman deals
with sockects and related timing issues so lot of things
are nondeterministic. Deterministic bugs can be found and fixed
much easier than intermittent ones, so after some time spent on
debugging remaining bugs are likely to be intermittent...

More to the point: you looked at respawning issue because
there were missing library. Missing library means broken
installation. So real fix is to make sure that library is
present. IIUC when building from source link stage will
fail in case of missing libraries. So only binary install
should matter. If install is done by some tool (script) the
tool is supposed to ensure that libraries are present.
If user is using simple binary tarball like I provide,
this tarball have stated dependencies and user is supposed
to install them. Failing to install them may lead to
non-working FriCAS. Since user will get error message
I do not see this as significant issue: user made mistake,
user got error message, user will correct the problem.

Coming back to respawning: I do not know if it is really
necessary. It may be just defensive programming
(sensible because modern OS-es work essentialy in
probablistic way, with small but nonzero probablity
you may get essentially random failures). It may be
attempt at masking errors: incorrect programming may
increase chance of error enough to be a trouble,
respawning may mask it.

I did eliminate some things of similar spirit, but my
procedure was to make modification in my local copy of
FriCAS, use it for some time (say a year), and commit
only if I so no bad effects of the change. And in
few cases I had noticed that seemingly useless code
in fact was doing useful thing, to I reverted the change.

--
Waldek Hebisch

Qian Yun

unread,
Jun 17, 2022, 9:50:26 PM6/17/22
to fricas...@googlegroups.com


On 6/18/22 09:23, Waldek Hebisch wrote:
>
> _Assuming_ our programs are correct reasonable reason for dying
> is some _intermittent_ system problem. Like getting wrong bits
> from RAM/HDD or OOM killer making wrong choice. Even in case
> of bugs intermittent bugs are quite likely, viewman deals
> with sockects and related timing issues so lot of things
> are nondeterministic. Deterministic bugs can be found and fixed
> much easier than intermittent ones, so after some time spent on
> debugging remaining bugs are likely to be intermittent...

That is from programmer's side of view.

My point of view is from user's side, about user friendly:

1. A user downloaded fricas binary to a home linux server,
or a cloud linux server, or a linux super computer to try
it out. Run "fricas" and get into endless error message.

2. A user downloaded fricas binary to macOS computer.
Double click the icon and get into endless error message.

3. A user downloaded fricas binary to Windows.
(I'm close to have sbcl with sman/hyperdoc work on windows.)
Double click the icon and get into endless error message.


> More to the point: you looked at respawning issue because
> there were missing library. Missing library means broken
> installation. So real fix is to make sure that library is
> present. IIUC when building from source link stage will
> fail in case of missing libraries. So only binary install
> should matter. If install is done by some tool (script) the
> tool is supposed to ensure that libraries are present.
> If user is using simple binary tarball like I provide,
> this tarball have stated dependencies and user is supposed
> to install them. Failing to install them may lead to
> non-working FriCAS. Since user will get error message
> I do not see this as significant issue: user made mistake,
> user got error message, user will correct the problem.

Again, user friendly issue. Windows and macOS users can
not install dependencies as easy as Linux users.

I just want these users can use command line interface
of fricas when X11 libs not present. Fallback gracefully,
If they go extra mile and install X11 libs, then they get
hyerDoc and drawing ability.

> I did eliminate some things of similar spirit, but my
> procedure was to make modification in my local copy of
> FriCAS, use it for some time (say a year), and commit
> only if I so no bad effects of the change. And in
> few cases I had noticed that seemingly useless code
> in fact was doing useful thing, to I reverted the change.
>

I will apply it locally and submit it again next year then.

- Qian

Qian Yun

unread,
Jun 24, 2022, 5:59:00 AM6/24/22
to fricas...@googlegroups.com


On 6/18/22 09:23, Waldek Hebisch wrote:
> Coming back to respawning: I do not know if it is really
> necessary. It may be just defensive programming
> (sensible because modern OS-es work essentialy in
> probablistic way, with small but nonzero probablity
> you may get essentially random failures). It may be
> attempt at masking errors: incorrect programming may
> increase chance of error enough to be a trouble,
> respawning may mask it.

I stumble across this today:


(3) -> )what synonyms
------------------------- System Command Synonyms -------------------------
<snip>
)startGraphics ................. )system $FRICAS/lib/viewman &
)stopGraphics .................. )lisp (|sockSendSignal| 2 15)
<snip>



So there was support to easily start and stop viewman inside FirCAS.
This is evidence that viewman does not need auto respawning --
current behavior is that when ")stopGraphics" stops a viewman process,
sman will start another one. This should be considered as bug.

- Qian

Waldek Hebisch

unread,
Jun 26, 2022, 12:06:21 PM6/26/22
to fricas...@googlegroups.com
There was a lot of "strange" code in the codebase. I tried to
remove things that were obviously wrong, but there were confusing
cases and each required long investigation to decide if it was
useful or not. And some still remain.

There was some evidence in the code that orignal developers
had to deal with intermitent errors and timing dependences.

Also, note that killing viewman closes all graphic windows.
So the intent could be to have easy way to close graphic windows.

--
Waldek Hebisch

Qian Yun

unread,
Jun 26, 2022, 12:21:15 PM6/26/22
to fricas...@googlegroups.com


On 6/27/22 00:06, Waldek Hebisch wrote:
>
> There was a lot of "strange" code in the codebase. I tried to
> remove things that were obviously wrong, but there were confusing
> cases and each required long investigation to decide if it was
> useful or not. And some still remain.
>
> There was some evidence in the code that orignal developers
> had to deal with intermitent errors and timing dependences.
>
> Also, note that killing viewman closes all graphic windows.
> So the intent could be to have easy way to close graphic windows.
>

OK, following your logic, viewman may fail. The the auto restart
will mask this bug.

With my patch, we may identify this kind of failures and fix them.
(Or we may never meet such "intermittent errors and timing
dependencies".)

[This is my last argument for this patch. If it is not accept this
time, I'll test it on my computer, and will submit again before next
release.]

- Qian

Waldek Hebisch

unread,
Jun 26, 2022, 7:01:12 PM6/26/22
to fricas...@googlegroups.com
On Mon, Jun 27, 2022 at 12:20:07AM +0800, Qian Yun wrote:
>
>
> On 6/27/22 00:06, Waldek Hebisch wrote:
> >
> >There was a lot of "strange" code in the codebase. I tried to
> >remove things that were obviously wrong, but there were confusing
> >cases and each required long investigation to decide if it was
> >useful or not. And some still remain.
> >
> >There was some evidence in the code that orignal developers
> >had to deal with intermitent errors and timing dependences.
> >
> >Also, note that killing viewman closes all graphic windows.
> >So the intent could be to have easy way to close graphic windows.
> >
>
> OK, following your logic, viewman may fail. The the auto restart
> will mask this bug.

Just to point out important difference: not installed libraries
are deterministic event and are easy to correct. viewman my
fail due to system reasons which are non-deteminstic and
not under our control.

> With my patch, we may identify this kind of failures and fix them.
> (Or we may never meet such "intermittent errors and timing
> dependencies".)
>
> [This is my last argument for this patch. If it is not accept this
> time, I'll test it on my computer, and will submit again before next
> release.]

Well, I have installed your patch in copy of FriCAS that I use.
I will look if it has any visible effect.

--
Waldek Hebisch
Reply all
Reply to author
Forward
0 new messages