Races in graphics

11 views
Skip to first unread message

Waldek Hebisch

unread,
May 5, 2024, 10:16:50 AM5/5/24
to fricas...@googlegroups.com
Apparently this did not get to the list, so sending once more.

Testing recent patch by Qian I got 'view3D' which showed empty window
and used 100% of a core. A little investigation using 'gdb' shows
that this 'view3D' was trying to exit. This was triggered from a
signal handler: 'view3D' received a signal (SIGTERM I think) and
'goodbye' in 'stuff3d.c' was the hander. However, signal arrived
during X11 call, namely 'view3D' was calling 'XGetWindowAttributes'
from 'libX11'. IIUC this is incorrect: one is not allowed to
perform X11 call when another call is in progress.

Why races: most of the time 'view3D' should be blocked in 'select'
system call. Signal interrupts execution of system call, and
in such case calling X11 is safe. In other places 'view3D'
performs computations and also calling X11 is safe. So if we
are "lucky" SIGTERM works as desired. But if signal arrives at
inconvenient time, than we get runaway process.

IIUC approved way to sove this is to set flag in the signal handler
and check flag in the main program.

And to make things clearer: there is bug in original code (some
folks say that this is a bug in X11 design, but simply issue is deferred
to other libraries like libXt). Recent changes just give me extra
incentive to run 'make book.pdf'...

--
Waldek Hebisch

Ralf Hemmecke

unread,
May 5, 2024, 11:39:23 AM5/5/24
to fricas...@googlegroups.com
On 5/5/24 16:16, Waldek Hebisch wrote:
> Testing recent patch by Qian I got 'view3D' which showed empty
> window and used 100% of a core. A little investigation using 'gdb'
> shows that this 'view3D' was trying to exit. This was triggered from
> a signal handler: 'view3D' received a signal (SIGTERM I think) and
> 'goodbye' in 'stuff3d.c' was the hander. However, signal arrived
> during X11 call, namely 'view3D' was calling 'XGetWindowAttributes'
> from 'libX11'. IIUC this is incorrect: one is not allowed to perform
> X11 call when another call is in progress.

So, IIUC what you are saying then X11 is to blame.
With Qian's patch one file, for example, ug11.input is processed under
xvfb. This is basically one X11 session running. If we could split the
commands to several xvfb+fricas calls, then they cannot possibly
interact with each other since these would be different X11 sessions.
Is this a correct picture?

Well, I am not saying, that it is actually possible to split the \psXtc
stuff into separate parts, since some might depend on the execution of
previous \psXtc environments. But I guess such a splitting would also
avoid races. Am I wrong?

> Recent changes just give me extra incentive to run 'make
> book.pdf'...

I am against inclusionn of Qians patch if it forces me only to hope that
the book will be compiled correctly.

Waldek, what you are saying does not sound like these races come from
"make -jN ..." with N>1, since each file is processed under a separete
xvfb. These races should also happen with -j1, right?

Ralf

Waldek Hebisch

unread,
May 5, 2024, 2:27:02 PM5/5/24
to fricas...@googlegroups.com
On Sun, May 05, 2024 at 05:39:21PM +0200, Ralf Hemmecke wrote:
> On 5/5/24 16:16, Waldek Hebisch wrote:
> > Testing recent patch by Qian I got 'view3D' which showed empty
> > window and used 100% of a core. A little investigation using 'gdb'
> > shows that this 'view3D' was trying to exit. This was triggered from
> > a signal handler: 'view3D' received a signal (SIGTERM I think) and
> > 'goodbye' in 'stuff3d.c' was the hander. However, signal arrived during
> > X11 call, namely 'view3D' was calling 'XGetWindowAttributes' from
> > 'libX11'. IIUC this is incorrect: one is not allowed to perform
> > X11 call when another call is in progress.
>
> So, IIUC what you are saying then X11 is to blame.

Well, X-related code.

> With Qian's patch one file, for example, ug11.input is processed under
> xvfb. This is basically one X11 session running. If we could split the
> commands to several xvfb+fricas calls, then they cannot possibly
> interact with each other since these would be different X11 sessions.
> Is this a correct picture?
>
> Well, I am not saying, that it is actually possible to split the \psXtc
> stuff into separate parts, since some might depend on the execution of
> previous \psXtc environments. But I guess such a splitting would also
> avoid races. Am I wrong?

Yes, you are wrong. Maybe it will help if I explain a bit design of
X11. First, there is X server. X server is doing real graphic work
(that is drawing images on the screen) and also some administrative
work. Programs connect to X server using sockets and _each_ client
program has its own connection. AFAICS X server is working fine and
trouble is within client program.

Theortically client program could directly talk to X server. However,
usually clients use Xlib. Client calls Xlib functions and Xlib is
responsible for actual communication with X server. There are higher
level libraries, but our program currently use directly call Xlib.

The "100% cpu use" problem is due to signal handling. Client, like
'view3D' receives a signal and tries to shut down connection to
X server. However, signal handling is tricky, when signal arrives
program may be in the middle of modifying critical data structures.
In particular most Unix/Linux libraries inculding Xlib does not
allow calls from signal handlers. There is a handful of Xlib calls
that are unconditionally safe and _most_ of the time calling Xlib
from signal handler works as naively expected. But sometimes
calls from signal handler cause trouble. If you want to find
guilty party, that is easy: our code in 'view3D' is doing forbidden
thing which causes trouble.

Concerning possible fixes: Xt has special support for signal handlers,
it remembers that signal arrived and calls handler when it is safe.
We could implement similar thing ourself: set a flag inside handler
and regularly test if flag is set.

In a sense core trouble is that we have parallel channels of
communication. Data mostly goes via sockets, but there are also
signals, ordinary files and Unix standard input/output.

I know clear reasons just for one problem ("100% cpu use"). Sometimes
graphic processes just hang waiting for something, probably in this
case signal to shut down was lost (but this is just a guess).

> > Recent changes just give me extra incentive to run 'make
> > book.pdf'...
>
> I am against inclusionn of Qians patch if it forces me only to hope that the
> book will be compiled correctly.

Sorry to say this: this is not a new problem, all the time you had to
hope that book will be compiled correctly. Simply probabilites were
in out favour and most of the time book worked correctly. So now could
be the first time when you see the problem.

> Waldek, what you are saying does not sound like these races come from "make
> -jN ..." with N>1, since each file is processed under a separete xvfb. These
> races should also happen with -j1, right?

Yes, races can happen with -j1. Simply changing machine load, adding
multiple jobs, changing delays affects timing and may change result of
race.

--
Waldek Hebisch
Reply all
Reply to author
Forward
0 new messages