Re: [Caml-list] Severe loss of performance due to new signal handling

Christophe TROESTLER

unread,

Mar 17, 2006, 2:12:49 PM3/17/06

to OCaml Mailing List

Hi,

On Fri, 17 Mar 2006, "Markus Mottl" <markus...@gmail.com> wrote:
>
> Profiling using oprofile revealed that the function
> "caml_process_pending_signals" seems to be responsible for that.

But your code is even more striking!

> OCaml-3.08.4 does not exhibit any problems of that kind.

If somebody who has both OCaml 3.08 and 3.09 on his machine is willing
to spend some time to check whether the same thing happens with the
above mentioned program, that will be appreciated.

Best regards,
ChriS

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Xavier Leroy

unread,

Mar 20, 2006, 4:32:04 AM3/20/06

to Markus Mottl, ocaml

> It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
> can lead to a very significant loss of performance (up to several orders
> of magnitude!) in code that uses threads and performs I/O (tested on Linux).
> [...]
> Maybe some assembler guru can repeat this result and explain to us
> what's going on...

Short explanation: atomic instructions are dog slow.

Longer explanation:

OCaml 3.09 fixed a number of long-standing bugs in signal handling
that could cause signals to be "lost" (not acted upon). The fixes,
located mostly in the code that polls for pending signals
(caml_process_pending_signals), rely on an atomic "read-and-clear"
operation, implemented using atomic processor instructions on x86,
x86-64 and PPC. This makes signal handling correct (no signal can be
lost) but I didn't realize that it has such an impact on performance,
even on a uniprocessor machine. Thanks for pointing this out.

(To prevent a number of well-meaning but irrelevant posts, keep in
mind that we're using atomic instructions in a single-threaded
program, to get atomicity w.r.t. signals, not w.r.t. concurrent threads.
We don't need the latter kind of atomicity given OCaml's threading model.)

Now, you may wonder why the problem appears mainly with threaded
programs. The reason is that programs linked with the Thread library,
even if they do not create threads, check for signals much more
often, because they enter and leave blocking sections more often. In
your example, each call to "print_char" needs to lock and unlock the
stdout channel, causing two signal polls each time.

So, it's time to go back to the drawing board. Fortunately, it
appears that reliable polling of signals is possible without atomic
processor instructions. Expect a fix in 3.09.2 at the latest, and
probably within a couple of weeks in the CVS.

Regards,

- Xavier Leroy

Oliver Bandel

unread,

Mar 20, 2006, 5:42:18 AM3/20/06

to caml...@inria.fr

On Mon, Mar 20, 2006 at 10:29:49AM +0100, Xavier Leroy wrote:
> > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
> > can lead to a very significant loss of performance (up to several orders
> > of magnitude!) in code that uses threads and performs I/O (tested on
> Linux).
> > [...]
> > Maybe some assembler guru can repeat this result and explain to us
> > what's going on...
>
> Short explanation: atomic instructions are dog slow.
>
> Longer explanation:
>
> OCaml 3.09 fixed a number of long-standing bugs in signal handling
> that could cause signals to be "lost" (not acted upon). The fixes,

[...]

> Now, you may wonder why the problem appears mainly with threaded
> programs. The reason is that programs linked with the Thread library,
> even if they do not create threads, check for signals much more
> often, because they enter and leave blocking sections more often. In
> your example, each call to "print_char" needs to lock and unlock the
> stdout channel, causing two signal polls each time.

Is this really necessary? Doing a write to stdout with
locking... if not explicitly wanted?!

> So, it's time to go back to the drawing board. Fortunately, it
> appears that reliable polling of signals is possible without atomic
> processor instructions. Expect a fix in 3.09.2 at the latest, and
> probably within a couple of weeks in the CVS.

I'm not clear about what your proble is with lost signals,
but when using signals on Unix/Linux-systems, you can use
UNIX-API, with sigaction/sigprocmask etc. you can do things well,
and with the signal-function which C provides things are bad/worse.
The C-API signal-function signal(3) clears out the signal handler
after a call to it. In the sigaction/sigprocmask/... functions
the handler remains installed.

But if this is what you think about (and how it will be done
on windows or other systems) I don't know, but maybe this is
a hint that matters.

BTW: I saw that in the Unix-module the unix-signalling functions are
now included... (the ywere not on older versions of Ocaml).

Ciao,
Oliver

Gerd Stolpmann

unread,

Mar 20, 2006, 7:39:51 AM3/20/06

to Oliver Bandel, caml...@inria.fr

Am Montag, den 20.03.2006, 11:39 +0100 schrieb Oliver Bandel:
> On Mon, Mar 20, 2006 at 10:29:49AM +0100, Xavier Leroy wrote:
> > > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
> > > can lead to a very significant loss of performance (up to several orders
> > > of magnitude!) in code that uses threads and performs I/O (tested on
> > Linux).
> > > [...]
> > > Maybe some assembler guru can repeat this result and explain to us
> > > what's going on...
> >
> > Short explanation: atomic instructions are dog slow.
> >
> > Longer explanation:
> >
> > OCaml 3.09 fixed a number of long-standing bugs in signal handling
> > that could cause signals to be "lost" (not acted upon). The fixes,
>

> I'm not clear about what your proble is with lost signals,
> but when using signals on Unix/Linux-systems, you can use
> UNIX-API, with sigaction/sigprocmask etc. you can do things well,
> and with the signal-function which C provides things are bad/worse.
> The C-API signal-function signal(3) clears out the signal handler
> after a call to it. In the sigaction/sigprocmask/... functions
> the handler remains installed.

The problem is the following: The O'Caml runtime cannot handle signals
immediately because this would break memory management (e.g. imagine a
signal happens when memory has just been allocated but not initialized).
To get around this the signal handler sets just a flag, and the compiler
emits instructions that regularly check this flag at safe points of
execution (i.e. memory is known to be initialised). These instructions
are now atomic in 3.09. In 3.08, you have basically

if "flag is set" then (
(*)
"clear flag";
"call the signal handler function"
)

If another signal happens at (*) it will be lost.

As you mention sigprocmask: Of course, you can block signals before
checking the flag and allow them again after clearing it, but this would
be even _much_ slower than the solution in 3.09, because sigprocmask
needs a context switch to do its work (it is a kernel function).

I don't know what Xavier has in mind to solve the problem, but I would
think about reducing the frequency of the atomic check.
This could work as follows:

- Revert the check to the 3.08 solution
- Use the alarm clock timer to regularly call a signal_manager
function at a certain frequency (i.e. the signal flag is set
at a certain frequency)
- Only the alarm clock timer signal is left unblocked. The
other signals are normally blocked.
- In signal_manager, it is checked whether there are other
pending signals, and if so, their functions are called.

Of course, it is again possible that alarm clock signals are lost, but
this is harmless, because it is a repeatedly emitted signal. The other
signals cannot be lost, but their execution is deferred to the next
alarm clock event.

> But if this is what you think about (and how it will be done
> on windows or other systems) I don't know, but maybe this is
> a hint that matters.
>
> BTW: I saw that in the Unix-module the unix-signalling functions are
> now included... (the ywere not on older versions of Ocaml).

They have been included for a long time. New is Thread.sigmask.

Gerd
--
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
ge...@gerd-stolpmann.de http://www.gerd-stolpmann.de
Phone: +49-6151-153855 Fax: +49-6151-997714
------------------------------------------------------------

Oliver Bandel

unread,

Mar 20, 2006, 8:16:09 AM3/20/06

to caml...@inria.fr

Well, I'm not an OCaml-internals specialist, so I can't say
if this would be necessary...

On the first look it looks like the problem one has when using
signal(3) instead of sigprocmask(), sigaction() and Co.

>
> As you mention sigprocmask: Of course, you can block signals before
> checking the flag and allow them again after clearing it, but this would
> be even _much_ slower than the solution in 3.09, because sigprocmask
> needs a context switch to do its work (it is a kernel function).

Why to call such functions often?

You can use sigaction() to handle signals when you want it;
even if signals are blocked, their occurence will be saved.
When you want to handle them, then you can do it.

It's too long ago to say details here, but if wanted,
I can look for details (not today, but tomorrow I will have some time
to do it).

(The only thing you can't find out with this mechanism is,
which of the signals came first and which later.)

>
> I don't know what Xavier has in mind to solve the problem, but I would
> think about reducing the frequency of the atomic check.
> This could work as follows:
>
> - Revert the check to the 3.08 solution
> - Use the alarm clock timer to regularly call a signal_manager
> function at a certain frequency (i.e. the signal flag is set
> at a certain frequency)

Using alarm() is not reliable.

[...]

> > BTW: I saw that in the Unix-module the unix-signalling functions are
> > now included... (the ywere not on older versions of Ocaml).
>
> They have been included for a long time. New is Thread.sigmask.

Depends on the definition of "long time" ;-)
As I had first conact with OCaml, which really is some years ago,
it was not included (I think 3.04?).
I didn't looked for these functions, and just saw them, while
looking for other things at about 3.08 (?).
So then I was astouned. This makes OCaml better suited for
applications in the real world, because C's signal(3) is unreliable.
(When catching the signal, the handler will be deactivated, until it is
re-established again - that's the same problem as you has mentioned above.
So if a signal comes twice, you lost one. But on the other hand,
this provides the system for recursive loops which could make it
unreliable too. But only with sigprocmask()/sigaction() and so on
you can do it reliable and clean and clear.)

Ciao,
Oliver

Xavier Leroy

unread,

Mar 20, 2006, 10:58:49 AM3/20/06

to Gerd Stolpmann, caml...@inria.fr

> The problem is the following: [...] In 3.08, you have basically

>
> if "flag is set" then (
> (*)
> "clear flag";
> "call the signal handler function"
> )
>
> If another signal happens at (*) it will be lost.

Actually, the problematic code in 3.08 is:

tmp <- flag;
(*)
flag <- 0;
if (tmp) { process the signal; }

and indeed a signal can be lost (never processed) if it occurs at (*).

The solution I have in mind is to implement exactly the pseudocode you
give above. If a signal occurs at (*), it is not lost (the signal
handler function will be called just afterwards!), just conflated with
a previous occurrence of that signal, but this is fair game: POSIX
signals have the same behaviour. (Yes, I'm ignoring the queueing
behaviour of realtime POSIX signals.)

Note however that in 3.09 and in my proposed fix, there is one flag
per signal, which still improves over 3.08 (which had only one shared
flag) and ensures that two occurrences of different signals are not
conflated, again as per POSIX.

> I don't know what Xavier has in mind to solve the problem, but I would
> think about reducing the frequency of the atomic check.

That would be plan C, plan B being making the check even more efficient.
I'd rather not introduce timer signals if at all possible, though,
since these mess up many function calls.

- Xavier Leroy

Will Farr

unread,

Mar 20, 2006, 11:26:50 AM3/20/06

to ocaml

Hello all,

As an aside, if anyone is interested in techniques for making atomic
transactions fast with low latency, etc, the paper

Atomic heap transactions and fine-grain interrupts by Olin Shivers,
James W. Clark and Roland McGrath:
http://www-static.cc.gatech.edu/~shivers/papers/heap.ps

presents several *neat* hacks to do this efficiently. I'm sure that
the implementators on the list are already aware of this work, but I
just wanted to point it out as interesting reading for people (like
myself) who think this stuff is neat but don't necessarily have broad
experience with it.

Will

Robert Roessler

unread,

Mar 20, 2006, 8:35:20 PM3/20/06

to Caml-list

Xavier Leroy wrote:
> > It seems that changes to signal handling between OCaml 3.08.4 and 3.09.1
> > can lead to a very significant loss of performance (up to several orders
> > of magnitude!) in code that uses threads and performs I/O (tested on
> Linux).
> > [...]
> > Maybe some assembler guru can repeat this result and explain to us
> > what's going on...
>
> Short explanation: atomic instructions are dog slow.

At the risk of being "irrelevant", I wanted to nail down exactly what
assertion is being made here: are we talking about directly executing
in assembly code the relevant x86[-64]/ppc/whatever instructions for
"read-and-clear", or going through OS-dependent access routines like
Windows' InterlockedExchange()?

Or: is the source of the dog slow behavior because of OS overhead, or
is it a low-level issue like memory barriers/cache lines getting
flushed/something else?

Robert Roessler
roes...@rftp.com
http://www.rftp.com

Brian Hurt

unread,

Mar 20, 2006, 11:05:35 PM3/20/06

to Markus Mottl, Robert Roessler, Caml-list

On Mon, 20 Mar 2006, Markus Mottl wrote:

> On 3/20/06, Robert Roessler <roes...@rftp.com> wrote:
>>
>> At the risk of being "irrelevant", I wanted to nail down exactly what
>> assertion is being made here: are we talking about directly executing
>> in assembly code the relevant x86[-64]/ppc/whatever instructions for
>> "read-and-clear", or going through OS-dependent access routines like
>> Windows' InterlockedExchange()?
>
>

> We are talking of the assembly code. See file byterun/signals_machdep.h,
> which contains the corresponding macros.

OK, poking around a little bit in byterun, I'm seeing this peice of code:

for (signal_number = 0; signal_number < NSIG; signal_number++) {
Read_and_clear(signal_state, caml_pending_signals[signal_number]);
if (signal_state) caml_execute_signal(signal_number, 0);
}

with Read_and_clear being defined as:

#if defined(__GNUC__) && defined(__i386__)

#define Read_and_clear(dst,src) \
asm("xorl %0, %0; xchgl %0, %1" \
: "=r" (dst), "=m" (src) \
: "m" (src))

xchgl is the atomic operation (this is always atomic when referencing a
memory location, regardless of the presence or absence of a lock prefix).

Appropos of nothing, a better definition of that macro would be:

#define Read_and_clear(dst,src) \
asm volatile ("xchgl %0, %1" \
: "=r" (dst), "+m" (src) \
: "0" (0))

as this gives gcc the choice of how to move 0 into the register (using an
xor will still be a popular choice, but it'll occassionally do a movl
depending upon instruction scheduling choices).

Some more poking around tells me that NSIG is defined on Linux to be 64.

I think the problem is not doing an atomic operation, but doing 64 of
them. I'd be inclined to move to a bitset implementation- allowing you
to replace 64 atomic instructions with 2.

On the x86, you can use the lock bts instruction to set the bit. Some
implementation like:

#if defined(__GNUC__) && defined(__i386__)

typedef unsigned long sigword_t;

#define Read_and_clear(dst,src) \
asm volatile ("xchgl %0, %1" \
: "=r" (dst), "+m" (src) \
: "0" (0))

#define Set_sigflag(sigflags, NR) \
asm volatile ("lock bts %1, %0" \
: "+m" (*sigflags) \
: "rN" (NR) \
: "cc")

..

#define SIGWORD_BITS (CHAR_BITS * sizeof(sigword_t))

#define NR_SIGWORDS ((NSIG + SIGWORD_BITS - 1)/SIGWORD_BITS)

extern sigword_t caml_pending_signals[NR_SIGWORDS];

for (i = 0; i < NR_SIGWORDS; i++) {
sigword_t temp;
int j;

Read_and_clear(temp, caml_pending_signals[i]);
for (j = 0; temp != 0; j++) {
if ((temp & 1ul) != 0) {
caml_execute_signal((i * SIGWORD_BITS) + j, 0)
}
temp >>= 1;
}
}

This is somewhat more code, but i, j, and temp would all end up in
registers, and it'd be two atomic instructions, not 64.

The x86 assembly code I can dash off from the top of my head. Similiar
bits of assembly can be written for other CPUs- I just have to go dig out
the right books.

Brian

Robert Roessler

unread,

Mar 21, 2006, 7:58:20 AM3/21/06

to Caml-list

Markus Mottl wrote:
> On 3/20/06, *Robert Roessler* <roes...@rftp.com

>
> At the risk of being "irrelevant", I wanted to nail down exactly what
> assertion is being made here: are we talking about directly executing
> in assembly code the relevant x86[-64]/ppc/whatever instructions for
> "read-and-clear", or going through OS-dependent access routines like
> Windows' InterlockedExchange()?
>
>
> We are talking of the assembly code. See file
> byterun/signals_machdep.h, which contains the corresponding macros.

Thanks, Markus - in the case you cite (direct instruction use), I was
hoping for some illumination on this huge cost... reviewing the Intel
manuals, I note that:

1) there is *no* claim that cache lines are flushed just by doing the xchg

2) in fact, with the Pentium Pro on, the bus LOCK# operation will not
even happen if the data is cached - everything is left to the cache
coherency mechanism

3) there *is* mention of processor *cache locking*, but this is still
just in the context of cache coherency with multiple processors... so
nothing here is suggesting cache line flushing or anything else that
sounds horrendously expensive, particularly in the single CPU case

< 8 hours later, back to finish email :) >

Finally, it is interesting that you bring up this file - it appears as
if the msvc toolchain is no longer supported for doing "correct" (in
terms of Xavier's "atomicity w.r.t. signals") builds... at least that
is how I interpret the conditional compilation directives.

Robert Roessler
roes...@rftp.com
http://www.rftp.com

_______________________________________________