Synchronization using a boolean variable

Emir Uner

unread,

Mar 27, 2003, 6:37:40 AM3/27/03

to

Considering the following code, is there any problem that
I cannot see? When I run the program at a dual CPU x86 machine
it produces the numbers as expected.

------------- Program -------------
#include <stdio.h>
#include <pthread.h>

int next = 0;
volatile int data = 0;
volatile char avail = 0;

void* prod(void* arg)
{
while(1)
{
while(avail);
++next;
data = next;
avail = 1;
}
}

void* cons(void* arg)
{
while(1)
{
while(!avail);
printf("%d\n", data);
avail = 0;
}
}

int main()
{
pthread_t p, c;

pthread_create(&p, NULL, prod, NULL);
pthread_create(&c, NULL, cons, NULL);
while(1);
}
-------------------------------------

Alexander Supalov

unread,

Mar 27, 2003, 7:32:35 AM3/27/03

to

Hi!

> Considering the following code, is there any problem that
> I cannot see? When I run the program at a dual CPU x86 machine
> it produces the numbers as expected.

As it should as long as you have interprocessor cache coherency properly
supported by the platform (NEC SX machines are the most noteworthy
exception, at least as of SX-4). You've actually implemented a rather
naive variety of a spin lock, which is known to work.

The problem with it is that you may spend a lot of time spinning on the
variable, thus squandering CPU time, since you may give another process
on the same processor no chance to run until your quantum is over. This
will hurt the overall system performance.

Next, if you happen to run this program on a single CPU machine, you may
well observe that the performance of your program will in fact depend on
that quantum. This may happen if the producer goes into its spin loop
before the consumer has have the chance to catch up.

You avoid this possibility in the consumer only thanks to the printf
call there, because it will probably make the thread calling it yield
the CPU somewhere deep in the C library. But, if you're explicitly
heading for a multiprocessor machine, it's no problem.

There are better varieties of the spin locks, which the experts on this
list won't hesitate to point you to. Let me just tell you that you don't
really want to spin unless you're going to break latency records, which
intention the printf call doesn't really fit with.

Best regards.

Alexander

--
Dr Alexander Supalov
Senior Software Engineer
--------------------------------------------------------------------
//// pallas / A Member of the ExperTeam Group
Pallas GmbH / Hermuelheimer Str. 10 / 50321 Bruehl / Germany
Alexande...@pallas.com / www.pallas.com
Tel +49-2232-1896-34 / Fax +49-2232-1896-29
--------------------------------------------------------------------

Alexander Terekhov

unread,

Mar 27, 2003, 7:56:53 AM3/27/03

to

Alexander Supalov wrote:
>
> Hi!
>
> > Considering the following code, is there any problem that

Yes. There are many problems. Try the google groups search...

> > I cannot see? When I run the program at a dual CPU x86 machine
> > it produces the numbers as expected.
>
> As it should as long as you have interprocessor cache coherency properly

> supported by the platform ....

http://rsim.cs.uiuc.edu/~sadve/Publications/models_tutorial.ps

"5.2.1 Cache Coherence and Sequential Consistency

Several definitions for cache coherence (also referred to
as cache consistency) exist in the literature. The strongest
definitions treat the term virtually as a synonym for
sequential consistency. Other definitions impose
extremely relaxed ordering guarantees.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What does the programmer expect from the memory system to
ensure correct execution of this program fragment? One
important requirement is that the value read from the
data field within a dequeued record should be the same
as that written by P1 in that record. However, in many
commercial shared memory systems, it is possible for
processors to observe the old value of the data field
(i.e., the value prior to P1's write of the field),
leading to behavior different from the programmer's
expectations. "

regards,
alexander.

Alexander Supalov

unread,

Mar 27, 2003, 8:02:12 AM3/27/03

to

Hi!

> What does the programmer expect from the memory system to
> ensure correct execution of this program fragment? One
> important requirement is that the value read from the
> data field within a dequeued record should be the same
> as that written by P1 in that record. However, in many
> commercial shared memory systems, it is possible for
> processors to observe the old value of the data field
> (i.e., the value prior to P1's write of the field),
> leading to behavior different from the programmer's
> expectations. "

Thanks for expanding on the word "properly".

Joseph Seigh

unread,

Mar 27, 2003, 8:22:38 AM3/27/03

to

Except cache is transparent. If you could tell it was there
from its effect on programming (apart from performance) then
it wouldn't be transparent. Most of the memory behavior that
we discuss in this newgroups derives mostly from the fact
that most processors are pipelined and some use out of order
memory access. Cache has nothing to do with it.

Joe Seigh

Alexander Supalov

unread,

Mar 27, 2003, 8:56:19 AM3/27/03

to

Hi!

> Except cache is transparent. If you could tell it was there
> from its effect on programming (apart from performance) then
> it wouldn't be transparent. Most of the memory behavior that
> we discuss in this newgroups derives mostly from the fact
> that most processors are pipelined and some use out of order
> memory access. Cache has nothing to do with it.

Wait a moment. Alex is certainly right in that we should agree on the
terms - and please correct me if I name something that I think of as
cache conherency somewhat differently from your point of view.

Let's consider two processors, P1 and P2 running over a piece of shared
memory, with whatever cache hierarchy between the main RAM and the CPU
registers.

Now, suppose we have P1 writing something into a memory cell C. Whatever
stage of the CPU pipeline that happens at is not so important: I've
written something into memory, and it's supposed to be there visible to
all who care once that stage has completed its job.

Now, P2 happens to request the value of C at an awkward moment, when
this new value has not yet been written back into the main memory of the
system. Note however that P1 has to the best of its knowledge already
performed the write to the respective memory address, for propagation of
the new value down the memory hierarchy is something other parts of the
processor or the memory susbsystem are going to deal with.

What I'd expect a cache coherent system to do is as follows: P2 is able
to sense that the old value of C possibly contained in one of its caches
is in fact stale, and through whatever mechanism necessary, the system
ensures that the value just commited by P1 becomes available to P2
before the corresponding stage of the P2's pipeline considers the data
valid for further processing.

This I call "properly supported cache coherency". In return, please
explain me what does pipelining and out-of-order memory access have to
do with that: I'm eager to learn.

Joseph Seigh

unread,

Mar 27, 2003, 9:06:07 AM3/27/03

to

Alexander Supalov wrote:
>
> Hi!
>
> > Except cache is transparent. If you could tell it was there
> > from its effect on programming (apart from performance) then
> > it wouldn't be transparent. Most of the memory behavior that
> > we discuss in this newgroups derives mostly from the fact
> > that most processors are pipelined and some use out of order
> > memory access. Cache has nothing to do with it.
>
> Wait a moment. Alex is certainly right in that we should agree on the
> terms - and please correct me if I name something that I think of as
> cache conherency somewhat differently from your point of view.
>

(snip)

>
> This I call "properly supported cache coherency". In return, please
> explain me what does pipelining and out-of-order memory access have to
> do with that: I'm eager to learn.
>

We're talking about memory visibility from a program point of view.
Cache has nothing to do with it because by definition it is transpsarent.
There is nothing a progam can do that will let it determine whether or
not there is cache apart from performance effects as I have mentioned
previously.

The behavior you are interested in is the memory model and that is
documented in the processor's architecture manual and that memory
behavior exists whether cache is installed or not.

So, to discuss something which you can neither see nor detect does not make
a lot of sense.

Joe Seigh

Alexander Terekhov

unread,

Mar 27, 2003, 9:30:12 AM3/27/03

to

Alexander Supalov wrote:
[...]

> This I call "properly supported cache coherency". In return, please
> explain me what does pipelining and out-of-order memory access have to
> do with that: I'm eager to learn.

Take a look at:

http://gene.wins.uva.nl/~jcarolus/java/MultiprocessorSafe.pdf

For example,

<quote>

Note that CPU caches are not shown in the diagram. If the CPU cache is
kept consistent by hardware, we can consider it part of the box labeled
"memory". If the CPU cache is not kept consistent, we can consider it
part of the boxes labeled "CPU". Either way, CPU caching is irrelevant
to our discussion [Schimmel].

</quote>

regards,
alexander.

Alexander Supalov

unread,

Mar 27, 2003, 10:01:59 AM3/27/03

to

Hi@

> There is nothing a progam can do that will let it determine whether or
> not there is cache apart from performance effects as I have mentioned
> previously.

Well, some processors do provide a way to determine what they have
inside ;)

> The behavior you are interested in is the memory model and that is
> documented in the processor's architecture manual and that memory
> behavior exists whether cache is installed or not.

Can't agree without further investigation. If we were to have no cache
at all, all write to memory would be write-throughs, and the other
processor would see the results immediately. If we have a cache layer,
there's a chance that something doesn't happen to match the contents of
the main memory, and here we have cache-coherent and -incoherent
systems.

I'll get to the bottom of this, thanks.

Alexander Supalov

unread,

Mar 27, 2003, 10:26:44 AM3/27/03

to

Hi!

> Note that CPU caches are not shown in the diagram. If the CPU cache is
> kept consistent by hardware, we can consider it part of the box labeled
> "memory". If the CPU cache is not kept consistent, we can consider it
> part of the boxes labeled "CPU". Either way, CPU caching is irrelevant
> to our discussion [Schimmel].

Let me cite something from the article involved:

"In reality, the write may go only as far as a hardware-consistent
cache; but again, in our
model, we can simply consider this being in the shared memory."

Now, what I've been trying to say is that if you don't have that
"hardware-consistent cache", you'll have to take extra precautions that
the original program doesn't seem to do, for volatile is about loading
the registers anew, and on a cache-incoherent system nothing guarantees
that whatever is written by the producer becomes visible to the consumer
(i.e., reaches the main memory as seen by the consumer) at all.

My point has been and remains: if you don't get this step right (thanks
to the built-in hardware support or special actions affected by the
program), it doesn't matter whether finer facets of the memory model are
involved or not. Of course, when this is ensured, your points concerning
the memory model idiosyncrasies are correct, thanks for pointing that
out.

Alexander Supalov

unread,

Mar 27, 2003, 10:35:31 AM3/27/03

to

Hi!

> So, to discuss something which you can neither see nor detect does not make
> a lot of sense.

Well, I think that I've finally got to the point at which we differ.

You're talking in what order whatever one processor does reaches the
other one. I'm talking about whether that reaches the other processor at
all. Note that "reaching" is more important in this case than the
"order": if you don't reach something you intend to, it doesn't really
matter in what order you fail.

Otherwise, I agree that those fine ripples do make a difference; thanks
for the point.

Patrick TJ McPhee

unread,

Mar 27, 2003, 12:53:49 PM3/27/03

to

In article <a2959ae.03032...@posting.google.com>,
Emir Uner <emir...@hotmail.com> wrote:

% Considering the following code, is there any problem that
% I cannot see? When I run the program at a dual CPU x86 machine
% it produces the numbers as expected.

You're wasting a lot of CPU spinning. There is no reason at all
to spin in main() -- you should either have main() call one of the
thread functions direction or have it exit using pthread_exit().

There's no benefit to spinning on avail. From a design perspective,
what you're doing with avail() is preventing the two threads from
running concurrently. Thus, the use of threads here is pointless.
In any case, what you ought to do is use condition variables to
efficiently signal the change in value of avail.

There's no point to using volatile. Nothing in POSIX suggests that
volatile relates to threading in any way, we have had a strong opinion
from a noted expert that volatile will not aid in thread safety in POSIX
implementations, and it undoubtedly slows things down. For your program
to be correct, none of your global variables should be volatile, and
you should protect access to them using one or more mutexes.

I would summarise by saying that there is absolutely nothing right
about this program. Whether there's a problem depends on what your
goals are.
--

Patrick TJ McPhee
East York Canada
pt...@interlog.com

Alexander Supalov

unread,

Mar 28, 2003, 2:42:04 AM3/28/03

to

Hi!

> There's no benefit to spinning on avail. From a design perspective,
> what you're doing with avail() is preventing the two threads from
> running concurrently. Thus, the use of threads here is pointless.
> In any case, what you ought to do is use condition variables to
> efficiently signal the change in value of avail.

No, it'll run alright (as it does) on a reasonable multiprocessor
machine. Waste of time is sometimes acceptable when what one's actually
targeting is low latency.

Many lock-free MPI implementations work this way, and they are miles
ahead of any condvar/mutex based competition in latency and no worse in
bandwidth, to put it mildly. However, there one has interprocess
communication as opposed to the intraprocess one discussed here.

Nevertheless, this approach is legal as long as it does what it's
supposed to.

> There's no point to using volatile. Nothing in POSIX suggests that
> volatile relates to threading in any way, we have had a strong opinion
> from a noted expert that volatile will not aid in thread safety in POSIX
> implementations, and it undoubtedly slows things down. For your program
> to be correct, none of your global variables should be volatile, and
> you should protect access to them using one or more mutexes.

Sure, volatile is about getting something from the memory instead of
using something that may accidentaly be on the register. This may make
sense on some architectures and/or with some compiler settings.

I've personally observed a program with this kind of a loop that on a
high optimization level was happily assuming the value was zero while it
actually was not. Adding volatile (or lowering the optimization level)
did help.

The guy may wish to try putting the volatile off and watching whether
that changes anything in his case. Perhaps, he'll run even faster, who
knows.

> I would summarise by saying that there is absolutely nothing right
> about this program. Whether there's a problem depends on what your
> goals are.

Well, if a program runs, it's basically correct. One should however
understand the limitations of the approach, I'm with you here.

David Butenhof

unread,

Mar 28, 2003, 6:53:44 AM3/28/03

to

Alexander Supalov wrote:

> Well, if a program runs, it's basically correct.

In a simple sequential program, that's often nearly true. In an asynchronous
program, most bugs are driven by TIMING, not by sequential logic errors,
and this statement is dangerously false.

The most you can say is that, if a program runs, no error has yet been
observed.

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\----[ http://homepage.mac.com/dbutenhof/Threads/Threads.html ]---/

Joseph Seigh

unread,

Mar 28, 2003, 7:35:35 AM3/28/03

to

Alexander Supalov wrote:
>
> Hi!
>
> > There's no benefit to spinning on avail. From a design perspective,
> > what you're doing with avail() is preventing the two threads from
> > running concurrently. Thus, the use of threads here is pointless.
> > In any case, what you ought to do is use condition variables to
> > efficiently signal the change in value of avail.
>
> No, it'll run alright (as it does) on a reasonable multiprocessor
> machine. Waste of time is sometimes acceptable when what one's actually
> targeting is low latency.
>

The OP only works because of happenstance. It depends on x86 not
requiring memory barriers and that the threads would be scheduled on
separate processors. But as Patrick pointed out, it turns a 2 processor
machine into a 1 processor machine, and a 1 processor machine into a
1/2 processor machine. I don't see how running at 50% efficiency
can be viewed as effecting "low latency".

> Many lock-free MPI implementations work this way, and they are miles
> ahead of any condvar/mutex based competition in latency and no worse in
> bandwidth, to put it mildly. However, there one has interprocess
> communication as opposed to the intraprocess one discussed here.
>

I don't think you can't take some hypothetical solution without specifics
and use it make a generalization in a completely different application
space. I know lock-free and would never offer the OP as a good
example of lock-free. It's not even clear that a single consumer,
single producer problem is even something that lock-free even applies
to since there is symmetric competition for shared resources.

> Nevertheless, this approach is legal as long as it does what it's
> supposed to.

If solving the problem in the worst possible way is what it is supposed
to do, the yes.

Joe Seigh

Alexander Supalov

unread,

Mar 28, 2003, 8:00:50 AM3/28/03

to

Hi!

> The most you can say is that, if a program runs, no error has yet been
> observed.

You are generally right. However, we have a very precisely defined
problem here, and I hope that you know it just as well as I do that no
trouble (apart from the decribed above) will ever occur with this
program on that particular Intel.

That's the whole point: there are situations in which this naive
approach may indeed be good, and in fact it is better than the beloved
mutexes once in a while. And there are known situations when spin locks
indeed perform better than any heavier primitive.

If you don't accept that from the point of view that they may (and
certainly will) be worse elsewhere, you're indulging in dogmatism that
is known to lead nowhere. I personally prefer to be pragmatic: if it's
good where I need it, I use it. Period.

Alexander Supalov

unread,

Mar 28, 2003, 8:15:15 AM3/28/03

to

Hi!

> I don't see how running at 50% efficiency can be viewed as effecting "low latency".

One rarely runs real HPC applications on less processors than there are
processes unless there's some latency hiding due to I/O or other time
consuming stuff. Generally, you do numerics on n processes over n
processors, and then you get the lowest possible latency by doing it
lock-free.

Joseph Seigh

unread,

Mar 28, 2003, 9:08:30 AM3/28/03

to

Alexander Supalov wrote:
>
> Hi!
>
> > I don't see how running at 50% efficiency can be viewed as effecting "low latency".
>
> One rarely runs real HPC applications on less processors than there are
> processes unless there's some latency hiding due to I/O or other time
> consuming stuff. Generally, you do numerics on n processes over n
> processors, and then you get the lowest possible latency by doing it
> lock-free.
>

No, you get the lowest possible latency in this case by running everything as one
thread on one processor and removing the intra processor memory latency and cache
thrashing. Plus, as an added benefit, you free up a processor which you can use
to double your throughput.

Joe Seigh

Alexander Supalov

unread,

Mar 28, 2003, 9:44:09 AM3/28/03

to

Hi!

> No, you get the lowest possible latency in this case by running everything as one
> thread on one processor and removing the intra processor memory latency and cache
> thrashing. Plus, as an added benefit, you free up a processor which you can use
> to double your throughput.

Wait a minute. We're talking about a program that has two threads
running on two processors that have to synchronize. Within that setting,
lock free is the way to go if the system works for you and not for the
marketing division.

So, I most kindly suggest that we stay within the original problem
definition. Otherwise I may also start feeling like popping aces up my
sleeve just to avoid admitting the fact that someone was in fact wrong
for a change.

Joseph Seigh

unread,

Mar 28, 2003, 11:12:16 AM3/28/03

to

Alexander Supalov wrote:
>
> Hi!
>
> > No, you get the lowest possible latency in this case by running everything as one
> > thread on one processor and removing the intra processor memory latency and cache
> > thrashing. Plus, as an added benefit, you free up a processor which you can use
> > to double your throughput.
>
> Wait a minute. We're talking about a program that has two threads
> running on two processors that have to synchronize. Within that setting,
> lock free is the way to go if the system works for you and not for the
> marketing division.

There is an assumption on somebody's part that the two threads are simultaneously
running on separate processors. The OP never stated that this was in fact happening
and never offered any evidence that it was. POSIX threads doesn't explicitly support
this. Scheduling of threads is implementation dependent and you can't make any
assumptions on how threads will be scheduled.

So, if you accept the premise that something which would be more efficient and less
problematic if done as a single thread, should be done as two threads, then the
solution in the original posting is either optimal or really suboptimal depending on
how the actual thread scheduling works.

To look at it another way. You can do a lock-free single producer/single consumer solution
using Posix semaphores. If you compared it to the orginal solution it clearly works
better when the threads are scheduled on the same processor and (barring some really
bad Posix semaphore implementation) not much worse when scheduled on two processors.
So, all in all, given that you don't know what the actual scheduling is, you'd have
to conclude that a Posix semaphore solution would be the better way to go unless you
don't care about overall efficiency.

> So, I most kindly suggest that we stay within the original problem
> definition. Otherwise I may also start feeling like popping aces up my
> sleeve just to avoid admitting the fact that someone was in fact wrong
> for a change.

Those MPI and HPC aces didn't come from up my sleeves. :) Having done lock-free for
about 18 yrs now, longer if you count pushing stuff on a queue using compare and swap,
I'm not about to argue that there isn't an application for lock free. Just that this
is probably not one of them, at least in that form anyway.

Joe Seigh

Joseph Seigh

unread,

Mar 28, 2003, 11:37:45 AM3/28/03

to

I wrote:
> So, if you accept the premise that something which would be more efficient and less
> problematic if done as a single thread, should be done as two threads, then the
> solution in the original posting is either optimal or really suboptimal depending on
> how the actual thread scheduling works.
>
> To look at it another way. You can do a lock-free single producer/single consumer solution
> using Posix semaphores. If you compared it to the orginal solution it clearly works
> better when the threads are scheduled on the same processor and (barring some really
> bad Posix semaphore implementation) not much worse when scheduled on two processors.
> So, all in all, given that you don't know what the actual scheduling is, you'd have
> to conclude that a Posix semaphore solution would be the better way to go unless you
> don't care about overall efficiency.

Slight correction. Technically, depending on the scheduling parameters, one of the two
threads may not run at all if both are always runnable. So suboptimal may be a bit
mild in this case.

Joe Seigh

Hillel Y. Sims

unread,

Mar 28, 2003, 10:02:36 PM3/28/03

to

"Alexander Supalov" <sup...@pallas.com> wrote in message
news:3E830303...@pallas.com...

>
> Let's consider two processors, P1 and P2 running over a piece
of shared
> memory, with whatever cache hierarchy between the main RAM and
the CPU
> registers.

[..]

> What I'd expect a cache coherent system to do is as follows:
P2 is able
> to sense that the old value of C possibly contained in one of
its caches
> is in fact stale, and through whatever mechanism necessary,
the system
> ensures that the value just commited by P1 becomes available
to P2
> before the corresponding stage of the P2's pipeline considers
the data
> valid for further processing.
>

Check out this link:
http://www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.htm
l

hys
--
(c) 2003 Hillel Y. Sims
hsims AT factset.com

Alexander Supalov

unread,

Mar 31, 2003, 1:56:34 AM3/31/03

to

Hi!

I believe to have dwelt upon most of the issues with the suggested
approach in the OR (original reply): cache coherency, quantization,
potential lockup, and niche applicability. The things that were missing
were memory model ideosynchrasies and scheduling problems, thanks for
them.

If a scheduler puts two worker process threads on the same processor on
an SMP machine when more processors are available, I pity the developers
who have to work there. Of course, the system may be so full that this
is the only right solution, but this is not our case, I hope.

Now, if an _application_ programmer has to take care in which order
memory writes are affected, I have pity for the architects of that
system - they are optimizing their stuff for the marketing division
instead of the developers and, ultimately, end users.

Joseph Seigh

unread,

Mar 31, 2003, 6:54:14 AM3/31/03

to

Alexander Supalov wrote:
...

>
> If a scheduler puts two worker process threads on the same processor on
> an SMP machine when more processors are available, I pity the developers
> who have to work there. Of course, the system may be so full that this
> is the only right solution, but this is not our case, I hope.

Generally, you can't make any assumptions about the scheduling of threads
on processors since it is implementation dependent. You can make arguments
for either case, scheduling on same processor or on different processors.
The trade off involves the overhead of intra processor cache data transfers
vs. thread switching overhead. In the OP case, since there is a lot
of shared data updating which would indicate normally it should run on the
same processor, except that the small granularity, plus busy waiting,
would indicate separate processors would be better. Idealing, hyperthreading
would be optimal, concurrency plus no cache problems. But the thread
scheduler generally has no awareness of these aspects, so it can't make
specific allowances for them. Remember, scheduling policy is based on
what the designers believe will improve overall system performance. They're
not trying to micro-optimise an application they never saw before. If you
are lucky, the platform you are on may allow you to make scheduling hints
or directives.

>
> Now, if an _application_ programmer has to take care in which order
> memory writes are affected, I have pity for the architects of that
> system - they are optimizing their stuff for the marketing division
> instead of the developers and, ultimately, end users.
>

Generally, they don't if they stick to Posix thread constructs. Otherwise
they have to use platform specific constructs, memory barriers, or use
stuff from other standards (if any) that have been ported to their platform.

Joe Seigh

Joseph Seigh

unread,

Mar 31, 2003, 7:35:55 AM3/31/03

to

I wrote:
>
> ... hyperthreading ...
>
...

> Generally, they don't if they stick to Posix thread constructs. Otherwise
> they have to use platform specific constructs, memory barriers, or use
> stuff from other standards (if any) that have been ported to their platform.
>

Actually, it's worse. If you are doing busy polling on a hyperthreaded Pentium
then you had better have a pause instruction in the polling loop, otherwise
to quote Intels Pentium processor optimization manual, "On a processor with
Hyper-Threading Technology, spin-wait loops can consume a significant portion
of the execution bandwidth of the processor. One logical processor executing a
spin-wait loop could severely impact the performance of the other logical processor
doing useful work." And spin waiting can cause problems on non hyperthreaded
multi-processors as well.

Joe Seigh

Alexander Supalov

unread,

Mar 31, 2003, 9:04:24 AM3/31/03

to

Hi!

> Actually, it's worse. If you are doing busy polling on a hyperthreaded Pentium

> then you had better have a pause instruction in the polling loop...

Right, but Pentium 4 is peculiar overall. I wonder, for example, what
justified making increments, decrements, shifts, integer multiplication
and division, and some other operations frequently used for manual
optimization on other Intel processors, slower on that platform.
Predominance of the floating point ops in their target application mix?
Reliance on the clever compilers?

As for the hyperthreading, it looks pretty experimental at the moment,
and I won't be totally surprised if the next generation Pentium x will
do it in a radically different way. The typical performance increase
observed now (reportedly around 25% if the compiler is good) can hardly
justify the current hype. I'd dearly like to know how much of the MTA
experience they've taken into account while working on this feature.