Basically you overlap 2 32 bit integers by one bit. The low order count contains bits
0-31 and the high order count contains bits 31-62. You use the overlapped bit to determine
whether the 2 32 bit reads were "in sync". If they are not you can correct the high order
read without having to reread the counter.
To overlap by one bit, you increment the high order count when bit 30 carries out, rather than
bit 31. These are the transitions
0x7fffffff --> 0x80000000
0xffffffff --> 0x00000000
You can test for this by (CF==1 || OF==1) on x86. I don't know x86 assembler that well.
It looks like it would require 2 Jcc and 1 JMP inst. Is that right?
The pseudo code for this would be
increment_atomic63:
lock add low_count, 1
if (CF == 1 || OF == 1)
lock add hi_count, 1
To read the counter, you simply read the high and low counts in any order. If the high order
bit 0 equals the low order bit 31, the reads were in sync and all you have to do is shift the
high order read value right by 1. If they are not in sync, you look at bit 30 of the low order
count that was read to determine whether it's newer or older than the high order count that
was read. If bit 30 is 0, it's newer, increment the high order read count by 1 and shift right 1,
otherwise decrement it by 1 and shift right 1.
Pseudo code would be
read_atomic63:
low_temp = low_count; // atomic fetch of aligned operand.
hi_temp = hi_count; // atomic fetch of aligned operand
if (hi_temp:0 == low_temp:31) // in sync?
;
else if (low_temp:30 == 0) // low_temp newer?
hi_temp++;
else // low_temp older
hi_temp--;
hi_temp >>= 1;
return (hi_temp, low_temp)
A bit more complicated. The testing for in sync should be optimal.
The low_temp newer or older test is very infrequent and doesn't need to be
as optimal.
This could be ported to any architecture that had equivalent condition flag
testing or anything equivalent to XADD that returns the old value for bit
testing that way.
Comments?
Joe Seigh
add dword [COUNTER],1 ;bits 0-31 ;set carry on overflow
adc dword [COUNTER+4],0 ;bits 32-63 ;add 0+CF
Zdenek Sojka
"Joe Seigh" <spam...@crayne.org> píse v diskusním príspevku
news:41BC5D3D...@xemaps.com...
Some code is given in the posts, though I don't recall how much. I never got
better than abysmal performance out of any of the different versions of the
counter. I don't recall the exact performance figures, but they were more
than two orders of magnitude larger than typical increment operations on
non-shared memory. My testing was done with 4 threads on a 2-way system.
I can be convinced to dig up the code if you'd like to have a look.
-Matt
what about
fild qword [COUNTER]
fld1
faddp
fistp qword [COUNTER]
yep, a bit slow :-(
Maybe usage of MMX would be faster
Or, you can use somewhere set bit and test it , if there is adding in
progress.
If so, wait until the other ends...
(well, this post is just to say I know i understood you incorrect)
Zdenek Sojka
"Zdenek Sojka" <spam...@crayne.org> píąe v diskusním příspěvku
news:cpiba5$ee8$1...@ns.felk.cvut.cz...
I'm familiar with that discussion. The techniques either spin wait if an update
appears to be in progress which means they're not lock-free, or make assumptions
that there's a single writer or that the reads will remember the previously read
value.
If you doing lot's of updates and relativel few reads, you can use a distributed
counter which as very efficent updates and fairly expensive reads. Distributed
counters are a fairly well known technique.
The only restriction on the algorithm I've developed is the fetch of the high and
low order words have to be within 2**30 increments of the counter of each other.
I have some gcc inline assembler I've experimenting with here
static __inline__ void atomic_inc63(uint64_t *v)
{
__asm__ __volatile__(
"lock addl $1, 0(%0) ;\n" // low order word
"jo 1f ;\n"
"jc 1f ;\n"
"jmp 2f ;\n"
// increment high order if bit 30 carries out
"1: lock incl 4(%0) ;\n" // high order word
"2: \n"
:
:"r" (v)
: "memory"
);
}
static __inline__ uint64_t atomic_read63(uint64_t *v)
{
uint64_t ret;
__asm__ __volatile__(
"mov 4(%0), %%edx ;\n" // high order word
"mov 0(%0), %%eax ;\n" // low order word
// if (high:0 != low:31) # not in sync
"mov %%edx, %%esi ;\n" // copy high order word
"shl $31, %%esi ;\n" // bit 0 -> bit 31
"xorl %%eax, %%esi ;\n" // compare bit 31's
"jns 1f ;\n" // if equal goto 1
// if (low:30 == 0b1) # high order newer than low order?
"testl $0x40000000, %%eax;\n"
"jz 2f ;\n"
"decl %%edx ;\n" // high order is new, decrement
"jmp 1f ;\n"
// else (low:30 == 0b0)
"2: incl %%edx ;\n" // high order is old, increment
// endif (low:30 == 0b1)
// endif (high:0 != low:31)
"1: shr %%edx ;\n" // high order word >> 1
"mov %%eax, 0(%1) ;\n"
"mov %%edx, 4(%1) ;\n"
:
: "r" (v), "r" (&ret)
: "memory", "ax", "dx", "si"
);
return ret;
}
Joe Seigh
Actually, you don't need the lock: prefix, read-mod-write
instructions are guaranteed atomic. And you are good
for 2^32 counts between
add [count0],1
adc [count1],0
It works until a carry gets lost. Which may not be
possible if all incrementors follow the same semantics.
The problem comes with readers if there are carries
yet to be adc.
There is an atomic `cmpxchg8b` instruction that
might be useful for you.
-- Robert
Reading it atomically is a problem. cmpxchg8b would be
an expensive way to read it.
Actually, Terje's pointer swapping techniqe can be made to work
for mulitple writers if you use 3 instead of 2 buffers. His
technique involves you switching the pointer to a new 64 buffer
instead of carrying out bit 31. With 3 buffers you can arrange
that the switched to buffer is already set up so the switch is
atomic. So for example if you have
p -> 00000003 ffffffff
00000004 00000000
00000002 ffffffff
The increment_unless_bit31carry fails so instead you try to switch
p to point to the next buffer. If the switch works, you prepare
the buffer after that for the next bit 31 carry. If the switch
fails you try to increment the count in the next buffer. So after
the switch and after the next buffer was prepared, it looks like
00000003 ffffffff
p -> 00000004 00000000
00000005 00000000
Slightly more expensive increment operation since you're using compare
and swap but reads are very fast.
Joe Seigh
I'm really not sure how pointer switching helps.
How does it safeguard the data when the process is
interrupted between the set/test and the switch?
> fails you try to increment the count in the next buffer. So after
> the switch and after the next buffer was prepared, it looks like
>
> 00000003 ffffffff
> p -> 00000004 00000000
> 00000005 00000000
>
> Slightly more expensive increment operation since you're using compare
> and swap but reads are very fast.
-- Robert
Robert Redelmeier wrote:
>
> Joe Seigh <spam...@crayne.org> wrote:
> > Actually, Terje's pointer swapping techniqe can be made to work
> > for mulitple writers if you use 3 instead of 2 buffers. His
> > technique involves you switching the pointer to a new 64 buffer
> > instead of carrying out bit 31. With 3 buffers you can arrange
> > that the switched to buffer is already set up so the switch is
> > atomic. So for example if you have
> >
> > p -> 00000003 ffffffff
> > 00000004 00000000
> > 00000002 ffffffff
> >
> > The increment_unless_bit31carry fails so instead you try to switch
> > p to point to the next buffer. If the switch works, you prepare
> > the buffer after that for the next bit 31 carry. If the switch
>
> I'm really not sure how pointer switching helps.
> How does it safeguard the data when the process is
> interrupted between the set/test and the switch?
There's no set/test. It's all done with one successfull compare
and swap plus maybe some unsuccessful ones prior to that. So just
one atomic update to storage plus exclusive access to the next
available buffer for as long as it takes the other threads to do
2**32 increments of the current buffer.
Atomically swapping pointers is a standard technique used in lock-free
programming. RCU (Read, Copy, Update), DCL (Double Checked Locking),
lock-free LIFO and FIFO queues, and some stuff in Java all use it.
Google on "lock-free" for more information.
Joe Seigh
Read-modify-write instructions are not atomic from the bus's perspective.
The lock prefix is necessary for the code to run on multi-CPU machines.
-Matt
I thought that bus-snooping and MESI cache protocols took care of this?
-- Robert
Timings on a 800mhz pentium are
*(uint64_t *) = 0.000000003
(*(uint64_t *))++ = 0.000000006
atomic_inc63 = 0.000000032
atomic_read63 = 0.000000009
cas64 = 0.000000080
cas32 = 0.000000043
atomic_inc64 = 0.000000066
atomic_inc64 is the modified pointer swapping technique written
in C. It could be optimized a bit closer to cas32 timing.
Joe Seigh
It doesn't have anything to do with cache coherency. Read-modify-write ops
are broken up internally into 3 discrete operations. If someone else
completes a store after the load completes but before the store completes,
the answer will be wrong. Excepting rep/repe prefixed instructions, the
processor makes every x86 instruction atomic with respect to interrupts, but
not with respect to bus operations. Just to make sure my understanding was
correct, I verified with the following program on a dual-processor machine:
uint32 counter = 0;
void thread_proc(void)
{
// Loop 2^31 times
for(unsigned int i = 0; i < 0x80000000; i++)
{
asm("incl (%0)" : : "m" (counter));
}
}
int main(void)
{
// Calling twice means 2^32 increments and the counter wraps to 0
thread_proc();
thread_proc();
printf("ctr = %u\n", counter);
// Create 2 threads of thread_proc and join them. Omitted for brevity
printf("ctr = %u\n", counter);
return 0;
}
The output is this:
ctr = 0
ctr = 3029308901
Adding the lock prefix causes the output to become this:
ctr = 0
ctr = 0
-Matt
Nope. Without the lock, a waiting cpu could theoretically aquire the
relevant cache line between the read and the write/update bus cycle.
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Are you willing to post source for these? Where can one obtain it? :-)
-Matt
The code for the 63 bit atomic counter has been posted already so I
assume you mean the 64 bit counter. I only did a quick and dirty
prototype in C to verify the carry out logic and pointer switching.
typedef struct atomic64_tt {
union {
uint32_t count_[2];
uint64_t count;
} x;
struct atomic64_tt *next;
} atomic64_t;
inline void atomic_inc64(atomic64_t **p) {
uint32_t oldcount0;
atomic64_t *lp; // local
lp = *p;
do {
oldcount0 = lp->x.count_[0];
while (oldcount0 < UINT32_MAX
&& !cas32(&((*p)->x.count_[0]), &oldcount0, (oldcount0 + 1)));
}
// if carryout switch to next counter
// exit if switch successful, else increment next counter
while (oldcount0 == UINT32_MAX && !cas32(p, &lp, (uint32_t)(lp->next)));
// if switch successful, set up next counter
if (oldcount0 == UINT32_MAX) {
lp = *p;
lp->next->x.count_[1] = lp->x.count_[1] + 1;
lp->next->x.count_[0] = 0;
}
}
the cas32 I use returns 1 if cas successful, 0 otherwise. 2nd parm is compare
value, 3rd xchange value.
initialization is more or less
atomic64_t *p; // global pointer
atomic64_t z[2];
z[0] = {0x0000000000000000ULL, &z[1]}
z[1] = {0x0000000100000000ULL, &z[2]}
z[2] = {0x0000000000000000ULL, &z[0]}
That's probably not valid C for the initialization but you get the idea.
The inner loop is the fast path and you'd probably code that in asm. And optimize
the redundent oldcount0 tests and use a little cleaner endian word access logic.
Reading the value is through same pointer passed to atomic_int64()
p->x.count
Joe Seigh