Atomic 63 bit counter

Joe Seigh

unread,

Dec 12, 2004, 2:26:44 PM12/12/04

to

I think this should work. It depends on forward progress guarantees, i.e. that a small group
of instructions can be executed before a 31 bit counter can wrap. It's lock-free, loop free,
and memory barrier free, both for incrementing the counter and reading it.

Basically you overlap 2 32 bit integers by one bit. The low order count contains bits
0-31 and the high order count contains bits 31-62. You use the overlapped bit to determine
whether the 2 32 bit reads were "in sync". If they are not you can correct the high order
read without having to reread the counter.

To overlap by one bit, you increment the high order count when bit 30 carries out, rather than
bit 31. These are the transitions

0x7fffffff --> 0x80000000
0xffffffff --> 0x00000000

You can test for this by (CF==1 || OF==1) on x86. I don't know x86 assembler that well.
It looks like it would require 2 Jcc and 1 JMP inst. Is that right?

The pseudo code for this would be

increment_atomic63:
lock add low_count, 1
if (CF == 1 || OF == 1)
lock add hi_count, 1

To read the counter, you simply read the high and low counts in any order. If the high order
bit 0 equals the low order bit 31, the reads were in sync and all you have to do is shift the
high order read value right by 1. If they are not in sync, you look at bit 30 of the low order
count that was read to determine whether it's newer or older than the high order count that
was read. If bit 30 is 0, it's newer, increment the high order read count by 1 and shift right 1,
otherwise decrement it by 1 and shift right 1.

Pseudo code would be

read_atomic63:
low_temp = low_count; // atomic fetch of aligned operand.
hi_temp = hi_count; // atomic fetch of aligned operand
if (hi_temp:0 == low_temp:31) // in sync?
;
else if (low_temp:30 == 0) // low_temp newer?
hi_temp++;
else // low_temp older
hi_temp--;
hi_temp >>= 1;
return (hi_temp, low_temp)

A bit more complicated. The testing for in sync should be optimal.
The low_temp newer or older test is very infrequent and doesn't need to be
as optimal.

This could be ported to any architecture that had equivalent condition flag
testing or anything equivalent to XADD that returns the old value for bit
testing that way.

Comments?

Joe Seigh

Zdenek Sojka

unread,

Dec 12, 2004, 4:09:32 PM12/12/04

to

Well, prolly I misunderstood your post, but what about

add dword [COUNTER],1 ;bits 0-31 ;set carry on overflow
adc dword [COUNTER+4],0 ;bits 32-63 ;add 0+CF

Zdenek Sojka

"Joe Seigh" <spam...@crayne.org> píse v diskusním príspevku
news:41BC5D3D...@xemaps.com...

Matt

unread,

Dec 13, 2004, 2:59:34 PM12/13/04

to

"Joe Seigh" <spam...@crayne.org> wrote in message
news:41BC5D3D...@xemaps.com...

>I think this should work. It depends on forward progress guarantees, i.e.
>that a small group
> of instructions can be executed before a 31 bit counter can wrap. It's
> lock-free, loop free,
> and memory barrier free, both for incrementing the counter and reading it.
>
> Basically you overlap 2 32 bit integers by one bit. The low order count
> contains bits
> 0-31 and the high order count contains bits 31-62. You use the overlapped
> bit to determine
> whether the 2 32 bit reads were "in sync". If they are not you can
> correct the high order
> read without having to reread the counter.

[...]

http://groups-beta.google.com/group/comp.lang.asm.x86/messages/b7602ccd9aa10d83,3cb8e87a1453bfcb,e9a5d236a6793b12,2cee5a9388ae5403,0ecf668e26da5a01,428747a08527f83b,dd2d5f8460176c4a,666228253006268e,becc6ef92933d8fe,eb765c1da80d71e0?thread_id=d47a5ba0cbbc3914&mode=thread&noheader=1&q=atomic+63-bit+counter#doc_b7602ccd9aa10d83

Some code is given in the posts, though I don't recall how much. I never got
better than abysmal performance out of any of the different versions of the
counter. I don't recall the exact performance figures, but they were more
than two orders of magnitude larger than typical increment operations on
non-shared memory. My testing was done with 4 threads on a 2-way system.

I can be convinced to dig up the code if you'd like to have a look.

-Matt

Zdenek Sojka

unread,

Dec 13, 2004, 3:45:04 PM12/13/04

to

Aha, now I understand...

what about
fild qword [COUNTER]
fld1
faddp
fistp qword [COUNTER]

yep, a bit slow :-(

Maybe usage of MMX would be faster

Or, you can use somewhere set bit and test it , if there is adding in
progress.
If so, wait until the other ends...

(well, this post is just to say I know i understood you incorrect)

Zdenek Sojka

"Zdenek Sojka" <spam...@crayne.org> píąe v diskusním příspěvku
news:cpiba5$ee8$1...@ns.felk.cvut.cz...

Joe Seigh

unread,

Dec 13, 2004, 6:25:59 PM12/13/04

to

I'm familiar with that discussion. The techniques either spin wait if an update
appears to be in progress which means they're not lock-free, or make assumptions
that there's a single writer or that the reads will remember the previously read
value.

If you doing lot's of updates and relativel few reads, you can use a distributed
counter which as very efficent updates and fairly expensive reads. Distributed
counters are a fairly well known technique.

The only restriction on the algorithm I've developed is the fetch of the high and
low order words have to be within 2**30 increments of the counter of each other.
I have some gcc inline assembler I've experimenting with here

static __inline__ void atomic_inc63(uint64_t *v)
{
__asm__ __volatile__(
"lock addl $1, 0(%0) ;\n" // low order word
"jo 1f ;\n"
"jc 1f ;\n"
"jmp 2f ;\n"
// increment high order if bit 30 carries out
"1: lock incl 4(%0) ;\n" // high order word
"2: \n"
:
:"r" (v)
: "memory"
);
}

static __inline__ uint64_t atomic_read63(uint64_t *v)
{
uint64_t ret;

__asm__ __volatile__(
"mov 4(%0), %%edx ;\n" // high order word
"mov 0(%0), %%eax ;\n" // low order word

// if (high:0 != low:31) # not in sync
"mov %%edx, %%esi ;\n" // copy high order word
"shl $31, %%esi ;\n" // bit 0 -> bit 31
"xorl %%eax, %%esi ;\n" // compare bit 31's
"jns 1f ;\n" // if equal goto 1

// if (low:30 == 0b1) # high order newer than low order?
"testl $0x40000000, %%eax;\n"
"jz 2f ;\n"

"decl %%edx ;\n" // high order is new, decrement
"jmp 1f ;\n"

// else (low:30 == 0b0)
"2: incl %%edx ;\n" // high order is old, increment
// endif (low:30 == 0b1)

// endif (high:0 != low:31)

"1: shr %%edx ;\n" // high order word >> 1
"mov %%eax, 0(%1) ;\n"
"mov %%edx, 4(%1) ;\n"
:
: "r" (v), "r" (&ret)
: "memory", "ax", "dx", "si"
);

return ret;
}

Joe Seigh

Robert Redelmeier

unread,

Dec 13, 2004, 7:26:56 PM12/13/04

to

Joe Seigh <spam...@crayne.org> wrote:
> The only restriction on the algorithm I've developed is
> the fetch of the high and low order words have to be within
> 2**30 increments of the counter of each other. I have some
> gcc inline assembler I've experimenting with here

Actually, you don't need the lock: prefix, read-mod-write
instructions are guaranteed atomic. And you are good
for 2^32 counts between

add [count0],1
adc [count1],0

It works until a carry gets lost. Which may not be
possible if all incrementors follow the same semantics.
The problem comes with readers if there are carries
yet to be adc.

There is an atomic `cmpxchg8b` instruction that
might be useful for you.

-- Robert

Joe Seigh

unread,

Dec 13, 2004, 11:04:07 PM12/13/04

to

Reading it atomically is a problem. cmpxchg8b would be
an expensive way to read it.

Actually, Terje's pointer swapping techniqe can be made to work
for mulitple writers if you use 3 instead of 2 buffers. His
technique involves you switching the pointer to a new 64 buffer
instead of carrying out bit 31. With 3 buffers you can arrange
that the switched to buffer is already set up so the switch is
atomic. So for example if you have

p -> 00000003 ffffffff
00000004 00000000
00000002 ffffffff

The increment_unless_bit31carry fails so instead you try to switch
p to point to the next buffer. If the switch works, you prepare
the buffer after that for the next bit 31 carry. If the switch
fails you try to increment the count in the next buffer. So after
the switch and after the next buffer was prepared, it looks like

00000003 ffffffff
p -> 00000004 00000000
00000005 00000000

Slightly more expensive increment operation since you're using compare
and swap but reads are very fast.

Joe Seigh

Robert Redelmeier

unread,

Dec 14, 2004, 9:48:23 AM12/14/04

to

Joe Seigh <spam...@crayne.org> wrote:
> Actually, Terje's pointer swapping techniqe can be made to work
> for mulitple writers if you use 3 instead of 2 buffers. His
> technique involves you switching the pointer to a new 64 buffer
> instead of carrying out bit 31. With 3 buffers you can arrange
> that the switched to buffer is already set up so the switch is
> atomic. So for example if you have
>
> p -> 00000003 ffffffff
> 00000004 00000000
> 00000002 ffffffff
>
> The increment_unless_bit31carry fails so instead you try to switch
> p to point to the next buffer. If the switch works, you prepare
> the buffer after that for the next bit 31 carry. If the switch

I'm really not sure how pointer switching helps.
How does it safeguard the data when the process is
interrupted between the set/test and the switch?

> fails you try to increment the count in the next buffer. So after
> the switch and after the next buffer was prepared, it looks like
>
> 00000003 ffffffff
> p -> 00000004 00000000
> 00000005 00000000
>
> Slightly more expensive increment operation since you're using compare
> and swap but reads are very fast.

-- Robert

Joe Seigh

unread,

Dec 14, 2004, 4:40:46 PM12/14/04

to

Robert Redelmeier wrote:
>
> Joe Seigh <spam...@crayne.org> wrote:
> > Actually, Terje's pointer swapping techniqe can be made to work
> > for mulitple writers if you use 3 instead of 2 buffers. His
> > technique involves you switching the pointer to a new 64 buffer
> > instead of carrying out bit 31. With 3 buffers you can arrange
> > that the switched to buffer is already set up so the switch is
> > atomic. So for example if you have
> >
> > p -> 00000003 ffffffff
> > 00000004 00000000
> > 00000002 ffffffff
> >
> > The increment_unless_bit31carry fails so instead you try to switch
> > p to point to the next buffer. If the switch works, you prepare
> > the buffer after that for the next bit 31 carry. If the switch
>
> I'm really not sure how pointer switching helps.
> How does it safeguard the data when the process is
> interrupted between the set/test and the switch?

There's no set/test. It's all done with one successfull compare
and swap plus maybe some unsuccessful ones prior to that. So just
one atomic update to storage plus exclusive access to the next
available buffer for as long as it takes the other threads to do
2**32 increments of the current buffer.

Atomically swapping pointers is a standard technique used in lock-free
programming. RCU (Read, Copy, Update), DCL (Double Checked Locking),
lock-free LIFO and FIFO queues, and some stuff in Java all use it.
Google on "lock-free" for more information.

Joe Seigh

Matt

unread,

Dec 15, 2004, 3:08:16 AM12/15/04

to

"Robert Redelmeier" <red...@ev1.net.invalid> wrote in message
news:geqvd.29453$fC4....@newssvr11.news.prodigy.com...

> Joe Seigh <spam...@crayne.org> wrote:
>> The only restriction on the algorithm I've developed is
>> the fetch of the high and low order words have to be within
>> 2**30 increments of the counter of each other. I have some
>> gcc inline assembler I've experimenting with here
>
>
> Actually, you don't need the lock: prefix, read-mod-write
> instructions are guaranteed atomic. And you are good
> for 2^32 counts between

Read-modify-write instructions are not atomic from the bus's perspective.
The lock prefix is necessary for the code to run on multi-CPU machines.

-Matt

Robert Redelmeier

unread,

Dec 15, 2004, 8:36:41 AM12/15/04

to

Matt <spam...@crayne.org> wrote:
> Read-modify-write instructions are not atomic from the
> bus's perspective. The lock prefix is necessary for the
> code to run on multi-CPU machines.

I thought that bus-snooping and MESI cache protocols took care of this?

-- Robert

Joe Seigh

unread,

Dec 15, 2004, 2:24:38 PM12/15/04

to

>
> Actually, Terje's pointer swapping techniqe can be made to work
> for mulitple writers if you use 3 instead of 2 buffers.

....

> Slightly more expensive increment operation since you're using compare
> and swap but reads are very fast.

Timings on a 800mhz pentium are

*(uint64_t *) = 0.000000003
(*(uint64_t *))++ = 0.000000006
atomic_inc63 = 0.000000032
atomic_read63 = 0.000000009
cas64 = 0.000000080
cas32 = 0.000000043
atomic_inc64 = 0.000000066

atomic_inc64 is the modified pointer swapping technique written
in C. It could be optimized a bit closer to cas32 timing.

Joe Seigh

Matt

unread,

Dec 16, 2004, 2:26:32 AM12/16/04

to

"Robert Redelmeier" <red...@ev1.net.invalid> wrote in message

news:pWWvd.39975$bP2....@newssvr12.news.prodigy.com...

It doesn't have anything to do with cache coherency. Read-modify-write ops
are broken up internally into 3 discrete operations. If someone else
completes a store after the load completes but before the store completes,
the answer will be wrong. Excepting rep/repe prefixed instructions, the
processor makes every x86 instruction atomic with respect to interrupts, but
not with respect to bus operations. Just to make sure my understanding was
correct, I verified with the following program on a dual-processor machine:

uint32 counter = 0;

void thread_proc(void)
{
// Loop 2^31 times
for(unsigned int i = 0; i < 0x80000000; i++)
{
asm("incl (%0)" : : "m" (counter));
}
}

int main(void)
{
// Calling twice means 2^32 increments and the counter wraps to 0
thread_proc();
thread_proc();
printf("ctr = %u\n", counter);

// Create 2 threads of thread_proc and join them. Omitted for brevity
printf("ctr = %u\n", counter);

return 0;
}

The output is this:
ctr = 0
ctr = 3029308901

Adding the lock prefix causes the output to become this:
ctr = 0
ctr = 0

-Matt

Terje Mathisen

unread,

Dec 20, 2004, 5:42:51 AM12/20/04

to

Robert Redelmeier wrote:

Nope. Without the lock, a waiting cpu could theoretically aquire the
relevant cache line between the read and the write/update bus cycle.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Matt

unread,

Dec 23, 2004, 8:12:33 AM12/23/04

to

"Joe Seigh" <spam...@crayne.org> wrote in message

news:41C07F6F...@xemaps.com...

>
>>
>> Actually, Terje's pointer swapping techniqe can be made to work
>> for mulitple writers if you use 3 instead of 2 buffers.
> ....
>> Slightly more expensive increment operation since you're using compare
>> and swap but reads are very fast.
>
> Timings on a 800mhz pentium are

[...]

Are you willing to post source for these? Where can one obtain it? :-)

-Matt

Joe Seigh

unread,

Dec 23, 2004, 2:47:57 PM12/23/04

to

The code for the 63 bit atomic counter has been posted already so I
assume you mean the 64 bit counter. I only did a quick and dirty
prototype in C to verify the carry out logic and pointer switching.

typedef struct atomic64_tt {
union {
uint32_t count_[2];
uint64_t count;
} x;
struct atomic64_tt *next;
} atomic64_t;

inline void atomic_inc64(atomic64_t **p) {
uint32_t oldcount0;
atomic64_t *lp; // local

lp = *p;
do {
oldcount0 = lp->x.count_[0];

while (oldcount0 < UINT32_MAX
&& !cas32(&((*p)->x.count_[0]), &oldcount0, (oldcount0 + 1)));
}

// if carryout switch to next counter
// exit if switch successful, else increment next counter
while (oldcount0 == UINT32_MAX && !cas32(p, &lp, (uint32_t)(lp->next)));

// if switch successful, set up next counter
if (oldcount0 == UINT32_MAX) {
lp = *p;
lp->next->x.count_[1] = lp->x.count_[1] + 1;
lp->next->x.count_[0] = 0;
}
}

the cas32 I use returns 1 if cas successful, 0 otherwise. 2nd parm is compare
value, 3rd xchange value.

initialization is more or less

atomic64_t *p; // global pointer
atomic64_t z[2];

z[0] = {0x0000000000000000ULL, &z[1]}
z[1] = {0x0000000100000000ULL, &z[2]}
z[2] = {0x0000000000000000ULL, &z[0]}

That's probably not valid C for the initialization but you get the idea.

The inner loop is the fast path and you'd probably code that in asm. And optimize
the redundent oldcount0 tests and use a little cleaner endian word access logic.

Reading the value is through same pointer passed to atomic_int64()

p->x.count

Joe Seigh