I need to have multiple threads increment/decrement an int atomically, and I
would like to avoid locking for performance reasons.
I thought about using memory barriers, but I'm not used to that. How should I
proceed?
Or should/can I rely on gcc atomic builtins?
In the end, I need it to be as fast as possible.
--
Olivier
--
You received this message because you are subscribed to the Google Groups "android-ndk" group.
To post to this group, send email to andro...@googlegroups.com.
To unsubscribe from this group, send email to android-ndk...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.
On 06/18/2010 03:55 PM, mic _ wrote:
> ARM doesn't support arithmetic operations like ADD/SUB on memory
> locations. The value needs to be read from memory, modified, and written
> back in separate instructions. I'm not an expert on threads but it seems
> to me like the scheduler could switch to another thread at any of those
> three steps if you leave the variable without any kind of lock.
>
> /Michael
>
> On Fri, Jun 18, 2010 at 3:05 PM, Olivier Guilyardi <li...@samalyse.com
> <mailto:li...@samalyse.com>> wrote:
>
> Hi,
>
> I need to have multiple threads increment/decrement an int
> atomically, and I
> would like to avoid locking for performance reasons.
>
> I thought about using memory barriers, but I'm not used to that. How
> should I
> proceed?
>
> Or should/can I rely on gcc atomic builtins?
>
> In the end, I need it to be as fast as possible.
>
> --
> Olivier
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "android-ndk" group.
> To post to this group, send email to andro...@googlegroups.com
> <mailto:andro...@googlegroups.com>.
> To unsubscribe from this group, send email to
> android-ndk...@googlegroups.com
> <mailto:android-ndk%2Bunsu...@googlegroups.com>.
Right, that was irrelevant.
>
> Atomic operations on ARMv6 and later can be implemented with LDREX/
> STREX; referencing a different page on that same site:
>
> http://www.keil.com/support/man/docs/armasm/armasm_cihbghef.htm
>
> Be really sure you want to do this. Memory barriers *do* matter on
> SMP hardware, and the penalties for getting it wrong can be subtle.
> The pthread calls will always be correct.
>
> I'm not sure offhand what the state of the gcc builtins is.
I tried to use __sync_fetch_and_add(), compiling was ok, but it wasn't found
when linking.
The trouble with LDREX/STREX is that this isn't supported on ARMv5.
So... I made quite a lot of tests, using:
- a custom atomic_add(): http://lwn.net/Articles/314561/
- Bionic's __atomic_inc() and __atomic_dec() from NDK's <sys/atomics.h>
- and using a semaphore as lock, for performance comparison
The code is here:
http://svn.samalyse.com/android/atomics/jni/atomicstest.c
It starts two threads one which increment an int N times, and another N-1 times.
A result of 1 is /likely/ to indicate success.
You can actually run it with:
$ http://svn.samalyse.com/android/atomics
$ cd atomics
$ ./bootstrap.sh
$ make
This assumes that ndk-build, ant and adb are on your PATH, and that you have a
single device plugged in.
Here are the results on a HTC Magic 32B, running Android 1.6:
=============== Starting Atomics Tests ===============
ADD thread starting for 10000000 iterations using simple incrementation
SUB thread starting for 9999999 iterations using simple decrementation
ADD thread done in 2183ms
SUB thread done in 2190ms
Threads joined, result: 755919 (FAILURE)
ADD thread starting for 10000000 iterations using custom atomic_add()
SUB thread starting for 9999999 iterations using custom atomic_add()
ADD thread done in 3240ms
SUB thread done in 3226ms
Threads joined, result: 1 (SUCCESS)
ADD thread starting for 10000000 iterations using Bionic's __atomic_int()
SUB thread starting for 9999999 iterations using Bionic's __atomic_dec()
ADD thread done in 4563ms
SUB thread done in 4553ms
Threads joined, result: 1 (SUCCESS)
ADD thread starting for 10000000 iterations using locks
SUB thread starting for 9999999 iterations using locks
SUB thread done in 69140ms
ADD thread done in 69236ms
Threads joined, result: 1 (SUCCESS)
=============== Finished Atomics Tests ===============
First thing is that without using atomics or locks, the test fails, which was
expected.
Then, using locks is about 20 times slower than atomics. Also, the custom
atomic_add() seems to perform a little better than Bionic calls, but are these
later public anyway?
I suppose it might worth trying with a pthread mutex too, to see how fast that is.
--
Olivier
Errata:
> It starts two threads one which increment an int N times, and another N-1 times.
> A result of 1 is /likely/ to indicate success.
I meant: ... and another which decrement it N-1 times.
> You can actually run it with:
> $ http://svn.samalyse.com/android/atomics
> $ cd atomics
> $ ./bootstrap.sh
> $ make
First line should be: svn co http://svn.samalyse.com/android/atomics
--
Olivier
If I understand you correctly, the kernel will issue LDREX/STREX from "my"
atomic_add() if it's running on ARMv6, but using LDREX/STREX directly should
still be 5x faster. That's quite a lot.
I see that the NDK targets -march=armv5te, but are there really devices which
are limited to the ARMv5 instruction set? Or are they all/mostly ARMv6?
>> Then, using locks is about 20 times slower than atomics. Also, the custom
>> atomic_add() seems to perform a little better than Bionic calls, but are these
>> later public anyway?
>
> The bionic mutex and sem calls are simple atomic operations in the
> absence of contention. When locks start colliding they'll sleep on
> futexes which is going to slow you down some. 20x seems a bit
> extreme. Are the numbers closer if you only execute one thread,
> avoiding contention?
Starting only the incrementing thread, the runtime is 34s instead of 69s, which
is half runtime for locking 1M times instead of 2M. So it really seems to come
from the locks themselves, because with one thread they don't collide at all.
I think it would be interesting to see if such a thing happen on a standard
Linux system. I may try that..
> The bionic atomic call is going to take longer than a local atomic
> call because it's in a different .so; I found that executing the same
> atomic op across .so boundaries took 56ns, vs. 25ns for a local
> function, vs. 8ns for an inline. (We're in the process of making
> substantial changes to the atomic ops right now -- DO NOT use the
> private bionic functions, it's highly likely they'll change.)
Okay, private, got it. Can you recommend the custom atomic_add() in my tests
then? Is this safe?
Actually, I would quite have like to hear about someone running those tests on a
device with an ARMv7 cpu, since I don't own one. The tests are very easy to run.
It's not only a matter of speed. I'd like to know if those things are reliable.
>> I suppose it might worth trying with a pthread mutex too, to see how fast that is.
>
> I expect it'll be similar to the sem results; possibly a bit worse
> since there's slightly more mechanism involved in a mutex than a
> semaphore.
Are you sure? David recently told that mutexes should be faster in Froyo,
because they rely on private futexes instead of standard ones. But I don't have
a device running Froyo (and didn't check yet the latest ROMs out there ;).
--
Olivier
> Starting only the incrementing thread, the runtime is 34s instead of 69s, which
> is half runtime for locking 1M times instead of 2M. So it really seems to come
> from the locks themselves, because with one thread they don't collide at all.
I meant: ".. for locking 10M times instead of 20M.".
--
Olivier
Aside: is there a stable futex() syscall that will survive this
reorganisation, or should I embed my own syscall stub? (will even the
kernel ABI be preserved?)
- Gus
--
Olivier
--
You received this message because you are subscribed to the Google Groups "android-ndk" group.
Okay, let's forget it, ARMv5 is the target, and I don't want to have a different
implementation for that kind of thing if I ever target the new ARMv7 support.
> Results from a Nexus One:
Good cpu I see :)
>
> Clearly gcc is turning the trivial add-loop into a single store.
> Introducing a new function:
That's weird, it doesn't do that here with APP_OPTIM:=debug. I suppose you're
not using the Application.mk which is provided in the test suite.
> I added a new test for inline LDREX/STREX, and got:
Would you mind doing an svn diff and posting it here so that I patch the tests?
> I'm a little baffled by the performance of the sem_* functions.
Actually, I have noticed for a long time that they were slow, and I now try to
avoid locking in my app. Whenever it's not critical I don't lock anymore,
although it often looks non-academic and could cause subtle (harmless) issues.
--
Olivier
svn diff?
> I'll file a bug about the semaphore performance.
Cool, my tests sound useful then :)
> I'll reiterate my recommendation to use mutexes over custom lock
> code. It *will* break on SMP if you don't also get the memory
> barriers in, and my experiments have shown that using barrier
> instructions on non-SMP hardware is a very bad idea.
There's one thing that you didn't answer: is the custom atomic_add() ok? Can I
use that? It doesn't qualify as "custom lock code" to me. Or would it be better
to reuse the __atomic_inc() implementation from arch-arm/bionic/atomics_arm.S?
(which looks similar actually)
--
Olivier
> I'll reiterate my recommendation to use mutexes over custom lockThere's one thing that you didn't answer: is the custom atomic_add() ok? Can I
> code. It *will* break on SMP if you don't also get the memory
> barriers in, and my experiments have shown that using barrier
> instructions on non-SMP hardware is a very bad idea.
use that? It doesn't qualify as "custom lock code" to me. Or would it be better
to reuse the __atomic_inc() implementation from arch-arm/bionic/atomics_arm.S?
(which looks similar actually)
--
Olivier
--
You received this message because you are subscribed to the Google Groups "android-ndk" group.
Okay, I thought Fadden was maybe answering Angus about futexes, since an atomic
operation isn't "lock code" to me.
But I have now tried with pthread mutexes (svn updated) and they do provide 6x
speedup over semaphores on Android 1.6. That's *very* interesting to know.
And this speedup doesn't come from Froyo's private futexes apparently (I thought
that was the case in Fadden's tests).
> do not reuse assembly fragment from the C library, they will break on SMP
> (and they are also conditionally compiled depending on the device's
> target CPU,
> something you cannot afford in an application).
> moreover, the __atomic_inc is private to the C library, do not use them.
> They could
> be removed, and they must be used with memory barriers on SMP hardware.
Okay, understood, anyway I found a workaround, and atomic operations are not
crucial anymore right now.
I think that one conclusion of this thread simply is that atomic operations are
not available to user space code in Anrdoid currently.
But please optimize the semaphores ;-)
Thanks
--
Olivier
Tim