Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Coroutines or threads ?

9 views
Skip to first unread message

rami17

unread,
Apr 25, 2017, 5:03:58 PM4/25/17
to
Hello,


I have come to an interesting subject...

Is my Object coroutines library still useful ?

Can threads do the same job as coroutines ?

I think that coroutines are still useful because a lock or a cache-line
transfer is expensive on threads running on multicores, a cache-line
transfer is around 400 CPU cycles on x86 , and that's too much expensive
compared to coroutines, and the semaphore and mutex of my coroutines
library and the contention on them are thus much much less expensive,
other than that coroutines can yield to other functions or procedures
inside a function or procedure, so i think that coroutines are thus
still useful.

My Object oriented Stackful coroutines library was updated..

Now i have rendered it completly stackful for both 32 bit and 64 bit
Delphi and FreePascal.

And it's portable to both Windows and MacOSX.

And I think it is portable to Linux, just compile it on Delphi Tokyo for
Linux.

You can download my new version from:

https://sites.google.com/site/aminer68/object-oriented-stackful-coroutines-library-for-delphi-and-freepascal

Thank you,
Amine Moulay Ramdane.




rami17

unread,
Apr 25, 2017, 5:04:42 PM4/25/17
to
Hello...

Bonita Montero

unread,
May 7, 2017, 4:52:07 AM5/7/17
to
> ..., a cache-line transfer is around 400 CPU cycles on x86, ...

LOL.

Bonita Montero

unread,
May 7, 2017, 10:16:11 AM5/7/17
to
Ok, you're an idiot on the one side, but on the other side you
were not even right in this case, but you even underestimated
the time of a cacheline-transfer.
I wrote a little program that measures the time of a 32 bit
LOCK CMPXCHG when die cacheline was in another l1-cache before
(or changed by a sibling thread on smt).
Here it is:
https://pastebin.com/M5W3tvHd
There are two spawned threads in this program. Both check a DWORD
value if it is even or odd. The one looking for even values increments
it through LOCK CMPXCHG (the InternlockedCompareExchange directly maps
to the intrinsic when you use MSVC++) when it is even, the other when
it is odd. So both play pingpong with the cache-line. The time to do
this is taken with RDTSC (ok, rdtsc isn't absolutely accuraten on my
Ryzen 1800X due to XFR). Each thread does a fixed number of iterations
and counts the successful swaps. Both threads singal their number of
ticks and successfull swaps to the main-thread. The main-thread does
the "pingpong" alone with itself with the number of iterations minus
the number of successful swaps but does only unsucessful CMXCHGs so
that it gets the overhead of unsuccessful CMPXCHs in both threads.
This overhead is the time of all iterations with unsuccessful CMPXCHGs.
The average time of both threads is subtracted by this time, so that
we get the pure time spent by doing successful CMPXCHGs, i.e. we get
the time mostly spent by transferring the cachelines between the two
cores.
The program tests the overhead of core 0 versus all other cores in
the system. With my ryzen, the overhead looks like this:
processor 0 and processor 1: 117.606
processor 0 and processor 2: 146.972
processor 0 and processor 3: 159.324
processor 0 and processor 4: 148.491
processor 0 and processor 5: 150.407
processor 0 and processor 6: 175.156
processor 0 and processor 7: 198.427
processor 0 and processor 8: 810.89
processor 0 and processor 9: 806.403
processor 0 and processor 10: 799.987
processor 0 and processor 11: 796.437
processor 0 and processor 12: 795.243
processor 0 and processor 13: 797.405
processor 0 and processor 14: 795.879
processor 0 and processor 15: 792.173
As we can see, there's a huge overhead even for core 0 with its
SMT-sibling ant the overhead increases from core 1 to core 2, i.e.
where a transfers between two cores become necessary. And there's
a hughe increase from core 7 to core 8 because Ryzen is organized
in two CCX-modules connected to a crossbar that isn't able to
transfer faster than the RAM (regarding both throughput and
latency).

Bonita Montero

unread,
May 7, 2017, 10:54:01 AM5/7/17
to
There was a little flaw with the PingPong-function.
Here is the accurate code:
https://pastebin.com/Ya3TFcB5
0 new messages