Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Thread-local storage

190 views
Skip to first unread message

Anton Ertl

unread,
Mar 3, 2017, 11:06:04 AM3/3/17
to
Thread-local storage (TLS) is implemented on Linux-AMD64 by having the
instances for all threads in the address space and using an FS: prefix
to access the right instance. I have measured that this costs two
cycles of latency on a Haswell and one cycle on a K8 (compared to
doing the same thing in global storage). On other CPUs accessing
thread-local storage also has a cost.

An alternative would be to map the thread-local storage in a certain
address range, i.e., different threads have different mappings for
that address range.

Advantages and disadvantages:

+ Avoids the overhead of using a segment register prefix (or
alternatively, reserving a general-purpose register for keeping the
base address of the thread-local storage).

+ Thread-local memory can be treated like all other memory instead of
needing conversion from thread-local to normal addresses when taking
the address of a thread-local object (and the cost of this
conversion is also avoided).

+? This results in more flexibility when dealing with thread-local
storage, e.g. thread-local memory could be allocated dynamically
(not sure if this is a real advantage; static thread-local pointers
to normal dynamic memory are probably sufficient).

- Thread switching now needs kernel intervention, user-space threads
(green threads) cannot use this TLS mechanism.

The following presumes that threads of a process share the address
space (and use the same address-space ID (ASID), if they use ASIDs),
which is not necessarily the case.

- Thread switching on the same core is now more expensive; in the
cheapest case it is as cheap as loading a new address-space id on
thread switch, which should be relatively cheap, but when that does
not suffice, it becomes quite expensive (TLB shootdowns or TLB
flushes). Of course, if you use threads mainly for making use of
multiple cores, this will not occur often.

- Sharing of TLB entries between threads is eliminated. Again, if you
use threads to make use of multiple cores, this is not an issue.

- More ASIDs are active at the same time.

Did I forget anything?

Overall, it seems to me that the advantages of having thread-local
memory at the same addresses in different threads are bigger than the
disadvantages, but nobody does it that way. Why?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Casper H.S. Dik

unread,
Mar 3, 2017, 11:32:38 AM3/3/17
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

>- Sharing of TLB entries between threads is eliminated. Again, if you
> use threads to make use of multiple cores, this is not an issue.

>- More ASIDs are active at the same time.

>Did I forget anything?

Not possible to ever share the thread-local storage between threads?

Casper

Anton Ertl

unread,
Mar 3, 2017, 11:47:53 AM3/3/17
to
If you want that (not sure if anybody wants to), also map the
thread-local memories of all threads in the process's address space.

Or, to put it in another way: take the current model and also map the
per-thread memory of the current thread in a fixed place in the
address space, and access it through that if you want to access it as
thread-local storage. If you want to access another thread's memory,
do what you would do now.

Bruce Hoult

unread,
Mar 3, 2017, 12:04:51 PM3/3/17
to
Not really useful once you've got on the order of 16 or more general purpose registers to burn. A little bit tight on x86_64 or ARM32 or VAX, true. But when you have 32...

Anton Ertl

unread,
Mar 3, 2017, 1:24:40 PM3/3/17
to
Let's test that theory:

__thread long a[]={0};

long foo(long l)
{
return a[l];
}

On Aarch64 this compiles to:

with __thread: without __thread:
0: mrs x1, tpidr_el0 0: adrp x1, 0 <foo>
4: add x1, x1, #0x0, lsl #12 4: add x1, x1, #0x0
8: add x1, x1, #0x0 8: ldr x0, [x1,x0,lsl #3]
c: ldr x0, [x1,x0,lsl #3] c: ret
10: ret

On a 2GHz Cortex A53 a loop calling this function 100M times takes:

0.71s user time 0.51s user time

Another function:

__thread long a;

void foo()
{
a=10;
}

with __thread: without __thread:
0: mov x1, #0xa 0: mov x1, #0xa
4: mrs x0, tpidr_el0 4: adrp x0, 8 <foo+0x8>
8: add x0, x0, #0x0, lsl #12 8: str x1, [x0]
c: add x0, x0, #0x0 c: ret
10: str x1, [x0]
14: ret

0.71s user time 0.52s user time

Looks to me like your theory has been falsified.

timca...@aol.com

unread,
Mar 3, 2017, 1:55:49 PM3/3/17
to
On Friday, March 3, 2017 at 11:06:04 AM UTC-5, Anton Ertl wrote:
... Analysis of TLS snipped.
>
> Overall, it seems to me that the advantages of having thread-local
> memory at the same addresses in different threads are bigger than the
> disadvantages, but nobody does it that way. Why?
>

The trick is that use TLS as little as possible. Big data structures are allocated/stored in global memory, you store a pointer to it in TLS. So, if you have a bunch of thread parameters, you put them in a data structure in global memory, and just a pointer to them in TLS. The FS/GS override trick (depending on 32 or 64 bit, Windows or Linux) is pretty neat for avoiding reloading the TLB, although I guess using ASIDs is pretty good to (although I suspect 32 bit IDs will be necessary pretty quick).

- Tim

Bruce Hoult

unread,
Mar 3, 2017, 3:07:53 PM3/3/17
to
Not being at work where there's an Odroid C2 I tried it on my RISC-V based Arduino clone. It's only 320 MHz so I did only 10M loops. Single issue, in-order by the way.

After adding "volatile" to the array and __attribute__ ((noinline)) to foo() and adding the results of foo() to a total that I printed later, so that the compiler didn't optimize everything away, I got (the timings are absolutely repeatable, to the us, every time by the way):

Thread local: 312546 us

204000d6 <_Z3fool>:
204000d6: 050a slli a0,a0,0x2
204000d8: 00020313 mv t1,tp
204000dc: 00a303b3 add t2,t1,a0
204000e0: 0003a503 lw a0,0(t2)
204000e4: 8082 ret

Global: 312546 us

204000d6 <_Z3fool>:
204000d6: 050a slli a0,a0,0x2
204000d8: 80818293 addi t0,gp,-2040 # 80000430 <a>
204000dc: 00a28333 add t1,t0,a0
204000e0: 00032503 lw a0,0(t1)
204000e4: 8082 ret

Hmm .. slightly different code .. but identical times. Fancy that. Exactly 10 instructions and 10 cycles per loop as the driver loop has 5 instructions:

20400134: 2a01 jal 20400244 <micros>
20400136: 89aa mv s3,a0
20400138: 00989537 lui a0,0x989
2040013c: 68050413 addi s0,a0,1664 # 989680

20400140: 4501 li a0,0
20400142: 3f51 jal 204000d6 <_Z3fool>
20400144: 147d addi s0,s0,-1
20400146: 94aa add s1,s1,a0
20400148: fc65 bnez s0,20400140 <setup+0x58>

2040014a: 28ed jal 20400244 <micros>

Interesting to note that the function call, function return, and conditional branch are all single-cycle instructions, thanks to branch prediction, BTB, and return-address stack. Not bad for a microcontroller. Load latency is 2 cycles.


This is of course a kind of "speed of light" test. You could hardly make a worse case example. If I stop going out of my way to prevent the compiler inlining foo() then I get:

Thread local: 125046 us

20400124: 2a09 jal 20400236 <micros>
20400126: 009893b7 lui t2,0x989
2040012a: 89aa mv s3,a0
2040012c: 68038613 addi a2,t2,1664 # 989680
20400130: 00020513 mv a0,tp

20400134: 4118 lw a4,0(a0)
20400136: 167d addi a2,a2,-1
20400138: 943a add s0,s0,a4
2040013a: fe6d bnez a2,20400134 <setup+0x5c>

2040013c: 28ed jal 20400236 <micros>

Global: 125046 us

20400120: 2239 jal 2040022e <micros>
20400122: 009893b7 lui t2,0x989
20400126: 89aa mv s3,a0
20400128: 68038513 addi a0,t2,1664 # 989680

2040012c: 4098 lw a4,0(s1)
2040012e: 157d addi a0,a0,-1
20400130: 943a add s0,s0,a4
20400132: fd6d bnez a0,2040012c <setup+0x54>

20400134: 28ed jal 2040022e <micros>

Once again, different code, but exactly the same times, and 2.499 times faster because of the lack of function call/ret and calculating the address of a[] every time. Either version is 4 cycles per loop instead of 10. (The load latency is covered by the loop control)

No doubt your ARM will also benefit considerably from allowing foo() to be inlined ... and in fact I predict identical times for global and thread-local there too, if you allow at least -O1 anyway.

EricP

unread,
Mar 3, 2017, 3:13:07 PM3/3/17
to
Anton Ertl wrote:
> Thread-local storage (TLS) is implemented on Linux-AMD64 by having the
> instances for all threads in the address space and using an FS: prefix
> to access the right instance. I have measured that this costs two
> cycles of latency on a Haswell and one cycle on a K8 (compared to
> doing the same thing in global storage). On other CPUs accessing
> thread-local storage also has a cost.
>
> An alternative would be to map the thread-local storage in a certain
> address range, i.e., different threads have different mappings for
> that address range.

This doesn't quite work but something very similar does.
See below.

> Advantages and disadvantages:
>
> + Avoids the overhead of using a segment register prefix (or
> alternatively, reserving a general-purpose register for keeping the
> base address of the thread-local storage).
>
> + Thread-local memory can be treated like all other memory instead of
> needing conversion from thread-local to normal addresses when taking
> the address of a thread-local object (and the cost of this
> conversion is also avoided).
>
> +? This results in more flexibility when dealing with thread-local
> storage, e.g. thread-local memory could be allocated dynamically
> (not sure if this is a real advantage; static thread-local pointers
> to normal dynamic memory are probably sufficient).
>
> - Thread switching now needs kernel intervention, user-space threads
> (green threads) cannot use this TLS mechanism.

You could store that pointer to the current user mode thread (aka fiber)
header in the OS thread local store. When the OS thread switched fiber,
it updates the current fiber header pointer.

There would have to be separate routines to access
Fiber Local Store pointers.

> The following presumes that threads of a process share the address
> space (and use the same address-space ID (ASID), if they use ASIDs),
> which is not necessarily the case.

The term 'thread' is pretty much defined as sharing the same
address space and handle table.

If you want something different, perhaps a different name would
be less confusing.

> - Thread switching on the same core is now more expensive; in the
> cheapest case it is as cheap as loading a new address-space id on
> thread switch, which should be relatively cheap, but when that does
> not suffice, it becomes quite expensive (TLB shootdowns or TLB
> flushes). Of course, if you use threads mainly for making use of
> multiple cores, this will not occur often.
>
> - Sharing of TLB entries between threads is eliminated. Again, if you
> use threads to make use of multiple cores, this is not an issue.
>
> - More ASIDs are active at the same time.
>
> Did I forget anything?
>
> Overall, it seems to me that the advantages of having thread-local
> memory at the same addresses in different threads are bigger than the
> disadvantages, but nobody does it that way. Why?

This would be "thread private virtual memory". It's possible.
Reasons why not:
nobody needs it, and the current tools work well enough?

>
> - anton

The way TLS works on Windows, and I'll assume Linux is similar,
is that each thread has a user mode header, a 4KB struct on x86/x64.
All thread headers reside within the same process virtual space
and are therefore concurrently mapped and accessible from all
cpu's mapping the same process address space.

On x86/x64, the virtual address of the thread header of the
thread currently running on each cpu is pointed at by the FS
segment register. FS is set to the proper value by the OS
thread switcher.

(There is also a kernel mode thread header and a kernel mode
Processor State Descriptor that on x86/x64 are pointed at
by FS and GS segment registers.)

For TLS, inside that thread header struct is an array of 64 pointers
call the Thread Local Store array. The pointer can be set
and fetched at its index. On x86/x64 that array is accessed
in each thread header as on offset to the FS segment register.

One consequence of the above statements is that the address of
a field in one thread header, is valid in the execution contexts
of any thread in the same process.
For example, if thread T1 executing on core C1 takes the address
of its header field and passes that address to thread T2 executing
on core C2, then T2 can access that address as normal.

That precludes mapping different thread headers on different cores.

However, instead you could have the same virtual page mapped as
user mode read-only to a different physical frame on each core,
and that page contains a pointer to the thread header that is current
on this cpu. Since each cpu has a different physical frame backing
that same virtual page, each can store a different pointer value.
The OS thread switch code as part of thread activation
writes the pointer to its thread header at that address.
If a thread switch did not include a process switch,
this does not require a whole TLB shootdown, just a single
virtual address invalidate for the pointer page would do.

To access say the local store array would be:
gblCurentThreadHeadPtr->LocalStoreArray[index]
This is covered is access routines so the changes are invisible
to applications.

On x86/x64, accessing the thread header via pointer in memory,
even if it is a cache hit, is more clocks than FS offset.
But on an ISA without spare registers hanging about, it would suffice.

Eric




Bruce Hoult

unread,
Mar 3, 2017, 4:56:21 PM3/3/17
to
On Friday, March 3, 2017 at 9:24:40 PM UTC+3, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >On Friday, March 3, 2017 at 7:06:04 PM UTC+3, Anton Ertl wrote:
> >> Overall, it seems to me that the advantages of having thread-local
> >> memory at the same addresses in different threads are bigger than the
> >> disadvantages, but nobody does it that way. Why?
> >
> >Not really useful once you've got on the order of 16 or more general purpose registers to burn. A little bit tight on x86_64 or ARM32 or VAX, true. But when you have 32...
>
> Let's test that theory:
>
> __thread long a[]={0};
>
> long foo(long l)
> {
> return a[l];
> }
>
> On Aarch64 this compiles to:
>
> with __thread: without __thread:
> 0: mrs x1, tpidr_el0 0: adrp x1, 0 <foo>
> 4: add x1, x1, #0x0, lsl #12 4: add x1, x1, #0x0
> 8: add x1, x1, #0x0 8: ldr x0, [x1,x0,lsl #3]
> c: ldr x0, [x1,x0,lsl #3] c: ret
> 10: ret
>
> On a 2GHz Cortex A53 a loop calling this function 100M times takes:
>
> 0.71s user time 0.51s user time

So, looking more closely at your test, it seems you're getting 10 clock cycles iteration for the global version on Aarch64 A53, the same as me on RISC-V E310 (Berkeley "Rocket" core). But you're seeing 14 cycles per loop for the thread-local version.

When I said "Not really useful once you've got on the order of 16 or more general purpose registers to burn. A little bit tight on x86_64 or ARM32 or VAX, true. But when you have 32..." I of course meant that a register should be dedicated to holding a pointer to the thread local variables, as happens on RISC-V with "tp" (x4), just as globals are referenced from "gp" (x3), and locals from "sp" (x2).

The decision to keep the thread local storage pointer in a system register and need an "mrs" (move from system register) instruction instead of in one of the plentiful integer registers seems not only a bit strange but even foolish, especially as it seems to cost you an extra 4 clock cycles.

Section 5.1.1 of the Procedure Call Standard for the ARM 64-bit Architecture (Aarch64) seems to suggest that r18 should be used for thread-local storage, just as r9 is used on Aarch32. So I really don't know what's going on there. Googling around didn't produce much illumination.

Anton Ertl

unread,
Mar 4, 2017, 12:24:51 PM3/4/17
to
timca...@aol.com writes:
>On Friday, March 3, 2017 at 11:06:04 AM UTC-5, Anton Ertl wrote:
>... Analysis of TLS snipped.
>>=20
>> Overall, it seems to me that the advantages of having thread-local
>> memory at the same addresses in different threads are bigger than the
>> disadvantages, but nobody does it that way. Why?
>>=20
>
>The trick is that use TLS as little as possible. Big data structures are a=
>llocated/stored in global memory, you store a pointer to it in TLS. So, if=
> you have a bunch of thread parameters, you put them in a data structure in=
> global memory, and just a pointer to them in TLS.

Which means that using TLS costs another indirection, at least sometimes.

> The FS/GS override tric=
>k (depending on 32 or 64 bit, Windows or Linux) is pretty neat for avoiding=
> reloading the TLB, although I guess using ASIDs is pretty good to (althoug=
>h I suspect 32 bit IDs will be necessary pretty quick).

I just looked at the number of threads in our servers, and they all
had <1000 threads; that's probably different on servers hosting many
virtual machines, and/or when people use threads a lot.

Anyway, ASIDs are not permanent, but instead are assigned when needed,
and possibly on a per-core basis. So, as long as you run less than
4096 threads on an Intel core (12-bit ASIDs called PCIDs), there is no
need to invalidate an ASID (or flush the TLB to invalidate all ASIDs).
And even if you run more threads on a core, you only need to flush the
TLB on the 4097th unique thread you run on the core. That's how the
big servers of yesteryear managed to do with 6-bit and 8-bit ASIDs.

Anton Ertl

unread,
Mar 4, 2017, 1:11:56 PM3/4/17
to
Bruce Hoult <bruce...@gmail.com> writes:
>After adding "volatile" to the array and __attribute__ ((noinline)) to foo(=
>) and adding the results of foo() to a total that I printed later, so that =
>the compiler didn't optimize everything away,

Separate compilation did the trick for me.

> I got (the timings are absolu=
>tely repeatable, to the us, every time by the way):
>
>Thread local: 312546 us
>
>204000d6 <_Z3fool>:
>204000d6: 050a slli a0,a0,0x2
>204000d8: 00020313 mv t1,tp
>204000dc: 00a303b3 add t2,t1,a0
>204000e0: 0003a503 lw a0,0(t2)
>204000e4: 8082 ret
>
>Global: 312546 us
>
>204000d6 <_Z3fool>:
>204000d6: 050a slli a0,a0,0x2
>204000d8: 80818293 addi t0,gp,-2040 # 80000430 <a>
>204000dc: 00a28333 add t1,t0,a0
>204000e0: 00032503 lw a0,0(t1)
>204000e4: 8082 ret
>
>Hmm .. slightly different code .. but identical times. Fancy that. Exactly =
>10 instructions and 10 cycles per loop as the driver loop has 5 instruction=
>s:
>
>20400134: 2a01 jal 20400244 <micros>
>20400136: 89aa mv s3,a0
>20400138: 00989537 lui a0,0x989
>2040013c: 68050413 addi s0,a0,1664 # 989680
>
>20400140: 4501 li a0,0
>20400142: 3f51 jal 204000d6 <_Z3fool>
>20400144: 147d addi s0,s0,-1
>20400146: 94aa add s1,s1,a0
>20400148: fc65 bnez s0,20400140 <setup+0x58>
>
>2040014a: 28ed jal 20400244 <micros>

I feel right at home: Almost the same mnemonics as MIPS (except that
add and addi probably do not correspond to MIPS add/addi, but to MIPS
addu/addiu). Obviously a tighter encoding. Interestingly, even
instructions that are probably encoded with two register specifier
look as if they were three-address instructions on the assembly level
(e.g., "add s1,s1,a0").

>This is of course a kind of "speed of light" test. You could hardly make a =
>worse case example. If I stop going out of my way to prevent the compiler i=
>nlining foo() then I get:

Sure, but in a real program the compiler does not inline everything,
in particular not indirect calls as used in OO method dispatch. So
you are likely to incur the extra cost of getting at the TLS pointer
at least once in every non-inlined function that accesses TLS.

>No doubt your ARM will also benefit considerably from allowing foo() to be =
>inlined ... and in fact I predict identical times for global and thread-loc=
>al there too, if you allow at least -O1 anyway.

I used -O for the timings above, but sure, as long as your whole
program is inlined and you don't run out of registers, the TLS
overhead is virtually non-existent on Aarch64.

And given that global accesses are typically also through a register
on most RISCs these days, globals are just as inefficient as TLS on
them when more complex addressing is involved. For IA-32/AMD64 the
GS:/FS: technique allows using the same addressing modes as for
globals, but incurs extra cycles at every access.

Anton Ertl

unread,
Mar 4, 2017, 1:32:44 PM3/4/17
to
EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>> Thread-local storage (TLS) is implemented on Linux-AMD64 by having the
>> instances for all threads in the address space and using an FS: prefix
>> to access the right instance. I have measured that this costs two
>> cycles of latency on a Haswell and one cycle on a K8 (compared to
>> doing the same thing in global storage). On other CPUs accessing
>> thread-local storage also has a cost.
>>
>> An alternative would be to map the thread-local storage in a certain
>> address range, i.e., different threads have different mappings for
>> that address range.
>
>This doesn't quite work but something very similar does.
>See below.
...
>> - Thread switching now needs kernel intervention, user-space threads
>> (green threads) cannot use this TLS mechanism.
>
>You could store that pointer to the current user mode thread (aka fiber)
>header in the OS thread local store. When the OS thread switched fiber,
>it updates the current fiber header pointer.
>
>There would have to be separate routines to access
>Fiber Local Store pointers.

You lost me here. The idea of user-space threads is to perform thread
switching without going through the OS kernel.

>The term 'thread' is pretty much defined as sharing the same
>address space and handle table.
>
>If you want something different, perhaps a different name would
>be less confusing.

Possibly. I disagree with your definition of "thread", however.
E.g., SMT and hyperthreading contain "thread", but do not require any
sharing.

In the present context, people looking at the *implementation* of
threads in some OSs may consider "sharing the address space" as a
defining feature, but *users* of these threads probably see the affair
a little differently: For them a part of the address space is shared,
and a part is not mapped; if the TLS is mapped in some previously
unmapped part of the address space to allow them to access TLS more
conveniently, will they see it as something fundamentally different
from a thread? I doubt it.

>nobody needs it, and the current tools work well enough?

I read that programmers put in extra effort to avoid TLS accesses
because they are slow on Windows and ARM (probably Android), so
the current tools don't work well enough.

Anton Ertl

unread,
Mar 4, 2017, 1:46:54 PM3/4/17
to
Bruce Hoult <bruce...@gmail.com> writes:
>The decision to keep the thread local storage pointer in a system register =
>and need an "mrs" (move from system register) instruction instead of in one=
> of the plentiful integer registers seems not only a bit strange but even f=
>oolish, especially as it seems to cost you an extra 4 clock cycles.
>
>Section 5.1.1 of the Procedure Call Standard for the ARM 64-bit Architectur=
>e (Aarch64) seems to suggest that r18 should be used for thread-local stora=
>ge, just as r9 is used on Aarch32. So I really don't know what's going on t=
>here. Googling around didn't produce much illumination.

Yes, given that Aarch64 and it's ABI were defined after multi-cores
made multi-threading really important, I would expect that they
dedicate a GPR to it in the ABI. But apparently not in the ABI that's
used by gcc-5.3 on Ubuntu 16.04.

OTOH, maybe the idea is that everybody works hard to reduce accesses
to __thread variables because they are so slow on some platforms, so
it does not pay to sacrifice a GPR for it even if you have 31, and
that putting the base in some system register is good enough.

Ivan Godard

unread,
Mar 4, 2017, 1:52:40 PM3/4/17
to
On 3/3/2017 6:10 AM, Anton Ertl wrote:
> Thread-local storage (TLS) is implemented on Linux-AMD64 by having the
> instances for all threads in the address space and using an FS: prefix
> to access the right instance. I have measured that this costs two
> cycles of latency on a Haswell and one cycle on a K8 (compared to
> doing the same thing in global storage). On other CPUs accessing
> thread-local storage also has a cost.
<snip proposal>

FWIW, Mill has a dedicated base register for TLS addressing, somewhat
akin to the FS: approach. All threads TLS is in the globals shared
address space (as is everything), so thread switch is a permissions
change and not an address change.


Bruce Hoult

unread,
Mar 4, 2017, 2:23:17 PM3/4/17
to
On Saturday, March 4, 2017 at 9:46:54 PM UTC+3, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >The decision to keep the thread local storage pointer in a system register =
> >and need an "mrs" (move from system register) instruction instead of in one=
> > of the plentiful integer registers seems not only a bit strange but even f=
> >oolish, especially as it seems to cost you an extra 4 clock cycles.
> >
> >Section 5.1.1 of the Procedure Call Standard for the ARM 64-bit Architectur=
> >e (Aarch64) seems to suggest that r18 should be used for thread-local stora=
> >ge, just as r9 is used on Aarch32. So I really don't know what's going on t=
> >here. Googling around didn't produce much illumination.
>
> Yes, given that Aarch64 and it's ABI were defined after multi-cores
> made multi-threading really important, I would expect that they
> dedicate a GPR to it in the ABI. But apparently not in the ABI that's
> used by gcc-5.3 on Ubuntu 16.04.
>
> OTOH, maybe the idea is that everybody works hard to reduce accesses
> to __thread variables because they are so slow on some platforms, so
> it does not pay to sacrifice a GPR for it even if you have 31, and
> that putting the base in some system register is good enough.

It seems crazy to me, for a new architecture with so many registers! And that's not the only inexplicable thing in Aarch64 .. see the other reply. Generally I think they did some nice things in Aarch64, but I suspect it's not going to be a terribly "sticky" upgrade for everyone who's been using ARM in the past -- and all the more so once Aarch64-only CPUs become common (e.g. Cavium).

At least it's not as bad as MIPS which, as I learned while trying to find the answer for Aarch64, gets the TLS base by using a "rdhwr" instruction with an illegal argument ($29), which is then trapped and emulated by the kernel!!

Bruce Hoult

unread,
Mar 4, 2017, 3:53:22 PM3/4/17
to
Right. Definitely MIPS-inspired, but with improvements from 30+ years of experience since 1981. In some ways more like Alpha (e.g. no arithmetic exceptions .. if you want to check, do it yourself), or even Aarch64 (which no doubt started development earlier, but was announced after RISC-V was already well underway).

Other improvements include provision for 32, 64, or 128 bit register sizes and corresponding load/store etc, from the start (and 8/16 bit load/store, unlike early Alpha).

Also provision for optional variable length instruction encodings integrated from the start.

While it retains the Set-on-Less-Than instruction, the compare-and-branch instruction implements the full set of signed and unsigned comparisons so SLT is needed only to create explicit booleans.

Immediates are only 12 bits vs 16 on MIPS. But LUI and AUIPC are 20 bits instead of 16 to compensate, so you can still load a 32 bit constant in two instructions. Or a 4KB page boundary with one instruction.

> (except that add and addi probably do not correspond to MIPS
> add/addi, but to MIPS addu/addiu).

Right. No arithmetic exceptions.

> Obviously a tighter encoding. Interestingly, even
> instructions that are probably encoded with two register specifier
> look as if they were three-address instructions on the assembly level
> (e.g., "add s1,s1,a0").

The base instruction set is pure fixed-length 32 bit 3-address RISC, and you could build a CPU like that -- but it's looking likely that will only be student projects or maybe the very simplest cacheless 2 or 3 stage pipe embedded processors in uses where code size is trivial.

The mixed-length encoding lets you get 35% - 50% more functionality in a given size ROM in an embedded application, but also 35% - 50% more functionality in a given size L1 cache in a high-performance desktop or server.

I think it's a huge blunder that Aarch64 has made no provision for variable-length encoding! This from the same company that took the world by storm with Thumb2.

As with original ARM and MIPS, there is no room in the encoding scheme to retro-fit smaller instruction without using two instruction sets and mode changes on function call/return (or similar) like Thumb1 or MIPS16.

Aarch64 is pretty cleverly designed and has easily the densest encoding of any pure fixed length 32-bit instruction set, but it's far worse than x86_64, Thumb2, or RISCV.

The RISC-V 16 bit instructions fit in with the 32 bit instruction encoding scheme (like Thumb2, and unlike Thumb1 or MIPS16). They are a pure functional subset of the 32 bit instructions. Current gcc and llvm don't know anything about the 16 bit instructions. The assembler just opportunistically uses the 16 bit encoding for any instruction that has one. So, yes, the assembly language looks like 3-address even when the encoding used is 2-address. Thumb2 "unified" syntax does the same.

Implementations of RISC-V with 64 or 128 bit register sizes can also use the 16 bit instructions. The encoding varies a little bit. The 16 bit JAL instruction you see in my code here disappears to make room for 16 bit encodings of load/store double/quad. Not a huge loss, as the short JAL only has a range of +/-2KB, so it wouldn't get used much in larger desktop/server software anyway. On the other hand, 16 bit encodings for load/store in the full register size get used in every non-leaf function prolog and epilog (for the return address, if nothing else).

The instruction length encoding works like this:

16 bit if the 2 LSBs (1..0) are not 11, otherwise
32 bit if the next 3 LSBs (4..2) are not 111, otherwise
48 bit if the next LSB (5) is not 1, otherwise
64 bit if the next LSB (6) is not 1, otherwise
80-176 bits if bits 14..12 are not 111, otherwise
... scheme for >=192 bits is not yet defined.

(bits 11..7 are skipped because they are always the dest register)

Most of that is far future stuff, and probably would be used for custom extensions not general purpose ones. All currently-mooted implementations only have to look at the bottom two bits to determine instruction length :-)

RISC-V isn't anything radical like The Mill, but I do believe it's a best of breed modern RISC ISA combining all the lessons learned from 30 years experience with MIPS and ARM, the transition from 32 bit to 64 bit addresses/registers (and looking forward to 128), and more recent designs such as Alpha and Thumb2.

A number of critical projects got forked by the RISC-V effort. That's just this month been included in the current official release of binutils, and the gcc support has been accepted upstream and will be in the 7.1 release in April. qemu support will probably be the next thing to get upstreamed.


> >This is of course a kind of "speed of light" test. You could hardly make a =
> >worse case example. If I stop going out of my way to prevent the compiler i=
> >nlining foo() then I get:
>
> Sure, but in a real program the compiler does not inline everything,
> in particular not indirect calls as used in OO method dispatch. So
> you are likely to incur the extra cost of getting at the TLS pointer
> at least once in every non-inlined function that accesses TLS.

Yes, on ABIs where there is not a dedicated TLS register that is changed only on thread switch.


> And given that global accesses are typically also through a register
> on most RISCs these days, globals are just as inefficient as TLS on
> them when more complex addressing is involved. For IA-32/AMD64 the
> GS:/FS: technique allows using the same addressing modes as for
> globals, but incurs extra cycles at every access.

If you don't have variable-length instructions that let you append a full size absolute address to instructions then dedicating a register to point to globals is the fastest option for anything that is within reach of a 12 or 16 bit or whatever offset.

If you want PIC data (rwpi in ARM-talk) then you're not going to use absolute addresses anyway.

MitchAlsup

unread,
Mar 5, 2017, 1:20:11 PM3/5/17
to
On Saturday, March 4, 2017 at 12:46:54 PM UTC-6, Anton Ertl wrote:
>
> Yes, given that Aarch64 and it's ABI were defined after multi-cores
> made multi-threading really important, I would expect that they
> dedicate a GPR to it in the ABI. But apparently not in the ABI that's
> used by gcc-5.3 on Ubuntu 16.04.

GPUs have a kind of thread local memory, and have memory ref instructions
directed towards TLM. The AGEN process does the conversion from virtual
address per thread into unified physical address--mostly by address
munging hidden to the application.

TLM is the means by which threads running in SIMT mode have a stack.

Bruce Hoult

unread,
Mar 5, 2017, 2:23:09 PM3/5/17
to
Yes. Strangely enough, at the moment I'm helping write a compiler for a brand new GPU architecture. Well .. implementing OpenCL built-ins (and adding intrinsics for the wierd but useful instructions in the ISA, as needed) and compiler memory management at the moment. Possibly improving instruction selection soon.
0 new messages