Anton Ertl wrote:
> Thread-local storage (TLS) is implemented on Linux-AMD64 by having the
> instances for all threads in the address space and using an FS: prefix
> to access the right instance. I have measured that this costs two
> cycles of latency on a Haswell and one cycle on a K8 (compared to
> doing the same thing in global storage). On other CPUs accessing
> thread-local storage also has a cost.
>
> An alternative would be to map the thread-local storage in a certain
> address range, i.e., different threads have different mappings for
> that address range.
This doesn't quite work but something very similar does.
See below.
> Advantages and disadvantages:
>
> + Avoids the overhead of using a segment register prefix (or
> alternatively, reserving a general-purpose register for keeping the
> base address of the thread-local storage).
>
> + Thread-local memory can be treated like all other memory instead of
> needing conversion from thread-local to normal addresses when taking
> the address of a thread-local object (and the cost of this
> conversion is also avoided).
>
> +? This results in more flexibility when dealing with thread-local
> storage, e.g. thread-local memory could be allocated dynamically
> (not sure if this is a real advantage; static thread-local pointers
> to normal dynamic memory are probably sufficient).
>
> - Thread switching now needs kernel intervention, user-space threads
> (green threads) cannot use this TLS mechanism.
You could store that pointer to the current user mode thread (aka fiber)
header in the OS thread local store. When the OS thread switched fiber,
it updates the current fiber header pointer.
There would have to be separate routines to access
Fiber Local Store pointers.
> The following presumes that threads of a process share the address
> space (and use the same address-space ID (ASID), if they use ASIDs),
> which is not necessarily the case.
The term 'thread' is pretty much defined as sharing the same
address space and handle table.
If you want something different, perhaps a different name would
be less confusing.
> - Thread switching on the same core is now more expensive; in the
> cheapest case it is as cheap as loading a new address-space id on
> thread switch, which should be relatively cheap, but when that does
> not suffice, it becomes quite expensive (TLB shootdowns or TLB
> flushes). Of course, if you use threads mainly for making use of
> multiple cores, this will not occur often.
>
> - Sharing of TLB entries between threads is eliminated. Again, if you
> use threads to make use of multiple cores, this is not an issue.
>
> - More ASIDs are active at the same time.
>
> Did I forget anything?
>
> Overall, it seems to me that the advantages of having thread-local
> memory at the same addresses in different threads are bigger than the
> disadvantages, but nobody does it that way. Why?
This would be "thread private virtual memory". It's possible.
Reasons why not:
nobody needs it, and the current tools work well enough?
>
> - anton
The way TLS works on Windows, and I'll assume Linux is similar,
is that each thread has a user mode header, a 4KB struct on x86/x64.
All thread headers reside within the same process virtual space
and are therefore concurrently mapped and accessible from all
cpu's mapping the same process address space.
On x86/x64, the virtual address of the thread header of the
thread currently running on each cpu is pointed at by the FS
segment register. FS is set to the proper value by the OS
thread switcher.
(There is also a kernel mode thread header and a kernel mode
Processor State Descriptor that on x86/x64 are pointed at
by FS and GS segment registers.)
For TLS, inside that thread header struct is an array of 64 pointers
call the Thread Local Store array. The pointer can be set
and fetched at its index. On x86/x64 that array is accessed
in each thread header as on offset to the FS segment register.
One consequence of the above statements is that the address of
a field in one thread header, is valid in the execution contexts
of any thread in the same process.
For example, if thread T1 executing on core C1 takes the address
of its header field and passes that address to thread T2 executing
on core C2, then T2 can access that address as normal.
That precludes mapping different thread headers on different cores.
However, instead you could have the same virtual page mapped as
user mode read-only to a different physical frame on each core,
and that page contains a pointer to the thread header that is current
on this cpu. Since each cpu has a different physical frame backing
that same virtual page, each can store a different pointer value.
The OS thread switch code as part of thread activation
writes the pointer to its thread header at that address.
If a thread switch did not include a process switch,
this does not require a whole TLB shootdown, just a single
virtual address invalidate for the pointer page would do.
To access say the local store array would be:
gblCurentThreadHeadPtr->LocalStoreArray[index]
This is covered is access routines so the changes are invisible
to applications.
On x86/x64, accessing the thread header via pointer in memory,
even if it is a cache hit, is more clocks than FS offset.
But on an ISA without spare registers hanging about, it would suffice.
Eric