TLS(thread-local storage) on Java

Sergey Melnikov

unread,

Nov 21, 2016, 1:58:40 PM11/21/16

to mechanica...@googlegroups.com

Hi,

I was digging ThreadLocal class implementation in Java. And this implementation looks a bit heavyweight. Details:

In best case it requires few loads (ThreadLocal.get and ThreadLocalMap.getEntry methods) to get thread-specific value. In worst case it require to do a lot of additional job (ThreadLocal.getEntryAfterMiss method).
On Linux (I'm not sure about windows, but I think windows does it the same way), kernel maps thread-local segment for current thread via FS segment register. So, in C/C++ it's possible to get value from TLS with only 1 (!) instruction:

10: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax

So, what do you think, is it worth to have a lightweight TLS API (inspired by C/C++) in Java? Or may be it's possible to implement it in JVM right now?

--Sergey

Georges Gomes

unread,

Nov 21, 2016, 4:00:50 PM11/21/16

to mechanica...@googlegroups.com

Hi, I would be pretty interested.

Would be nice if it could intrinsic using the existing API...

I think that Threadlocals are very much used in "the field". (At least we do intensively)

JDK implementation is not the fastest but on top of that not very predictable/stable in terms of performance.

It's hard to have high performance code paths using threadlocals... lots of references or cache misses or both...

I don't know but it hurts.

@Nitsan I think the MPSC queue of JCTools using threadlocal SPSC queues would be significantly faster ;)

My 2 cents

GG

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aleksey Shipilev

unread,

Nov 21, 2016, 4:08:13 PM11/21/16

to mechanica...@googlegroups.com

Hi,

On 11/21/2016 07:58 PM, Sergey Melnikov wrote:
> On Linux (I'm not sure about windows, but I think windows does it the
> same way), kernel maps thread-local segment for current thread via FS
> segment register. So, in C/C++ it's possible to get value from TLS
> with only 1 (!) instruction:

> 10: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax

This is JVM- and platform-specific, but Hotspot x86_64 does pull TLS in
register:
http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/225b91f1b118/src/cpu/x86/vm/x86_64.ad
-- you could do the same for 32-bit and FS segment.

> So, what do you think, is it worth to have a lightweight TLS API
> (inspired by C/C++) in Java? Or may be it's possible to implement it
> in JVM right now?

Note that java.lang.ThreadLocal and native thread TLS are the beasts
from different worlds. Mapping one to another would require passing
through JDK<->VM boundary in several places.

An obvious idea would be reserving the indexed slot per each ThreadLocal
instance's value in native TLS, right? Then you can "just" poll
ThreadLocal.idx, and do mov 0x$idx(%r15), %dst on read -- voila! The
same would go if we ditch ThreadLocal and do straight TLS.{get|set}(int
idx, T val).

But then complications emerge: you basically want to store the Java
references in native TLS storage. Which means you need to make sure it
works nicely with GC: e.g. slots get recycled properly when ThreadLocal
objects die, GC detects the reachability via TLS slots (which probably
requires TLS to get scanned as part of rootset now? or some special case
in all GCs?), all barriers for stores and loads are in place, etc.

Then, you'd realize doing this from Java is complicated, because user
code has no business roaming around native TLS where some interesting
VM-specific things lie (e.g. GC/runtime flags, queues, etc). You might
probably poll VM about thread specifics, and what is available and what
is not. Coexistence would be interesting, because there is already one
heavy TLS user -- the JVM itself.

Not to mention that instantiating the ThreadLocal now has to have global
effects on all TLSes, because slots must match between the threads. (One
of those nice properties of thread-local Map<ThreadLocal, V> map is that
I can init a ThreadLocal from a single thread only, with no cost to
other threads).

After all that bi-directional thing is done, you'd need to prove this
works equally well across all other architectures OpenJDK supports ;)

This is to say the whole ordeal is not as easy as it might sound. "Just
do the intrinsics for them!" is not gonna cut it.

Thanks,
-Aleksey

signature.asc

Всеволод Толстопятов

unread,

Nov 21, 2016, 4:27:47 PM11/21/16

to mechanica...@googlegroups.com

Hi,

Why do you need such a low level abstraction? It doesn't look like a good candidate to standard library/JVM (Alexey already wrote about it while I was writing this :)).

In hotspot pointer to native thread object is always stored in %r15 register (on x86 at least) and Thread.currentThread() is just a memory read from address in %r15 with constant offset, so if you are actually care about performance *that* much you can provide your own implementation of single-key TLS via Thread subclassing, e.g.:

class MyThread extends Thread {

public int tls = 42;

...

}

private int getTls() {

return ((MyThread) Thread.currentThread()).tls;

}

getTls will be compiled to five instruction:

mov 0x1d0(%r15),%r10 ;*invokestatic currentThread

mov 0x8(%r10),%r11d ; get class pointer from Thread header

cmp $0xf800c14d,%r11d ; compare it with MyThread class pointer

jne 0x0000000104f111f3 ; throw CCE

0x178(%r10),%eax ;*getfield tls

But I'm in doubts if performance gain will be visible at all.

--

Best regards,

Tolstopyatov Vsevolod

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Sergey Melnikov

unread,

Nov 21, 2016, 4:51:07 PM11/21/16

to mechanica...@googlegroups.com

Hi Vsevolod,

Thank you for an idea to have successor for Thread class and use it across my application.

> Why do you need such a low level abstraction? It doesn't look like a good
> candidate to standard library/

Yes, It's not the best candidate to standard library. I'd like just to point that TLS in Java may be not as efficient as it may be (and as it is on OS level).

--Sergey

> > email to mechanical-symp...@googlegroups.com.

> > For more options, visit https://groups.google.com/d/optout.
> >
>

> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Sergey Melnikov

unread,

Nov 21, 2016, 5:45:02 PM11/21/16

to mechanica...@googlegroups.com

Hi,

> An obvious idea would be reserving the indexed slot per each ThreadLocal
> instance's value in native TLS, right? Then you can "just" poll
> ThreadLocal.idx, and do mov 0x$idx(%r15), %dst on read -- voila! The
> same would go if we ditch ThreadLocal and do straight TLS.{get|set}(int
> idx, T val).

Yes, you are right.

Let's start from something like this:

class Tls {
static class TlsKey {
private int key;
}

public static Object getTlsObject(TlsKey k);
public static void setTls(TlsKey k, Object obj);

public static TlsKey claimKey() throws NoSufficientTls;
public static void releaseKey(TlsKey k);
}

I'm not so familiar with OpenJDK sources as good as you, so, please correct my, if I'm not right.

- User claims TLS key via static function claimKey.
- It's allowed to use only specified region of low-level TLS keys.
- If user wants to get a value for specified TLS key, we should make bound check if key belongs to user region. If so, just return reference from TLS.
- If user wants to set a value for specified TLS key, we make bound check, create Global Ref for object (in terms of JNI) and set value to specified TLS item. If TLS item already contains reference, we should release global reference.
- Therefore, lifecycle of java objects in thread's TLS should correspond to lifecycle of Thread only. After Thread.run method has finished, we can go through all TLS items and remove global reference for each object.

> After all that bi-directional thing is done, you'd need to prove this
> works equally well across all other architectures OpenJDK supports ;)

Surely, it would be the most difficult part of task if someone did it ;-)

> This is to say the whole ordeal is not as easy as it might sound. "Just
> do the intrinsics for them!" is not gonna cut it.

I don't want to say "Hey, let's implement it!". My point is if something may be done fast (especially, if this is widely used in low-latency/high performance applications), why not just to talk about it? May be someone relevant become interested.

--Sergey

Aleksey Shipilev

unread,

Nov 22, 2016, 10:26:15 AM11/22/16

to mechanica...@googlegroups.com

On 11/21/2016 11:44 PM, Sergey Melnikov wrote:
> Let's start from something like this:
>
> class Tls {
> static class TlsKey {
> private int key;
> }
>
> public static Object getTlsObject(TlsKey k);
> public static void setTls(TlsKey k, Object obj);
>
> public static TlsKey claimKey() throws NoSufficientTls;
> public static void releaseKey(TlsKey k);
> }

This is a very special API, and it would be hard to justify its
inclusion into JDK. On the other hand, it would be hard to do in
high-performance manner without compiler support (or at least before
Panama arrives and allows to do machine-code snippets).

Also, all the peculiarities of manual memory management: e.g. memory
leaks through the dangling TLS slots (which you might have also lost the
TlsKey for, oops).

> - If user wants to set a value for specified TLS key, we make bound
> check, create Global Ref for object (in terms of JNI) and set value
> to specified TLS item. If TLS item already contains reference, we
> should release global reference.

Using JNI Global Refs is a cute trick, but it does not work as one would
naively think. You cannot store an "ordinary object pointer" (oop) into
unmanaged memory without telling GC about it. You can store a Global Ref
itself into TLS. But that gets you almost to square one: there is a load
of Global Ref from TLS slot, and _then_ the load of oop from the Global
Ref.

These are already two loads that you have with subclassing a Java
thread, without all this TLS madness. I do think that memory-access-wise
you can get the performance improvement if you are able to store the
oops straight in the TLS, which requires heavy VM support.

>> This is to say the whole ordeal is not as easy as it might sound. "Just
>> do the intrinsics for them!" is not gonna cut it.

> I don't want to say "Hey, let's implement it!". My point is if
something may be done fast (especially, if this is widely used in
low-latency/high performance applications), why not just to talk about
it? May be someone relevant become interested.

My point stands on "it cannot be made faster than alternatives without
heavy VM support". Unfortunately.

Thanks,
-Aleksey

signature.asc

Reply all

Reply to author

Forward