Fwd: Lazy vals in dotty

Dark_Dimius

unread,

Feb 15, 2014, 4:15:40 PM2/15/14

to scala-i...@googlegroups.com

Hi all,

I'm working on lazy vals implementation for Dotty, and I'm considering implementing them differently from their current implementation in scalac.
There are 2 implementation details in scalac that I want to overcome:
1) lazy vals in scala can lead to deadlocks, even if there are no circular dependencies between variables;
2) concurrent initialization of different lazy vals in the same object isn't possible - all initializers are synchronized on class instance.
Both issues are explained in detail in SIP-20.

My implementation prototype is based on Version4-CAS Alex Prokopec, with 2 differences:
1) if initialization fails the status of bitmap for current field is reset and other threads will try to initialize once again. This also is the default behaviour in current scalac lazy vals.
2) instead of locking on 'this', monitor is taken per-application global array. This is done to make sure that user wouldn't share synchronization object with lazy val. Due to fact that Version4-CAS releases monitor while the user controlled code is being executed, if user is able to access same monitor as one used by lazy val he could potentially create a deadlock as in this example:

class A { self =>
  lazy val x: Int = {
    (new Thread() {
      override def run() = self.synchronized { while (true) {} }
    }).start()
    1
  }
}

This code won't create deadlocks for implementation from scalac, as lazy val initialization holds monitor on self till till val initialization succeeds.

The prototype behaves better in both contended scenario as well as single threaded. see "lazy-simulation-D3Try" on graphs. Benchmarks sources: uncontended, contended.

Should be noted that proposed version behaves differently in case of circular dependencies. Considering such code:
object A {
lazy val a = B.b
}
object B {
lazy val b = A.a
}

In current scalac implementation if single thread tries to initialized A.a it will fail with StackOverflowException. In provided prototype thread will deadlock itself. I believe that such code is inherently broken by design, and I see no way maintain previous behavior without scarifying performance. I would like to hear some feedback from community on this issue.

There's also an idea to provide user means to say that he wants the simplest implementation, which is not thread-safe. As can be seen on graph, it is substantially faster. Though it should be noted that such vals are even less thread-save than normal vals, due to JVM always issuing a memory barrier in the end of constructor.
Another question is how such 'thread-unsafe lazy vals' should be flagged by user. It seems that from 'keyword' point of view, we can say that if user wants a thread-safe lazy val, he should write

@volatile lazy val a = 1 // thread-safe

while writing just

lazy val a = 1

will give a thread unsafe implementation. Though, I believe that we should live @volatile as default, following 'least surprise' principle, but i didn't come up with an annotation that doesn't have a misleading name for such 'thread-unsafe lazy vals'.

Cheers,
Dmtiry

Nils Kilden-Pedersen

unread,

Feb 15, 2014, 6:10:21 PM2/15/14

to scala-i...@googlegroups.com

+1000

I have this on my private Scala wish list.

will give a thread unsafe implementation. Though, I believe that we should live @volatile as default, following 'least surprise' principle, but i didn't come up with an annotation that doesn't have a misleading name for such 'thread-unsafe lazy vals'.

Cheers,
Dmtiry

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Simon Ochsenreither

unread,

Feb 15, 2014, 7:30:10 PM2/15/14

to scala-i...@googlegroups.com

I'd prefer just having one version of lazy. If it is slower, because it's thread-safe, then so be it.
Having two lazies is not worth the trouble and confusion imho.

Rex Kerr

unread,

Feb 15, 2014, 7:57:43 PM2/15/14

to scala-i...@googlegroups.com

@noThreads

I think it's pretty clear what the assumptions are with that name.

--Rex

Dmitry Petrashko

unread,

Feb 15, 2014, 8:14:35 PM2/15/14

to scala-i...@googlegroups.com, dotty-i...@googlegroups.com

Just realized that server I was serving graphs from isn't accessible from the Internet.
Here you can see updated links:
Graphs: http://tnc.d-d.me/30/lazy/report/
and in particular thread-unsafe-volatile performance: http://goo.gl/ZySoex

Ryan LeCompte

unread,

Feb 15, 2014, 8:19:37 PM2/15/14

to scala-i...@googlegroups.com, dotty-i...@googlegroups.com

While we're talking about Dotty --

Neat LRUCache by @odersky in the upcoming Dotty scala compiler based
on sixteen nibbles!

https://github.com/lampepfl/dotty/blob/master/src/dotty/tools/dotc/util/SixteenNibbles.scala
https://github.com/lampepfl/dotty/blob/master/src/dotty/tools/dotc/util/LRUCache.scala

Ryan

virtualeyes

unread,

Feb 16, 2014, 3:45:49 AM2/16/14

to scala-i...@googlegroups.com, dotty-i...@googlegroups.com

So, der Dotty cometh?

vis-a-vis @paulp and his critiques of Scala in its current state, what affect will Dotty have on the language?

Can't imagine Dotty in the mainstream, if it happens, will occur before 2016/2017, but am very curious to know of its potential impact on the Scala ecosystem.

Simon Ochsenreither

unread,

Feb 16, 2014, 8:45:02 AM2/16/14

to scala-i...@googlegroups.com, dotty-i...@googlegroups.com

vis-a-vis @paulp and his critiques of Scala in its current state, what affect will Dotty have on the language?

Can't imagine Dotty in the mainstream, if it happens, will occur before 2016/2017, but am very curious to know of its potential impact on the Scala ecosystem.

I kind of hope that those responsible manage to merge improvements gradually, so that we don't suffer from yet another upgrade-nightmare (like Python2/Python3).
Thinking further about this, I'd really like to drop the "2." from Scala's version number ... the effect from it is kind of insane from a marketing POV anyway.

Grzegorz Kossakowski

unread,

Feb 16, 2014, 8:56:55 AM2/16/14

to scala-internals, dotty-i...@googlegroups.com

Dotty is just exploration in both improving Scala's typesystem and compiler design. It's an experiment that doesn't have to result in Scala 3 (especially the implementation that just got open sourced). It's really too early to talk about possible impact of dotty or migration strategies.

Let's stick to technical discussions for now. E.g. better implementation of lazy vals :)

--
Grzegorz Kossakowski

Scalac hacker at Typesafe

twitter: @gkossakowski

github: @gkossakowski

Nils Kilden-Pedersen

unread,

Feb 16, 2014, 8:56:46 AM2/16/14

to scala-i...@googlegroups.com

On Sat, Feb 15, 2014 at 6:30 PM, Simon Ochsenreither <simon.och...@gmail.com> wrote:

I'd prefer just having one version of lazy. If it is slower, because it's thread-safe, then so be it.
Having two lazies is not worth the trouble and confusion imho.

Do you feel the same way about variables in general? I.e. they should all just be implicitly volatile, because the it's too confusing otherwise?

Juha Heljoranta

unread,

Feb 16, 2014, 9:15:08 AM2/16/14

to scala-i...@googlegroups.com

On Sunday, February 16, 2014 05:45:02 Simon Ochsenreither wrote:
> Thinking further about this, I'd really like to drop the "2." from Scala's
> version number ... the effect from it is kind of insane from a marketing
> POV anyway.
 

Or like Java did it: version 1.6.0 is *marketed* as Java SE 6.
 
Cheers,
Juha
 

Simon Ochsenreither

unread,

Feb 16, 2014, 9:45:38 AM2/16/14

to scala-i...@googlegroups.com, juha.he...@iki.fi

Or like Java did it: version 1.6.0 is *marketed* as Java SE 6.

Yes ... currently we have the worst of both worlds.

Just imagine how Scala's versioning scheme looks to new people:

Yes, we happily break compatibility in pretty much every minor version ...
... but hey, we have reserved the major part of the version number for times we really want to break things, so you have basically seen nothing yet!
Ehhh ... why do you want to stay with Java?!

Nils Kilden-Pedersen

unread,

Feb 16, 2014, 10:03:39 AM2/16/14

to scala-i...@googlegroups.com

On Sat, Feb 15, 2014 at 6:57 PM, Rex Kerr <ich...@gmail.com> wrote:

@noThreads

I think it's pretty clear what the assumptions are with that name.

It’s technically wrong though. Although I prefer having to use @volatile, if we must stay backwards compatible, a better name would be @singleThread.

Chris Hodapp

unread,

Feb 16, 2014, 12:11:31 PM2/16/14

to scala-i...@googlegroups.com, ni...@kilden-pedersen.net

@threadUnsafe is more accurate than either @singleThread or @noThreads. @threadUnsafeInit is probably more accurate still, but it's kind of unwieldy.

Rex Kerr

unread,

Feb 16, 2014, 6:25:52 PM2/16/14

to scala-i...@googlegroups.com, Nils Kilden-Pedersen

@threadUnsafe is fine also. It describes a state of being, while @noThreads is more of a command regarding how one should use it. I don't think either is more accurate in a meaningful way, though the former is literally closer to the truth. ("No threads? Why not? Obviously it can't mean literally none, as you can't compute anything, so it must mean no _other_ threads. But then the only thing that differs is thread safety. Ah, I bet it's not safe." or "Thread unsafe? Well, I can use it here; this is all single-threaded.")

Anyway, I do think this would be great to have under any name, if the benchmarks are accurate, as there is a huge difference in performance. But the benchmark is measuring two things: loading from a by-name parameter _and_ the synchronized block. Since most of the work is in creation, this makes a big difference. Maybe the whole thing.

Can you run it again where Cell is

class Cell(x: => Int)

to get a fair comparison?

--Rex

Naftoli Gugenheim

unread,

Feb 16, 2014, 6:27:22 PM2/16/14

to scala-internals, Nils Kilden-Pedersen

I think at one point there was discussion about making the lazy val implementation pluggable. What ever happened with that?

Chris Hodapp

unread,

Feb 16, 2014, 9:42:44 PM2/16/14

to scala-i...@googlegroups.com, Nils Kilden-Pedersen

The reason "noThreads" is inaccurate is that you really are not barred from using the thing on multiple threads, you just need to do something to make it safe (e.g. locks, planning, etc.).

I agree that, discussion of naming aside, the feature sounds great.

Rex Kerr

unread,

Feb 16, 2014, 9:51:11 PM2/16/14

to scala-i...@googlegroups.com, Nils Kilden-Pedersen

I want to see a revised benchmark where the initialization is properly lazy before I commit to agreeing that the feature sounds great. By-name parameters aren't free either, and right now the benchmark mixes the two issues.

There should be head to head non-by-name tests, and head to head by-name tests.

*If* the performance wins hold in the uncontested case *then* I think it's great.

--Rex

√iktor Ҡlang

unread,

Feb 17, 2014, 7:19:50 AM2/17/14

to scala-i...@googlegroups.com

Hi Dmitry,

using global (even if striped) locks for lazy vals is _extremely_ unsatisfactory (especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told)).

Using global locks will make lazy vals basically unusable for multithreaded systems as seemingly unrelated things will create contention–creating nondeterministic contention due to TLAB address ranges and allocation rates between runs.

Some ideas come to mind:

1) deadlock detection + blow up

2) being able to specify what monitor to use for lazy vals

Have you looked into either of those and what does that look like?

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Cheers,

√

———————
Viktor Klang

Chief Architect - Typesafe

Twitter: @viktorklang

martin odersky

unread,

Feb 17, 2014, 7:37:53 AM2/17/14

to scala-internals

On Mon, Feb 17, 2014 at 1:19 PM, √iktor Ҡlang <viktor...@gmail.com> wrote:

Hi Dmitry,

using global (even if striped) locks for lazy vals is _extremely_ unsatisfactory (especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told)).

Using global locks will make lazy vals basically unusable for multithreaded systems as seemingly unrelated things will create contention–creating nondeterministic contention due to TLAB address ranges and allocation rates between runs.

Some ideas come to mind:
1) deadlock detection + blow up
2) being able to specify what monitor to use for lazy vals

Have you looked into either of those and what does that look like?

The simpler alternative would be to use the object containing the lazy val itself for the locking. The only disadvantage is then that user code that takes a lock on this same object can mess up things. So we would have to state a policy that objects containing lazy vals may not be used as locks in user code themselves. I am personally OK with that and think it might be more predictable than throwing too much magic with global lock pools at the problem. What do others think?

Cheers

- Martin

--

Martin Odersky
EPFL

Jason Zaugg

unread,

Feb 17, 2014, 7:43:21 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 1:37 PM, martin odersky <martin....@epfl.ch> wrote:

On Mon, Feb 17, 2014 at 1:19 PM, √iktor Ҡlang <viktor...@gmail.com> wrote:

Hi Dmitry,

using global (even if striped) locks for lazy vals is _extremely_ unsatisfactory (especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told)).

Using global locks will make lazy vals basically unusable for multithreaded systems as seemingly unrelated things will create contention–creating nondeterministic contention due to TLAB address ranges and allocation rates between runs.

Some ideas come to mind:
1) deadlock detection + blow up
2) being able to specify what monitor to use for lazy vals

Have you looked into either of those and what does that look like?

The simpler alternative would be to use the object containing the lazy val itself for the locking. The only disadvantage is then that user code that takes a lock on this same object can mess up things. So we would have to state a policy that objects containing lazy vals may not be used as locks in user code themselves. I am personally OK with that and think it might be more predictable than throwing too much magic with global lock pools at the problem. What do others think?

I once prototyped a scheme whereby an object could declare a member `__lazyLock` which would be used in preference to `this` as the monitor for all lazy vals inside it.

-jason

Jason Zaugg

unread,

Feb 17, 2014, 7:45:41 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 1:43 PM, Jason Zaugg <jza...@gmail.com> wrote:

The simpler alternative would be to use the object containing the lazy val itself for the locking. The only disadvantage is then that user code that takes a lock on this same object can mess up things. So we would have to state a policy that objects containing lazy vals may not be used as locks in user code themselves. I am personally OK with that and think it might be more predictable than throwing too much magic with global lock pools at the problem. What do others think?

I once prototyped a scheme whereby an object could declare a member `__lazyLock` which would be used in preference to `this` as the monitor for all lazy vals inside it.

We could also change the defaults a bit, and synthesize that member as a val containg with a fresh object. If you don't want to burn the extra field/allocation, and are happy to live with the risk of other code contending the lock, you could define that manually to return `this`.

-jason

√iktor Ҡlang

unread,

Feb 17, 2014, 7:46:36 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 1:37 PM, martin odersky <martin....@epfl.ch> wrote:

On Mon, Feb 17, 2014 at 1:19 PM, √iktor Ҡlang <viktor...@gmail.com> wrote:

Hi Dmitry,

using global (even if striped) locks for lazy vals is _extremely_ unsatisfactory (especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told)).

Using global locks will make lazy vals basically unusable for multithreaded systems as seemingly unrelated things will create contention–creating nondeterministic contention due to TLAB address ranges and allocation rates between runs.

Some ideas come to mind:
1) deadlock detection + blow up
2) being able to specify what monitor to use for lazy vals

Have you looked into either of those and what does that look like?

The simpler alternative would be to use the object containing the lazy val itself for the locking. The only disadvantage is then that user code that takes a lock on this same object can mess up things. So we would have to state a policy that objects containing lazy vals may not be used as locks in user code themselves. I am personally OK with that and think it might be more predictable than throwing too much magic with global lock pools at the problem. What do others think?

That is the current situation, right? I'm thinking also about the case where you may have multiple lazy vals in the same object, and they are not interdependent and you may want to be able to avoid locking on the same monitor.

√iktor Ҡlang

unread,

Feb 17, 2014, 7:47:30 AM2/17/14

to scala-i...@googlegroups.com

Can we make something where you can specify different monitors for different lazy vals (independent) in the same instance.

-jason

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jason Zaugg

unread,

Feb 17, 2014, 7:50:44 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 1:47 PM, √iktor Ҡlang <viktor...@gmail.com> wrote:

We could also change the defaults a bit, and synthesize that member as a val containg with a fresh object. If you don't want to burn the extra field/allocation, and are happy to live with the risk of other code contending the lock, you could define that manually to return `this`.

Can we make something where you can specify different monitors for different lazy vals (independent) in the same instance.

In my prototype, I didn't support that by design. All lazy vals in an object shared the initialized bitmap, so they had to agree on the locking strategy. But perhaps the constraints are different with the newer proposals.

-jason

Grzegorz Kossakowski

unread,

Feb 17, 2014, 7:56:03 AM2/17/14

to scala-internals

On 17 February 2014 13:47, √iktor Ҡlang <viktor...@gmail.com> wrote:

Can we make something where you can specify different monitors for different lazy vals (independent) in the same instance.

I'm wondering if the proposals that allow customization of lazy val are not going a bit too far? How often do you need to specify different monitors for lazy vals? Would it be ok to just roll your own implementation of the lazy val in such cases?

√iktor Ҡlang

unread,

Feb 17, 2014, 7:59:27 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 1:56 PM, Grzegorz Kossakowski <grzegorz.k...@gmail.com> wrote:

On 17 February 2014 13:47, √iktor Ҡlang <viktor...@gmail.com> wrote:

Can we make something where you can specify different monitors for different lazy vals (independent) in the same instance.

I'm wondering if the proposals that allow customization of lazy val are not going a bit too far? How often do you need to specify different monitors for lazy vals? Would it be ok to just roll your own implementation of the lazy val in such cases?

That's a fair point, perhaps we just need Lazy[T] for that purpose.

--
Grzegorz Kossakowski
Scalac hacker at Typesafe
twitter: @gkossakowski

github: @gkossakowski

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

martin odersky

unread,

Feb 17, 2014, 8:00:52 AM2/17/14

to scala-internals

On Mon, Feb 17, 2014 at 1:56 PM, Grzegorz Kossakowski <grzegorz.k...@gmail.com> wrote:

On 17 February 2014 13:47, √iktor Ҡlang <viktor...@gmail.com> wrote:

Can we make something where you can specify different monitors for different lazy vals (independent) in the same instance.

I'm wondering if the proposals that allow customization of lazy val are not going a bit too far? How often do you need to specify different monitors for lazy vals? Would it be ok to just roll your own implementation of the lazy val in such cases?

I tend to agree. In any case, if there is only one class containing the lazyval, the __lazyLock workaround does not seem to buy us much.

Instead of

class C {

val __lazyLock = new Object

lazy val x = ...

}

one could equally well write

class C {

object lazies { lazy val x = ... }

}

In both cases, the C instances would then be free to take locks.

Cheers

- Martin

--
Grzegorz Kossakowski
Scalac hacker at Typesafe
twitter: @gkossakowski

github: @gkossakowski

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Martin Odersky
EPFL

√iktor Ҡlang

unread,

Feb 17, 2014, 8:03:59 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 2:00 PM, martin odersky <martin....@epfl.ch> wrote:

On Mon, Feb 17, 2014 at 1:56 PM, Grzegorz Kossakowski <grzegorz.k...@gmail.com> wrote:

On 17 February 2014 13:47, √iktor Ҡlang <viktor...@gmail.com> wrote:

Can we make something where you can specify different monitors for different lazy vals (independent) in the same instance.

I'm wondering if the proposals that allow customization of lazy val are not going a bit too far? How often do you need to specify different monitors for lazy vals? Would it be ok to just roll your own implementation of the lazy val in such cases?

I tend to agree. In any case, if there is only one class containing the lazyval, the __lazyLock workaround does not seem to buy us much.

Instead of

class C {
val __lazyLock = new Object
lazy val x = ...
}

one could equally well write

class C {
object lazies { lazy val x = ... }

you'd need:

def x = lazies.x

}

In both cases, the C instances would then be free to take locks.

Cheers

- Martin

--
Grzegorz Kossakowski
Scalac hacker at Typesafe
twitter: @gkossakowski

github: @gkossakowski

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Martin Odersky
EPFL

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Jason Zaugg

unread,

Feb 17, 2014, 8:06:15 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 2:00 PM, martin odersky <martin....@epfl.ch> wrote:

I tend to agree. In any case, if there is only one class containing the lazyval, the __lazyLock workaround does not seem to buy us much.

Instead of

class C {
val __lazyLock = new Object
lazy val x = ...
}

one could equally well write

class C {
object lazies { lazy val x = ... }
}

In both cases, the C instances would then be free to take locks.

The motivating use case was for the cake pattern, where you want the entire cake to share a lazy lock so as to avoid deadlocks. This was for the compiler cake.

Instead, we eagerly *every* lazy during startup of the reflection cake: https://github.com/scala/scala/blob/master/src/reflect/scala/reflect/runtime/JavaUniverseForce.scala

In any case, this problem seems to be addressed more directly by not holding the lock during lazy val initialization, and instead using the three-state initialization.

-jason

√iktor Ҡlang

unread,

Feb 17, 2014, 8:09:10 AM2/17/14

to scala-i...@googlegroups.com

Oh, a small thing: don't look up Unsafe yourself, delegate to scala.concurrent's Unsafe lookup (your version doesn't work on Android for instance)

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jason Zaugg

unread,

Feb 17, 2014, 8:10:00 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 2:03 PM, √iktor Ҡlang <viktor...@gmail.com> wrote:

class C {
object lazies { lazy val x = ... }

you'd need:
def x = lazies.x

...and to make `lazies` private, so others can't lock it.

That `def x` part actually highlights a problem with DIY-lazies; they can't be stable values. This limits is usefulness in cakes.

It would be nice to let us annotate a def as stable. (Internally, the compiler marks already marks certain synthetic vars as stable, such as the onces that back lazy vals.)

-jason

Dmitry Petrashko

unread,

Feb 17, 2014, 9:10:03 AM2/17/14

to scala-i...@googlegroups.com

On 17 February 2014 03:51, Rex Kerr <ich...@gmail.com> wrote:

I want to see a revised benchmark where the initialization is properly lazy before I commit to agreeing that the feature sounds great. By-name parameters aren't free either, and right now the benchmark mixes the two issues.

There should be head to head non-by-name tests, and head to head by-name tests.

Thanks for pointing this out!

Here are the updated graphs: http://tnc.d-d.me/30/lazy1/report/

and here is comparison for uncontended case: http://goo.gl/uhWlu5 lower line is 'not' call-by-name, and is indeed 2-times faster than call-by-name. But speed difference is still substantial.

--
You received this message because you are subscribed to a topic in the Google Groups "scala-internals" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-internals/4sjw8pcKysg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scala-interna...@googlegroups.com.

Dmitry Petrashko

unread,

Feb 17, 2014, 9:28:34 AM2/17/14

to scala-i...@googlegroups.com

On 17 February 2014 13:19, √iktor Ҡlang <viktor.klang@gmail.com> wrote:

Hi Dmitry,

using global (even if striped) locks for lazy vals is _extremely_ unsatisfactory (especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told)).

Using global locks will make lazy vals basically unusable for multithreaded systems as seemingly unrelated things will create contention–creating nondeterministic contention due to TLAB address ranges and allocation rates between runs.

Hi Victor,
That's true, I've spend a week digging into this potential issue.

Please note that each lock is held only for small period of time, that doesn't depend on user-provided initialization code. That's actually why I'm creating number of monitors quadratic to number of processors: I've been benchmarking on 2XE5645 (12 cores, 24 threads) and collecting contention information by Intel tool "Vtune Amplifier XE". Test was: 50 threads concurrently creating LazyVals, inititalizing them and throwing them away. For 2*CPU_COUNT^2 contention was only 15% bigger that if using independent monitors. But: if using independent monitors there was a 5% slowdown, due to monitor expanding(monitors start as minimalistic biased locking, but are expended when there's contention). If 5*CPU_COUNT^2 monitors are used, the 'global-monitors' implementation is actually 2-3% faster.

But note that those benchmarks a EXTREMELY specific to hardware configuration.

Some ideas come to mind:
1) deadlock detection + blow up
2) being able to specify what monitor to use for lazy vals

I believe this is a good idea, I'll think about adding annotation for this. Thanks!

You received this message because you are subscribed to a topic in the Google Groups "scala-internals" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-internals/4sjw8pcKysg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scala-interna...@googlegroups.com.

Dmitry Petrashko

unread,

Feb 17, 2014, 9:33:24 AM2/17/14

to scala-i...@googlegroups.com

It's current situation indeed, but proposed approach uses locking only for a very short period, that doesn't depend on user-provided initializer If initializers take substantial time to run, concurrent initialization is still possible.

--
You received this message because you are subscribed to a topic in the Google Groups "scala-internals" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-internals/4sjw8pcKysg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scala-interna...@googlegroups.com.

√iktor Ҡlang

unread,

Feb 17, 2014, 9:57:30 AM2/17/14

to scala-i...@googlegroups.com

Hi Dmitry,

On Mon, Feb 17, 2014 at 3:28 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor.klang@gmail.com> wrote:

Hi Dmitry,

using global (even if striped) locks for lazy vals is _extremely_ unsatisfactory (especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told)).

Using global locks will make lazy vals basically unusable for multithreaded systems as seemingly unrelated things will create contention–creating nondeterministic contention due to TLAB address ranges and allocation rates between runs.

Hi Victor,
That's true, I've spend a week digging into this potential issue.

Please note that each lock is held only for small period of time, that doesn't depend on user-provided initialization code. That's actually why I'm creating number of monitors quadratic to number of processors: I've been benchmarking on 2XE5645 (12 cores, 24 threads) and collecting contention information by Intel tool "Vtune Amplifier XE". Test was: 50 threads concurrently creating LazyVals, inititalizing them and throwing them away. For 2*CPU_COUNT^2 contention was only 15% bigger that if using independent monitors. But: if using independent monitors there was a 5% slowdown, due to monitor expanding(monitors start as minimalistic biased locking, but are expended when there's contention). If 5*CPU_COUNT^2 monitors are used, the 'global-monitors' implementation is actually 2-3% faster.

What does performance look like in comparison to a spinlock approach as done in FJP? : https://github.com/scala/scala/blob/master/src/forkjoin/scala/concurrent/forkjoin/ForkJoinPool.java#L1622

But note that those benchmarks a EXTREMELY specific to hardware configuration.

Yep, I know ;)

It'll look different on AMD and others.

Dmitry Petrashko

unread,

Feb 17, 2014, 10:01:15 AM2/17/14

to scala-i...@googlegroups.com

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.

It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

√iktor Ҡlang

unread,

Feb 17, 2014, 10:07:33 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 4:01 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.
It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

this is only in HotSpot though. We have J9, Zing, JRockit etc.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

Yes, it would also most likely throw lock coarsening and lock elision out the window. Amirite?

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

How so?

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dmitry Petrashko

unread,

Feb 17, 2014, 10:30:22 AM2/17/14

to scala-i...@googlegroups.com

On 17 February 2014 16:07, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:01 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.
It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

this is only in HotSpot though. We have J9, Zing, JRockit etc.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

Yes, it would also most likely throw lock coarsening and lock elision out the window. Amirite?

AFAIK lock coarsening is independent of MarkWord structure, and I've been observing lock elision in my implementation just by reading native code. I believe that they still work.

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

How so?

Because in proposed implementation on logic(both our and jvm's) will work on per-instance basis. While a single contended lazy val synchronizing on 'this' can disable biased locking for ALL instances of same class. In total we have both low cost for uncontended instance(simple CAS), and we are also better worse for contended case. In contended case were better because:

1) in case initializer is fast, in most cases initializer thread is fast enough to finish initialization before anyone is able to successfully CAS bitmap to state 2(this state means that someone is waiting for initialization to finish). This 'failed' CASes of other threads effectively replace a short spinlock.

2) in case initializer is slow, waiting IS required.

There can be the case, that additional hand-written spinlock will be beneficial: the case where initializers's cost is comparative to cost of CAS, but I've tried to do this, and with spin lock method is already big to inline, and benchmarks show see a 15% slowdown in more common cases.

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Cheers,
√

———————
Viktor Klang
Chief Architect - Typesafe

Twitter: @viktorklang

--
You received this message because you are subscribed to a topic in the Google Groups "scala-internals" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-internals/4sjw8pcKysg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scala-interna...@googlegroups.com.

√iktor Ҡlang

unread,

Feb 17, 2014, 10:59:59 AM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 4:30 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:07, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:01 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.
It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

this is only in HotSpot though. We have J9, Zing, JRockit etc.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

Yes, it would also most likely throw lock coarsening and lock elision out the window. Amirite?
AFAIK lock coarsening is independent of MarkWord structure, and I've been observing lock elision in my implementation just by reading native code. I believe that they still work.

I don't see how it could elide it since you'll make the striped locks contended.

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

How so?

Because in proposed implementation on logic(both our and jvm's) will work on per-instance basis. While a single contended lazy val synchronizing on 'this' can disable biased locking for ALL instances of same class. In total we have both low cost for uncontended instance(simple CAS), and we are also better worse for contended case. In contended case were better because:

1) in case initializer is fast, in most cases initializer thread is fast enough to finish initialization before anyone is able to successfully CAS bitmap to state 2(this state means that someone is waiting for initialization to finish). This 'failed' CASes of other threads effectively replace a short spinlock.

2) in case initializer is slow, waiting IS required.

But the problem is that _all_ contended initializations that hash to the same lock that is already used during initialization will essentially park.

Doesn't that come with a huge domino-warning?

Shouldn't the stripe length be configured as a System property? With followup: What is a good value for the stripe length?

There can be the case, that additional hand-written spinlock will be beneficial: the case where initializers's cost is comparative to cost of CAS, but I've tried to do this, and with spin lock method is already big to inline, and benchmarks show see a 15% slowdown in more common cases.

Did you try the technique I linked? And if so, what was the result?

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Cheers,
√

———————
Viktor Klang
Chief Architect - Typesafe

Twitter: @viktorklang

--
You received this message because you are subscribed to a topic in the Google Groups "scala-internals" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-internals/4sjw8pcKysg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scala-interna...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dmitry Petrashko

unread,

Feb 17, 2014, 12:03:28 PM2/17/14

to scala-i...@googlegroups.com

On 17 February 2014 16:59, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:30 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:07, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:01 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.
It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

this is only in HotSpot though. We have J9, Zing, JRockit etc.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

Yes, it would also most likely throw lock coarsening and lock elision out the window. Amirite?
AFAIK lock coarsening is independent of MarkWord structure, and I've been observing lock elision in my implementation just by reading native code. I believe that they still work.

I don't see how it could elide it since you'll make the striped locks contended.

As lock is held for a very short period it's very hard to make it contended. But indeed possible. As I told, the direct benchmark that actually meassures this locking contention shows that contention isn't big.

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

How so?

Because in proposed implementation on logic(both our and jvm's) will work on per-instance basis. While a single contended lazy val synchronizing on 'this' can disable biased locking for ALL instances of same class. In total we have both low cost for uncontended instance(simple CAS), and we are also better worse for contended case. In contended case were better because:

1) in case initializer is fast, in most cases initializer thread is fast enough to finish initialization before anyone is able to successfully CAS bitmap to state 2(this state means that someone is waiting for initialization to finish). This 'failed' CASes of other threads effectively replace a short spinlock.

2) in case initializer is slow, waiting IS required.

But the problem is that _all_ contended initializations that hash to the same lock that is already used during initialization will essentially park.

They will only park if they intend to wait on this monitor, which is a very rare occurrence for fast initializers and, as I believe, expected behavior for slow ones.

Doesn't that come with a huge domino-warning?
Shouldn't the stripe length be configured as a System property? With followup: What is a good value for the stripe length?

I performed the benchmark for sake of detecting this good value. But, indeed system property can be provided. But we're back to question: if user wants a highly customizable lazy val, maybe he'd better go with his custom preferred implementation Lazy[T]?

There can be the case, that additional hand-written spinlock will be beneficial: the case where initializers's cost is comparative to cost of CAS, but I've tried to do this, and with spin lock method is already big to inline, and benchmarks show see a 15% slowdown in more common cases.

Did you try the technique I linked? And if so, what was the result?

If I understood technique correctly, this is a cycle, that with probability of 1/2 decreases counter that was initially equal to 256?
I've been trying technique with predefined spinlock on 500 iterations.

Here are results of technique referenced by you. http://goo.gl/xXQArr (if lazy-simulation-D3TrySpin isn't there press ctrl-F5, it's a static page with aggressive caching). Source: https://github.com/DarkDimius/lazy-val-bench/blob/CallSites/src/test/scala/example/package.scala#L323

√iktor Ҡlang

unread,

Feb 17, 2014, 12:25:43 PM2/17/14

to scala-i...@googlegroups.com

On Mon, Feb 17, 2014 at 6:03 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:59, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:30 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:07, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:01 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.
It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

this is only in HotSpot though. We have J9, Zing, JRockit etc.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

Yes, it would also most likely throw lock coarsening and lock elision out the window. Amirite?
AFAIK lock coarsening is independent of MarkWord structure, and I've been observing lock elision in my implementation just by reading native code. I believe that they still work.

I don't see how it could elide it since you'll make the striped locks contended.
As lock is held for a very short period it's very hard to make it contended. But indeed possible. As I told, the direct benchmark that actually meassures this locking contention shows that contention isn't big.

What's your testbed and which are the test-cases?

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

How so?

Because in proposed implementation on logic(both our and jvm's) will work on per-instance basis. While a single contended lazy val synchronizing on 'this' can disable biased locking for ALL instances of same class. In total we have both low cost for uncontended instance(simple CAS), and we are also better worse for contended case. In contended case were better because:

1) in case initializer is fast, in most cases initializer thread is fast enough to finish initialization before anyone is able to successfully CAS bitmap to state 2(this state means that someone is waiting for initialization to finish). This 'failed' CASes of other threads effectively replace a short spinlock.

2) in case initializer is slow, waiting IS required.

But the problem is that _all_ contended initializations that hash to the same lock that is already used during initialization will essentially park.

They will only park if they intend to wait on this monitor, which is a very rare occurrence for fast initializers and, as I believe, expected behavior for slow ones.

Doesn't that come with a huge domino-warning?
Shouldn't the stripe length be configured as a System property? With followup: What is a good value for the stripe length?

I performed the benchmark for sake of detecting this good value. But, indeed system property can be provided. But we're back to question: if user wants a highly customizable lazy val, maybe he'd better go with his custom preferred implementation Lazy[T]?

True, but are you expecting him/her to rewrite all his/her lazy-val code and get the added benefit of higher memory usage + worse locality?

There can be the case, that additional hand-written spinlock will be beneficial: the case where initializers's cost is comparative to cost of CAS, but I've tried to do this, and with spin lock method is already big to inline, and benchmarks show see a 15% slowdown in more common cases.

Did you try the technique I linked? And if so, what was the result?
If I understood technique correctly, this is a cycle, that with probability of 1/2 decreases counter that was initially equal to 256?
I've been trying technique with predefined spinlock on 500 iterations.

AFAICS the contended benchmark has at most 4 Threads.

Scott Carey

unread,

Feb 17, 2014, 5:04:34 PM2/17/14

to scala-i...@googlegroups.com

If you can use Method Handles (why not in dotty? this is research, not android-ville), this looks to be vastly superior in many ways:

https://groups.google.com/forum/#!topic/scala-internals/SCZhEN0DyPY

Dmitry Petrashko

unread,

Feb 18, 2014, 8:47:52 AM2/18/14

to scala-i...@googlegroups.com

On Monday, 17 February 2014 18:25:43 UTC+1, √iktor Klang wrote:

On Mon, Feb 17, 2014 at 6:03 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:59, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:30 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:07, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:01 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.
It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

this is only in HotSpot though. We have J9, Zing, JRockit etc.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

Yes, it would also most likely throw lock coarsening and lock elision out the window. Amirite?
AFAIK lock coarsening is independent of MarkWord structure, and I've been observing lock elision in my implementation just by reading native code. I believe that they still work.

I don't see how it could elide it since you'll make the striped locks contended.
As lock is held for a very short period it's very hard to make it contended. But indeed possible. As I told, the direct benchmark that actually meassures this locking contention shows that contention isn't big.

What's your testbed and which are the test-cases?

testbeds: 1) i7-3930K; oracleJDK 1.7.0_45; 2) 2x Xeon E5645; oracle JDK 1.7.0_51; 3) i7-2640M oracleJDK 1.7.0_51
test case: given array of 1 000 000 objects, start locking on them concurrently in 50 threads from different starting locations, compared to: start locking on getMonitor(object) implemented in prototype.
Body of synchronized method is per-thread public volatile update.

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

How so?

Because in proposed implementation on logic(both our and jvm's) will work on per-instance basis. While a single contended lazy val synchronizing on 'this' can disable biased locking for ALL instances of same class. In total we have both low cost for uncontended instance(simple CAS), and we are also better worse for contended case. In contended case were better because:

1) in case initializer is fast, in most cases initializer thread is fast enough to finish initialization before anyone is able to successfully CAS bitmap to state 2(this state means that someone is waiting for initialization to finish). This 'failed' CASes of other threads effectively replace a short spinlock.

2) in case initializer is slow, waiting IS required.

But the problem is that _all_ contended initializations that hash to the same lock that is already used during initialization will essentially park.

They will only park if they intend to wait on this monitor, which is a very rare occurrence for fast initializers and, as I believe, expected behavior for slow ones.

Doesn't that come with a huge domino-warning?
Shouldn't the stripe length be configured as a System property? With followup: What is a good value for the stripe length?

I performed the benchmark for sake of detecting this good value. But, indeed system property can be provided. But we're back to question: if user wants a highly customizable lazy val, maybe he'd better go with his custom preferred implementation Lazy[T]?

True, but are you expecting him/her to rewrite all his/her lazy-val code and get the added benefit of higher memory usage + worse locality?

I'm trying to come up with implementation that does reasonably good for common cases and is easy to maintain.

There can be the case, that additional hand-written spinlock will be beneficial: the case where initializers's cost is comparative to cost of CAS, but I've tried to do this, and with spin lock method is already big to inline, and benchmarks show see a 15% slowdown in more common cases.

Did you try the technique I linked? And if so, what was the result?
If I understood technique correctly, this is a cycle, that with probability of 1/2 decreases counter that was initially equal to 256?
I've been trying technique with predefined spinlock on 500 iterations.

AFAICS the contended benchmark has at most 4 Threads.

This particular contended benchmark has only a constant measurable difference for running time for different number of threads(until reaching number of cores). Increasing number of threads makes test less informative, as it begins to benchmark OS scheduler.
While running this test on 2xXeon I pin threads to different CPUs during warmup.

√iktor Ҡlang

unread,

Feb 18, 2014, 8:51:44 AM2/18/14

to scala-i...@googlegroups.com

On Tue, Feb 18, 2014 at 2:47 PM, Dmitry Petrashko <darkd...@gmail.com> wrote:

On Monday, 17 February 2014 18:25:43 UTC+1, √iktor Klang wrote:

On Mon, Feb 17, 2014 at 6:03 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:59, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:30 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 16:07, √iktor Ҡlang <viktor...@gmail.com> wrote:

On Mon, Feb 17, 2014 at 4:01 PM, Dmitry Petrashko <dmitry.p...@gmail.com> wrote:

On 17 February 2014 13:19, √iktor Ҡlang <viktor...@gmail.com> wrote:

especially when used with identityHashCode, as it requires the hashCode to be stored in the object header due to compaction by GC (at least this is what I was told).

Both information about identityHashCode and lock is stored(or referenced) in 'MarkWord' field of object header.
It has such structure @see slide6 http://www.cs.princeton.edu/picasso/mats/HotspotOverview.pdf

The transitions between different representations are explained here: https://wikis.oracle.com/display/HotSpotInternals/Synchronization

Note that by default, biased locking IS enabled, but as soon as first contention is found it's disabled for all class by bulk revocation. And this IS slow. Note that due to CAS biased locking isn't required, as CAS effectively replaces it with a better implementation.

this is only in HotSpot though. We have J9, Zing, JRockit etc.

The proposed approach with predefined set of monitors has advantage of not expanding monitors on lazy-val holder objects to heavyweight monitors, but it indeed switches objects to 'unlocked non-biasable' states.

Yes, it would also most likely throw lock coarsening and lock elision out the window. Amirite?

AFAIK lock coarsening is independent of MarkWord structure, and I've been observing lock elision in my implementation just by reading native code. I believe that they still work.

I don't see how it could elide it since you'll make the striped locks contended.
As lock is held for a very short period it's very hard to make it contended. But indeed possible. As I told, the direct benchmark that actually meassures this locking contention shows that contention isn't big.

What's your testbed and which are the test-cases?

testbeds: 1) i7-3930K; oracleJDK 1.7.0_45; 2) 2x Xeon E5645; oracle JDK 1.7.0_51; 3) i7-2640M oracleJDK 1.7.0_51
test case: given array of 1 000 000 objects, start locking on them concurrently in 50 threads from different starting locations, compared to: start locking on getMonitor(object) implemented in prototype.
Body of synchronized method is per-thread public volatile update.

If you ping Toni Cunei you might get to run on our 48 core Opteron box.

I believe that identityHashCode is a lot less evil than locking on 'this', and actually more deterministic.

How so?

Because in proposed implementation on logic(both our and jvm's) will work on per-instance basis. While a single contended lazy val synchronizing on 'this' can disable biased locking for ALL instances of same class. In total we have both low cost for uncontended instance(simple CAS), and we are also better worse for contended case. In contended case were better because:

1) in case initializer is fast, in most cases initializer thread is fast enough to finish initialization before anyone is able to successfully CAS bitmap to state 2(this state means that someone is waiting for initialization to finish). This 'failed' CASes of other threads effectively replace a short spinlock.

2) in case initializer is slow, waiting IS required.

But the problem is that _all_ contended initializations that hash to the same lock that is already used during initialization will essentially park.

They will only park if they intend to wait on this monitor, which is a very rare occurrence for fast initializers and, as I believe, expected behavior for slow ones.

Doesn't that come with a huge domino-warning?
Shouldn't the stripe length be configured as a System property? With followup: What is a good value for the stripe length?

I performed the benchmark for sake of detecting this good value. But, indeed system property can be provided. But we're back to question: if user wants a highly customizable lazy val, maybe he'd better go with his custom preferred implementation Lazy[T]?

True, but are you expecting him/her to rewrite all his/her lazy-val code and get the added benefit of higher memory usage + worse locality?

I'm trying to come up with implementation that does reasonably good for common cases and is easy to maintain.

How are you collecting these common cases and how are you measuring maintainability?

There can be the case, that additional hand-written spinlock will be beneficial: the case where initializers's cost is comparative to cost of CAS, but I've tried to do this, and with spin lock method is already big to inline, and benchmarks show see a 15% slowdown in more common cases.

Did you try the technique I linked? And if so, what was the result?
If I understood technique correctly, this is a cycle, that with probability of 1/2 decreases counter that was initially equal to 256?
I've been trying technique with predefined spinlock on 500 iterations.

AFAICS the contended benchmark has at most 4 Threads.

This particular contended benchmark has only a constant measurable difference for running time for different number of threads(until reaching number of cores). Increasing number of threads makes test less informative, as it begins to benchmark OS scheduler.
While running this test on 2xXeon I pin threads to different CPUs during warmup.

The sad fact is that we _are_ talking about OS Scheduler and CTX Switch overhead, that's where the bottleneck lies.

Dmitry Petrashko

unread,

Feb 18, 2014, 8:56:55 AM2/18/14

to scala-i...@googlegroups.com

Hi Scott,

If you have a look into source you can see that I have a MH prototype and benchmark for it.
Long story short: creating MH costs you more than all this implementations, and doesn't pay off. Even on jdk-8 early builds. And you still need a VolatileCallSite to make sure that it's thread-safe. In total, HM version initialization is slower than current lazy vals by 50-1000 times depending on jvm version and flags.
I believe that MH can be a good tool if you're going to allocate them during startup, and have a reasonable(constant?) amount of them. What I observe is that first 10 000 MH are reasonably slow, but as soon as you've allocated several millions of them their performance degrades substantially.

If somebody wants to explore MH, this will help to start: https://github.com/DarkDimius/lazy-val-bench/blob/CallSites/src/test/java/example/LazyValCallSite.java

Cheers,
Dmitry

Reply all

Reply to author

Forward