Mechanically sympathetic Functional Programming

Fernando Racca

unread,

Jul 22, 2013, 5:43:53 AM7/22/13

to mechanica...@googlegroups.com

Immutability and functional programming look very reasonable ideas in a highly concurrent environment.

But what's the impact of generating millions of very short lived objects in a low latency environment?

I'm particularly thinking in Scala, which offers great tools for building sane financial applications. Remember, even though performance is important, not all applications require microsecond end to end responses.

What advise or better, best practices would you give to minimise latency, while retaining as much as possible immutability.

For example, Is G1 the best option these days?

I saw some interesting comments in other threads:

| Peter Lawrey:

that said, even if you have something like Zing, I still recommend you keep your garbage produced to a minimum, not

because of GC pauses, but to avoid filling your CPU caches with garbage. A thread which produces 32 MB of garbage a second will fill the L1 cache with garbage in about one milli-second. The more garbage you produce, the more likely you will be working from a lower cache or main memory, slowing down your application. By minimising garbage produced I have improved the performance of code by a factor of 2-5x, in between the GC pauses, just by making it more likely you will be using your L1 cache more effectively, rather than say your L3 cache which is shared.

| Rüdiger Möller

The hidden cost of allocation and fine grained Class design are much higher than most people suspect. Even a high allocation rate of very short lived objects (as typical for fancy Scala and Groovy code) has a massive impact on overall system performance (eden gets filled up faster, frequently leading to a high (false) promotion rate). Allocation and GC make up the largest part of CPU consumption in a lot of java applications out there.This is where Micro Benchmarks don't tell the full story.

Thanks

Fernando Racca

Rüdiger Möller

unread,

Jul 22, 2013, 8:40:04 AM7/22/13

to mechanica...@googlegroups.com

I don't think creating millions of short lived Objects is a good idea at all and no, it does not help in concurrent programming performance wise imo. There might be certain use cases were immutables pay off.

Regarding locality: overuse of immutables will have negative impact on CPU caches.

Regarding GC: as long escape analysis is not working that good (Peter has done some detailed analysis on this topic): Get a huge Eden to reduce GC overhead. CMS is easiest tuneable+predictable at the moment. I have done some testing which might save you some time http://java-is-the-new-c.blogspot.com/

jamie...@typesafe.com

unread,

Jul 22, 2013, 12:25:28 PM7/22/13

to

Fernando - First off, a disclaimer - I'm from Typesafe and therefore I'm obviously a fan of Scala. I think it's important that we all keep in mind that every developer has to choose the right tool for the right job. There will be some tasks for which Scala is very well suited, some for which Java is the better approach and some for which neither is the right answer. Your experience will vary depending on your application, what you're doing and how it's written. I can say that several large financial institutions, as in just about any that you've heard of, are now using Scala. I can't say which - many of the large companies with which we work would prefer to stay quiet about it. However, they like it so much that these institutions pay us to build new libraries in support of their needs which we then open source, such as the Slick data access library and the new Async library.

Martin will attest that I've been a long-time follower of his Mechanical Sympathy tenets, having written a Scala port of the Disruptor v1.0 in July of 2011. I see tremendous value in applying the principles he, Peter Lawrey and others espouse to maximize the performance of my application, such as sizing my working set to fit within the L2 cache that isn't shared between cores on Intel architectures.. I apply these in Scala as well, at varying times. I write all of my methods/functions to be smaller than 35 bytecode instructions to take advantage of inlining while I'm coding. Most others, like cache line isolation and working set sizing, I apply once I've measured where my biggest performance pain points are and how I want to address them. This likely differs from Peter Lawrey's approach, because HFT is an extremely performance critical task where even nanoseconds matter.

Functional programming is not the best approach for high performance when talking about the greatest throughput you will get on a single thread, and will use a greater footprint. The implementation of FP will vary between languages. For example, look at collections libraries. Some libraries/languages use full Copy On Write (COW), which is extremely expensive but guarantees the most isolation of data between threads of execution. Others, like Scala's standard library collections, use structural sharing, which can generate short-lived garbage but does not copy all of the data in the heap - instead, the references to shared data are shared between collections. This is transparent to developers, while also providing extremely rich collections functionality.

Note that Scala can be written to be just as performant as Java. We do not have the ability to create bytecode that doesn't execute on the JVM. You can always write code in Java and work directly between the two, if you feel more comfortable doing so - they are completely interoperable (a caveat - it's always easier going from Java to Scala than vice versa). You can delegate work from a Scala application to a Java module very easily. Kent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when. This is particularly true with a team of developers of varying skill developing simultaneously. Once we have something we know works the way we intended, we must then profile it and find the areas where optimization is required. How much time is spent in a method? How much garbage is being created and which collector best suits the object lifecycle patterns for this application? How do we size our regions to handle those patterns most effectively?

One other area I'd like to point out - Java has wonderful constructs at a very primitive level. Threads, executors, atomic CAS types, etc. They work very well and form the basis of my multithreaded applications. However, they are very difficult to compose, and even more so for handling errors that can occur. Maybe superstar developers like Martin and Peter can figure out what happened inside of a thread that failed inside of a task executed with ForkJoin, but most devs can't. And even if they wanted to use Thread.UncaughtExceptionHandler, they've now got to figure out all kinds of things they didn't before such as parallelism level. Scala makes defining work to be spread across cores extremely simple, even when things go wrong. Scala also has a completely asynchronous Future implementation (based on JSR166y and therefore usable on Java 1.6+), whereas Java still requires you to block a thread just to see if one or more futures are completed.

Hope that helps!

Gil Tene

unread,

Jul 22, 2013, 1:15:30 PM7/22/13

to mechanica...@googlegroups.com

Let me step in in defense of short lived object allocation.

I don't think there is a strong rule of thumb here, but short lived object allocation is in most cases superior to alternatives. I've certainly seen many cases where short lived object allocation (not the kind that escape analysis eliminates) provides better performance than object pooling, for example, as it's simply cheaper and faster to manage (from a CPU cycles point of view). I've also see several concurrent algorithms that get faster when they can rely on the magic cleanup that GC provides, and would otherwise need to safely deal with destruction and other lifecycle challenges in a concurrent environment. Concurrent algorithms are often much easier to reason about in such cases, making you able to tackle bigger problems.

I'm certainly not advocating allocating stuff for no good reason, and I fully agree that eliminating unneeded allocation reduces cycles and improves performance. But while the concerns that Peter and Rüdiger raise about the cache impacts of allocating short lived objects have some merit when very short latency operations are concerned in extremely idle systems, I think that people often overestimate the impact of moving this stuff through the cache. A set-associative LRU-managed CPU cache (which is the case for most) will successfully avoid pushing out hot data form the cache to a large degree, even in the presence of streaming operations going through the same cache. It's the cold data that is usually susceptible to cache pressure, and that cold data is just as susceptible to packet data moving through the cache as it is to object allocation. Adding the two together does mean increased pressure on the cache, of course, but cold data is "hard" to keep in your cache no matter what.

When we look at very short latency critical paths (and we are talking microseconds here), successfully reducing cold-data misses by avoiding cache pressure usually comes hand in hand with dedicating cores to threads and isolating them from the OS, as you don't want anyone using that core or it's L1 if you are trying to keep cold (not hot) data in it. Even context switches (and the TLB flushes that comes with them in x86) can be responsible for some key cache-miss latency of otherwise-cached cold stuff like the higher parts of page table hierarchy if you don't keep that core completely in one process context. (Unfortunately, even with core isolation, a socket-wide shared inclusive L3 can kick out your cold L1 and L2 data if it really is cold, even when you've kept anyone else from using your core....)

The approach of dedicating core to specific threads is certainly valid, but it also tend to significantly restrict what you can do with a given piece of hardware. While I know several people here do use these trick, hand craft and map their application to the underlying hardware, and are able to shave some microseconds or parts of microseconds off by doing so, I find that most latency sensitive apps (the ones that have 15-50 usec total processing time that I often see in Java) won't spend the effort, as they often have bigger fish to fry in the latency critical path. Those applications are usually happy just sticking to socket affinity and process segregation, allowing the OS to schedule stuff on the various cores, but keeping dedicated sockets for their critical applications and making sure (through task sets on init, interrupt vectoring control, etc.) that no other process will end up being schedule don the same socket.

BTW, I've seen some "pretty exotic" practices that try to make latency-path-critical things that would normally be relatively cold stay warm in caches by intentionally accessing them unnecessarily as part of regular work. I've even heard of a games where a cache line is intentionally modified regularly (in an interleaved manner) by two different threads on two different cores in the same socket, with the purpose being to keep the L3 aware of the warmness of the line, presumably protecting against it evicting an actually warm L1/L2 line because it's seen no L3 traffic on it to keep the LRU up to date. There are certainly tradeoffs here (e.g. wasted cycles keeping things warm, and accepting L1/L2 misses to reduce likelihood of L3 miss), and I don't personally know how well this stuff works in practice, but these are good examples of the lengths that people trying to keep things cached in relatively idle but latency critical paths will go to.

Bottom line: allocating short lived objects is often cheaper and faster than alternatives, leading to lower latency common-case results If we discount the L1 cache pressure arguments outside of the area of single-usec cold data latency. It's only real "down side" is the requirement for periodic collection of that stuff, and the increased frequency of that collection if you increase allocation pressure. Of course, when a pausing collector is used to collect newgen (as is unfortunately the case with all current HotSpot collectors in both Oracle and OpenJDK JVMs), this concern becomes very real as higher allocation rates will cause more frequency pauses, worsening the percentiles picture for your application. When newgen pausing question is taken out of the picture the main downside to techniques that increase allocation rate disappears with it...

For example, Is G1 the best option these days?

G1/CMS/ParallelGC will all behave the same in this respect, as non of the OldGen collector algorithms matter for this question. Those short lived objects will all be collected by the newgen, and the OldGen collector parts will never see them. All mainstream HotSpot collectors use virtually identical stop-the-world, parallel, generational newgen collector mechanisms to to this work. So if the stop-the-world newgen effects are ones you can live with in your application, then the efficiency is there to be had with a well sized generational heap, and the common-case speed will be there too. However, If the newgen pause effects become too much for the application's low latency requirements (showing up as outlier and percentile requirements not being met), you'll need to somehow deal with them, either by using a non-stop-the-world newgen collector (hint hint), or by somehow reducing allocation pressure and newgen collection frequency enough to meet your requirements (which is what many in the low latency Java world end up having to do).

Peter Lawrey

unread,

Jul 22, 2013, 4:04:19 PM7/22/13

to mechanica...@googlegroups.com

Often when people talk about concurrency, it is not clear to me why this is being used.

A common argument seems to be; I have all these cores so I have to use them and concurrency seems a logical choice. The problem I have with this is it is like saying, I have all this disk space, I need to find a way to use 100% of it.

A good answer to me is to say; you need to improve performance or reduce the impact of IO delays.

Now, if you have the position that performance is not an issue, then I have to wonder why you would want to add complexity by using multiple threads when you know you don't need to.

In summary, I would have your core code single threaded, and perhaps reading/writing IO in additional threads. I would only add threads when you know you need them.

> while retaining as much as possible immutability.

I would suggest having thread local recyclable mutable data and having recyclable immutable data, as much as possible. This leads to a minimum of garbage.

Fernando Racca

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Peter Lawrey

unread,

Jul 22, 2013, 4:17:12 PM7/22/13

to mechanica...@googlegroups.com

> I apply these in Scala as well, at varying times. I write all of my methods/functions to be smaller than 35 bytecode instructions to take advantage of inlining while I'm coding.

Scala is a very good solution for mathematically based solutions.

Java will inline larger methods if they are called "frequently" (The size depends on the JVM) Even for less frequently called methods can have a higher size for inlining.

> Note that Scala can be written to be just as performant as Java.

For problem naturally expressed in Scala, the performance is likely to be so close it doesn't matter. For libraries not so suited for SCala, they can be implemented in Java and called from Scala.

> Scala makes defining work to be spread across cores extremely simple, even when things go wrong.

IMHO, That is only a win for the performance is consistently better than writing in Java with one thread. Spreading work across cores is not an aim in itself, performance should be the requirement.

> Scala also has a completely asynchronous Future implementation (based on JSR166y and therefore usable on Java 1.6+), whereas Java still requires you to block a thread just to see if one or more futures are completed.

In Java if you want a task to performed after another task, you put in the code together. e.g.

new Runnable() {

public void run() {

task1();

             task2(); // task to be performed when task1 finishes
        }
   }

Not that complicated

On 22 July 2013 17:24, <jamie...@typesafe.com> wrote:

Fernando - First off, a disclaimer - I'm from Typesafe and therefore I'm obviously a fan of Scala. I think it's important that we all keep in mind that every developer has to choose the right tool for the right job. There will be some tasks for which Scala is very well suited, some for which Java is the better approach and some for which neither is the right answer. Your experience will vary depending on your application, what you're doing and how it's written. I can say that several large financial institutions as in just about any that you've heard of, are now using Scala. I can't say which - many of the large companies with which we work would prefer to stay quiet about it. However, they like it so much that these institutions pay us to build new libraries in support of their needs which we then open source, such as the Slick data access library and the new Async library.

Martin will attest that I've been a long-time follower of his Mechanical Sympathy tenets, having written a Scala port of the Disruptor v1.0 in July of 2011. I see tremendous value in applying the principles he, Peter Lawrey and others espouse to maximize the performance of my application, such as sizing my working set to fit within the L2 cache that isn't shared between cores on Intel architectures.. I apply these in Scala as well, at varying times. I write all of my methods/functions to be smaller than 35 bytecode instructions to take advantage of inlining while I'm coding. Most others, like cache line isolation and working set sizing, I apply once I've measured where my biggest performance pain points are and how I want to address them. This likely differs from Peter Lawrey's approach, because HFT is an extremely performance critical task where even nanoseconds matter.

Functional programming is not the best approach for high performance when talking about the greatest throughput you will get on a single thread, and will use a greater footprint. The implementation of FP will vary between languages. For example, look at collections libraries. Some libraries/languages use full Copy On Write (COW), which is extremely expensive but guarantees the most isolation of data between threads of execution. Others, like Scala's standard library collections, use structural sharing, which can generate short-lived garbage but does not copy all of the data in the heap - instead, the references to shared data are shared between collections. This is transparent to developers, while also providing extremely rich collections functionality.

Note that Scala can be written to be just as performant as Java. We do not have the ability to create bytecode that doesn't execute on the JVM. You can always write code in Java and work directly between the two, if you feel more comfortable doing so - they are completely interoperable (a caveat - it's always easier going from Java to Scala than vice versa). You can delegate work from a Scala application to a Java module very easily. Kent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when. This is particularly true with a team of developers of varying skill developing simultaneously. Once we have something we know works the way we intended, we must then profile it and find the areas where optimization is required. How much time is spent in a method? How much garbage is being created and which collector best suits the object lifecycle patterns for this application? How do we size our regions to handle those patterns most effectively?

One other area I'd like to point out - Java has wonderful constructs at a very primitive level. Threads, executors, atomic CAS types, etc. They work very well and form the basis of my multithreaded applications. However, they are very difficult to compose, and even more so for handling errors that can occur. Maybe superstar developers like Martin and Peter can figure out what happened inside of a thread that failed inside of a task executed with ForkJoin, but most devs can't. And even if they wanted to use Thread.UncaughtExceptionHandler, they've now got to figure out all kinds of things they didn't before such as parallelism level. Scala makes defining work to be spread across cores extremely simple, even when things go wrong. Scala also has a completely asynchronous Future implementation (based on JSR166y and therefore usable on Java 1.6+), whereas Java still requires you to block a thread just to see if one or more futures are completed.

Hope that helps!

On Monday, July 22, 2013 5:40:04 AM UTC-7, Rüdiger Möller wrote:

--

Peter Lawrey

unread,

Jul 22, 2013, 4:36:17 PM7/22/13

to mechanica...@googlegroups.com

On 22 July 2013 18:15, Gil Tene <g...@azulsystems.com> wrote:

Let me step in in defense of short lived object allocation.

Don't get me wrong, short lived objects are great, medium lived objects are the pain (unless you have something like Zing)

For most applications. short lived objects are the simplest, natural choice. My view is that it doesn't have to be this way and in rare cases, you actually want to avoid it.

I've certainly seen many cases where short lived object allocation (not the kind that escape analysis eliminates) provides better performance than object pooling,

In my experience this falls under the common practice of making a change for performance reason, without check it helps the performance. If you make the object pooling light weight enough it can be faster. This mean adding complexity to your application, and I am only in favour of adding complexity when you have measured that you need it.

Concurrent algorithms are often much easier to reason about in such cases, making you able to tackle bigger problems.

Immutable objects across threads are the easiest to reason about.,

A set-associative LRU-managed CPU cache (which is the case for most) will successfully avoid pushing out hot data form the cache to a large degree, even in the presence of streaming operations going through the same cache.

I agree but for the 99%tile latencies and worst case, this doesn't help so much.

I find that most latency sensitive apps (the ones that have 15-50 usec total processing time that I often see in Java) won't spend the effort, as they often have bigger fish to fry in the latency critical path.

I agree.

I don't personally know how well this stuff works in practice, but these are good examples of the lengths that people trying to keep things cached in relatively idle but latency critical paths will go to.

I would have to wonder this too. Often when you think you are helping you are making the high percentile latencies worse.

Bottom line: allocating short lived objects is often cheaper and faster than alternatives, leading to lower latency common-case results If we discount the L1 cache pressure arguments outside of the area of single-usec cold data latency. It's only real "down side" is the requirement for periodic collection of that stuff, and the increased frequency of that collection if you increase allocation pressure.

There is a point at which the difference reducing the amount of garbage created become smaller than other causes of jitter and performance issues in your system. At this point you risk adding complexity with little benefit which makes keeping your application performant when you change it more difficult. i.e. ultimately it can make performance worse.

jamie...@typesafe.com

unread,

Jul 22, 2013, 6:07:26 PM7/22/13

to mechanica...@googlegroups.com

IMHO, That is only a win for the performance is consistently better than writing in Java with one thread. Spreading work across cores is not an aim in itself, performance should be the requirement.

I'm torn about this - it depends on whether or not the team is able to effectively write code that executes very fast on one thread, and whether or not the application has to have fault tolerance above error handling on the single thread, which can be tough to get right.

I posted a tweet a few weeks ago about how Klout is handling 1 billion requests per day on 20 servers using Play/Akka/Scala. Broken down to the second, that's really only 11,574 per second, which isn't terrible but also isn't mind-bogglingly great. Someone on Twitter called that out and said they could do that in their sleep, and I agree, they probably could. But does that necessarily make sense? That load average doesn't take into burstiness, nor the ability for those 20 servers to withstand failures (including latency spikes) that might occur for whatever reason. At the end of the day, companies like Klout have to spread their load across multiple data centers in case Amazon-East goes down, and the remaining servers need to be able to handle the same traffic until the others can be brought back online. So while it would be great to get more throughput out of their servers, that might be counterproductive to what they're actually trying to achieve with their architecture.

In Java if you want a task to performed after another task, you put in the code together. e.g.
    new Runnable() {
        public void run() {
             task1();
             task2(); // task to be performed when task1 finishes
        }
   }
Not that complicated

Absolutely true. I don't always want to chain tasks together unless there is a relationship, but yes, this works great if I want sequential task execution on the same thread.

The futures stuff I'm pointing out is much more related to I/O bound tasks and blocking operations, like database access. Context switches are expensive, but so is wasting a core, as Java's Future currently forces you to do. Every developer looking to maximize performance in an application with I/O bound tasks has to make the choice between whether they want to busy spin the handler or defer execution via callbacks.

Fernando Racca

unread,

Jul 22, 2013, 7:14:21 PM7/22/13

to mechanica...@googlegroups.com

First of all, thanks a lot for the comments, I'm very pleased to have your feedback.

Let me describe a bit more the environment that i'm considering.

The system I'm describing is a generic one, but the principles would apply to several architectures I've worked in.

In these environments, latency is to be kept to a "considerable minimum", but the hard SLAs are closer to hundreds of milliseconds, so there's some wiggle room.

Lets take a hypothetical example: static data is pre-cached to enrich some fast moving data coming in,

some business logic applies, validation happens, analytics get calculated, and results are pushed out.

If the process is too busy, we can create new sharded instances, i.e., we can trade space for speed.

Given these constraints, these application should push data out as fast as possible, in a soft real time basis.

Concurrency is considered mainly because:

- "Static" data can and probably will change during the course of this application

- Fast moving data will arrive from a topic subscription and we want to minimize time spent reading from the source

- We need to perform expensive computations on the fast moving data (including in some cases native calls)

- push data out, again blocking on I/O

- Mixed environment, where one application doesn't own the hardware it's running on. Think extra large instances in AWS

Thrashing on the CPU caches is unfortunately, going to happen no matter what one application does,

since there will be no less than 10 heterogenous processes running, competing for resources.

The best one can do is to be a good citizen, and play nice during your time slice.

Under these circumstances, the main priority I consider is correctness and simplicity, not raw speed.

A rich, well tested domain model over bytes and cache lines. However, I'm concious of the performance implications, and that's

why I would be interested in opinions to further improve the design.

Functional programming is particularly applicable to the domain of these applications, specially since they have a fair amount of

logic, data transformation and math. Immutability offers a critical improvement over designs they had to deal with from previous

iterations of these systems, where behaviour, state and concurrency, are one and the same thing all blended together.

Immutable data structures and functional constructs, when used carefully, can greatly benefit the design. In fact, I have obtained

performance improvements in memory usage/object churn over conventional Java by using lightweight case classes which only store

primitives instead of objects.

Now onto some of your comments:

| Rüdiger Möller

Get a huge Eden to reduce GC overhead. CMS is easiest tuneable+predictable at the

moment. I have done some testing which might save you some time

http://java-is-the-new-c.blogspot.com/

What if most of the objects are either very short lived, or almost static? Once the application static data is fully pre-cached,

it goes into a stable mode. Static data should have minimum references to young objects.

CMS used to be a reasonable alternative, but G1/ Zing should perform even better in this situation.

| Jamie

Kent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when.

Completely agree.

| Gil Tene

- I don't think there is a strong rule of thumb here, but short lived object allocation is in most cases superior to alternatives. I've certainly seen many cases where short lived object allocation (not the kind that escape analysis eliminates) provides better performance than object pooling, for example, as it's simply cheaper and faster to manage (from a CPU cycles point of view).

Thanks Gil, I have read many papers and benchmarks and did some myself, and I agree with that statement. I can obviously understand that fixed preallocation is efficient, buy my experience with the JVM is that churn is rarely a problem as long is kept local.

- The approach of dedicating core to specific threads is certainly valid, but it also tend to significantly restrict what you can do with a given piece of hardware.

That's certainly a valid choice with commodity hardware trying to do one thing as fast as possible, but I'm not so sure in the kind of expensive hardware I'm describing where one easily runs a dozen applications

- Bottom line: allocating short lived objects is often cheaper and faster than alternatives, leading to lower latency common-case results

If we discount the L1 cache pressure arguments outside of the area of single-usec cold data latency.

It's only real "down side" is the requirement for periodic collection of that stuff, and the increased frequency of that collection if you increase allocation pressure

I would try to perform some measures writing two similar applications, one where mutability is commonplace, and references are kept for a while, and a mainly immutable application. I'm pretty sure that the immutable will outperform the former. I have witnessed some rewritten applications work better this way, but aren't quite similar enough to prove it. I'll look into jHiccup for measuring this.

- G1/CMS/ParallelGC will all behave the same in this respect, as non of the OldGen collector algorithms matter for this question. Those short lived objects will all be collected by the newgen, and the OldGen collector parts will never see them. All mainstream HotSpot collectors use virtually identical stop-the-world, parallel, generational newgen collector mechanisms to to this work. So if the stop-the-world newgen effects are ones you can live with in your application, then the efficiency is there to be had with a well sized generational heap, and the common-case speed will be there too.

Correct me if I'm wrong, but wouldn't G1 basically work closer to what a NewGen is instead of Old Gen? I.e, is optimised for frequent collection of short lived objects + compaction.

| Peter Lawrey

I would have your core code single threaded, and perhaps reading/writing IO in additional threads. I would only add threads when you know you need them.

To a certain extent, that is the case.

- Don't get me wrong, short lived objects are great, medium lived objects are the pain (unless you have something like Zing)

For most applications. short lived objects are the simplest, natural choice. My view is that it doesn't have to be this way and in rare cases, you actually want to avoid it.

Agree

Fernando Racca

Gil Tene

unread,

Jul 22, 2013, 8:51:01 PM7/22/13

to mechanica...@googlegroups.com

- G1/CMS/ParallelGC will all behave the same in this respect, as non of the OldGen collector algorithms matter for this question. Those short lived objects will all be collected by the newgen, and the OldGen collector parts will never see them. All mainstream HotSpot collectors use virtually identical stop-the-world, parallel, generational newgen collector mechanisms to to this work. So if the stop-the-world newgen effects are ones you can live with in your application, then the efficiency is there to be had with a well sized generational heap, and the common-case speed will be there too.

...

Correct me if I'm wrong, but wouldn't G1 basically work closer to what a NewGen is instead of Old Gen? I.e, is optimised for frequent collection of short lived objects + compaction.

G1's collection logic is still split into a monolithic stop-the world newgen, and an incremental compacting oldgen. While the newgen implementation in G1 differs under the hood form ParallelGC and CMS (it doesn't use the same contiguous newgen heap concept, and instead uses the G1 regions), the newgen collection algorithm is still the same. Basically, when newgen is "full", a monolithic stop-the-world, parallel, single-pass-copying using-a-card-table-based-oldgen-to-newgen-remembered-set collection is run on the newgen regions. Things there either die, survive and stay newgen regions, or get promoted to oldgen regions. The key things is that there are no increments to newgen. It's an all-or-nothing sort of thing.

Oldgen for G1 is where the actual "G1" algorithm comes in. Per my understanding/classification, a single G1 oldgen cycle uses a mostly concurrent multi-pass marker (with a concurrent pass followed by a STW catch-up pass to finish marking mutated stuff and avoid the classic concurrent marking race), followed by multiple increments of incremental stoop-the-world compaction of selected region sets. Each compaction increment evacuates a chosen set of regions (a CSet), and corrects all references in the heap that might have pointed to evacuated objects. G1's RSet tracking is used to usually reduce the set of references that need to be scanned on each increment (to something smaller than the whole heap per increment when things work out the right way). CSets are selected based on target stop-the-world pause latencies, and based on estimates of the stop-the-world work needed to evacuate the CSet (which is the "easy" part in my experience) and correct all the references to it (which is the "harder" part, IMO). CSets are also selected so that the sparsest regions get evacuated first, hence the name "garbage first".

It's misleading to think of G1's oldgen as "frequent collections". G1 is not more efficient than other oldgens. It's less efficient but attempt to contain oldgen pause times. What G1's oldgen really does it "frequent" compaction intervals within one oldgen collection. Basically, it will take what would have been one big oldgen pause and try to break it into multiple shorter pauses, where the program is allowed to run in between. Those pauses will still amount to at least as much (stop-the-world) work per oldgen cycle as the non-incremental ParallelGC does. E.g. the bulk of the cost in an oldgen collection is the marking and pointer fixup passes, and G1 will do more of that oldgen work per promoted unit when compared to ParallelGC. Significantly more sometimes (ParallelGC deal with work that is, in the worst case, still linear to the live set and heap size, where G1's becomes N^2 to heap size).

So back to the question: In the newgen/oldgen split above, the newgen would be picking up all your short lived objects, and none of them would make it into the oldgen for the incremental work to deal with...

Peter Lawrey

unread,

Jul 23, 2013, 2:29:52 AM7/23/13

to mechanica...@googlegroups.com

On 22 July 2013 23:07, <jamie...@typesafe.com> wrote:

In Java if you want a task to performed after another task, you put in the code together. e.g.

    new Runnable() {
        public void run() {
             task1();
             task2(); // task to be performed when task1 finishes
        }
   }
Not that complicated

Absolutely true. I don't always want to chain tasks together unless there is a relationship, but yes, this works great if I want sequential task execution on the same thread.

The futures stuff I'm pointing out is much more related to I/O bound tasks and blocking operations, like database access. Context switches are expensive, but so is wasting a core, as Java's Future currently forces you to do. Every developer looking to maximize performance in an application with I/O bound tasks has to make the choice between whether they want to busy spin the handler or defer execution via callbacks.

There is only one thread here so it doesn't make sense to say it forces you to waste a thread, and even if you did wait for the Future in another thread, it wouldn't make sense to busy spin.

Peter Lawrey

unread,

Jul 23, 2013, 2:56:29 AM7/23/13

to mechanica...@googlegroups.com

Its important to have a clear idea of your performance requirements. There is no such requirement "as fast as possible"

On 23 July 2013 00:14, Fernando Racca <fra...@gmail.com> wrote:

In these environments, latency is to be kept to a "considerable minimum", but the hard SLAs are closer to hundreds of milliseconds, so there's some wiggle room.

You need to have some idea what percentage you need couple of hundred milli-seconds 99.9% of the time, you may need a typical latency 20 milli-seconds. This translates to about 50,000 events per second.

If the process is too busy, we can create new sharded instances, i.e., we can trade space for speed.

Before you start assuming you need multiple threads, you need to look at the throughput. If you need 50,000 you only need one thread. If you need 100,000 you need two or three threads.

Given these constraints, these application should push data out as fast as possible, in a soft real time basis.

Whenever I have heard "as fast as possible" all reason goes out the window. You need to engineer your system and make it as fast as it needs to be, but no more complicated than it needs to be.

Concurrency is considered mainly because:

- "Static" data can and probably will change during the course of this application

- Fast moving data will arrive from a topic subscription and we want to minimize time spent reading from the source
- We need to perform expensive computations on the fast moving data (including in some cases native calls)

- push data out, again blocking on I/O

None of these alone say you can't use just one thread to do the whole job.

- Mixed environment, where one application doesn't own the hardware it's running on. Think extra large instances in AWS

This is where you might need concurrency to minimise the impact of uncontrolled jitter. A better, simpler solution is to have a controlled system.

Thrashing on the CPU caches is unfortunately, going to happen no matter what one application does,

since there will be no less than 10 heterogenous processes running, competing for resources.
The best one can do is to be a good citizen, and play nice during your time slice.

For the time constraints you have, I would start with one thread, you don't need to worry about any of these issues.

Under these circumstances, the main priority I consider is correctness and simplicity, not raw speed.

So start with one key thread IMHO that will be the simplest.

A rich, well tested domain model over bytes and cache lines. However, I'm concious of the performance implications, and that's

why I would be interested in opinions to further improve the design.

Cache lines issues make a real difference if you are looking at over one million per second in one thread, or latencies around one micro-second.

Kristian Rosenvold

unread,

Jul 23, 2013, 2:58:31 AM7/23/13

to mechanica...@googlegroups.com

2013/7/23 <jamie...@typesafe.com>:

> I posted a tweet a few weeks ago about how Klout is handling 1 billion
> requests per day on 20 servers using Play/Akka/Scala. Broken down to the
> second, that's really only 11,574 per second, which isn't terrible but also
> isn't mind-bogglingly great. Someone on Twitter called that out and said
> they could do that in their sleep, and I agree, they probably could.

Sorry for trolling about this; but that performance is really not very
good at all.
As a bystander in scala world it makes me wonder if this is something about
scala/fp or something about your code/application - did you make any
measurements
to determine where your performance gets lost ?

Kristian

Martin Thompson

unread,

Jul 23, 2013, 4:55:20 AM7/23/13

to mechanica...@googlegroups.com

On Monday, July 22, 2013 10:43:53 AM UTC+1, Fernando Racca wrote:

Immutability and functional programming look very reasonable ideas in a highly concurrent environment.

Fernando,

This statement I see often and it fascinates me. The big plus I see with FP and immutability in general is the ability to reason about code in a nice way. If you throw concurrency into an mix then the ability to reason about code is just never going to be a easy as with single threaded code regardless of paradigm.

The main reason I see for multi-threaded programming is IO or parsing text based protocols which is pretty much always associated with IO. If you measure most real-world systems the business logic takes up a tiny fraction of total CPU time. There are a few application for which this is not the case, but they are rare. I cannot tell you how many times I've seen people try to make their business logic run faster by making it parallel only to find out that the costs of going parallel are greater than the benefits they bring. It is often much easier to profile and optimise the single thread to make it faster. Unfortunately we don't seem use science or know anything about computers these days even though we are supposed to be doing Computer Science :-)

If you cannot get all your work done on a single thread, even after profile lead optimisation, then you can often split the work down into a pipeline of stages with queues between each single threaded stage.

To the subject of short lived objects. Algorithms can be made much cleaner and easier to reason about with some object allocation. They are usually easier to test and lets face it correct code is what needs to be front and centre. I've found stack based allocation to give the best performance and it is such a shame more effort has not been put into escape analysis. JRockit lead the way with object explosion but it is no more. Hopefully this area will get more focus in the future. Hint to Gil, because you have such a good GC maybe you've neglected the potential performance gains from escape analysis and stack based allocation ;-) Most of our production systems run on multi-socket servers and NUMA is a major issue. Try binding your application to a single socket for CPU and memory resources and measure the performance. Often restricting your application like this is a major improvement. If you do allocate a lot of short lived objects most likely they do not escape the allocating thread. If you are running with ParallelOldGC try adding -XX:+UseNUMA so that objects are allocated in a part of eden that is a node local memory region. I've seen this bring significant performance improvements. Shame this option is not available with other collectors.

For me the bottom line is if you share objects across threads then there is a cost to be paid. Objects that do not escape a thread can be much more efficient. An object pool that is single threaded is much simpler and faster than one that is concurrent. Concurrent protection of an object pool is very unlikely to beat the efficiency of a garbage collector. I'm going to leave aside all discussion of GC pauses are they have been done to death in other discussions in this group.

Now to the real issue of immutable data structures in a concurrent environment. In FP data structures are described as "persistent", which means that if you have a reference to a structure it will appear to be immutable even in the presence of concurrent updates. This is really nice in that you get a read consistent view. To achieve this a data structure cannot totally copy itself on update and remain efficient. Most FP persistent data structures take the form of a tree and perform a path copy from leaf to root for the change, reusing as much of the tree as possible. What this results in is a single contention point at the root of the tree for CAS operations. Now expand this view to a larger domain model of many entities, e.g. Customers, Orders, Products, etc., all interrelated and that model needs one or more entry points. Each of these entry points become a contention hotspot as the model is mutated. Also with trees and linked lists being the underlying data structures, performance is limited due to indirection causing cache misses.

If you have a relatively low-level of mutation in data model then pure FP can perform reasonably well and it is much easier to reason about. If you have significant levels of mutation then I've never seen it perform anywhere close to what is possible with other mutation based techniques. There is typically an order of magnitude cost in performance delta here. Personally I'm a big fan of functional techniques and my coding styles draws on FP and OO. Often an FP approach of treating things as functions makes more sense and is greater performance than objects by having less indirection. John Carmack wrote a nice blog on a practical approach to this.

I keep hearing that "immutable objects can be optimised by the runtime". My answer to this is "How?" In most JVMs I know we do not even have constant propagation for optimisations with instance final fields, yet we have it for static final fields.

Martin...

Rüdiger Möller

unread,

Jul 23, 2013, 5:41:26 AM7/23/13

to mechanica...@googlegroups.com

| Rüdiger Möller

Get a huge Eden to reduce GC overhead. CMS is easiest tuneable+predictable at the
moment. I have done some testing which might save you some time
http://java-is-the-new-c.blogspot.com/

What if most of the objects are either very short lived, or almost static? Once the application static data is fully pre-cached,
it goes into a stable mode. Static data should have minimum references to young objects.
CMS used to be a reasonable alternative, but G1/ Zing should perform even better in this situation.

Actually the "age" of objects is not measured at a time base, but how many newgen collection have been "survived". So a high allocation rate and a smallish Eden+survivor leads to "false" promotion of temporary Objects to OldSpace as Eden collections happen too frequent. The cost of false promotion and its indirect impact on OldGen collection is hard to quantify and should be taken into account when benchmarking pooling against object creation. From my experience reducing allocation/reusing Objects is 90% of time superior to allocation. Ofc one can overdo it, especially explicit object pooling using non-primitive data structures (e.g. collections) is easily contraproductive. Additionally experiments show, that a large Eden improves throughput for apps with a high permanent alloc rate.

| Jamie

Kent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when.

Completely agree.

Imho No. One has to have a clear idea on how to optimize the system later on in advance. Then apply the golden rule.

jamie...@typesafe.com

unread,

Jul 23, 2013, 3:59:56 PM7/23/13

to mechanica...@googlegroups.com

It's not my application, and the point is that high throughput is not always the concern. They could certainly try to get a higher number of requests per second on less servers, but that comes at the cost of fault tolerance. For you to say "that performance is really not very good at all" without knowing their application requirements isn't realistic, because you have no idea how bursty their traffic can be nor their requirements for uptime. A site that needs 9 nines of uptime will not be architected to maximize throughput on the minimal number of servers - you need redundancy across JVMs, VMs/nodes and data centers, and the ability to handle all of your traffic on a minimal subset of each of these when major failures occur.

jamie...@typesafe.com

unread,

Jul 23, 2013, 4:00:55 PM7/23/13

to mechanica...@googlegroups.com

That may work fine for you. I'd say it depends on what you're doing. Donald Knuth would say that premature optimization is the root of all evil.

jamie...@typesafe.com

unread,

Jul 23, 2013, 4:06:19 PM7/23/13

to mechanica...@googlegroups.com

I think we're talking past each other. I was under the expectation that we were discussing using futures to utilize multiple threads, not defining a runnable of a single thread with sequential execution of tasks. My bad.

John A. De Goes

unread,

Jul 23, 2013, 4:09:12 PM7/23/13

to mechanica...@googlegroups.com

Ditto on the below.

To which I'd add: servers are cheap, especially compared to human labor. If the nature of your work can be highly distributed, then scaling out with high-level abstractions that make it easy to write and maintain quality code is likely to be substantially cheaper than over-engineering a high-performance solution in an ultimately pointless attempt to save a few servers.

Regards,

John

Gil Tene

unread,

Jul 23, 2013, 6:09:46 PM7/23/13

to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com

In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.

I

Sent from Gil's iPhone

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/gSkbc3grzNY/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Kristian Rosenvold

unread,

Jul 23, 2013, 6:36:39 PM7/23/13

to mechanica...@googlegroups.com

Well of course I don't know your application, but the numbers are
sufficiently low to warrant asking if you know why they're so low.

Most of the discussion on this mailing list seems to be with people
who have studied the bottlenecks in their systems and worked through
most of the usual suspects. Then you change logging frameworks a few
times because the "defaults" suck. After that you come to the
interesting stuff about synchronization and locality issues. At this
point we might be looking at performance numbers in the area 5-20x
your numbers.

Hosting 20 servers when you could've gotten away with 4 (give or take
a few) is not necessarily a clear cut business case if you're looking
at premium high-availablity type rack space in a secure mountain,
which still amounts to serious $$ for a 5 year period.

If you're willing to look at /why/ it's slow, we might all learn
something. Since you're so clearly selling the scala kook-aid it
certainly makes me curious if there's something inherently
non-performant
about scala or the current generation of scala frameworks; I'm trying
to not troll here; I'm genuinely interested in what this might bring.
If you don't have sufficient interest in your application to know
/why/ it's slow, I just think you should bring your discussion
elsewhere.

Kristian

2013/7/23 <jamie...@typesafe.com>:

Gil Tene

unread,

Jul 23, 2013, 7:36:27 PM7/23/13

to <mechanical-sympathy@googlegroups.com>

It's worth noting that in Most sites and applications, the peak 30 seconds of traffic can easily exceed 100x or even 1000x the average load. The total messages per day and the load you need to be able to carry at peak time are almost as unrelated as cats are to hummingbirds, making the exercise of extrapolating capacity carrying requirements from total daily traffic roughly as useful as counting beans on a rosary is for an algo trader.

But Jamie, in all fairness it's your silly metric that started this, and you are the one that started the extrapolating number-hurling by calculating the average per second to begin with... ;-).

Maybe we should focus on extrapolating standard deviation instead.

> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/gSkbc3grzNY/unsubscribe.
> To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Jason Koch

unread,

Jul 23, 2013, 7:47:54 PM7/23/13

to mechanica...@googlegroups.com

To OP - As with most comp sci problems, choosing the right data structures for your problem will help your performance immensely. Often you can also wrap mutable data structures to give a similar "no side effects" outcome while using mutable data.

I'd highly recommend doing a little work on reimplementing the set and map and list type implementations in Scala so you have a good understanding of where you are allocating objects. That will help you understand when to make decisions about which collection type to use, and know when to take advantage of mutability, streaming, etc.

If you're using actors, or threads connected by queues, or futures, or whatever you might find that mutable data structures are OK if you can avoid sharing the objects across actors.

Also, I'm not clear why anyone thinks 11K transactions per second is slow or fast, you can't tell a lot from the number alone without knowing what's involved in handling a transaction.

11K/second for print("message received") is different from a sequential order book match which is different from analysing differences between currency pairs and making predictions on price movements. It's not hard to imagine that recomputing node weights on a complex social graph would be a slower calculation again.

jamie...@typesafe.com

unread,

Jul 23, 2013, 7:52:44 PM7/23/13

to mechanica...@googlegroups.com

Gil - Well-said. I was attempting to make the point that sometimes throughput isn't the only concern, and that's probably not a discussion for this mailing list.

Michael Barker

unread,

Jul 23, 2013, 11:30:42 PM7/23/13

to mechanica...@googlegroups.com

One thing that makes life difficult for functional programming on the
JVM is that supporting pure functional languages is something that is
relatively new. Meaning that a number of optimisations/features that
are probably available in runtimes built specifically for functional
languages (e.g. Erlang, Haskell) are not yet implemented on the JVM.
Examples would include value types[1] and tail recursion[2]. The
JVM's sweet spot for optimisation is still imperative OO code*. While
lambdas will be appearing in JDK8 there is still a lot of optimisation
work that needs to go on in order to associated features like the
streaming API comparable to standard array iteration[3]. I'm fairly
sure that this will improve over time as versions 8, 9 and 10 come
out. But right now, to squeeze the most out of the JVM you'll
probably need to code in Java (including Scala that looks like Java,
but with less boiler plate) with mutability in the appropriate places.
I'm fairly confident that an OO/imperative approach will always be
faster on the JVM, but the gap will close. Much the same way the gap
between Java and C has closed, and the dominant factor when it comes
to performance will be less about language choice or coding style.

* One example of this was when I looked at the parallel string
splitting algorithm that Guy Steele talked about in his talk about
parallel programming[4]. He demonstrated a functional divide and
conquer approach to solving the problem that was easy to parallelise.
However I found that I was able write a trivial single threaded loop
that was able to beat a parallel implementation that I wrote in Scala
(on a 16 core workstation). That was after I spent many days
optimising the parallel implementation, e.g. using my own mutable list
implementation over the built in immutable version that had O(1)
complexity for concatenation.

[1] https://blogs.oracle.com/jrose/entry/value_types_in_the_vm
[2] https://blogs.oracle.com/jrose/entry/tail_calls_in_the_vm
[3] http://markmail.org/thread/6w36zmcizw3xivpg
[4] http://www.infoq.com/presentations/Thinking-Parallel-Programming

Mike.

Rüdiger Möller

unread,

Jul 24, 2013, 5:46:49 AM7/24/13

to mechanica...@googlegroups.com

Wherever I go, Gil has been there ;-)

To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,

Jul 24, 2013, 11:23:52 AM7/24/13

to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com

Or maybe its a cloned version. ;-) sometimes I don't remember posting these...

Sent from my iPad

To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,

Jul 25, 2013, 2:59:06 AM7/25/13

to mechanica...@googlegroups.com

Wherever I go, Gil has been there ;-)

+1

Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:

In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.

I

The notion of the number of times around an arbitrary sized block is a very very good diagnostic tool which cannot be replicated using wall clock time. So I'd say this alternate view of time isn't silly :-)

| Jamie

Kent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when.

What Kent Beck *didn't* say which is *often* if not *always* read into this quote is that you shouldn't plan for performance. We do have tools and techniques to help plan for performance one of which is called benchmarking.

Imho No. One has to have a clear idea on how to optimize the system later on in advance. Then apply the golden rule.

What I've found is that if you've made a fundamental mistake at the beginning of the process it can be very very very difficult to back out of it. You need to plan for performance which means you need to do more than just simply hack away and grab requirements as needed as people who (mis)understand agile and scrum etc.... BTW, and slightly OT, I've always suggested agile isn't about being faster, it's about being agile which means being able to recover from mistakes or being able to react to changing conditions very quickly. It's take planning (and practice to be able create flexible adaptive plans) to be agile.

Regards,

Kirk

Fernando Racca

unread,

Jul 25, 2013, 3:23:15 AM7/25/13

to mechanica...@googlegroups.com

On Wednesday, 24 July 2013 04:30:42 UTC+1, mikeb01 wrote:

One thing that makes life difficult for functional programming on the
JVM is that supporting pure functional languages is something that is
relatively new. Meaning that a number of optimisations/features that

are probably available in runtimes built specifically for functional
languages (e.g. Erlang, Haskell) are not yet implemented on the JVM.

Indeed

Examples would include value types[1] and tail recursion[2].

On value types, until they are provided by JVM, there's a very interesting approach in Scala 2.10 by using value classes, which will avoid allocation of runtime objects, while retaining their flexibility http://docs.scala-lang.org/overviews/core/value-classes.html

On tailrecursion, you can annotate a method to be @tailrec, which will get recompiled into a more efficient while loop. both are techniques that although not provided by the JVM, Scala can offer a significant improvement over the same approach in plain Java.

But right now, to squeeze the most out of the JVM you'll
probably need to code in Java (including Scala that looks like Java,
but with less boiler plate) with mutability in the appropriate places.

This may be correct to say if you are looking for maxing out what you can get out of the JVM. In my case, I'm looking for maxing out what i can get out of the JVM while been almost purely functional.

I'm fairly confident that an OO/imperative approach will always be
faster on the JVM, but the gap will close. Much the same way the gap
between Java and C has closed, and the dominant factor when it comes
to performance will be less about language choice or coding style.

Precisely. I'm trying to identify what are the best techniques to deliver high performance functional applications.

So far, in terms of Garbage Collection, the advice seems to be either large Eden/YoungGen or Zing, which is quite useful and need to benchmark it properly.

Also, to consider more mutable collections. In this regard, scala has a well split mutable and immutable family of collections, from which to choose. Some useful examples are this: http://www.scala-lang.org/api/current/index.html#scala.collection.mutable.ArrayBuffer and http://www.scala-lang.org/api/current/index.html#scala.Array

Fernando

Martin Thompson

unread,

Jul 25, 2013, 3:46:07 AM7/25/13

to mechanica...@googlegroups.com

+1 on the whole reply.

Premature optimisation is a poor excuse for the unprofessional behaviour of not considering future performance needs. Continuous performance testing, profiling, and debugging should become common practice. The most important value from Agile is having short feedback cycles to guide action and decision making. Without the tight feedback cycles we can go really far in the wrong direction and it can be costly to correct and just plain unprofessional.

On Thursday, July 25, 2013 7:59:06 AM UTC+1, Kirk Pepperdine wrote:

Wherever I go, Gil has been there ;-)
+1

Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:

In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.

The notion of the number of times around an arbitrary sized block is a very very good diagnostic tool which cannot be replicated using wall clock time. So I'd say this alternate view of time isn't silly :-)

Peter Lawrey

unread,

Jul 25, 2013, 4:06:34 AM7/25/13

to mechanica...@googlegroups.com

On 25 July 2013 08:46, Martin Thompson <mjp...@gmail.com> wrote:

Continuous performance testing, profiling, and debugging should become common practice.

I would agree with this even if performance is not critical for you. A profiler gives you another view of what your system is doing. I have found bugs using a profiler I wouldn't have found another way. e.g. in one application the most common object was an XxxxxException (which was being caught and handled but shouldn't have been happening so often) in another case, a resource with thread which should have been cleaned up, wasn't being cleaned up. This led to a thread leak which didn't show in unit tests and I might not have seen in production for a very long time.

The most important value from Agile is having short feedback cycles to guide action and decision making. Without the tight feedback cycles we can go really far in the wrong direction and it can be costly to correct and just plain unprofessional.

Also, not releasing regularly can lead to trying to get too many changes working at once. The more things you try to change at once, the more interactions you have between these changes the longer it takes to get it all understood and working i.e. sometimes N changes take O(N^2) time to get working/stable or worse. When you have to deploy regularly, it changes the design by forcing you to make changes which are more incremental and less likely to blow out. It doesn't work so well in all cases as sometimes you need something more big bang, but in most case incremental changes can be much more efficient and effective IMHO.

What I've found is that if you've made a fundamental mistake at the beginning of the process it can be very very very difficult to back out of it.

Worse, if the performance is not there from the start, trying to determine where your latency went can be very very hard to even find in a complex system because all your resource usage tends to interact. For me, tuning a reasonably well written application can help about 30% but if you need significantly faster timings, e.g. 10x faster, you need to start again, testing the performance of the critical paths as you build the system.

-- Peter.

Kirk Pepperdine

unread,

Jul 25, 2013, 4:25:59 AM7/25/13

to mechanica...@googlegroups.com

On 2013-07-23, at 10:00 PM, jamie...@typesafe.com wrote:

That may work fine for you. I'd say it depends on what you're doing. Donald Knuth would say that premature optimization is the root of all evil.

Again, Tony Hoare's infamous quote is being misused. Planning for performance is *NOT* a premature optimization. It's doing what is necessary to ensure that you application will meet it's performance goals.

Regards,

Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

steview....@googlemail.com

unread,

Jul 25, 2013, 10:45:59 AM7/25/13

to

Indeed, I ended up getting so fed up with people saying this that it formed a sort of mini-rant. (apologies for the self reference).

http://asymmetric-payoffs.blogspot.co.uk/2011/12/premature-optimisation-is-root-of-all.html.

-----------

On the wider point I always thought it a bit strange that the performance and optimisation tooling has lagged behind the other types of tooling in the recent past. I suppose Moore's law has squeezed that aspect out for a lot of programmers in a day to day to sense. Although I can easily see this becoming far more of an issue in the lateral expansion into larger multi-cores as the CPUs/compilers are not going to pick up the slack and you cannot scale an application by just throwing more cores at it without a bit more thought as to design, profiling to see what interactions are actually happening and optimisation.

Perhaps multi-core is the shift that will move this more into the mainstream and to me it seems the reason for the pickup in groups like this is essentially this issue.

------

Gil Tene

unread,

Jul 25, 2013, 10:59:03 AM7/25/13

to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com

On Jul 24, 2013, at 11:59 PM, "Kirk Pepperdine" <ki...@kodewerk.com> wrote:

Wherever I go, Gil has been there ;-)

+1

Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:

In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.

I

The notion of the number of times around an arbitrary sized block is a very very good diagnostic tool which cannot be replicated using wall clock time. So I'd say this alternate view of time isn't silly :-)

For diagnostics, wall clock time reporting would give you the same (or better) information. But the silliness is not about reporting, its about what happens to GC behavior under increased load. That silly behavior may then lead people to need to diagnose it ;-).

Here is a simple thought exercise: think about what happens to newgen efficiency and premature promotion when throughput grows in a multithreaded, multi-session system, if age is counted in number of newgen cycles instead if time...

- The purpose of keeping objects in newgen is to "give enough to let them die young" before promoting them if they don't. Ths is key to maintaining efficiency in all generational collectors, and losing this filter makes efficiency collapse.

- A very large newgen helps, but Objects that are very young at collection time would be prematurely promoted regardless of the size of the newgen they come from (this is the main reason GCs that count age in cycles keep objects around for at least one additional cycle even in very large newgens).

- On the other hand, promoting too late causes oldgen do do unnecessary work, which hurts efficiency (and if done in. Stop the world pause, increases pause time significantly).

There are two negative effects to counting age in units that compress under load. The first is the unfortunate behavior of premature promotion as load increases if you do this. The other is the behavior during program phase changes. I'll explain the Premature promotion down side here (the phase change one can go in another post if people are interested):

As load grows on an (e.g. app-server-style) environment, it usually grows in the form of additional concurrent work (as opposed to "harder" or longer individual operations). The natural length and object lifetime behavior of computations and session either stays the same or elongates. It virtually never gets shorter with higher load.

If age is counted is GC cycles, then the higher the load and the faster the allocation rate is, the sooner (in wall clock time) that objects will get promoted. Once the load gets high enough for objects that would have previously died in newgen to be promoted because they are "old enough in cycles", efficiency drops, oldgen rates grow, and the snowball starts rolling down he hill. As load grows, not only will allocation rate grow, but more premature promotion will occur as a percentage of allocation. That in turn would mean that not just GC work, but GC efficiency (measured as the % of overall CPU work spent on GC) gets worse with load.

In contrast, when age is counted as time, promotion decisions remain more "semi-constant under load" and are either right or wrong for a computation's object lifetime behavior regardless of load. This keeps GC efficiency closer to constant over a wider range of load compared to cycle-count based timing.

Kirk Pepperdine

unread,

Jul 25, 2013, 2:35:27 PM7/25/13

to mechanica...@googlegroups.com

Nice, I wrote a review/rant for one of Andy Hunt's books on Agile a while back. Of course Andy had it right where as others just don't seem to get that there are no shortcuts for some of this stuff....

As for tooling... there is a long rant I could go on about but.....

-- Kirk

On 2013-07-25, at 4:42 PM, steview....@googlemail.com wrote:

Indeed, I ended up getting so fed up with people saying this that it formed a sort of mini-rant. (apologies for the self reference).

http://asymmetric-payoffs.blogspot.co.uk/2011/12/premature-optimisation-is-root-of-all.html.

-----------

On the wider point I always thought it a bit strange that the performance and optimisation tooling has lagged behind the other types of tooling in the recent past. I suppose Moore's law has squeezed that aspect out for a lot of programmers in a day to day to sense. Although I can easily see this becoming far more of an issue in the lateral expansion into larger multi-cores as the CPUs/compilers are not going to pick up the slack and you cannot scale an application by just throwing more cores at it without a bit more thought as to design, profiling to see what interactions are actually happening and optimisation.

Perhaps multi-core is the shift that will move this more into the mainstream and in to me it seems the reason for the pickup in groups like this is essentially this issue.

------

On Thursday, July 25, 2013 9:25:59 AM UTC+1, Kirk Pepperdine wrote:

On 2013-07-23, at 10:00 PM, jamie...@typesafe.com wrote:

That may work fine for you. I'd say it depends on what you're doing. Donald Knuth would say that premature optimization is the root of all evil.

Again, Tony Hoare's infamous quote is being misused. Planning for performance is *NOT* a premature optimization. It's doing what is necessary to ensure that you application will meet it's performance goals.

Regards,
Kirk

Kirk Pepperdine

unread,

Jul 25, 2013, 2:48:26 PM7/25/13

to mechanica...@googlegroups.com

On 2013-07-25, at 4:59 PM, Gil Tene <g...@azulsystems.com> wrote:

On Jul 24, 2013, at 11:59 PM, "Kirk Pepperdine" <ki...@kodewerk.com> wrote:

Wherever I go, Gil has been there ;-)

+1

Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:

In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.

I

The notion of the number of times around an arbitrary sized block is a very very good diagnostic tool which cannot be replicated using wall clock time. So I'd say this alternate view of time isn't silly :-)

For diagnostics, wall clock time reporting would give you the same (or better) information.

I disagree. Wall clock time is important do doubt but it can't replace aging using survival counts as a measure.

But the silliness is not about reporting, its about what happens to GC behavior under increased load. That silly behavior may then lead people to need to diagnose it ;-).

Ain't going to argue with you here.

Here is a simple thought exercise: think about what happens to newgen efficiency and premature promotion when throughput grows in a multithreaded, multi-session system, if age is counted in number of newgen cycles instead if time...

I see the effects and I can tune (or hopefully auto-tune) in response to the delta in frequency.

- The purpose of keeping objects in newgen is to "give enough to let them die young" before promoting them if they don't. Ths is key to maintaining efficiency in all generational collectors, and losing this filter makes efficiency collapse.

time in this case is a measure that is some what orthogonal to application behaviour... and I still would argue that the best measure of "time" is age by survival of collections. I would not find wall clock time helpful in these cases.

- A very large newgen helps, but Objects that are very young at collection time would be prematurely promoted regardless of the size of the newgen they come from (this is the main reason GCs that count age in cycles keep objects around for at least one additional cycle even in very large newgens).

This is due to the orthogonal behaviour. Survivor spaces do help with this problem.. quite a bit!!!

- On the other hand, promoting too late causes oldgen do do unnecessary work, which hurts efficiency (and if done in. Stop the world pause, increases pause time significantly).

Tuning tenuring thresholds solves this problem also.

There are two negative effects to counting age in units that compress under load. The first is the unfortunate behavior of premature promotion as load increases if you do this. The other is the behavior during program phase changes. I'll explain the Premature promotion down side here (the phase change one can go in another post if people are interested):

This is why I've been arguing for new adaptive sizing policies for more than a few years.....

As load grows on an (e.g. app-server-style) environment, it usually grows in the form of additional concurrent work (as opposed to "harder" or longer individual operations). The natural length and object lifetime behavior of computations and session either stays the same or elongates. It virtually never gets shorter with higher load.

If age is counted is GC cycles, then the higher the load and the faster the allocation rate is, the sooner (in wall clock time) that objects will get promoted. Once the load gets high enough for objects that would have previously died in newgen to be promoted because they are "old enough in cycles", efficiency drops, oldgen rates grow, and the snowball starts rolling down he hill. As load grows, not only will allocation rate grow, but more premature promotion will occur as a percentage of allocation. That in turn would mean that not just GC work, but GC efficiency (measured as the % of overall CPU work spent on GC) gets worse with load.

again, proper adaptive sizing should be able to compensate for this... right now I do this manually.. it is this type of calculation that motivated me to write Censum.

In contrast, when age is counted as time, promotion decisions remain more "semi-constant under load" and are either right or wrong for a computation's object lifetime behavior regardless of load. This keeps GC efficiency closer to constant over a wider range of load compared to cycle-count based timing.

Again, understanding promotion rates under different load conditions is important but to tune (Oracle's) collector I need to know age.

W.R.T. premature promotion, all of the full GC implementations are broken. Adaptive sizing is also broken. This is due to the original authors getting the cost models wrong.. And even though we've known the cost models were wrong for almost 10 years these poor adaptive sizing implementations still persist! This is one of the reasons why I think there is a ton more that can be done to improve the efficiency of the current set of OpenJDK GC implementations. It's only recently that key people on the OpenJDK team have started to recognize this. There are supposedly two new implementations but I've yet to see them checked in.. I guess it's time for be to poke the bear again.