For example, Is G1 the best option these days?
Fernando Racca--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Fernando - First off, a disclaimer - I'm from Typesafe and therefore I'm obviously a fan of Scala. I think it's important that we all keep in mind that every developer has to choose the right tool for the right job. There will be some tasks for which Scala is very well suited, some for which Java is the better approach and some for which neither is the right answer. Your experience will vary depending on your application, what you're doing and how it's written. I can say that several large financial institutions as in just about any that you've heard of, are now using Scala. I can't say which - many of the large companies with which we work would prefer to stay quiet about it. However, they like it so much that these institutions pay us to build new libraries in support of their needs which we then open source, such as the Slick data access library and the new Async library.
Martin will attest that I've been a long-time follower of his Mechanical Sympathy tenets, having written a Scala port of the Disruptor v1.0 in July of 2011. I see tremendous value in applying the principles he, Peter Lawrey and others espouse to maximize the performance of my application, such as sizing my working set to fit within the L2 cache that isn't shared between cores on Intel architectures.. I apply these in Scala as well, at varying times. I write all of my methods/functions to be smaller than 35 bytecode instructions to take advantage of inlining while I'm coding. Most others, like cache line isolation and working set sizing, I apply once I've measured where my biggest performance pain points are and how I want to address them. This likely differs from Peter Lawrey's approach, because HFT is an extremely performance critical task where even nanoseconds matter.
Functional programming is not the best approach for high performance when talking about the greatest throughput you will get on a single thread, and will use a greater footprint. The implementation of FP will vary between languages. For example, look at collections libraries. Some libraries/languages use full Copy On Write (COW), which is extremely expensive but guarantees the most isolation of data between threads of execution. Others, like Scala's standard library collections, use structural sharing, which can generate short-lived garbage but does not copy all of the data in the heap - instead, the references to shared data are shared between collections. This is transparent to developers, while also providing extremely rich collections functionality.Note that Scala can be written to be just as performant as Java. We do not have the ability to create bytecode that doesn't execute on the JVM. You can always write code in Java and work directly between the two, if you feel more comfortable doing so - they are completely interoperable (a caveat - it's always easier going from Java to Scala than vice versa). You can delegate work from a Scala application to a Java module very easily. Kent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when. This is particularly true with a team of developers of varying skill developing simultaneously. Once we have something we know works the way we intended, we must then profile it and find the areas where optimization is required. How much time is spent in a method? How much garbage is being created and which collector best suits the object lifecycle patterns for this application? How do we size our regions to handle those patterns most effectively?One other area I'd like to point out - Java has wonderful constructs at a very primitive level. Threads, executors, atomic CAS types, etc. They work very well and form the basis of my multithreaded applications. However, they are very difficult to compose, and even more so for handling errors that can occur. Maybe superstar developers like Martin and Peter can figure out what happened inside of a thread that failed inside of a task executed with ForkJoin, but most devs can't. And even if they wanted to use Thread.UncaughtExceptionHandler, they've now got to figure out all kinds of things they didn't before such as parallelism level. Scala makes defining work to be spread across cores extremely simple, even when things go wrong. Scala also has a completely asynchronous Future implementation (based on JSR166y and therefore usable on Java 1.6+), whereas Java still requires you to block a thread just to see if one or more futures are completed.Hope that helps!
On Monday, July 22, 2013 5:40:04 AM UTC-7, Rüdiger Möller wrote:
--
Let me step in in defense of short lived object allocation.
I've certainly seen many cases where short lived object allocation (not the kind that escape analysis eliminates) provides better performance than object pooling,
Concurrent algorithms are often much easier to reason about in such cases, making you able to tackle bigger problems.
A set-associative LRU-managed CPU cache (which is the case for most) will successfully avoid pushing out hot data form the cache to a large degree, even in the presence of streaming operations going through the same cache.
I find that most latency sensitive apps (the ones that have 15-50 usec total processing time that I often see in Java) won't spend the effort, as they often have bigger fish to fry in the latency critical path.
I don't personally know how well this stuff works in practice, but these are good examples of the lengths that people trying to keep things cached in relatively idle but latency critical paths will go to.
Bottom line: allocating short lived objects is often cheaper and faster than alternatives, leading to lower latency common-case results If we discount the L1 cache pressure arguments outside of the area of single-usec cold data latency. It's only real "down side" is the requirement for periodic collection of that stuff, and the increased frequency of that collection if you increase allocation pressure.
IMHO, That is only a win for the performance is consistently better than writing in Java with one thread. Spreading work across cores is not an aim in itself, performance should be the requirement.
In Java if you want a task to performed after another task, you put in the code together. e.g.
new Runnable() {
public void run() {
task1();
task2(); // task to be performed when task1 finishes
}
}
Not that complicated
- G1/CMS/ParallelGC will all behave the same in this respect, as non of the OldGen collector algorithms matter for this question. Those short lived objects will all be collected by the newgen, and the OldGen collector parts will never see them. All mainstream HotSpot collectors use virtually identical stop-the-world, parallel, generational newgen collector mechanisms to to this work. So if the stop-the-world newgen effects are ones you can live with in your application, then the efficiency is there to be had with a well sized generational heap, and the common-case speed will be there too.
...
Correct me if I'm wrong, but wouldn't G1 basically work closer to what a NewGen is instead of Old Gen? I.e, is optimised for frequent collection of short lived objects + compaction.
In Java if you want a task to performed after another task, you put in the code together. e.g.new Runnable() {
public void run() {
task1();
task2(); // task to be performed when task1 finishes
}
}
Not that complicated
Absolutely true. I don't always want to chain tasks together unless there is a relationship, but yes, this works great if I want sequential task execution on the same thread.The futures stuff I'm pointing out is much more related to I/O bound tasks and blocking operations, like database access. Context switches are expensive, but so is wasting a core, as Java's Future currently forces you to do. Every developer looking to maximize performance in an application with I/O bound tasks has to make the choice between whether they want to busy spin the handler or defer execution via callbacks.
In these environments, latency is to be kept to a "considerable minimum", but the hard SLAs are closer to hundreds of milliseconds, so there's some wiggle room.
If the process is too busy, we can create new sharded instances, i.e., we can trade space for speed.
Given these constraints, these application should push data out as fast as possible, in a soft real time basis.
Concurrency is considered mainly because:- "Static" data can and probably will change during the course of this application- Fast moving data will arrive from a topic subscription and we want to minimize time spent reading from the source- We need to perform expensive computations on the fast moving data (including in some cases native calls)- push data out, again blocking on I/O
- Mixed environment, where one application doesn't own the hardware it's running on. Think extra large instances in AWS
Thrashing on the CPU caches is unfortunately, going to happen no matter what one application does,since there will be no less than 10 heterogenous processes running, competing for resources.The best one can do is to be a good citizen, and play nice during your time slice.
Under these circumstances, the main priority I consider is correctness and simplicity, not raw speed.
A rich, well tested domain model over bytes and cache lines. However, I'm concious of the performance implications, and that'swhy I would be interested in opinions to further improve the design.
Immutability and functional programming look very reasonable ideas in a highly concurrent environment.
Get a huge Eden to reduce GC overhead. CMS is easiest tuneable+predictable at themoment. I have done some testing which might save you some timeWhat if most of the objects are either very short lived, or almost static? Once the application static data is fully pre-cached,it goes into a stable mode. Static data should have minimum references to young objects.CMS used to be a reasonable alternative, but G1/ Zing should perform even better in this situation.
| JamieKent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when.Completely agree.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/gSkbc3grzNY/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.
Wherever I go, Gil has been there ;-)
Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.I
| Jamie
Kent Beck famously said, "Make it work, make it work well, make it work fast." My approach to building fast Scala applications is to first write it as correctly as possible, which is much easier in Scala than Java. I write less code which is easier to read and maintain, and with immutability and referential transparency I can be certain of who has the ability to change what values and when.
Imho No. One has to have a clear idea on how to optimize the system later on in advance. Then apply the golden rule.
One thing that makes life difficult for functional programming on the
JVM is that supporting pure functional languages is something that is
relatively new. Meaning that a number of optimisations/features that
are probably available in runtimes built specifically for functional
languages (e.g. Erlang, Haskell) are not yet implemented on the JVM.
Examples would include value types[1] and tail recursion[2].
But right now, to squeeze the most out of the JVM you'll
probably need to code in Java (including Scala that looks like Java,
but with less boiler plate) with mutability in the appropriate places.
I'm fairly confident that an OO/imperative approach will always be
faster on the JVM, but the gap will close. Much the same way the gap
between Java and C has closed, and the dominant factor when it comes
to performance will be less about language choice or coding style.
Wherever I go, Gil has been there ;-)+1
Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.
The notion of the number of times around an arbitrary sized block is a very very good diagnostic tool which cannot be replicated using wall clock time. So I'd say this alternate view of time isn't silly :-)
Continuous performance testing, profiling, and debugging should become common practice.
The most important value from Agile is having short feedback cycles to guide action and decision making. Without the tight feedback cycles we can go really far in the wrong direction and it can be costly to correct and just plain unprofessional.
What I've found is that if you've made a fundamental mistake at the beginning of the process it can be very very very difficult to back out of it.
That may work fine for you. I'd say it depends on what you're doing. Donald Knuth would say that premature optimization is the root of all evil.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Wherever I go, Gil has been there ;-)+1
Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.I
The notion of the number of times around an arbitrary sized block is a very very good diagnostic tool which cannot be replicated using wall clock time. So I'd say this alternate view of time isn't silly :-)
Indeed, I ended up getting so fed up with people saying this that it formed a sort of mini-rant. (apologies for the self reference).
-----------On the wider point I always thought it a bit strange that the performance and optimisation tooling has lagged behind the other types of tooling in the recent past. I suppose Moore's law has squeezed that aspect out for a lot of programmers in a day to day to sense. Although I can easily see this becoming far more of an issue in the lateral expansion into larger multi-cores as the CPUs/compilers are not going to pick up the slack and you cannot scale an application by just throwing more cores at it without a bit more thought as to design, profiling to see what interactions are actually happening and optimisation.
Perhaps multi-core is the shift that will move this more into the mainstream and in to me it seems the reason for the pickup in groups like this is essentially this issue.
------
On Thursday, July 25, 2013 9:25:59 AM UTC+1, Kirk Pepperdine wrote:
On 2013-07-23, at 10:00 PM, jamie...@typesafe.com wrote:That may work fine for you. I'd say it depends on what you're doing. Donald Knuth would say that premature optimization is the root of all evil.Again, Tony Hoare's infamous quote is being misused. Planning for performance is *NOT* a premature optimization. It's doing what is necessary to ensure that you application will meet it's performance goals.Regards,Kirk
On Jul 24, 2013, at 11:59 PM, "Kirk Pepperdine" <ki...@kodewerk.com> wrote:
Wherever I go, Gil has been there ;-)+1
Am Mittwoch, 24. Juli 2013 00:09:46 UTC+2 schrieb Gil Tene:In Zing, BTW, age is time, and not that silly notion of number if times around an arbitrary sized block going at a different and varying arbitrary speed.I
The notion of the number of times around an arbitrary sized block is a very very good diagnostic tool which cannot be replicated using wall clock time. So I'd say this alternate view of time isn't silly :-)
For diagnostics, wall clock time reporting would give you the same (or better) information.
But the silliness is not about reporting, its about what happens to GC behavior under increased load. That silly behavior may then lead people to need to diagnose it ;-).
Here is a simple thought exercise: think about what happens to newgen efficiency and premature promotion when throughput grows in a multithreaded, multi-session system, if age is counted in number of newgen cycles instead if time...
- The purpose of keeping objects in newgen is to "give enough to let them die young" before promoting them if they don't. Ths is key to maintaining efficiency in all generational collectors, and losing this filter makes efficiency collapse.
- A very large newgen helps, but Objects that are very young at collection time would be prematurely promoted regardless of the size of the newgen they come from (this is the main reason GCs that count age in cycles keep objects around for at least one additional cycle even in very large newgens).
- On the other hand, promoting too late causes oldgen do do unnecessary work, which hurts efficiency (and if done in. Stop the world pause, increases pause time significantly).
There are two negative effects to counting age in units that compress under load. The first is the unfortunate behavior of premature promotion as load increases if you do this. The other is the behavior during program phase changes. I'll explain the Premature promotion down side here (the phase change one can go in another post if people are interested):
As load grows on an (e.g. app-server-style) environment, it usually grows in the form of additional concurrent work (as opposed to "harder" or longer individual operations). The natural length and object lifetime behavior of computations and session either stays the same or elongates. It virtually never gets shorter with higher load.
If age is counted is GC cycles, then the higher the load and the faster the allocation rate is, the sooner (in wall clock time) that objects will get promoted. Once the load gets high enough for objects that would have previously died in newgen to be promoted because they are "old enough in cycles", efficiency drops, oldgen rates grow, and the snowball starts rolling down he hill. As load grows, not only will allocation rate grow, but more premature promotion will occur as a percentage of allocation. That in turn would mean that not just GC work, but GC efficiency (measured as the % of overall CPU work spent on GC) gets worse with load.
In contrast, when age is counted as time, promotion decisions remain more "semi-constant under load" and are either right or wrong for a computation's object lifetime behavior regardless of load. This keeps GC efficiency closer to constant over a wider range of load compared to cycle-count based timing.