Re: Java Performance And Scalability A Quantitative Approach Pdf 12

0 views

Skip to first unread message

Message has been deleted

Francis Caya

unread,

Jul 16, 2024, 3:25:26 PM7/16/24

to sockgloucballro

This is the first book to take a quantitative approach to the subject of software performance and scalability. It brings together three unique perspectives to demonstrate how your products can be optimized and tuned for the best possible performance and scalability:

Software Performance and Scalability gives you a specialized skill set that will enable you to design and build performance into your products with immediate, measurable improvements. Complemented with real-world case studies, it is an indispensable resource for software developers, quality and performance assurance engineers, architects, and managers. It is anideal text for university courses related to computer and software performance evaluation and can also be used to supplement a course in computer organization or in queuing theory for upper-division and graduate computer science students.

Java Performance And Scalability A Quantitative Approach Pdf 12

DOWNLOAD https://cinurl.com/2yMPZa

Our work quantitatively examines power, performance, and scaling during this period of disruptive software and hardware changes (20032011). Voluminous research explores performance analysis and a growing body of work explores power (see Section 6), but our work is the first to systematically measure the power, performance, and energy characteristics of software and hardware across a range of processors, technologies, and workloads.

We execute 61 diverse sequential and parallel benchmarks written in three native languages and one managed language, all widely used: C, C++, Fortran, and Java. We choose Java because it has mature virtual machine technology and substantial open source benchmarks. We choose eight representative Intel IA32 processors from five technology generations (130 nm to 32 nm). Each processor has an isolated processor power supply with stable voltage on the motherboard, to which we attach a Hall effect sensor that measures power supply current, and hence processor power. We calibrate and validate our sensor data. We find that power consumption varies widely among benchmarks. Furthermore, relative performance, power, and energy are not well predicted by core count, clock speed, or reported Thermal Design Power (TDP). TDP is the nominal amount of power the chip is designed to dissipate (i.e., without exceeding the maximum transistor junction temperature).

Using controlled hardware configurations, we explore the energy impact of hardware features and workload. We perform historical and Pareto analyses that identify the most power- and performance-efficient designs in our architecture configuration space. We make all of our data publicly available in the ACM Digital Library as a companion to our original ASPLOS 2011 paper. Our data quantifies a large number of workload and hardware trends with precision and depth, some known and many previously unreported. This paper highlights eight findings, which we list in Figure 1. Two themes emerge from our analysis: workload and architecture.

Workload. The power, performance, and energy trends of native workloads substantially differ from managed and parallel native workloads. For example, (a) the SPEC CPU2006 native benchmarks draw significantly less power than parallel benchmarks and (b) managed runtimes exploit parallelism even when executing single-threaded applications. The results recommend that systems researchers include managed and native, sequential and parallel workloads when designing and evaluating energy-efficient systems.

Architecture. Hardware features such as clock scaling, gross microarchitecture, simultaneous multithreading, and chip multiprocessors each elicit a huge variety of power, performance, and energy responses. This variety and the difficulty of obtaining power measurements recommend exposing on-chip power meters and, when possible, power meters for individual structures, such as cores and caches. Modern processors include power management techniques that monitor power sensors to minimize power usage and boost performance. However, only in 2011 (after our original paper) did Intel first expose energy counters, in their production Sandy Bridge processors. Just as hardware event counters provide a quantitative grounding for performance innovations, future architectures should include power and/or energy meters to drive innovation in the power-constrained computer systems era.

We systematically explore workload selection and show that it is a critical component for analyzing power and performance. Native and managed applications embody different trade-offs between performance, reliability, portability, and deployment. It is impossible to meaningfully separate language from workload and we offer no commentary on the virtue of language choice. We create four workloads from 61 benchmarks.

We execute the Java benchmarks on the Oracle HotSpot 1.6.0 virtual machine because it is a mature high-performance virtual machine. The virtual machine dynamically optimizes each benchmark on each architecture. We use best practices for virtual machine measurement of steady state performance.2 We compile the native non-scalable workload with icc at o3. We use gcc at o3 for the native scalable workload because icc did not correctly compile all benchmarks. The icc compiler generates better performing code than gcc. We execute the same native binaries on all machines. All the parallel native benchmarks scale up to eight hardware contexts. The Java scalable workload is the subset of Java benchmarks that scale well.

To explore the influence of architectural features, we selectively down-clock the processors, disable cores on these chip multiprocessors (CMP), disable simultaneous multithreading (SMT), and disable Turbo Boost using BIOS configuration.

We execute each benchmark multiple times on every architecture, log its power values, and then compute average power consumption. The aggregate 95% confidence intervals of execution time and power range from 0.7% to 4%. The measurement error in time and power for all processors and benchmarks is low. We compute arithmetic means over the four workloads, weighting each workload equally. To avoid biasing performance measurements to any one architecture, we compute a reference performance for each benchmark by averaging the execution time on four architectures: Pentium 4 (130), Core 2D (65), Atom (45), and i5 (32). These choices capture four microarchitectures and four technology generations. We also normalize energy to a reference, since energy = power time. The reference energy is the average benchmark power on the four processors multiplied by their average execution time.

We measure the 45 processor configurations (8 stock and 37 BIOS configurations) and produce power and performance data for each benchmark and processor. Figure 2 shows an example of this data, plotting the power versus performance characteristics for one of the 45 processor configurations, the stock i7 (45).

We organize our analysis into eight findings, as summarized in Figure 1. The original paper contains additional analyses and findings. We begin with broad trends. We show that applications exhibit a large range of power and performance characteristics that are not well summarized by a single number. This section conducts a Pareto energy efficiency analysis for all of the 45 nm processor configurations. Even with this modest exploration of architectural features, the results indicate that each workload prefers a different hardware configuration for energy efficiency.

Figure 2 plots power versus relative performance for each benchmark on the i7 (45), which has eight hardware contexts and is the most recent of the 45 nm processors. Native (red) and managed (green) are differentiated by color, whereas scalable (triangle) and non-scalable (circle) are differentiated by shape. Unsurprisingly, the scalable benchmarks (triangles) tend to perform the best and consume the most power. More unexpected is the range of power and performance characteristics of the non-scalable benchmarks. Power is not strongly correlated with performance across workload or benchmarks. The points would form a straight line if the correlation were strong. For example, the point on the bottom right of the figure achieves almost the best relative performance and lowest power.

Figure 4(a) plots the average power and performance for each processor in their stock configuration relative to the reference performance, using a log/log scale. For example, the i7 (45) points are the average of the workloads derived from the points in Figure 2. Both graphs use the same color for all of the experimental processors in the same family. The shapes encode release age: a square is the oldest, the diamond is next, and the triangle is the youngest, smallest technology in the family.

Contemporaneous comparisons also reveal the tension between power and performance. For example, the contrast between the Core 2D (45) and i7 (45) shows that the i7 (45) delivers 75% more performance than the Core 2D (45), but this performance is very costly in power, with an increase of nearly 100%. These processors thus span a wide range of energy trade-offs within and across the generations. Overall, these results indicate that optimizing for both power and performance is proving a lot more challenging than optimizing for performance alone.

Figure 4(b) explores the effect of transistors on power and performance by dividing them by the number of transistors in the package for each processor. We include all transistors because our power measurements occur at the level of the package, not the die. This measure is rough and will downplay results for the i5 (32) and Atom D (45), each of which have a Graphics Processing Unit (GPU) in their package. Even though the benchmarks do not exercise the GPUs, we cannot discount them because the GPU transistor counts on the Atom D (45) are undocumented. Note the similarity between the Atom (45), AtomD (45), Core 2D (45), and i5 (32), which at the bottom right of the graph are the most efficient processors by the transistor metric. Even though the i5 (32) and Core 2D (45) have five to eight times more transistors than the Atom (45), they all eke out very similar performance and power per transistor. There are likely bigger differences to be found in power efficiency per transistor between chips from different manufacturers.