Message: Adding in Erik's description of the workload generator.
Commit: fc3eccbef55a1671325a4c2828e6168f467a6703
Date: Tue Apr 12 00:16:35 2011 -0400
Author: Wolfgang Richter <
wo...@cs.cmu.edu>
|
1 |
The workload generator is a tool for building a distributions of MapReduce |
|
2 |
workloads. The generator allows us to abstract experimentation away from |
|
3 |
particular example workloads and focuses on higher-level characteristics. |
|
4 |
|
|
5 |
This tool not only generates the initial data for the workload, but also builds |
|
6 |
synthetic PAO functions that map primary key-value pairs to intermediate |
|
7 |
key-value pairs (at some potentially random ‘dummy’ computational cost). |
|
8 |
Additionally, the user can simulate failures by causing PAO operations to |
|
9 |
randomly terminate without warning or loop infinitely. These three elements |
|
10 |
together describe everything that a MapReduce job needs to do---the only |
|
11 |
remaining factors that can affect a MapReduce run are the job scheduling and |
|
12 |
management. |
|
13 |
|
|
14 |
This synthetic workload approach is not a substitute for investigating |
|
15 |
performance on real-world jobs but it is complementary. For example, after |
|
16 |
having observed several MapReduce workloads that seem to have a similar |
|
17 |
distribution of keys and values we may be interested in learning this |
|
18 |
distribution and generating addition samples of this workload type. Or, as an |
|
19 |
alternative example, we might be interested in simulating different PAO |
|
20 |
aggregation schemes operating on a variety of different intermediate key-value |
|
21 |
pair distributions. |
|
22 |
|
|
23 |
We allow the user to describe a workload distribution in template PAO function |
|
24 |
files and XML configuration files. This is an extremely flexible scheme that |
|
25 |
facilitates extension and ease of use. |