Our application is very performance sensitive. We have been trying to identify where the application is spending execution time so that we can focus on optimizing the slow parts. We have been using hprof to do low level profiling, but we would also like to do high level metrics by recording the total execution time of each custom operation in our workflows.
The numbers don't seem to add up correctly though. Is what we're trying to do impossible? Are we implementing it wrong?
public void operate(FlowProcess flowProcess, FunctionCall<CoresWrapper> functionCall) {
long startTime = System.nanoTime();
// do heavy lifting
long endTime = System.nanoTime();
incrementExecutionCounter(hadoopFlowProcess, endTime - startTime);
}
And these utility methods:
protected final void incrementExecutionCounter(FlowProcess flowProcess, long value) {
incrementCounter(flowProcess, getOperationName(), PlatformConstants.ACC_COUNTER_EXEUCTION_TIME,
Utils.toMilliSeconds(value));
}
protected final void incrementCounter(FlowProcess flowProcess, String group, String counter, long value) {
if (flowProcess instanceof HadoopFlowProcess) {
((HadoopFlowProcess) flowProcess).getReporter().getCounter(group, counter).increment(value);
}
}
What we are seeing is that the total time for my two operations in the reduce phase add up to a greater amount of time than the tasks were executing! As an example:
Wall clock time: 2.15 hours
Map Tasks: 606
Reduce Tasks: 207
Avg Map Time: 117 seconds
Avg Reduce Time: 945 seconds
CPU_MILLISECONDS: 20.18891 days (For Reduce only)
VCORES_MILLIS_REDUCES: 16.5976 days
Custom Operation 1: 13.34499 days
Custom Operation 2: 13.19409 days
I would have expected that Custom Operation 1 + Custom Operation 2 + shuffle + sort = one of the total reduce time metrics. But that isn't the case. It appears that something is going horribly wrong as these numbers are added up. These aren't fast operations. The average time per execution of the entire reduce phase is around 0.5-1 second per tuple.
Are we doing something wrong? Or is it possible that some multi-threading is happening in the reduce tasks?
We are running Cascading 2.6.1 on MapR 4.0.1 (planning to do upgrades very soon)