Performance Debugging for Distributed Systems of Black Boxes
Author(s)
Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick
Reynolds, Athicha Muthitacharoen
Date
SOSP'03 - Symposium on Operating Systems Principles, 2003
Novel Idea
Passively obtaining system-level message traces, and, without
reasoning about message or system semantics, inferring causal paths
and main node/path latency bottlenecks.
Main Result(s)
The paper presents two algorithms for inferring node/path bottlenecks
from message traces. The nesting algorithm works by scoring causal
parents to pairs of related messages, and the convolution algorithm
correlates inter-node messages in a holistic trace analysis approach.
Impact
The techniques are clearly useful when absolutely no knowledge about
system behavior is available. In this worst-case scenario, inferring
message causal correlations could be the only tool available for
diagnostics (for latency diagnostics).
Evidence
The authors describe the nesting algorithm in an appropriate detail,
including the score attribution/penalties involved. The convolution
algorithm briefly describes the "discrete convolution" of two
functions (the s_1 functions), appropriately contextualized to the
complete algorithm.
They also describe some practical issues involved when tracing data;
finally, they describe some experiments. In the experimental section,
they evaluate (1) artificial traces generated by a tool designed by
them; (2) J2EE traces subject to a load generator (unspecified); and
(3) an email-header trace (artificially built from messages of some
user).
Prior Work
They mention Pinpoint, which focus on faults, but is also based on
message tracing; they also cite techniques based on perturbing system
components (one by Brown et al and other by Bagchi et al).
Other statistical techniques for establishing correlation are also
mentioned (one by Zang and Paxson).
Competitive work
They mention Magpie, which treats components as black boxes buts,
according to the authors have a "more sophisticated tracing
infrastructure" and "concentrates more [...] on rare anomalies". A
system by Hrishuck et al is also mentioned. Commercial products as
AppAssure, PerformaSure, and OptiBech are mentioned.
The paper cites NetLogger and also ETE. These solutions, however, are
not black-box techniques (they require changes in the application).
Reproducibility
It is difficult to evaluate their results. They acknowledge that they
did not have access to "full-scale, real-world applications", and they
mention that testing the tool under the circumstances is an important
future work. Additionally, the artificial traces are not completely
specified (the tracelet parameters, the J2EE load generator and more
details about the email header test setting).
Questions + Criticism
[Criticism] The authors acknowledge that they take a "radical"
approach to analyzing black-box components (Sec 3.1). However,
although such approach is definitely an option for worst-case
situations, where absolute no knowledge of the distributed system is
assumed (by the analyzing tool), there are other "external" factors
that could be considered, perhaps without losing their "black-box"
classification: network message loss ratio, link contention problems
(that could be measured on the fly), and also some "known" causality
information (perhaps the users who set up the system know that some
communication patterns are actually impossible), etc. Then,
[Question], why not using this external network information, and all
"assumed known" correlation characteristics as input hints to the
algorithms?
[Criticism] Actually, they try to _infer_ network metrics _using_
message tracing, but perhaps _using_ network metrics (perhaps measured
on the fly) could _infer_ better message correlation (for instance,
adjusting the delay tolerance of some parent attribution scores) and
provide a better analysis (perhaps information like "link X high
latency is responsible for inumerous retransmissions of messages from
component A to component B").
Anyways, perhaps their results for "pathological cases", high
parallelism, message loss or variance in delay would be improved, if
their algorithms "stretch" their scoring/tolerance parameters in
response to network metrics. Well, if it would be indeed the case, or
by how much is another [Question].
[Question] Extending this argument, but now considering an wider view:
[Question] how this approach compares to the tactic of promoting
application changes that might improve overall system instrumentation
(specific log formats, etc)? Do such application-level changes incur
in a big difference in _latency analysis_?
Ideas for further work
One future improvement, mentioned by the authors, is providing a
suitable visualization mechanism. Additionally, perhaps modifying the
algorithms to provide different possibilities of message-correlation
causal paths, with different global scores associated to each of them,
might be useful, specially if some "pathological cases", high
parallelism, message loss or variance in delay are present. Another
possible improvement was mentioned on the "Questions + Criticism"
section: using network metrics to improve message correlation instead
of only doing the opposite.
Paper Title |
Performance Debugging for Distributed Systems of Black Boxes
|
Author(s) |
Marcos K. Aguilera Patrick Reynolds Athicha Muthitacharoen |
Jeffrey C. Mogul Janet L. Wiener |
Date | 2003 |
Novel Idea | 'Solution vendors' need to be able to identify component interactions that are causing delays, but the interacting components may come from different vendors who are unwilling or unable to provide support for the interaction between them. Many solution vendors also do not have the resources (source code, expertise, money) to do in-house modification of system components for diagnostic purposes. Therefore, the authors set out to create a diagnostic system which treats components as black boxes and requires minimal system knowledge to provide useful performance data. |
Main Result(s) |
Their goal is to listen to a system's TCP traffic for a while, and then assemble likely call paths and then profile those paths. They describe two main algorithms for assembling call paths. The first is specific to RPC-based software, and attempts to identify 'call pairs' (An RPC call from A to B, and the subsequent return call from B to A) simply by matching a call with the earliest unmatched return call [see 'Criticism' below]. Then they try to identify causal links by looking at call pairs that are completely nested inside others, and scoring them by how often call pairs similar to the inner one occur completely inside call pairs similar to the outer one. The second is more general. It uses a signal processing strategy: starting from a node i that sends a message to node j, they then assume than any subsequent messages from j to [k, l...] within d units of time from receipt of i's message are caused by that message, and so forth. It scans the entire trace this way, and builds up a set of possible causal links. Then, for pairs of connected nodes, it computes the convolutions of pairs of functions based on a fixed time epsilon, such that the convolution shows a spike if the second node often sends a message within d units of receiving one from the first. |
Impact | This paper has been cited by 349 others, including a few from 2010. Unfortunately, they are behind paywalls, and I'm not on campus, and I'm terminally lazy so I don't feel like X-forwarding a browser to read them. One of them, "Ganesha: blackBox diagnosis of MapReduce systems", also cites a paper by one R. Fonseca, though, so it must be of extremely high quality. |
Evidence | They make up some fake trace and then ran their thing on it. They added some random delay, but don't seem to have dropped any packets or tried particularly hard to interleave things confusingly. I think they also don't use the patterns they specifically mentioned that they would have a hard time with. Unsurprisingly, they do OK. |
Prior Work | They mention Magpie as the most similar existing project - it does basically the same thing, but requires more intrusive modification of system messages, and 'concentrates on detecting relatively rare anomalies', which is very vague. |
Reproducibility | With lines like "Further improvements to the convolution algorithm remove “noise” (low-frequency paths), suppress the detection of accidentally short paths and of cycles, and suppress edges with negative apparent delay. Space limitations prevent us from describing them here", I don't think so. |
Question | Is the nesting-based approach really viable? It seems insanely fragile compared to the signal processing algorithm, and its drawbacks seem much more severe. The only real flaw with the signal processing algo is that it sometimes reports extra, uninteresting paths that are probably trivial to filter out. By contrast, the nesting version can't handle a large number of common call graphs, gets much more confused by a noisy trace, and is specific to RPC-based software. They seem to be trying to present them on a level playing field, and they say that they think they can fix a lot of the nesting algo's problems, but I don't really buy it. I don't see a great reason for them not to just scrap the nesting one and concentrate on making the signal processing one better. |
Criticism |
"noise in the trace—extraneous and dropped messages—does not impact the rest of the algorithm." This is disingenuous. They don't show up in its input, but of course the absences they cause do. Much worse though is their next statement about ignoring the problems with their assumption that no call is faster than its predecessors. They mention that this means they "cannot handle extraneous or dropped messages," but that seems to understate the problem - it seems that one single dropped message could in this case cause /every/ future message between those two nodes to be misclassified. In general, this paper seems less professional than others we have read. "Figure 11: Expanded multi-tier results from the nesting algorithm (you are not expected to be able to read this)" Really? Why is it here then? |
Ideas for further work | This paper contains the phrase "true false positives". I believe a promising direction of future research would be to attempt to publish papers with increasingly long strings of simple opposites. At first we can simply build off the excellent work pioneered here, perhaps extending to "true false positive negatives," or simply "true false true false." However, I think there is much more potential here than meets the eye, and that future researchers should not let the artificial limits of the past constrain them. One day it may be common for computer science research to be peppered with sentences containing 4, 5, even 7 consecutive pairs of contradictory terms! |
Authors: Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen
Date: SOSP 2003
Novel idea: The authors present a performance debugging system for distributed systems with black-box components.
Main results: The authors present two algorithms which infer causal path patterns and latencies from low-level message traces. The "nesting algorithm" requires RPC-style communication to make its inferences, while the "convolution algorithm" uses signal-processing techniques. Visualizations are available for both methods.
Impact: It goes without saying that better performance debugging tools will enable programmers to more easily eliminate performance bottlenecks
Evidence: To test the nesting algorithm, the authors traced inter-EJB calls in the PetStore J2EE application. To test the convolution algorithm, the authors used headers from an individual's email. The authors validate the correctness of their algorithms with case-specific metrics.
Prior work: Previous systems, such as NetLogger, required programmers to modify their applications and insert logging statements.
Competitive work: In comparison to Magpie, the authors have a less sophisticated tracing architecture and a more sophisticated post-processing architecture.
Reproducibility: The algorithms are described well, and pseudocode is provided for both nesting and convolution, so these results are fairly reproducible.