Paper Title: Performance Debugging for Distributed Systems of Black Boxes
Authors: Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, Athicha Muthitacharoen.
Date 2003
Novel Idea: The paper proposes isolating performance bottlenecks in a black box system by tracing the system activities in message level. The approach proposed in the paper relies on the messages between nodes, and then use offline algorithms to infer the causal path patterns and the latency statistics. The algorithms given in the paper are nesting algorithm and convolution algorithm. The former one uses timing information from RPC messages to infer inter-call causality. The latter converts traces into time signals and then use signal processing techniques to find the cross correlations between signals.
Main Result & Impact: The paper presents the approach to target performance bottlenecks in large distributed systems without manual instrumentation. It describes how to tracing the system in message level, and gives two offline algorithms which infer the causal path pattern from messages.
Evidence: The evaluation section of the paper shows the practical feasibility of this tool and also the accuracy of the tracing result. The accuracy of each algorithm is discussed and also the cost of execution of the tool is talked in the paper.
Prior Work: The paper introduces works that uses similar approaches to address the similar problems, such as Magpie, which also aims to analyze the performance of a distributed system; works that uses different methods to address the similar problems, like NetLogger and ETE; works using similar methods to deal with different problems, such as Pinpoint which aims to target the component of a distributed system that most likely causes the fault.
Reproducibility: I don’t think we can reproduce the work in the paper.
Question & Criticism: none