Authors
"Aguilera, Marcos K., Mogul, Jeffrey C., Wiener, Janet L., Reynolds,
Patrick, and Muthitacharoen, Athicha"
Date
SOSP 2003
Novel Idea
This project captures RPC-style messages in a distributed application
and extracts causal relationships using two algorithms: the first is a
"nesting algorithm" which identifies nested pairs of requests/
responses, and the second treats the flow of messages as a time-
varying signal as uses cross-correlation to identify causal
relationships.
Main Results
The paper presents a sizable working system which implements these
techniques on the captured RPC messages. Psuedocode for the algorithms
are also presented. The paper discusses a complete effort to trace a
J2EE application, "PetStore," from the mechanics of doing the tracing,
to the graphical output.
Impact
This paper is cited by numerous tracing-framework papers in the last
five years.
Evidence
The authors experiment with their system by running it on synthetic
traces and a variety of pseudo-real world workloads.
Prior Work
Magpie, PinPoint, NetLogger, and various commercial products. Also,
general efforts in tracing such as gprof inspired their design.
Competitive Work
The authors indicate that MagPie is the closest competitive work,
although MagPie adds tags to the RPC messages, whereas this system
does sophisticated post-processing to organize the messages instead of
adding tags.
Reproducibility
I think this project is available as a (part of a) commercial
application from HP, so it might be possible to evaluate it.
Question/Criticism
Can anyone think of a better way to write this paper? I think it
suffered from poor organization and too much detail, although it is
tough to complain about overly-detailed papers from industrial labs.
The ideas are good, and I wish they were better presented.
Ideas for further work
None.
Novel Idea
The authors set out to create network traces based on statistical
inferencing from a network log, developing two algorithms to
accomplish this task. One using RPC messages and one using
signal-processing techniques (looks at aggregations of messages)
Main Result(s)
The authors create two polynomial time algorithms and test them
empirically on several different scenarios. The results show a fairly
high false positive rate for assigning calls (from nodes on a network)
to paths under many of the scenarios, but false negatives seem to be
better contained.
Impact
This work provides an interesting template for future work, one that I
think has given rise to numerous ideas for tracing programs and
network activity.
Evidence
The authors test their algorithms on several scenarios:
trace-generator, J2EE environment (from the "PetStore" application),
email received-header traces, and "several other traces."
Prior Work
Magpie is cited as the closest work to this paper, dealing with
performance analysis of distributed systems. However, Magpie tags
messages with an identifier which this work does not do.
Reproducibility
Yes, given access to the datasets used by the authors the algorithms
they present look implementable.
Question
What guarantees do the two algorithms give us outside of runtime guarantees?
Criticism
First, I think the results section is a bit unclear at times, and I
can't find a direct comparison of the nested and convolution
algorithms. (Perhaps I am just too tired to see it sitting in front of
my face, but I can't find it).
Also, I think one thing that is missing are any guarantees (or lack
there of) about the algorithms along with some intuition about why
these algorithms should work well.
Ideas for further work
More sophisticated ML techniques could probably be used to determine
good pairings of trace calls. (since it is an offline analysis anyway,
cpu time isn't a primary concern)