11/5 Reviews - Project5

Andrew Ferguson

unread,

Nov 4, 2009, 11:58:30 PM11/4/09

to brown-cs...@googlegroups.com

Paper Title
"Performance debugging for distributed systems of black boxes"

Authors
"Aguilera, Marcos K., Mogul, Jeffrey C., Wiener, Janet L., Reynolds,
Patrick, and Muthitacharoen, Athicha"

Date
SOSP 2003

Novel Idea
This project captures RPC-style messages in a distributed application
and extracts causal relationships using two algorithms: the first is a
"nesting algorithm" which identifies nested pairs of requests/
responses, and the second treats the flow of messages as a time-
varying signal as uses cross-correlation to identify causal
relationships.

Main Results
The paper presents a sizable working system which implements these
techniques on the captured RPC messages. Psuedocode for the algorithms
are also presented. The paper discusses a complete effort to trace a
J2EE application, "PetStore," from the mechanics of doing the tracing,
to the graphical output.

Impact
This paper is cited by numerous tracing-framework papers in the last
five years.

Evidence
The authors experiment with their system by running it on synthetic
traces and a variety of pseudo-real world workloads.

Prior Work
Magpie, PinPoint, NetLogger, and various commercial products. Also,
general efforts in tracing such as gprof inspired their design.

Competitive Work
The authors indicate that MagPie is the closest competitive work,
although MagPie adds tags to the RPC messages, whereas this system
does sophisticated post-processing to organize the messages instead of
adding tags.

Reproducibility
I think this project is available as a (part of a) commercial
application from HP, so it might be possible to evaluate it.

Question/Criticism
Can anyone think of a better way to write this paper? I think it
suffered from poor organization and too much detail, although it is
tough to complain about overly-detailed papers from industrial labs.
The ideas are good, and I wish they were better presented.

Ideas for further work
None.

Steve Gomez

unread,

Nov 5, 2009, 12:15:19 AM11/5/09

to CSCI2950-u Fall 09 - Brown

Author: Marcos K. Aguilera, et al.
Paper Title: "Performance Debugging for Distributed Systems of Black
Boxes"
Date: In SOSP '03

The paper addresses the problem that debugging system behavior is
difficult in distributed systems with "black-box" components. The
novel solution is to infer causal path patterns probabilistically,
given trace data, as a way to isolate bottlenecks in performance. The
main contributions of the paper are two algorithms - "nesting" and
"convolution" - that work to infer these paths.

In nesting, the algorithm "uses timing information from from RPC
messages to infer inter-call causality." This relies on RPC-style
nesting, where communication between nodes flows like a stack of
requests (pushing) and returns (popping) before a result is returned
to the original requesting node. The other algorithm, convolution,
handles "free-form message-based communication, and uses signal-
processing techniques to extract causal information from traces".
This is 'convolution' because the algorithm computes cross-correlation
of the indicator function values to find which messages (request/
response) match up. I thought both of these algorithms were clever
(esp. the signal processing approach, which seems slightly more
creative).

For that reason it may have some impact, but the vagueness about how
to effectively collect (and evaluation of) passive traces makes me
wonder: will the (at least perceived) difficulty of getting reasonable
trace data out of a given uninstrumented system prevent users from
getting to the stage of applying these algorithms?

The evaluation section presents lots of graphs for pathological
cases. These don't seem very helpful -- many are not easy to read and
interpret. It would have been nice to see a plot of the execution
costs for these algorithms (like the other plots in the evaluation)
instead of that massive table at the end.

This work could be reproduced ... if one can find trace data. The
authors provide pseudocode for both the nesting and convolution
algorithms. The tough part is getting reasonable trace data, and this
seems like a big challenge (big enough to be very vague and hand-wavy
by the authors).

One criticism related to this: The authors claim that passively
collected traces are the preferred method for their system, but do not
evaluate their system on any of these traces. The authors even claim
to have that data, but instead chose traces that best demonstrated
their hypotheses! This seems like a no-no in science, to include
favorable results knowing that other traces might be more realistic
(which is more helpful for implementers) for real use cases. Another
criticism is that, in the explanation of the convolution algorithm,
the authors mention that certain tunings can be made to accommodate
network conditions (higher delay variance, for example), but these are
not explored well at all.

James Tavares

unread,

Nov 4, 2009, 10:44:09 PM11/4/09

to CSCI2950u

*Project5
*

Paper Title: Performance Debugging for Distributed Systems of Black Boxes

Author(s): Marcos K. Agulera, Jeffrey C. Mogul, Janet L. Wiener,
Patriack Reynolds, Athicha Muthitacharoen

Date: SOSP’03. October 19-22, 2003, Bolton Landing, New York.

Novel Idea: The paper presents a pair of algorithms for the offline
processing of RPC and message traces of distributed systems whose
components can be treated as black boxes. Causal paths are statistically
inferred and latencies are assigned to each step of the transaction.
Assuming passive traces can be obtained which preserve (sender,
receiver, timestamp) tuples (in addition to ‘call’ or ‘return’ for RPC
traces), no modification to “application, middleware, or messages” is
necessary.

Main Result(s)/Impact: The primary outputs are a description and
comparison of the two proposed algorithms, a discussion of related work
and how it differs, and a rather detailed experimental analysis of the
effectiveness and sensitivities of the proposed algorithms (see
‘evidence’). I believe that the detail and intricacies present in this
paper point to how difficult of a problem reverse engineering causal
paths from traces is.

Evidence: The authors were not shy in attempting to evaluate the
effectiveness of the proposed algorithms at reconstructing call paths
from traces, in addition to investigating how noise (from various
sources) impact output quality. The authors claimed that it was too
difficult to obtain real trace data, so a variety of traces from an
application-level logger, a trace generator, and email histories were
used after being stripped of their detail. A few experiments on call
path reconstruction and latency revealed that both algorithms were able
to correctly identify the call paths and high-latency component in each
scenario provided. Algorithm accuracy (vs. “ground truth”) was measured
against un-stripped data sets, with few false positives and few false
negatives reported. Delay variation, message loss, clock skew, and
parallelism were all also tested and found to be of little impact at
relatively small quantities. All in all, the author’s algorithms proved
fruitful over the data sets the authors choose to test with.

Prior/Competitive Work: The authors list a great deal of prior art in
dealing with both similar and different problems. Their essential
argument is that prior similar work either demanded application-level
instrumentation or the insertion of application-level event logs,
whereas their work treats applications as black boxes and never tags or
modifies messages in flight.

Reproducibility: Not at all reproducible – the experiments were
extremely dependent on the particular data sets used, as well as an
algorithm which appears to have many, many “magic numbers” and
tweak-able parameters. Section 5.1.1 reveals nearly a half-dozen such
parameters, plus those related to averaging and smoothing.

Question: In section 4 – Figure 2, the authors claim that “a key feature
of our approach is that we can separately report the delays of steps
2,6, and 10”… Why can’t they also report the delays of steps 4, and 8?
It seems simple enough to extract the necessary deltas.

Criticism:

1.) The authors contend that a “tool which requires application-specific
knowledge…is much less likely to be used”. I contend that if a tool such
as the one proposed by the authors is so sensitive as to require a dozen
or more tunable parameters, rendering it so difficult to use and/or so
inaccurate that very little value comes from its use, then THAT tool is
much less likely to be used.

2.) I would think that developers would very rarely need to profile an
*entire* distributed system. Instead, for example, it is probably known
in advance what outward facing service of a web service is suffering
from poor performance, and what is needed is a specific trace from that
outward facing service as it flows through some set of internal servers.
The path is unknown, but the starting point is—why not just selectively
trace on that path? This system forces one to instrument nearly an
entire system just to obtain one trace. To borrow an expression from the
2008 presidential election, this system appears to be the “hatchet” when
what might more typically be needed is a “scalpel”.

3.) With regard to section 2.1 and the author’s second hypothesis: if
the authors are not willing to make a claim as to whether or not their
system will help programmers debug distributed systems, then why should
the reader?

Future Work: None.

Dan Rosenberg

unread,

Nov 4, 2009, 11:16:28 PM11/4/09

to brown-cs...@googlegroups.com

Paper Title
Performance Debugging for Distributed Systems of Black Boxes

Authors

Marcos K. Aguilera, et al.

Date
October, 2003

Novel Idea
The authors propose strategies that aim to allow them to identify high-latency
causal path patterns in inter-module network traffic and identify which nodes
in the network are responsible for these patterns. These insights should prove
useful to programmers attempting to debug a distributed system's performance.

Main Result
The authors develop two algorithms to extract causal information from
communications with both RPC-style and message-based semantics. They evaluate
these algorithms on traces collected from live systems.

Impact
This seems like a starting point for future work, but more development would
have to be put in before producing anything usable on a live system.

Evidence
The authors evaluate their techniques using primarily synthetic traces. The
primary goal is evaluating the accuracy of their algorithms, so they devote
several pages to their description and verification. Whether or not this would
work in a real system seems to remain an open question.

Prior Work
The authors expands on prior work in debugging performance in distributed
settings, such as Magpie, DPM, and several commercial solutions.

Reproducibility
The techniques in this paper are clearly described, but not enough details are
provided on implementation to reproduce these findings.

Criticism
Implementation details seem to be lacking - they discuss what they didn't do to
obtain trace information, but go into little depth on what they DID do. The
authors admit that they did not stick to their own requirement of black-box
testing, in the interest of testing their algorithms - why couldn't they have
done both? Finally, not using application-layer data contained in requests
seems like it would limit the functionality, especially in situations with
multiple concurrent requests being handled by the same application and being
responded to in potentially different ways.

Questions/Ideas for Further Work
A means of visualization is important to this work, so developing a way of
visualizing these causal relationships should be the first priority.

Kevin Tierney

unread,

Nov 4, 2009, 10:48:09 PM11/4/09

to brown-cs...@googlegroups.com

Title: Performance Debugging for Distributed Systems of Black Boxes

Author(s): Aguilera, et. al.
Date : SOSP 03

Novel Idea
The authors set out to create network traces based on statistical
inferencing from a network log, developing two algorithms to
accomplish this task. One using RPC messages and one using
signal-processing techniques (looks at aggregations of messages)

Main Result(s)
The authors create two polynomial time algorithms and test them
empirically on several different scenarios. The results show a fairly
high false positive rate for assigning calls (from nodes on a network)
to paths under many of the scenarios, but false negatives seem to be
better contained.

Impact
This work provides an interesting template for future work, one that I
think has given rise to numerous ideas for tracing programs and
network activity.

Evidence
The authors test their algorithms on several scenarios:
trace-generator, J2EE environment (from the "PetStore" application),
email received-header traces, and "several other traces."

Prior Work
Magpie is cited as the closest work to this paper, dealing with
performance analysis of distributed systems. However, Magpie tags
messages with an identifier which this work does not do.

Reproducibility
Yes, given access to the datasets used by the authors the algorithms
they present look implementable.

Question
What guarantees do the two algorithms give us outside of runtime guarantees?

Criticism
First, I think the results section is a bit unclear at times, and I
can't find a direct comparison of the nested and convolution
algorithms. (Perhaps I am just too tired to see it sitting in front of
my face, but I can't find it).

Also, I think one thing that is missing are any guarantees (or lack
there of) about the algorithms along with some intuition about why
these algorithms should work well.

Ideas for further work
More sophisticated ML techniques could probably be used to determine
good pairings of trace calls. (since it is an offline analysis anyway,
cpu time isn't a primary concern)

Rodrigo

unread,

Nov 5, 2009, 2:27:07 AM11/5/09

to CSCI2950-u Fall 09 - Brown

Review for Qiao

Performance Debugging for Distributed Systems for Black Boxes

Large-scale distributed systems with communicating blackboxes are hard
to debug. The paper presents an approach that uses low-level message
traces and features offline processing and strict blackbox model. The
goal is to help programmers isolate performance bottlenecks in
blackbox distributed systems. Previous researches do not consider
blackbox components.
Their approach relies on tracing messages between the nodes and using
offline algorithms to construct causality from these traces. No
modification to application is needed. It has three phases: Exposing
and tracing communication, inferring causal paths and patterns and
Visualization. They have two algorithms to construct causality: the
nesting algorithm for RPC-style communication and the convolution
algorithm for freeform message-based communication.
The nesting algorithm construct causality more precise than the
convolution algorthm, but only works for some RPC-based distributed
systems. The convolution algorithm has the drawback that if a node
appearl multiple times on a path, a large number of redundant paths
will be generated. The convultion algorithm needs just the timestamp,
sender and reciver while the nesting algorithm also needs to know
whether messages are calls or returns.
The experiments are simulation based, which means further research is
needed in real world environment. They are able to construct some
proper causal path pattern using both of the algorithms. The quality
of the results and the execution cost are not clear in this paper.
For future work, they will trace and analyze full-scale, realworld
applications.

Spiros E.

unread,

Nov 5, 2009, 7:31:26 AM11/5/09

to CSCI2950-u Fall 09 - Brown

The paper presents tools that enable programmers to isolate
performance bottlenecks in distributed systems composed of blackboxes.
These tools assume nothing of the implementation of the blackboxes
comprising the system, nor does it assume anything about the protocols
the blackboxes use to communicate. These are offline tools that are
not suitable for real-time monitoring.

Given a trace of network messages between nodes, the paper presents
two algorithms for determining causal paths comprising node-to-node
messages. The first is well-suited for RPC-style communication, while
the other can more generally infer causal paths of RPC communication
as well as asynchronous (non-returning) requests.

This paper is more about analyzing traces throught the distributed
system and largely ignores the issue of collecting the traces to begin
with. In fact the paper suggests a passive method of trace collection,
while using traces collected from instrumented systems in its
evaluation. In addition, the paper assumes that message boundaries of
an arbitrary protocol can be inferred, thus obviating the need for any
knowledge of the protocols used in a system. I question this
assumption, especially in the face of abnormal network behavior.

While the paper presents two trace analysis algorithms, only the first
(nesting) algorithm's accuracy is evaluated.

小柯

unread,

Nov 5, 2009, 8:52:14 AM11/5/09

to brown-cs...@googlegroups.com

Paper Title:    Performance Debugging for Distributed Systems of Black Boxes

Authors:        Marcos K. Aguilera
                    Jeffrey C. Mogul
                  Janet L. Wiener
                  Patrick Reynolds
                  Athicha Muthitacharoen

Date:          2003

Novel Idea:


Main Result:
    Requiring no modification and semantics of the distributed system itself, authors proposed an idea to trace communications between components and services. The data recorded could later be analyzed by different algorithm and causal path or patterns, containing nodes has great impact to the system would be identified. This is very helpful for programmers to trace and improve their system.

Impact:
    Many researches on monitor tools that ask no semantics and modification on existing system are conducted because such a tool is much more piratical and convenient for usage.

Evidence:
    Authors differentiate messages applied to various distributed system and introduce different algorithm ( nesting & convolution ) to process them. Actually, this is also an example based on the communication knowledge to gain better analysis.

Prior Work:

Competitive work:
    BorderPatro

Reproducibility:
    Since most details are provided, it seems feasible to build and test this system the same way as authors.

Question:
    Just as mentioned before, authors state that their need no knowledge about system; however, without info about the message, how they decide which algorithm should be used to process their tracing data?

Criticism:

Ideas for further work:

joeyp

unread,

Nov 5, 2009, 9:01:46 AM11/5/09

to CSCI2950-u Fall 09 - Brown

Performance Debugging for Distributed Systems of Black Boxes

Aguilera et al

SOSP 2003

This paper presents a system for debugging systems composed of black
boxes - components that adhere to an interface but do not expose any
internal workings. Examining systems of black boxes is desirable
because of the number of proprietary products that are often used in
systems like these. They cannot be easily (or sometimes ever)
instrumented, so a solution outside the application is needed to do
so.

The main result I get from this paper is that given data about the
communication between components in a system, it is possible to create
an impression of the paths that requests likely took. From this
paper, I didn't get that this data would be easy to acquire, only that
it could be used after the fact to infer information about the
network. This isn't an insignificant result, but ensuring that there
is a way to collect this data that works with the algorithms presented
would have really been cool.

I was dissapointed in Section 6. The paper claims to be movivated by
providing as close to a zero-knowledge prerequisite as possible to use
the system, and avoid instrumentation and changing components. For
this to work, there needs to be a very clear way to acquire the data
and get the process started. I was not convinced by this section of
the paper that this is always (or most of the time) possible. To me,
the results of experiments based on passive collection are the
exciting ones, and the ones that would really sell me on this system.
Evaluating on synthetic traces, or real whitebox traces "converted" to
blackbox can confirm that the algorithms work to some extent, but
don't make an argument for the entire targeted use case.

They also seem conflicted on this issue - tracing packet-level data is
"not a novel issue," but it's a challenge and they're discussing
possible approaches. Which do they reccomend? Which has worked for
them? Is this in fact a really easy problem to solve and I just don't
know that?

Reply all

Reply to author

Forward