Reviews: Project5

Rodrigo Fonseca

unread,

Nov 1, 2010, 8:23:23 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Hi,
Please post your review to Project5 as a group reply to this message.
Thanks,
Rodrigo

Shah

unread,

Nov 1, 2010, 9:18:52 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Title:

Performance Debugging for Distributed Systems of Black Boxes

Authors:

Marcos K. Aguilera
Jeffrey C. Mogul
Janet L. Wiener
Patrick Reynolds
Athicha Muthitacharoen

Source and Date:

Proceedings of the Nineteenth ACM Symposium on Operating systems,
Bolton Landing, NY, USA.

Novel Idea:

The scientists developed algorithms for helping with black-box-level
debugging. Specifically, they use traces of system activity without
knowing the specifics of the internals.

Main Result:

The authors designed two algorithms to aid with testing: The first
uses information from RPC messages to determine inter-call causality
and the other uses signal processing techniques.

Impact:

As the authors state at the end of the paper, the tools do produce
useful results. However, they claim that they are in the process of
testing them with more real-world traces.

Evidence:

The researchers conduct a fair amount of tests to prove that their
algorithms are indeed effective.

Prior Work:

The authors state that several others have attempted what they present
in this paper before. The earliest such project is Distributed
Programs Monitor (DPM) that uses kernel instrumentation to try to
achieve the same results.

Competitive Work:

The researchers state that Magpie is the work that comes closest to
Project5. They say that in this system too components are viewed as
black boxes.

Reproducibility

The section on setting up the experiments is not very detailed.
However, they do provide a detailed description of both their
algorithms.

Question:

On which kinds of real-world system can this novel idea be used?

Criticism:

The authors don't provide too much insight into the setup of
experiments. They do cover them all but give a higher level overview.

Ideas for Further Work:

As the authors describe in Section 8, the key idea would be to test
Project5 on a real-world system.

Hammurabi Mendes

unread,

Nov 1, 2010, 10:26:38 PM11/1/10

to brown-csci...@googlegroups.com

Paper Title

Performance Debugging for Distributed Systems of Black Boxes

Author(s)

Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick
Reynolds, Athicha Muthitacharoen

Date

SOSP'03 - Symposium on Operating Systems Principles, 2003

Novel Idea

Passively obtaining system-level message traces, and, without
reasoning about message or system semantics, inferring causal paths
and main node/path latency bottlenecks.

Main Result(s)

The paper presents two algorithms for inferring node/path bottlenecks
from message traces. The nesting algorithm works by scoring causal
parents to pairs of related messages, and the convolution algorithm
correlates inter-node messages in a holistic trace analysis approach.

Impact

The techniques are clearly useful when absolutely no knowledge about
system behavior is available. In this worst-case scenario, inferring
message causal correlations could be the only tool available for
diagnostics (for latency diagnostics).

Evidence

The authors describe the nesting algorithm in an appropriate detail,
including the score attribution/penalties involved. The convolution
algorithm briefly describes the "discrete convolution" of two
functions (the s_1 functions), appropriately contextualized to the
complete algorithm.

They also describe some practical issues involved when tracing data;
finally, they describe some experiments. In the experimental section,
they evaluate (1) artificial traces generated by a tool designed by
them; (2) J2EE traces subject to a load generator (unspecified); and
(3) an email-header trace (artificially built from messages of some
user).

Prior Work

They mention Pinpoint, which focus on faults, but is also based on
message tracing; they also cite techniques based on perturbing system
components (one by Brown et al and other by Bagchi et al).

Other statistical techniques for establishing correlation are also
mentioned (one by Zang and Paxson).

Competitive work

They mention Magpie, which treats components as black boxes buts,
according to the authors have a "more sophisticated tracing
infrastructure" and "concentrates more [...] on rare anomalies". A
system by Hrishuck et al is also mentioned. Commercial products as
AppAssure, PerformaSure, and OptiBech are mentioned.

The paper cites NetLogger and also ETE. These solutions, however, are
not black-box techniques (they require changes in the application).

Reproducibility

It is difficult to evaluate their results. They acknowledge that they
did not have access to "full-scale, real-world applications", and they
mention that testing the tool under the circumstances is an important
future work. Additionally, the artificial traces are not completely
specified (the tracelet parameters, the J2EE load generator and more
details about the email header test setting).

Questions + Criticism

[Criticism] The authors acknowledge that they take a "radical"
approach to analyzing black-box components (Sec 3.1). However,
although such approach is definitely an option for worst-case
situations, where absolute no knowledge of the distributed system is
assumed (by the analyzing tool), there are other "external" factors
that could be considered, perhaps without losing their "black-box"
classification: network message loss ratio, link contention problems
(that could be measured on the fly), and also some "known" causality
information (perhaps the users who set up the system know that some
communication patterns are actually impossible), etc. Then,
[Question], why not using this external network information, and all
"assumed known" correlation characteristics as input hints to the
algorithms?

[Criticism] Actually, they try to _infer_ network metrics _using_
message tracing, but perhaps _using_ network metrics (perhaps measured
on the fly) could _infer_ better message correlation (for instance,
adjusting the delay tolerance of some parent attribution scores) and
provide a better analysis (perhaps information like "link X high
latency is responsible for inumerous retransmissions of messages from
component A to component B").

Anyways, perhaps their results for "pathological cases", high
parallelism, message loss or variance in delay would be improved, if
their algorithms "stretch" their scoring/tolerance parameters in
response to network metrics. Well, if it would be indeed the case, or
by how much is another [Question].

[Question] Extending this argument, but now considering an wider view:
[Question] how this approach compares to the tactic of promoting
application changes that might improve overall system instrumentation
(specific log formats, etc)? Do such application-level changes incur
in a big difference in _latency analysis_?

Ideas for further work

One future improvement, mentioned by the authors, is providing a
suitable visualization mechanism. Additionally, perhaps modifying the
algorithms to provide different possibilities of message-correlation
causal paths, with different global scores associated to each of them,
might be useful, specially if some "pathological cases", high
parallelism, message loss or variance in delay are present. Another
possible improvement was mentioned on the "Questions + Criticism"
section: using network metrics to improve message correlation instead
of only doing the opposite.

Sandy Ryza

unread,

Nov 2, 2010, 12:45:16 AM11/2/10

to CSCI2950-u Fall 10 - Brown

Title:
Performance Debugging for Distributed Systems of Black Boxes

Authors:

Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick
Reynolds, Athicha Muthitacharoen

Date:
SOSP '03

Novel Idea:
The authors provide an approach to debugging and identifying
performance bottlenecks in distributed systems that views machines as
black boxes and attempts to reconstruct traces by collecting data on
passed messages and analyzing it offline. They describe two
algorithms: a nesting algorithm that relies on messages consisting of
calls and returns and convolution algorithm that examines aggregations
of many messages.

Main Result(s):
They find that for the three experiments that they run, despite a few
mis-estimations of latencies, their algorithms are mostly effective at
reconstructing traces.

Evidence:
The authors tested their algorithms in a few ways. They wrote a
program that generates traces with random delays for their algorithms
to analyze. They ran a Java application on a single node to test J2EE
tracing. They obtained traces from 81,044 email headers from one of
the authors' mailboxes. They also mention that they test their
algorithms on real distributed system traces, but do not give more
information than that their algorithms worked well.

Competitive Work:
Magpie also analyzes performance of distributed systems and treats its
components as black boxes, but requires a slightly more sophisticated
tracing infrastructure. Another system uses a prototyping language to
label end-to-end activity to detect more general causal traces in
distributed computations.

Reproducibility:
The authors provide detailed explanations of and pseudocode for their
algorithms. Their experiments are described with varying degrees of
detail - obviously the one relying on the author's email could not
easily be reproduced.

Criticism:
They make no attempts to quantify the effect that non-passive tracing
techniques might have on system performance.

Question:
How well do the algorithms function when some of the messages are
omitted?

Visawee

unread,

Nov 1, 2010, 9:32:02 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Paper Title :

Performance Debugging for Distributed Systems of Black Boxes

Author(s) :

Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick
Reynolds, Athicha Muthitacharoen

Date :

SOSP'03, October 19–22, 2003, Bolton Landing, New York, USA

Novel Idea :
A new approach for inferring the dominant causal paths through a
distributed system without requiring no modifications to applications,
middleware, or messages.

Main Result(s) :
The system is able to show the dominant causal paths accurately. It is
also able to tolerate message drop and clock skew to a certain level.

Impact :
Tools that enable modestly-skilled programmers to isolate performance
bottlenecks in distributed systems composed of black-box nodes.

Evidence :
- The results from “Tracelet-based multi-tier traces”, “J2EE traces”,
and “Received-header trace” show that the system is able to show the
added delay correctly.
- The results from “Validation of accuracy” test show that the false-
negative rate is bounded in most cases by 1/N (which results from the
near-ties in the ranking). They also show that the system is able to
tolerate message drop and clock skew to a certain level.

Reproducibility :
The results are reproducible. The authors explain about the algorithms
(nesting algorithm and convolution algorithm) in detail. The
experiments and matrices are also explained in detail.

Criticism :
The authors should also show the experiment results conducted on real
passive network tracing.

On Nov 1, 8:23 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Abhiram Natarajan

unread,

Nov 1, 2010, 10:31:07 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Paper Title: Performance Debugging for Distributed Systems of Black
Boxes

Author(s): Marcos K Aguilera, Jeffrey C Mogul, Janet L. Wiener,
Patrick Reynolds, Athicha Muthitacharoen

Date: 2003, SOSP

Novel Idea: Obtaining message-level traces of system activity, as
passively as possible and without any knowledge of node internals or
message semantics, and using this to infer inter-call causality.

Main Result(s): They propose algorithms which ascribe delay to
specific nodes on specific causal paths.

Impact: They design tools that enable modestly-skilled programmers to

isolate performance bottlenecks in distributed systems composed of
black-box nodes.

Evidence: The authors show that traces gathered with little or no
knowledge of application design or message semantics are sufficient to
make useful attributions of the sources of system latency. Their
insistence on passive tracing with no application modification makes
the approach applicable to almost any distributed
system and differentiates the work from other approaches that make
stronger assumptions about applications or messages.

Prior Work:
(1) Obtaining causal traces using MLog (Hrischuk et al.)
(2) Magpie
(3) DPM
(4) AppAssure
(5) PerformaSure
(6) OptiBench
(7) NetLogger
(8) Pinpoint

Competitive Work: They evaluate their algorithms in the realm of the
following utitilities:
(1) RPC vs free-form messages
(2) Rare events
(3) Detail required in traces
(4) Time and space complexity
(5) Visualization

Overall, there is evidence that the tools do produce useful and
accurate results.

Reproducibility: Their exact results would require a lot of very
specific data as used by them. However, their algorithms are pretty
well explained with enough generality. The algorithms are
implementable as isolated units.

On Nov 1, 8:23 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Jake Eakle

unread,

Nov 1, 2010, 9:34:37 PM11/1/10

to brown-csci...@googlegroups.com

Paper Title

Performance Debugging for

Distributed Systems of Black Boxes

Author(s)

Marcos K. Aguilera

Patrick Reynolds

Athicha Muthitacharoen

Jeffrey C. Mogul

Janet L. Wiener


Date	2003
Novel Idea	'Solution vendors' need to be able to identify component interactions that are causing delays, but the interacting components may come from different vendors who are unwilling or unable to provide support for the interaction between them. Many solution vendors also do not have the resources (source code, expertise, money) to do in-house modification of system components for diagnostic purposes. Therefore, the authors set out to create a diagnostic system which treats components as black boxes and requires minimal system knowledge to provide useful performance data.
Main Result(s)	Their goal is to listen to a system's TCP traffic for a while, and then assemble likely call paths and then profile those paths. They describe two main algorithms for assembling call paths. The first is specific to RPC-based software, and attempts to identify 'call pairs' (An RPC call from A to B, and the subsequent return call from B to A) simply by matching a call with the earliest unmatched return call [see 'Criticism' below]. Then they try to identify causal links by looking at call pairs that are completely nested inside others, and scoring them by how often call pairs similar to the inner one occur completely inside call pairs similar to the outer one. The second is more general. It uses a signal processing strategy: starting from a node i that sends a message to node j, they then assume than any subsequent messages from j to [k, l...] within d units of time from receipt of i's message are caused by that message, and so forth. It scans the entire trace this way, and builds up a set of possible causal links. Then, for pairs of connected nodes, it computes the convolutions of pairs of functions based on a fixed time epsilon, such that the convolution shows a spike if the second node often sends a message within d units of receiving one from the first.
Impact	This paper has been cited by 349 others, including a few from 2010. Unfortunately, they are behind paywalls, and I'm not on campus, and I'm terminally lazy so I don't feel like X-forwarding a browser to read them. One of them, "Ganesha: blackBox diagnosis of MapReduce systems", also cites a paper by one R. Fonseca, though, so it must be of extremely high quality.
Evidence	They make up some fake trace and then ran their thing on it. They added some random delay, but don't seem to have dropped any packets or tried particularly hard to interleave things confusingly. I think they also don't use the patterns they specifically mentioned that they would have a hard time with. Unsurprisingly, they do OK.
Prior Work	They mention Magpie as the most similar existing project - it does basically the same thing, but requires more intrusive modification of system messages, and 'concentrates on detecting relatively rare anomalies', which is very vague.
Reproducibility	With lines like "Further improvements to the convolution algorithm remove “noise” (low-frequency paths), suppress the detection of accidentally short paths and of cycles, and suppress edges with negative apparent delay. Space limitations prevent us from describing them here", I don't think so.
Question	Is the nesting-based approach really viable? It seems insanely fragile compared to the signal processing algorithm, and its drawbacks seem much more severe. The only real flaw with the signal processing algo is that it sometimes reports extra, uninteresting paths that are probably trivial to filter out. By contrast, the nesting version can't handle a large number of common call graphs, gets much more confused by a noisy trace, and is specific to RPC-based software. They seem to be trying to present them on a level playing field, and they say that they think they can fix a lot of the nesting algo's problems, but I don't really buy it. I don't see a great reason for them not to just scrap the nesting one and concentrate on making the signal processing one better.
Criticism	"noise in the trace—extraneous and dropped messages—does not impact the rest of the algorithm." This is disingenuous. They don't show up in its input, but of course the absences they cause do. Much worse though is their next statement about ignoring the problems with their assumption that no call is faster than its predecessors. They mention that this means they "cannot handle extraneous or dropped messages," but that seems to understate the problem - it seems that one single dropped message could in this case cause /every/ future message between those two nodes to be misclassified. In general, this paper seems less professional than others we have read. "Figure 11: Expanded multi-tier results from the nesting algorithm (you are not expected to be able to read this)" Really? Why is it here then?
Ideas for further work	This paper contains the phrase "true false positives". I believe a promising direction of future research would be to attempt to publish papers with increasingly long strings of simple opposites. At first we can simply build off the excellent work pioneered here, perhaps extending to "true false positive negatives," or simply "true false true false." However, I think there is much more potential here than meets the eye, and that future researchers should not let the artificial limits of the past constrain them. One day it may be common for computer science research to be peppered with sentences containing 4, 5, even 7 consecutive pairs of contradictory terms!

--
A warb degombs the brangy. Your gitch zanks and leils the warb.

Tom Wall

unread,

Nov 1, 2010, 9:23:13 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Performance Debugging for Distributed Systems of Black Boxes

Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick
Reynolds, Athicha Muthitacharoen
SOSP 2003

Novel Idea:
Debugging large distributed applications can can be very difficult,
especially when developers do not have full access to all components
involved. The authors create a performance profiling tool
specifically for systems which contain "black box" components. They
accomplish this by first doing a phase of online monitoring of inter-
component communication, and then doing offline analysis of the data
gathered.

Main Result:
The two tracing algorithms do a good job of diagnosing problems in
large systems with a reasonable amount of error. Their experiments
show that their two detection algorithms work without an unacceptable
number of false diagnoses.

Evidence:
They do a number of tests with a few different traces. Unfortunately
they do not trace on a production scale application. Their tests
focus on verifying the accurracy of the algorithms.

Impact:
Seems like it is another great tool for an enterprise to have at their
disposal. Their emphasis on black box systems should be a huge selling
point. It seems a little odd that they couldn't find any large real
world applications to test on. I bet by now (7 years and 349 google
scholar citations later) it has seen some legitimate usage.

Similar Work:
They mention many related systems. Some, such as Mlog and Magpie do
similar distributed debugging but either require more instrumentation
of the components' sources or are incompatible with existing systems.
Many commercial projects do instrumentation at the CLR or JVM level.
They mention lots of other work, including NetLogger, ETE, Pinpoint
and more.

Criticisms:
A goal for the system to do passive, non-obtrusive online observation,
but they spend hardly any time talking about their results doing
passive tracing, and they dont really talk much about the online
performance costs. They did have a lot to cover and limited space
though, so this is somewhat understandable.

Overall I really liked the structure of this paper and appreciated the
details shown off in the many graphs and figures.

Future Work:
P8 - The nesting/convolution algorithms can be improved in a few ways.
Nesting does not work with certain types of RPC mechanisms (A->B->C-
>A), and convolution reports false positives for long interlinked
communications chains (A->B->C->B->A reports both A->B->A and A->B/B-
>A). They suggest many other enhancements, such as using a sliding
window technique with the nesting algorithm to reduce false
detections.

On Nov 1, 8:23 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Zikai

unread,

Nov 1, 2010, 8:57:42 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Paper Title: Performance Debugging for Distributed Systems of Black
Boxes
Author(s):
Marcos Aguilera (HP Labs)
Jeffrey Mogul (HP Labs)
Janet Wiener (HP Labs)
Patrick Reynolds (Duke University)
Athicha Muthitacharoen (MIT Lab for Computer Science)

Date/Conference: ACM Symposium on Operating Systems Principles (SOSP)
03

Novel Idea: (1) Isolate performance bottlenecks in distributed systems
composed of black-box nodes by obtaining message-level traces of
systems activities and inferring dominant casual paths from them
without modifications to applications, middleware and messages.
(2) Propose two algorithms that infer causal path patterns and
latencies from message traces: nesting algorithm for RPC-style
communication and convolution algorithm for both RPC-style
communication and free-form message-based communication.

Main Results: Design and implement efficient ways of obtaining traces
without perturbing system performance. On top of the collected traces,
design, implement and evaluate two algorithms that infer causal path
patterns and latencies from message traces: nesting algorithm for RPC-
style communication and convolution algorithm for both RPC-style
communication and free-form message-based communication. The
evaluations are on three trace sets: tracelet-based trace generator,
PetStore-based J2EE traces and email transit service.

Evidence: Experiments in Part7 test three trace sets: tracelet-based
trace generator, PetStore-based J2EE traces and email transit service.
Visualizations of results are shown and analyzed. Furthermore,
accuracy of the algorithms is discussed in detail.

Prior Work:
Analysis based recovering or inferring high-level information from raw
packet traces [7, 21]
Packet sniffing at participant hosts when they support programs such
as tcpdump [14], [17]
PetStore [24], JBoss[15]

Competitive Work: Authors discuss related work in Part3 covering
similar approaches to similar problems, different approaches to
similar problems and similar approaches to different problems.

Reproducibility: Though descriptions about the two algorithms are
detailed, the obtaining traces part is ambiguous. If one can have some
tracing methods without perturbing target system and have efficient
ways to manage and store large amounts of trace data, he can reproduce
the design and experiments.

Question: Are the two algorithms able to handle complex systems with
interleaving of a large number of causal paths and other irrelevant
trash messages?

Criticism: A paradox comes from the goal of the paper that is to
accurately find performance bottlenecks in complex real-world
distributed systems made of black box components.

For typical target system like a multi-tier web server, the algorithms
in the paper can definitely report something. However, we are unable
to know whether such diagnosis is correct or not before devoting great
efforts in engineering to improve the system according to the
diagnosis. That is because, the system is made of black box components
and we know nothing about internal design and implementation.
Furthermore, the paper’s target is to assist modestly-skilled
programmers to do performance debugging. These programmers can do
nothing except blindly believing in the diagnosis because lack of
skills for debugging distributed systems.

Therefore, it may be good if we can have some quantitative metrics for
degree of belief (confidence level) along with the diagnosis to help
make decisions.

On Nov 1, 8:23 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Dimitar

unread,

Nov 1, 2010, 11:45:34 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Performance Debugging for Distributed Systems of Black
Boxes

Authors: Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener,
Patrick Reynolds, Athicha
Muthitacharoen

Date:2003

Main Idea: The paper presents new techniques for debugging distributed
systems composed from
components of different vendors where source code is not available.
The main goal is to to provide
modestly skills programmers with the tools needed to isolated
performance problems. The tools
rely on tracing the messages between the nodes , and using offline
algorithms to infer casual path
patterns and provide statistics about latency.

Main Result: Working set of tools and methodologies to that enable
programs to isolate
performance bottlenecks. The authors approach in detecting the
problems goes through three
phases. Phase one they gather a trace of all inter-node messages under
load. Phase two,
they infer causal paths using two different algorithms. Nesting
algorithm used for RPC -style
communication, and convolution algorithm design to handle message
based communication.
Third phases is visualizing the result.

Impact : A new way to tackle an old problem of performance debugging
in distributed systems using
passive message tracing.

Evidence:In Section 7 the authors show how their tools can be used.
They conduct experiments
using synthetic trace generator to evaluate the usefulness of the
nested and convolution algorithms.
At the end they conduct a test that validates the result obtained from
nested and convolution
algorithm.

Prior Work : Magpie which has the same approach to treat components
as black boxes, but it requires
more sophisticated infrastructure. Also Magpie concentrate on finding
anomalies.
Reproducibility : I think the experiments are reproducible as long as
we have the software required
to generate synthetic traces , and their code.

Criticism: I think the paper is well structured, and explains their
work in details. As the authors
mentioned it would have been better if they have used real world
workload instead of a synthetic
one.

On Nov 1, 8:23 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Matt Mallozzi

unread,

Nov 1, 2010, 11:53:45 PM11/1/10

to brown-csci...@googlegroups.com

Matt Mallozzi

11/2/10

Title:

Performance Debugging for Distributed Systems of Black Boxes

Authors:

Aguilera, Mogul, Wiener, Reynolds, Muthitacharoen

Date:

2003

Novel Idea:

Using message-level traces of system activity (obtained passively and

without internal system knowledge) to debug performance problems in

interconnected black-box distributed systems.

Main Results:

A distributed performance debugging system that infers causal path patterns

through networked systems, and which provides useful performance information

about paths and nodes.

Impact:

This could greatly aid enterprises in building performant systems in a

cost-effective way - necessary performance upgrades can be targeted where

they are needed, rather than just blindly throwing hardware to where the

bottleneck might be.

Evidence:

They provide evidence from mostly small, contrived example loads and

applications that supports the correctness and runtime of their offline

trace analysis algorithms.

Prior Work:

A different approach to the same problem is NetLogger, which requires

programmers of the applications in question to insert trace points in the

code - this definitely does not work for black box systems.

Competitive Work:

Magpie is another distributed performance debugger which strives to maintain

the view of systems as black boxes. However, Magpie determines the

existence of an incoming request from messages, and associates system

resource utilization with each such request.

Reproducibility:

Fairly reproducible - nice, understandable pseudocode.

Question:

Although they say that determining if certain information is useful to a

human debugger is not something they can evaluate, is this actually correct?

It seems like they could at least provide some generalizations, or show that

the information they provide suits multiple styles of debugging, etc.

Criticism:

The paper does not provide sufficient detail justifying that packet sniffing

does not introduce noticeable overhead into the system - in my experience,

packet sniffing utilizes a lot of CPU and memory, which could drastically

change results by causing one node to appear to be a bottleneck during

data collection when it is normally not during production execution.

Ideas for Further Work:

On Mon, Nov 1, 2010 at 8:23 PM, Rodrigo Fonseca <rodrigo...@gmail.com> wrote:

James Chin

unread,

Nov 1, 2010, 8:47:49 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Paper Title: “Performance Debugging for Distributed Systems of Black
Boxes”

Authors(s): Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener,
Patrick Reynolds, and Athicha Muthitacharoen

Date: 2003 (SOSP ‘03)

Novel Idea: The authors’ goal is to design tools that enable modestly-
skilled programmers (and experts) to isolate performance bottlenecks
in distributed systems composed of black-box nodes. They approach
this problem by obtaining message-level traces of system activity, as

passively as possible and without any knowledge of node internals or

message semantics. They have developed two very different algorithms

for inferring the dominant causal paths through a distributed system

from these traces. One uses timing information from RPC messages to
infer inter-call causality; the other uses signal-processing
techniques.

Main Result(s): Preliminary results, based on several different kinds
of traces, suggest that the tools do produce useful and accurate
results, and the authors are now working on testing them with more
real traces.

Impact: Whole-system performance debugging can require either an
inordinate amount of time, or the services of expensive and hard-to-
find systems integration experts. Both problems cut into profits for
solutions vendors. The authors’ goal is to design tools that help
programmers isolate performance bottlenecks in black-box distributed
systems. The tools will not themselves solve any performance
problems, but by isolating problems efficiently and accurately, they
should increase the efficiency both of modestly-skilled programmers
and of experts at systems integration.

Evidence: The authors performed experiments with trace sets, testing
things like pathological cases and the effects of parallelism, delay
variation, message loss, and clock skew. The results validate the
accuracy of the authors’ tools.

Prior Work: This includes different approaches to similar problems,
including NetLogger, a system for real-time diagnosis of performance
problems in distributed systems, and ETE, an approach for measuring
both end-to-end response times and the contributing component
latencies. It also includes similar approaches to different problems,
including Pinpoint (a system for locating the components in a
distributed system most likely to be the cause of a fault), a
technique aimed at problem (fault) determination based on
characterizing dynamic dependencies between components, and
statistical techniques that correlate traffic to detect intruders who
subvert hosts for use as “stepping stones.”

Competitive Work: This includes similar approaches to similar
problems, including the Magpie and Distributed Programs Monitor
projects. It also includes work by Hrischuk et al., who obtain casual
traces of distributed computations, including various resource demands
(not just latency), by labeling each end-to-end activity using an
object-oriented prototyping language (Mlog).

Reproducibility: The findings appear to be reproducible if one follows
the testing procedures outlined in the paper, although the names of
the “other traces” that were tested were not specified.

Question: Will this performance debugging system work well on real-
world applications? Also, why is this project called Project5? The
term "Project5" does not exist anywhere in this paper.

Criticism: The names of the “other traces” that were tested were not
specified.

Ideas for further work: Trace and analyze full-scale, real-world
applications with the authors’ tools; extend those tools to add
several significant capabilities, including techniques for locating
the causes of low-frequency high-latency end-to-end behaviors; extend
techniques to handle lock-based interactions between nodes; develop
sliding window version of nesting algorithm that processes all
messages within a time window into path pattern instances before
processing messages in the next (overlapping) time window.

On Nov 1, 8:23 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Basil Crow

unread,

Nov 1, 2010, 11:53:14 PM11/1/10

to brown-csci...@googlegroups.com

Title: Performance Debugging for Distributed Systems of Black Boxes

Authors: Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen

Date: SOSP 2003

Novel idea: The authors present a performance debugging system for distributed systems with black-box components.

Main results: The authors present two algorithms which infer causal path patterns and latencies from low-level message traces. The "nesting algorithm" requires RPC-style communication to make its inferences, while the "convolution algorithm" uses signal-processing techniques. Visualizations are available for both methods.

Impact: It goes without saying that better performance debugging tools will enable programmers to more easily eliminate performance bottlenecks

Evidence: To test the nesting algorithm, the authors traced inter-EJB calls in the PetStore J2EE application. To test the convolution algorithm, the authors used headers from an individual's email. The authors validate the correctness of their algorithms with case-specific metrics.

Prior work: Previous systems, such as NetLogger, required programmers to modify their applications and insert logging statements.

Competitive work: In comparison to Magpie, the authors have a less sophisticated tracing architecture and a more sophisticated post-processing architecture.

Reproducibility: The algorithms are described well, and pseudocode is provided for both nesting and convolution, so these results are fairly reproducible.

Reply all

Reply to author

Forward