James Tavares
November 3, 2009
NetMedic
Paper Title: Detailed Diagnosis in Enterprise Networks
Author(s): Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad
Agarwal, Jitendra Padhye, Paramvir Bahl
Date: SIGCOMM ’09. August 17-21, 2009. Barcelona, Spain.
Novel Idea: The paper’s novel contribution is the use of historical
data to predict the extent to which a failure in one application can
affect another on the network, and the use of those relationships to
predict causes of current failures in an enterprise network. Data is
collected at the machine-level and the application-level from a
variety of performance related counters, as well as the collection of
configuration change histories. Each of these inputs guide NetMedic in
attempting to make the most specific failure cause estimate possible.
All variable correlation analysis is done without knowledge of the
application’s semantics.
Main Result(s): The authors claim a first-of-its-kind study in the
analysis of failures in small enterprise networks. The paper also
provides a fairly detailed description of the classification and
prediction algorithms used within the NetMedic system, as well as a
description of how data is collected from system performance counters.
An evaluation is performed which attempts to show NetMedic’s
superiority over a course-grained tool.
Impact: I would predict that these results would have little impact in
the field unless the system can be proven to be more robust over
varying networks.
Evidence: Results of the survey are provided, including a list of 10
(possibly non-representative) example problems and a table showing
weight by problem impact, symptom, and cause. In the performance
evaluation section, a variety of graphs are presented to show
NetMedic’s capabilities. Specifically, the authors claim NetMedic is
able to correctly identify faults on the first try 80% of the time,
and “almost always” asserts the correct cause within its list of top
five culprits. These claims are based on an evaluation performed
against failures injected into an 11 machine “live” system operated
under the assistance of volunteers.
Prior/Competitive Work: The authors primarily compare their work to
that of systems which assign only a single state variable to each
application: a measure of the service’s healthiness, and claim that
this single metric/state variable is insufficient.
Reproducibility: The logic involved is so convoluted that I think it
would be rather difficult to create a duplicate system for
experimental purposes, not including the fact that the exact error
conditions, network conditions, and users are all variables which
could not be accurately reproduced.
Question:
1. The authors claim the 10+1 testbed was a “live” production
environment, yet they had to deploy their own servers and custom
clients. Can this really be considered a live test?
2. How would NetMedic capture a situation where a server was giving
the ‘wrong’ answer? Presumably, it is possible than none of the
application-level performance counters would appear abnormal in this
case. (e.g., the web server is still serving the same number or
requests per second, etc)
Criticism:
1. This “study” left much to be desired: only looking at 450
problem reports out of 450,000? Further, how were 450 the ‘random’
problems chosen? Were they representative the organizations included
in the study (as determined by size, type, industry, etc.)? Were they
representative of times of day, weekends, etc?
2. Given that NetMedic was only tested on a single network of 11
computers, it would be interesting to have seen how NetMedic compared
to the analysis of a seasoned Systems Administrator. For that matter,
is there any evidence that for networks of 10 – 100 computers (the
range for which NetMedic targets), a simple service monitoring system
such as Nagios is insufficient?
3. I think that a key question that remains to be answered is: How
does NetMedic’s prediction capability change as network size
increases? The number of variables could skyrocket as network size
increases, making it very difficult for NetMedic to determine correct
correlations and inferences amongst the noise.
4. The authors state in 7.6 that “30 minutes of historical data
suffices for most faults”… That is a bit broad: actually, it suffices
for the handful of faults that they injected into the system, at best.
Future Work: It may be interesting to add a framework whereby
applications could export their internal state to NetMedic. This may
allow NetMedic to make more interesting parallels than it otherwise
could with performance counter data alone.
Detailed diagnosis in enterprise networks
Author(s)
Kandula, Srikanth, Mahajan, Ratul, Verkaik, Patrick, Agarwal, Sharad,
Padhye, Jitendra, and Bahl, Paramvir
Date
SIGCOMM 2009
Novel Idea
I think using history based joint behavior estimation between
components is the only novel idea in this paper. May be the attempt to
abstract the model from application semantics could also be considered
novel.
Main Result(s)
There is a need to build diagnostic tools to detect generic faults and
application faults and hence in this paper the authors suggest a
solution (“netmedic”) which is based on intuitive technique that uses
the joint behavior of component’s history to estimate the likelihood
of them impacting each other in the present. The Netmedic solution has
two major functions, firstly it formulates the detailed diagnosis of
the problem as an inference problem and then it estimates when the two
entities in the network are impacting each other without knowledge of
how they interact (i.e without knowledge of application specific
issues).
The model works by modeling the network into a dependency graph
between components and assigns values between nodes (directed edge
with weight) if the source impacts the destination. The construction
of the dependency graph is an automated process. Using the weighted
directed edges between components the visible changes in any
components are attributed to other components and using the weights we
could determine the contributions of each of the components (ranking
likely causes) who brought about the change in state and finally
determine the main cause.
Finally the authors explain the workflow of netmedic which essentially
can be divided into 3 phases namely Capturing component state,
Generating dependency graph and finally diagnosis. From the paper I
infer that this is an attempt to build a diagnostic tool or at least a
model which can diagnose faults or abnormal behavior in systems
without application specific data or cognizance domain semantics.
Evidence
The evaluation is done in two environments one with 10 clients with
custom client processes and other with 3 clients in a controlled
environment. Ideally I would have liked them to have evaluated the
model on a larger scale of atleast >30 machines as it was done by MSR
and I don’t think they have funding issues!! Moreover they did not
evaluate in real network systems where complexity is higher. But to be
fair I guess it is a difficult problem that they are trying to
address.
Question
How efficient is the history based mechanism? is recent history of 30
-60 mins enough ? How long would it take for components to generate
historical data enough to make decisions?
Criticism
Not really convinced about the efficiency and accuracy of the model
with the evaluation that is given in the paper.
Multiple simultaneous faults could have been looked deeper into as in
pragmatic situations its very common.
On Mon, Nov 2, 2009 at 11:57 PM, Steve Gomez <stev...@gmail.com> wrote:
>
Authors:
Srikanth Kandula Ratul Mahajan Patrick Verkaik
Sharad Agarwal Jitendra Padhye Paramvir Bahl
Date:
Aug. 2009
Novel Idea:
analysis joint behavior of two components in the past and estimate the
impact of current events.
Main Result(s):
the authors studied the small enterprise network and realized that
they are different than the large enterprises in that the
administration is less sophisticated. They developed a diagnosis
system, NetMedic, which is scalable to large networks.
Impact:
The paper presents an approach that enables detailed analysis at a
finer granularity with little application specific knowledge.
Evidence:
- Modeling the network as a dependency graph and then using history to
detect abnormalities and likely causes. The nodes of this graph are
network components such as application processes, machines, and
configurations, and network paths. There is a directed edge from a
node A to a node B if A impacts B, and the weight of this edge
represents the magnitude of this impact.
- The NetMedic's workflow: capturing component state, generating the
dependency graph, diagnosing by computing abnormalities of states and
rank the edges.
Prior Work
N?A
Competitive work
None for small enterprise networks.
Reproducibility
Gathering statues (analysis the log) is a barrier.
Question
What make them come to this idea, to study small enterprise networks?
Criticism:
Obtaining dependency graph is complex when a operator want to debug a
performance problem. So the Future Work could be that reform the log
organization, keep the data in the way that easy for Diagnosing.
--
J.W
Happy to receive ur message
On Mon, Nov 2, 2009 at 11:57 PM, Steve Gomez <stev...@gmail.com> wrote:
>
Paper Title: Detailed Diagnosis in Enterprise Networks
Authors: Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl.
Date: 2009
Novel Idea: The paper presents a diagnostic system “NetMedic”. This system aims at detecting and diagnosing fine-grained application specific issues in small enterprise network. It uses a chain of dependency edges to represent the relationship between components, and decides whether a component is affected by another one by comparing the similarity of the present and historical states of these components. By exploiting the links between components, the system can not only detect detect abnormal states, but also identify the culprits behind the abnormal states.
Main Result & Impact: The paper describes the design and implementation of the “NetMedic”, and evaluation of the system shows the correct rate of identifying culprits of system behaviors is high. By studying the joint behaviors of two components, the system can cut down a factor of three the number of possible source components that is deemed affecting the destination component, and this is the main reason of its effectiveness. The system is more suitable for small enterprise, because in small enterprise network, the workload and performance is not as important as the specific application problems. The fine-grained diagnosing of this system works for that.
Evidence: The paper describes the design of the system and its basic idea in great detail. The evaluation also reflects its effectiveness.
Prior work & competitive: the paper introduces existing diagnostic systems, but all of them are mentioned as examples of coarse-grained diagnosing systems for large enterprise networks. Dependency graph is also used in some other formulations, but the paper compares the difference of its formulation and existing models.
Reproducibility: I think it’s possible to reproduce the work of this paper. It’s described in great detail.
Question & Criticism: none.Novel Idea
NetMedic is a system that leverages information provided by modern
applications and operating systems to provide detailed fault diagnosis
to operators of small-scale networks. It is capable of providing
detailed response without semantic knowledge of applications through an
inference framework that formulates the discover of relations between
components of different granularities using a more informative
dependency graph.
Main Result(s)
From the observations and live experiments, the authors observed that a
great number of faults are application-specific defects. The
history-based inference technique of NetMedic augmented with extensions
described in the paper is capable of precisely identifying the culprit
in most of 80% of the cases considered in the controlled and
live-environment experiments. The remaining 20% of the cases are due to
performance-related issues, which are not the scope of the system, as
mentioned by the authors.
The extensions show that NetMedic has almost the same performance as the
case when semantic knowledge is given a priori. Finally, NetMedic has
shown inspiring results when detecting multiple faults.
Impact
NetMedic is a relevant system as it provides detailed diagnosis about
hard-to-find faults over different domains of a network (from multiple
connections to single applications pertaining to one machine). Given the
level of granularity of NetMedic, a system administrator or network
operator can quickly locate the culprit and repair the misbehaving node.
Evidence
The authors based their work on reports of real problems that are common
in enterprise networks. After interviews with operators, they identified
the main issues they have when dealing with interconnected systems and
errors from different domains and used the findings as part of their
motivation to develop NetMedic.
Prior Work
A considerable number of articles are mentioned throughout the paper, in
special in Section 9. Previous diagnostic systems are divided into four
broad categories and described in details: (1) inference-based, (2)
rule-based, (3) cluster-based, and (4) single-machine.
Competitive work
The authors mention that since they do not know of any detailed
diagnosis techniques to compare NetMedic against, they develop a method
called Coarse that is inspired on Sherlock and Score.
Reproducibility
As far as I know, NetMedic is not available for download. Although the
description of how NetMedic was implemented is clear, it lack enough
details about the system's intricacies, which may difficult a possible
independent development. Finally, not enough information about the
experiments were provided, such as workload over each client and server
and the type of applications and timestamps of fault injections. Given
all these constraints, it can be said that it is not easy to reproduce
the paper's findings.
Question(s)
1.) In Section 6, it is said that NetMedic read the values of all
exported application counters periodically, though some of them are
cumulative. NetMedic can identify those which are cumulative and
correctly extract the last value. How is this possible if no application
knowledge is assumed in such system?
Criticism
1.) Free Microsoft propaganda.
2.) The authors mention that their system can be easily applicable to a
network of 100 machines, but their experiments consider a maximum of ...
11 <sigh>. Lack of resources is definitely not a excuse for MSR! Is
NetMedic really what it claims to be?
3.) If the real clients communicating with the real machines cannot be
assessed so to not hinder the live environments, why the need of
volunteers? How does this differ from a synthetic workload? The authors
should have been more clear about their methodology.
4.) The authors mention that active probing is used to measure path loss
rate. It is a well-known fact that active probing can actually influence
the measurements. In the case of NetMedic, this is even worse, since
path pathologies is one the possible causes of abnormal behaviors, which
means that, in extreme cases, NetMedic can induce the abnormal behaviors
that it is trying to identify.
Ideas for further work
The main idea for future work is to actually verify whether NetMedic
scales to large-enterprise networks. In addition, the history-based
techniques can be applied to identify patterns in other types of
diagnosis systems, e.g., IDS, network tomography, etc.