11/3 Reviews - NetMedic

Steve Gomez

unread,

Nov 2, 2009, 11:57:57 PM11/2/09

to CSCI2950-u Fall 09 - Brown

Author: S. Kandula, R. Mahajan, R. Verkaik, et al.
Paper Title: "Detailed Diagnosis in Enterprise Networks"
Date: In SIGCOMM '09

The main idea of the paper is to explain the failure of available
tools to diagnose faults in small enterprise networks. The authors
then present their own diagnostic system - NetMedic - which formulates
diagnosis as an inference problem, given failure details exposed from
the network and node operating systems.

Related work in fault diagnosis is split into four categories.
1.) NetMedic builds off 'inference-based' systems that look at
dependencies between components, but these have typically been applied
to larger networks, which have coarser diagnosis requirements.
2.) There has been work in 'rule-based' diagnosis (which looks at
fault signatures and tries to match errors against known symptom-
>fault mappings), but these lack generality for new failure modes.
3.) Learning has been applied in 'classifier-based' diagnosis (like
NetPrints), which trains on known-behavior data, but this also lacks
generality because it relies on a robust training set, and also may
not be broadly applicable to other network types/topologies outside
the learning set.
4.) Finally, the authors mention 'single-machine' diagnosis systems
(like Strider), to look at individual nodes in isolation in
determining fault. This appears to be a big influence on NetMedic,
especially in how it monitors and compares node behavior and history.

I thought the most interesting part of NetMedic is the way that it
constructs dependency graphs from network components. This seems like
a rich way to capture interactions between components (as edges), and
allows the authors to analyze system health in subgraphs or path
weights, instead of a singular 'health' variable (which may be less
helpful in actually fixing system faults, because the locality of
failure is opaque). The impact could be valuable, though not
providing brand-new ideas as much as combining previous diagnosis
techniques using inference and state history.

The authors evaluate NetMedic analytically first, then run experiments
- mostly against an alternate approach they call Coarse, which isn't a
competitor as much as something the authors whipped up, formulated
from past ideas about network diagnostics. I hoped to see some
competitor analyzed, even if it isn't a perfect match or current.

I was also disappointed that there is no evaluation of scalability of
NetMedic. The authors say that scalability is future work, but it
would be nice to know the limits of the current system on bigger
networks to see where its usefulness begins to taper off. Similarly,
a current diagnostic tool for larger networks (the authors tell us
they exist!) should be evaluated for a small enterprise network, to
give us some benchmarks at least.

One question/criticism: It wasn't clear to me whether 'normal'
behavior (as NetMedic evaluates) in a component x is sufficient to say
that x cannot influence abnormal behavior in other components.

The assumption that a normal-behaving component cannot cause abnormal
behavior elsewhere is fundamental (because the system will assign a
low edge weight between components like this, and neither component
will be a likely suspect if the other fails). But unless the network
and all components are trusted, I feel like we could engineer a
component to behave 'normally' per its state vector, but still cause
or propagate bad effects in other components, without exhibiting
symptoms itself.

小柯

unread,

Nov 3, 2009, 12:55:02 AM11/3/09

to brown-cs...@googlegroups.com

Paper Title: Detailed Diagnosis in Enterprise Networks

Authors:        Srikanth Kandula
                  Ratul Mahajan
                  Patrick Verkaik
                  Sharad Agarwal
                  Jitendra Padhye
                    Paramvir Bahl

Date:            2009

Novel Idea:
    After analyzing ticket logs of small enterprise networks, the authors proposed new methodology to diagnose both generic and application specific problems without asking systems to provide much more information. The primary difference is the authors taking diagnosis as a inference problem and through building the interaction relation between several component as a dependency graph, this inference problem could be solved. This requires no more info because the NetMedic observed the history data to clarify interactions between components.

Main Result:
    NetMedic is built and tested, which shows a very high "hit" rate for finding problematic applications.

Impact:
    With the growth of network and service complexity, new researches about various enterprise network are conducted, and new ideas are proposed.

Evidence:
    Very clear explanation about background, their observation and strategies. Test details and reports are also presented.

Prior Work:
    Many researches in network and system monitor and diagnosis.

Competitive work:

Reproducibility:
    Yes, both for system NetMedic implementation and evaluation.

Question:
    How is the accuracy of using the joint behavior of two components in the past to judge the current interaction between two components?

Criticism:

Ideas for further work:

Rodrigo Fonseca

unread,

Nov 3, 2009, 1:26:21 AM11/3/09

to brown-cs...@googlegroups.com

Review from James:

James Tavares

November 3, 2009

NetMedic

Paper Title: Detailed Diagnosis in Enterprise Networks

Author(s): Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad
Agarwal, Jitendra Padhye, Paramvir Bahl

Date: SIGCOMM ’09. August 17-21, 2009. Barcelona, Spain.

Novel Idea: The paper’s novel contribution is the use of historical
data to predict the extent to which a failure in one application can
affect another on the network, and the use of those relationships to
predict causes of current failures in an enterprise network. Data is
collected at the machine-level and the application-level from a
variety of performance related counters, as well as the collection of
configuration change histories. Each of these inputs guide NetMedic in
attempting to make the most specific failure cause estimate possible.
All variable correlation analysis is done without knowledge of the
application’s semantics.

Main Result(s): The authors claim a first-of-its-kind study in the
analysis of failures in small enterprise networks. The paper also
provides a fairly detailed description of the classification and
prediction algorithms used within the NetMedic system, as well as a
description of how data is collected from system performance counters.
An evaluation is performed which attempts to show NetMedic’s
superiority over a course-grained tool.

Impact: I would predict that these results would have little impact in
the field unless the system can be proven to be more robust over
varying networks.

Evidence: Results of the survey are provided, including a list of 10
(possibly non-representative) example problems and a table showing
weight by problem impact, symptom, and cause. In the performance
evaluation section, a variety of graphs are presented to show
NetMedic’s capabilities. Specifically, the authors claim NetMedic is
able to correctly identify faults on the first try 80% of the time,
and “almost always” asserts the correct cause within its list of top
five culprits. These claims are based on an evaluation performed
against failures injected into an 11 machine “live” system operated
under the assistance of volunteers.

Prior/Competitive Work: The authors primarily compare their work to
that of systems which assign only a single state variable to each
application: a measure of the service’s healthiness, and claim that
this single metric/state variable is insufficient.

Reproducibility: The logic involved is so convoluted that I think it
would be rather difficult to create a duplicate system for
experimental purposes, not including the fact that the exact error
conditions, network conditions, and users are all variables which
could not be accurately reproduced.

Question:

1. The authors claim the 10+1 testbed was a “live” production
environment, yet they had to deploy their own servers and custom
clients. Can this really be considered a live test?
2. How would NetMedic capture a situation where a server was giving
the ‘wrong’ answer? Presumably, it is possible than none of the
application-level performance counters would appear abnormal in this
case. (e.g., the web server is still serving the same number or
requests per second, etc)

Criticism:

1. This “study” left much to be desired: only looking at 450
problem reports out of 450,000? Further, how were 450 the ‘random’
problems chosen? Were they representative the organizations included
in the study (as determined by size, type, industry, etc.)? Were they
representative of times of day, weekends, etc?
2. Given that NetMedic was only tested on a single network of 11
computers, it would be interesting to have seen how NetMedic compared
to the analysis of a seasoned Systems Administrator. For that matter,
is there any evidence that for networks of 10 – 100 computers (the
range for which NetMedic targets), a simple service monitoring system
such as Nagios is insufficient?
3. I think that a key question that remains to be answered is: How
does NetMedic’s prediction capability change as network size
increases? The number of variables could skyrocket as network size
increases, making it very difficult for NetMedic to determine correct
correlations and inferences amongst the noise.
4. The authors state in 7.6 that “30 minutes of historical data
suffices for most faults”… That is a bit broad: actually, it suffices
for the handful of faults that they injected into the system, at best.

Future Work: It may be interesting to add a framework whereby
applications could export their internal state to NetMedic. This may
allow NetMedic to make more interesting parallels than it otherwise
could with performance counter data alone.

Rodrigo Fonseca

unread,

Nov 3, 2009, 1:27:04 AM11/3/09

to brown-cs...@googlegroups.com

Reviews from Sunil
Paper Title

Detailed diagnosis in enterprise networks

Author(s)

Kandula, Srikanth, Mahajan, Ratul, Verkaik, Patrick, Agarwal, Sharad,
Padhye, Jitendra, and Bahl, Paramvir

Date

SIGCOMM 2009

Novel Idea

I think using history based joint behavior estimation between
components is the only novel idea in this paper. May be the attempt to
abstract the model from application semantics could also be considered
novel.

Main Result(s)

There is a need to build diagnostic tools to detect generic faults and
application faults and hence in this paper the authors suggest a
solution (“netmedic”) which is based on intuitive technique that uses
the joint behavior of component’s history to estimate the likelihood
of them impacting each other in the present. The Netmedic solution has
two major functions, firstly it formulates the detailed diagnosis of
the problem as an inference problem and then it estimates when the two
entities in the network are impacting each other without knowledge of
how they interact (i.e without knowledge of application specific
issues).

The model works by modeling the network into a dependency graph
between components and assigns values between nodes (directed edge
with weight) if the source impacts the destination. The construction
of the dependency graph is an automated process. Using the weighted
directed edges between components the visible changes in any
components are attributed to other components and using the weights we
could determine the contributions of each of the components (ranking
likely causes) who brought about the change in state and finally
determine the main cause.

Finally the authors explain the workflow of netmedic which essentially
can be divided into 3 phases namely Capturing component state,
Generating dependency graph and finally diagnosis. From the paper I
infer that this is an attempt to build a diagnostic tool or at least a
model which can diagnose faults or abnormal behavior in systems
without application specific data or cognizance domain semantics.

Evidence

The evaluation is done in two environments one with 10 clients with
custom client processes and other with 3 clients in a controlled
environment. Ideally I would have liked them to have evaluated the
model on a larger scale of atleast >30 machines as it was done by MSR
and I don’t think they have funding issues!! Moreover they did not
evaluate in real network systems where complexity is higher. But to be
fair I guess it is a difficult problem that they are trying to
address.

Question

How efficient is the history based mechanism? is recent history of 30
-60 mins enough ? How long would it take for components to generate
historical data enough to make decisions?

Criticism

Not really convinced about the efficiency and accuracy of the model
with the evaluation that is given in the paper.

Multiple simultaneous faults could have been looked deeper into as in
pragmatic situations its very common.

On Mon, Nov 2, 2009 at 11:57 PM, Steve Gomez <stev...@gmail.com> wrote:
>

Rodrigo Fonseca

unread,

Nov 3, 2009, 1:27:46 AM11/3/09

to brown-cs...@googlegroups.com

Review from Juexin

Title:
Detailed Diagnosis in Enterprise Networks

Authors:

Srikanth Kandula Ratul Mahajan Patrick Verkaik
Sharad Agarwal Jitendra Padhye Paramvir Bahl

Date:
Aug. 2009

Novel Idea:
analysis joint behavior of two components in the past and estimate the
impact of current events.

Main Result(s):
the authors studied the small enterprise network and realized that
they are different than the large enterprises in that the
administration is less sophisticated. They developed a diagnosis
system, NetMedic, which is scalable to large networks.

Impact:
The paper presents an approach that enables detailed analysis at a
finer granularity with little application specific knowledge.

Evidence:
- Modeling the network as a dependency graph and then using history to
detect abnormalities and likely causes. The nodes of this graph are
network components such as application processes, machines, and
configurations, and network paths. There is a directed edge from a
node A to a node B if A impacts B, and the weight of this edge
represents the magnitude of this impact.
- The NetMedic's workflow: capturing component state, generating the
dependency graph, diagnosing by computing abnormalities of states and
rank the edges.

Prior Work
N?A

Competitive work
None for small enterprise networks.

Reproducibility
Gathering statues (analysis the log) is a barrier.

Question
What make them come to this idea, to study small enterprise networks?

Criticism:
Obtaining dependency graph is complex when a operator want to debug a
performance problem. So the Future Work could be that reform the log
organization, keep the data in the way that easy for Diagnosing.
--
J.W
Happy to receive ur message

On Mon, Nov 2, 2009 at 11:57 PM, Steve Gomez <stev...@gmail.com> wrote:
>

Dongbo Wang

unread,

Nov 3, 2009, 1:47:43 AM11/3/09

to brown-cs...@googlegroups.com

Paper Title: Detailed Diagnosis in Enterprise Networks

Authors: Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl.

Date: 2009

Novel Idea: The paper presents a diagnostic system “NetMedic”. This system aims at detecting and diagnosing fine-grained application specific issues in small enterprise network. It uses a chain of dependency edges to represent the relationship between components, and decides whether a component is affected by another one by comparing the similarity of the present and historical states of these components. By exploiting the links between components, the system can not only detect detect abnormal states, but also identify the culprits behind the abnormal states.

Main Result & Impact: The paper describes the design and implementation of the “NetMedic”, and evaluation of the system shows the correct rate of identifying culprits of system behaviors is high. By studying the joint behaviors of two components, the system can cut down a factor of three the number of possible source components that is deemed affecting the destination component, and this is the main reason of its effectiveness. The system is more suitable for small enterprise, because in small enterprise network, the workload and performance is not as important as the specific application problems. The fine-grained diagnosing of this system works for that.

Evidence: The paper describes the design of the system and its basic idea in great detail. The evaluation also reflects its effectiveness.

Prior work & competitive: the paper introduces existing diagnostic systems, but all of them are mentioned as examples of coarse-grained diagnosing systems for large enterprise networks. Dependency graph is also used in some other formulations, but the paper compares the difference of its formulation and existing models.

Reproducibility: I think it’s possible to reproduce the work of this paper. It’s described in great detail.

Question & Criticism: none.

Marcelo Martins

unread,

Nov 3, 2009, 1:28:34 AM11/3/09

to brown-cs...@googlegroups.com

Paper Title "Detailed Diagnosis in Enterprise Networks"

Author(s) Srikanth Kandula et al.
Date SIGCOMM'09 August 2009

Novel Idea

NetMedic is a system that leverages information provided by modern
applications and operating systems to provide detailed fault diagnosis
to operators of small-scale networks. It is capable of providing
detailed response without semantic knowledge of applications through an
inference framework that formulates the discover of relations between
components of different granularities using a more informative
dependency graph.

Main Result(s)

From the observations and live experiments, the authors observed that a
great number of faults are application-specific defects. The
history-based inference technique of NetMedic augmented with extensions
described in the paper is capable of precisely identifying the culprit
in most of 80% of the cases considered in the controlled and
live-environment experiments. The remaining 20% of the cases are due to
performance-related issues, which are not the scope of the system, as
mentioned by the authors.

The extensions show that NetMedic has almost the same performance as the
case when semantic knowledge is given a priori. Finally, NetMedic has
shown inspiring results when detecting multiple faults.

Impact

NetMedic is a relevant system as it provides detailed diagnosis about
hard-to-find faults over different domains of a network (from multiple
connections to single applications pertaining to one machine). Given the
level of granularity of NetMedic, a system administrator or network
operator can quickly locate the culprit and repair the misbehaving node.

Evidence

The authors based their work on reports of real problems that are common
in enterprise networks. After interviews with operators, they identified
the main issues they have when dealing with interconnected systems and
errors from different domains and used the findings as part of their
motivation to develop NetMedic.

Prior Work

A considerable number of articles are mentioned throughout the paper, in
special in Section 9. Previous diagnostic systems are divided into four
broad categories and described in details: (1) inference-based, (2)
rule-based, (3) cluster-based, and (4) single-machine.

Competitive work

The authors mention that since they do not know of any detailed
diagnosis techniques to compare NetMedic against, they develop a method
called Coarse that is inspired on Sherlock and Score.

Reproducibility

As far as I know, NetMedic is not available for download. Although the
description of how NetMedic was implemented is clear, it lack enough
details about the system's intricacies, which may difficult a possible
independent development. Finally, not enough information about the
experiments were provided, such as workload over each client and server
and the type of applications and timestamps of fault injections. Given
all these constraints, it can be said that it is not easy to reproduce
the paper's findings.

Question(s)

1.) In Section 6, it is said that NetMedic read the values of all
exported application counters periodically, though some of them are
cumulative. NetMedic can identify those which are cumulative and
correctly extract the last value. How is this possible if no application
knowledge is assumed in such system?

Criticism

1.) Free Microsoft propaganda.

2.) The authors mention that their system can be easily applicable to a
network of 100 machines, but their experiments consider a maximum of ...
11 <sigh>. Lack of resources is definitely not a excuse for MSR! Is
NetMedic really what it claims to be?

3.) If the real clients communicating with the real machines cannot be
assessed so to not hinder the live environments, why the need of
volunteers? How does this differ from a synthetic workload? The authors
should have been more clear about their methodology.

4.) The authors mention that active probing is used to measure path loss
rate. It is a well-known fact that active probing can actually influence
the measurements. In the case of NetMedic, this is even worse, since
path pathologies is one the possible causes of abnormal behaviors, which
means that, in extreme cases, NetMedic can induce the abnormal behaviors
that it is trying to identify.

Ideas for further work

The main idea for future work is to actually verify whether NetMedic
scales to large-enterprise networks. In addition, the history-based
techniques can be applied to identify patterns in other types of
diagnosis systems, e.g., IDS, network tomography, etc.

Xiyang Liu

unread,

Nov 3, 2009, 1:37:30 AM11/3/09

to brown-cs...@googlegroups.com

Paper Title
Detailed Diagnosis in Enterprise Networks

Author(s)
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitendra Padhye, Paramvir Bahl

Date
SIGCOMM‚Äô09, August 17‚Äì21, 2009

Novel Idea
The paper presented the design and implementation of NetMedic, a fault diagnosis system for small enterprise networks. The system takes as input simple template to build dependency graph amongst components such as application processes and machines. It detects abnormality based on historical logs of component state variables and computes synthesized edge weights to determine likely causes.

Main Results
NetMedic successfully diagnoses fault causes at a high rate in experimental small enterprise networks. It enables fault diagnosis without predefining application knowledge and variable semantics. This feature allows NetMedic working in a flexible environment - arbitrary application, state variable and component topology.

Impact
NetMedic uses unique techniques to rank possible causes by capturing abnormality of components from weighted and pre-processed state variables. It is a major improvements to previous works which use only one variable to represent component states thus leads to inaccurate results.

Evidence
The system was evaluated in both live and controlled environments. NetMedic was compared with a coarse diagnosis method in all experiments to indicate an overall better performance.

Prior Work
Dependent graph used by NetMedic was also utilized by some previous method. NetMedic extended the number of variables to represent component states and employed linear independent component analysis to determine correlated variables.

Competitive work
Other relative works either focus on a specific domain or requires extensive application knowledge. Several methods sharing the model of dependent graph only consider one variable per component state rather than a synthesized weight of variables used by NetMedic.

Reproducibility
Considering the unpredictable faults in networks with varying applications and services, the results might not be exactly reproduced. But the rate of correct diagnosis should not vary too much.

Question
The paper is vague in multiple faults detection. We can infer from the words 'NetMedic diagnoses all the abnormal aspects in the network' that NetMedic can detect multiple faults and produce separate rankings respectively. But how this is done is not mentioned. But how it distinguishes and works differently for interfering and non-interfering faults? Further, the paper mentioned history log is not necessarily fault-free. What if the history period contains a fault which no longer exists within input period? Will NetMedic consider this as abnormality and diagnose the 'fault' cause?

Criticism
NetMedic was only evaluated working as an offline diagnosis tool. It requires the operator to provide suitable fault-infected log and historical log. The ability to run as a service and dynamically detect and diagnosis fault was not mentioned.

joeyp

unread,

Nov 3, 2009, 8:48:25 AM11/3/09

to CSCI2950-u Fall 09 - Brown

Detailed Diagnosis in Enterprise Networks

Kandula et al

SIGCOMM 2009

The paper presents a network monitoring tool that monitors at the
granularity of applications and processes. It does this by using data
available at the OS level and building a dependency graph between
components to figure out which components are acting abnormally, and
what may have caused this to happen.

The paper suggests that this might work. I believe that it's not a
half bad idea. From the evaluation it's a little hard to tell, but it
seems like they're able to recognize likely causes of things. It is
unclear how well they can *really* do and what the limits of the
system are.

Hopefully, this inspires other researchers to try this and do a better
evaluation, because the core idea seems to work. There is likely a
lot more work that can be done to make this analysis better than the
80% this paper claims.

The comparison evaluation doesn't mean much to me. Let's look
at how they define COARSE for a second. They pick probabilities 0.1
and 0.9 for the making judgments between neighboring components,
though these choices are "insignificant." The probable faults are
ranked in a method that is "similar to" NetMedic in one sentence, but
are kept "the same" in the next sentence. What are they actually
doing? They claim that they don't have anything to compare to, so
they cripple their system and compare that to the unbroken system.
Surprise surprise, the system in its unbroken state works way better.

The evaluation of the different lengths of history is actually
a reasonable thing to do, but the figure provided isn't very helpful.
In fact, most of the figures are really unintuitive, since low rank is
better and it's unclear how to interpret the rank in the later part of
the graph.

What about all the parameters that they arbitrarily set for the edges
in the graph? The values of these parameters under different
workloads
seem like the things to evaluate to determine just how effective the
system can be, and how the parameters might affect correctness with
larger numbers of machines.

I also didn't see exactly how much configuration would be required to
set this up in my datacenter. What would I need to install on each
machine, what do I need to tell it about the variables my OS is
capable of reporting? If the target audience really is users of
medium sized enterprise data centers, they need a good story for this.

Reply all

Reply to author

Forward