Reviews: Netmedic

Rodrigo Fonseca

unread,

Nov 1, 2010, 8:22:19 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Hi,

Please post your reviews to NetMedic as a group reply to this message.

Thanks,
Rodrigo

Visawee

unread,

Nov 2, 2010, 1:16:01 AM11/2/10

to CSCI2950-u Fall 10 - Brown

Paper Title :
Detailed Diagnosis in Enterprise Networks

Author(s) :
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal,
Jitendra Padhye, Paramvir Bahl

Date :
SIGCOMM’09, August 17–21, 2009, Barcelona, Spain.

Novel Idea :
Combining a rich formulation of the inference problem with a novel
technique to determine when a component might be impacting another
without programmed knowledge of how they interact.

Main Result(s) :
(1) NetMedic is able to identify the correct component as the most
likely root cause in over 80% of the cases.
(2) The accuracy is slightly degrade when there are multiple fault
occurring.
(3) The extensions to the basic procedure for edge weight assignment
(hand coded using knowledge about what each variable represents)
significantly enhance the effectiveness of the diagnosis.
(4) A modest amount of history seems to be sufficient for NetMedic

Impact :
Help operators identify root cause of a problem at a fine granularity
in a small enterprise network.

Evidence :
The authors set up several experiments to evaluate NetMedic comparing
to Coarse grain approach. The results of these experiments are shown
above.

Reproducibility :
The results are irreproducible. Although, the authors explain about
the algorithms and formulations in detail, the workloads using in the
experiments are not explained in detail.

Criticism :
The experiments are based from only a small number of computers. The
authors should also show the scalability of NetMedic.

Tom Wall

unread,

Nov 1, 2010, 8:41:46 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Detailed Diagnosis in Enterprise Networks

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal,

Jitendra Padhye & Paramvir Bahl
SIGCOMM 2009

Novel Idea:
They create an inference graph using a large number of variables to
model the interactions of a small enterprise network. Using data
gathered from various logs and lightwieght network monitoring
software, they can create a history to compare against. With this
graph structure and history, they can then infer the causes of
problems with promising results.

Main Result:
NetMedic does a pretty good job in most cases of identifying the
source of a problem. They claim that the faulty component(s) is
identified as primary cause 80% of the time, and that it is amost
always within the top 5.

Evidence:
They do a number of tests injecting issues in two networks, one of one
application server and 10 clients and another one with 3 clients and
one server. The fault injections were based off of real problem
tickets from some support organization. Sometimes the histories
themselves included other faults.

Impact:
It seems to do a good job, and could be quite helpful in if you run a
small Windows shop.

Reproducibility:
Not very reproducible. The tests were lacking a lot of detail.

Prior/Competitive Work:
Sherlock, Pinpoint and SCORE are three similar diagnostic tools, but
they claim that none of them can be as detailed or application-
agnostic as NetMedic. Some only view at a machine and not process
level, some try representing a process' state as a single variable,
and they do not allow cyclic dependencies

Questions:
How easy is it to extract application specific variables from
application logs or wherever they come from? They don't really mention
how they do it and if it is difficult it hinders NetMedic's usefulness

Criticisms:
The networks they test on can hardly be considered "enterprise"
networks, and this leave me unconvinced that NetMedic is really as
useful as they claim. I also would have liked to hear more on how they
log network traffic.

Future Work:
Implementing a linux version and making it more scalable are two big
things that need to be addressed.

On Nov 1, 8:22 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Sandy Ryza

unread,

Nov 2, 2010, 12:27:32 AM11/2/10

to CSCI2950-u Fall 10 - Brown

Title:

Detailed Diagnosis in Enterprise Networks

Authors:

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal,

Jitendra Padhye, Paramvir Bahl

Date:

SIGCOMM '09

Novel Idea:
The authors present an application-agnostic system for diagnosing
faults in small to medium sized networks. Their algorithm models
systems as a graph of processes, machines, network paths, and
configuration elements, and uses a history-based approach to infer
causality and isolate dependency paths to find the culprit
components.

Main Result(s):
NetMedic correctly identifies the component causing a fault in 80% of
cases tested and, even when it does not, almost always ranks the
component in the list of top five culprits.

Evidence:
The authors deployed their system in two environments: a live, real-
world environment with hundreds of desktops, and a small controlled
environment with a few cluster machines and a server. They compare
the accuracy of their system's diagnoses (in terms of ranking
components as likely causes of faulty behavior) with those of an
unnamed coarse-grained method.

Competitive Work:
The authors describe several classes of competitive/related work:
inference-based systems that target large-scale networks and focus on
simple, scalable analysis, rule-based systems which diagnose specific
faults they have been programmed to recognize, classifier based
systems which are trained on healthy and unhealthy states, and single-
machine systems which work similarly to NetMedic, but on a single
machine.

Reproducibility:
Both the algorithms and the implementation are described in detail.
The system could likely be reproduced without too much difficulty.

Criticism:
Their Coarse diagnosis method that they compare their system to seems
like a bit of a straw man. Comparing with other competing systems
that they mention at the beginning of the paper would have been much
more informative and compelling.

Question:
How would we expect the quality of the application variables to affect
the accuracy of the algorithm? How significant is their impact?

On Nov 1, 8:22 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Duy Nguyen

unread,

Nov 1, 2010, 11:19:07 PM11/1/10

to brown-csci...@googlegroups.com

Title:
Detailed Diagnosis in Enterprise Networks

Authors:

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl

Date:
SIGCOMM 2009

Novel Idea:
A diagnostic system making use of history joint behaviors between components in
small enterprise network in the past, and base on these information, estimate the
impact of current events

Main Result(s):
The first of its kind (claimed by authors) work in investigating of failures in
small enterprise networks. NetMedic models network as a dependency graph of components,
edges values are based on the likelihood in which one component affecting another.
By using edge weights, it ranks possible causes of faults to find out the "real"
root cause.

Evidence:
2 testbeds are described: live one with 10 clients and controlled with 3 clients.
Netmedic is compared with a coarse diagnostic method and shows better performance.

Impact:
It depends on how we define "small" enterprise network, but to me the evaluation was
done in a pretty small scale so I wonder how Netmedic is applied in the real world.

Prior Work
Dependency graph and coarse diagnostic method.

Competitive Work:
N/A

Reproducibility:
Yes

Criticism:
Like I mentioned, the test bed is quite simple. One other thing is that they use many
heuristics which are claimed to work on "controlled" environments. But how these ones
cope with real world scenarios.

Question:
N/A

Shah

unread,

Nov 1, 2010, 9:16:40 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Title:

Detailed Diagnosis in Enterprise Networks

Authors:

[1] Srikanth Kandula
[2] Ratul Mahajan
[3] Patrick Verkaik
[4] Sharad Agarwal
[5] Jitendra Padhye
[6] Paramvir Bahl

Source and Date:

Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication,
Barcelona, Spain.

Novel Idea:

The scientists research into providing operators of small enterprise
networks a detailed diagnosis of faults.

Main Result:

The authors build a system called NetMedic that attempts to provide a
detailed diagnosis of faults by using the data available that modern
operating systems provide. Their solution employs the joint behavior
of components in the past to predict their impact in the future.

Impact:

As is claimed in Section 7.2, NetMedic seems to be effective in
diagnosing faults. Indeed NetMedic identifies the key cause for
failure in most of the cases, and, as the authors state towards the
end, outperforms Coarse.

Evidence:

The authors test the effectiveness of NetMedic by injecting faults
into the system and then gauging the result of the diagnosis. Mostly
they introduce each fault one by one but they also check NetMedic's
effectiveness by injecting multiple faults at once. They provide
several graphs that summarize their work.

Prior Work:

Towards the end of the paper, the researchers list out work done in
the field into four classes:

[1] Inference-based

[2] Rule-based

[3] Classifier-based (NetPrints)

[4] Single-machine (Strider)

NetMedic borrows from several areas.

Competitive Work:

In Section 3.2, the authors mention the limitations present in
existing models. Specifically, they state that:

[1] Only a single component is used to represent the health of a
system

[2] Existing models assume some dependency among components

[3] Current models don't account for circular dependencies

In Section 7.3, the scientists state why NetMedic outperforms its peer
competitor, Coarse.

Reproducibility

Though the authors provide several graphs of their results, they don't
go into too much detail with regards to the experiments themselves.

Question:

Has NetMedic been tested in a more 'real-world' scenaorio since?

Criticism:

The authors perform a very small subset of experiments - as they state
just 0.1% of the cases. Perhaps this doesn't paint the entire picture
in detail.

Ideas for Further Work:

As the authors state in the paper, one idea that would useful is to
make NetMedic scalable for large enterprises as well since the
problems between small enterprises is similar.

Dimitar

unread,

Nov 1, 2010, 11:43:27 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Detailed Diagnosis in Enterprise Networks

Authors: Srikanth Kandula, Sharad Agarwal, Ratul Mahajan, Jitendra
Padhye Patrick Verkaik,
Paramvir Bahl

Date: 2009

Novel Idea: Designing and Implementing Diagnostic tool for small
enterprise networks. The tool
should be able to find the culprit of the problem with fine
granularity, for example it should
be able to identify if the problem is caused by a firewall or a
process. This should be done without
implementation specific knowledge.

Main Result: In 80 % of the cases NetMedic Prototype is effective at
diagnosing the faulty components.
This includes application specific and generic problems. NetMedic uses
a novel, history based primitive
to extract information from the joint historical behavior of
components to estimate the likelihood that a
component is affecting its neighbor. The network is model using a
dependency graph
in which to edges are connected if one impacts the other

Impact: NetMedic can have significant impact on a small enterprise
networks because unlike many
other diagnostic tools, it provides fine granularity.

Evidence: The author evaluate their work against Coarse which also
uses dependency graph, but unlik
e NetMedics it uses only one variable to determine the state of a
component. The test cases
presented in the paper verify that NetMedic has high effectiveness in
diagnosing the right culprit

Related Work: There are many different works in diagnosing faults ,
but none of them take same
approach as NetMedic or provide the same level of details.

Reproducibility: The test cases are reproducible if we have a NetMedic
and Coarse

Criticism: I think the paper introduces a truly novel idea , but a
lot of details about implementation are left out.
Also, the acceptance of NetMedic will be limited because it only runs
on Windows.

On Nov 1, 8:22 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Zikai

unread,

Nov 1, 2010, 8:58:07 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Paper Title: Detailed Diagnosis in Enterprise Networks
Author(s): Srikanth Kandula
Ratul Mahajan
Patrick Verkaik
(UCSD)
Sharad Agarwal
Jitendra Padhye
Paramvir Bahl
(Microsoft Research)

Date/Conference: ACM SIGCOMM 09 (Aug, 2009)

Novel Idea: (1) Formulate the diagnosis problem as an inference
problem inside a dependency graph of fine-grained components. This
modeling is much richer than current formulations because the state of
a network component is captured using many variables rather than one
and components may impact each other in complex ways depending on
their state.
(2) Estimate when two components in the network are impacting each
other based on joint behavior of them in the past. Specifically,
search in the history of component states for time periods where the
source component’s state is similar to its current state. If during
those periods the destination component is often in a state similar to
its current state, the chances are that it is currently being impacted
by the source component.

Main Results: (1) Perform an analysis into trouble ticket logs from a
small enterprise network and find concrete descriptions of a diverse
set of typical problems. Classify problem cases along three dimensions
to have a clear understanding of demands on a diagnostic system for
small enterprise networks.
(2) Design and implement NetMedic, a diagnosis system that aims at
providing details on causes of problems as well as relying on minimal
application specific knowledge.
Based on the two key ideas in Novel Idea part, the system can
automatically analyze component types and data sources, capture
component states, form dependency graph and perform inference on the
graph to find a ranking of likely causes of the problem.
(3) Evaluate NetMedic in two environments and compare it with a Coarse
diagnosis method.

Evidence: In part7, authors evaluate and compare NetMedic with a
Coarse diagnosis method in a live environment with roughly 1000
components and 3600 edges in the dependency graph and roughly 35 state
variables for each component. Both single and multiple simultaneous
faults from trouble tickets described in Part 2 are tested. NetMedic
is proved to be effective in both absolute metrics and relative
metrics in finding the correct causes in all experiments.

Prior Work:
Model the diagnosis problem as inference problem in dependency graph
[2, 5, 17, 32]
Probabilistic/Causal inference [13,23]
Use a virtual component called Neighbor set to model dependencies more
accurately [2]
Independent component analysis [14]
Windows Performance Counter framework [20]

Competitive Work:
Diagnosis system for large enterprise networks: Sherlock [2], for
online services: Pinpoint[5], for ISP networks: SCORE [17]
Inference-based diagnosis system [2, 17, 32]
Rule-based diagnosis system [4, 11, 16, 21]
Classifier-based diagnosis system [1, 6, 7, 29]

Reproducibility:
The entire workflow of NetMedic, from analyzing component types and
data sources, to capturing component states, to forming dependency
graph and performing inference on the graph to find a ranking of
likely causes of the problem is covered in detail. Furthermore, the
evaluation environment is also discussed. Therefore, it is possible to
reproduce the design, implementation and evaluations.

Question: Is the test environment representative of networking
environments in small enterprises? Are the type and distribution of
the problems collected in Part2 representative of what we will
encounter in small enterprises? The evaluation result is amazing but
authors lack proof that the method works in all typical small
enterprises environment and under all application-domains.

Criticism: The method that estimates when two components in the
network are impacting each other based on joint behavior of them in
the past is intuitive but not mathematically proved. It is possible
the method does not work for some rare or pathological problems
concerning a large number of components and strange race conditions.
Again, because authors only get their problem set from one ‘typical’
small enterprise network, it is possible.

On Nov 1, 8:22 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Abhiram Natarajan

unread,

Nov 1, 2010, 9:20:20 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Paper Title: Detailed Diagnosis in Enterprise Networks

Author(s): Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad
Agarwal, Jithendra Padhye, Paramvir Bahl

Date: 2009, SIGCOMM

Novel Idea: Usage of information exposed by modern operating systems
and applications to build NetMedic

Main Result(s): NetMedic is a system that formulates detailed
diagnosis as an inference problem that more faithfully captures the
behaviours and interactions of fine-grained network components such as
processes.

Impact: The prototype the authors deploy has been found to be
effective at diagnosing faults that are injected in a live
environment.

Evidence: In 80% of the cases, the faulty component is identified
perfectly; and in almost all the cases, the faulty component is is
among the list of the top five culprits.

Prior Work: Some previously developed models are:
(a) Failure Diagnosis using decision trees
(b) Pinpoint
(c) Sherlock
(d) SCORE

Competitive Work: The authors perform tests and find that NetMedic
correctly identifies the faulty components 80% of the time across a
diverse set of faults; and there is very less decrease in efficiency
in the face of simultaneously occurring faults. They claim that coarse
diagnosis method exhibits an efficiency of 15%.
The deploy their prototype in two environments - a live environment
and also an environment where there are three client machines and a
server.

Reproducibility: Obtaining the environments where they perform their
tests would be hard; however, vaguely building a NetMedic would
probably not be too hard, given more details.

Question: I am assuming the final goal would be to have self-managing
networks; networks which cure themselves. Has there any work been done
in this area?

On Nov 1, 8:22 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

James Chin

unread,

Nov 1, 2010, 8:52:54 PM11/1/10

to CSCI2950-u Fall 10 - Brown

Paper Title: “Detailed Diagnosis in Enterprise Networks”

Authors(s): Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad
Agarwal, Jitendra Padhye, and Paramvir Bahl

Date: 2009 (SIGCOMM ‘09)

Novel Idea: This paper presents a system called NetMedic that enables
detailed diagnosis by harnessing the rich information exposed by
modern operating systems and applications. It formulates detailed

diagnosis as an inference problem that more faithfully captures the

behaviors and interactions of fine-grained network components such as
processes.

Main Result(s): The faulty component is correctly identified as the
most likely culprit in 80% of the cases and is almost always in the
list of top five culprits. NetMedic can also cut down by a factor of
three the number of edges in the dependency graph for which the source
is deemed as likely impacting the destination. The authors also found
that the extensions to the basic procedure for edge weight assignment
significantly enhance the effectiveness of diagnosis and a modest
amount of history seems to be sufficient.

Impact: Diagnosing problems in computer networks is frustrating.
Existing diagnostic systems, designed with large, complex networks in
mind, fall short at helping operators of small networks because they
either lack detail or require extensive domain knowledge. The authors
conclude that detailed diagnosis is required to help these operators.
That is, the diagnostic system should be capable of observing both
generic as well as application-specific faults and of identifying
culprits at the granularity of processes and configuration entries.
Diagnosis at the granularity of machines is not very useful.
Operators often already know which machine is faulty. They want to
know what is amiss in more detail.

Evidence: The authors were hindered by the inability to monitor real
servers and the frequent lack of ground truth available for real
faults. They do, however, present evidence that NetMedic can help
with faults that occur in situ, and they do so by inject a diverse set
of ten faults.

Prior Work: Systems for large enterprises, such as Sherlock, target
only performance and reachability issues and diagnose at the
granularity of machines. Other systems, such as Pinpoint and SCORE,
use extensive knowledge of the structure of their domains, but
extending them to perform detailed diagnosis in enterprise networks
would require embedding detailed knowledge of each application’s
dependencies and failure modes. The range and complexity of
applications inside modern enterprises makes this task intractable.

Competitive Work: Related work falls into four broad categories of
systems: inference-based, rule-based, classifier-based diagnosis
method that is based loosely on prior formulations that use dependency
graphs such as Sherlock and Score. By doing so, the authors help the
reader understand the value of the detailed history-based analysis
offered by NetMedic.

Reproducibility: The findings appear to be reproducible if one follows
the testing procedures outlined in the paper and has access to the
code that the authors used. However, the authors were unable to test
real servers with real faults.

Question: Will this performance debugging system work well on real
servers with real faults?

Criticism: The authors were unable to test real servers with real
faults.

Ideas for further work: Address challenges of scaling NetMedic: carry
out diagnosis-related computation over large dependency graphs, and
handle data collection, storage, and retrieval for large deployments.

On Nov 1, 8:22 pm, Rodrigo Fonseca <rodrigo.fons...@gmail.com> wrote:

Jake Eakle

unread,

Nov 2, 2010, 1:59:14 AM11/2/10

to brown-csci...@googlegroups.com

Paper Title

Detailed Diagnosis in Enterprise Networks

Author(s)

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik,

Sharad Agarwal, Jitendra Padhye, Paramvir Bahl
Date	2009
Novel Idea	Operators of small networks need diagnostic tools that report problems at the granularity of processes and configurations rather than machines. However, this is hard to do in a system-independent way. One primary idea behind this paper is that past system behavior can be used to model expected future behavior, and thus to point to probable causes for erroneous behavior.
Main Result(s)	<more coming...? stay tuned!>
Impact
Evidence
Prior Work
Reproducibility
Question
Criticism
Ideas for further work

--
A warb degombs the brangy. Your gitch zanks and leils the warb.

Hammurabi Mendes

unread,

Nov 1, 2010, 10:24:53 PM11/1/10

to brown-csci...@googlegroups.com

Paper Title

Detailed Diagnosis in Enterprise Networks

Authors

Srikanth Kandula, Sharad Agarwal, Ratul Mahajan, Patrick Verkaik,
Jitendra Padhye, Paramvir Bahl

Date

SIGCOMM'09 - August 2009

Novel Idea

Providing detailed diagnostic for distributed systems based on
fine-grained historic data analysis.

Main Results

The paper presents NetMedic, a tool that, with little application
specific knowledge and statistical inference based on historical data,
can provide indications of faulty components on a distributed system.
The particular goals of the project are application agnosticism and
detailed diagnose.

Impact

Small networks comprised of software that provides the appropriate
named counters can benefit from the application. Therefore, although
the NetMedic system is agnostic to applications, the applications
themselves must provide the appropriate information.

Evidence

The authors describe limitations to existing applications, namely
coarse-grained indicator variables, uniform failure propagation
abstraction (a failure uniformly affects all the other ones), and the
lack of support for circular failure propagation semantics.

They also describe their mechanism for their history-based inference
analysis, and only then describe the system design in appropriate
detail.

The evaluation is mostly carried in a system of a server + 10 clients.
Their comparison is to a generic diagnostics method that has
coarse-grained indicator variables, and simple failure propagation
semantics. They evaluate the system, with good results, in different
dimensions, including cases where simultaneous failure occur, where
the historical data is constrained and where unusual faults are
injected.

Prior Work

The authors mainly build upon inference-based techniques for analyzing
failure scenarios. Regarding the system implementation, they also
depend on the Windows Performance Counter framework.

The authors also mention that they use techniques to gather
information, analyze historical data sets and promote system
monitoring form previous literature on how to identify faults within a
single machine.

Competitive Work

How to the compare their results to related prior or contemporary work.

The authors describe the related projects in four categories: (a)
inference-based. This is the case of NetMedic. The authors claim that
prior work on this area focuses on large networks. (b) rule-based.
They mention these systems are not flexible enough, as they are based
on a set of a priori rules. (c) classifier-based. Based on training
software with indicators in the case of a system malfunction and in
the case of correct functioning. (d) single-machine. Work that
comprises on identifying faults in a single machine. See previous
section.

Reproducibility

It is difficult to reproduce the experiments. Not enough detail was provided.

Criticism

I believe that this system is a good trade-off between black-box
approaches and techniques based on providing information from within
the applications (more complex information, not only indicator
variables). I still believe that true fine-grained diagnostic
information for big systems should be based on detailed tracing
information generated from within the applications.

In other words, I believe that this paper presents an interesting
technique for small networks (and with good evaluation results!), but
as the system increases in complexity, the variables involved also do
so. Therefore, identifying those complex variables with general
"templates" probably becomes more and more difficult, up to the point
of actually requiring application-specific data generation.

Anyway, I'd like to finish restating that this system is well thought,
and a good trade-off between black-box and application-based complex
trace generation techniques.

On Mon, Nov 1, 2010 at 8:22 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote:

Matt Mallozzi

unread,

Nov 1, 2010, 11:53:58 PM11/1/10

to brown-csci...@googlegroups.com

Matt Mallozzi

11/2/10

Title:

Detailed Diagnosis in Enterprise Networks

Authors:

Kandula, Mahajan, Verkaik, Agarwal, Padhye, Bahl

Date:

2009

Novel Idea:

Using an inference model based on system and application history to provide

detailed fault diagnosis for small enterprise networks.

Main Results:

An application-agnostic diagnostic system that is able to identify faulty

applications, configurations, or components in a networked system using no

application-specific knowledge.

Impact:

This could make enterprise network management much easier, resulting in

lower costs to employ personnel as well as quicker response time to

problems.

Evidence:

The diagnostic system was deployed in two environments - one which is a live

environment and one which is a contrived environment.

Prior Work:

Most prior work falls into three categories - inference-based (which

NetMedic uses), rule-based, and classifier-based.

Competitive Work:

The most similar systems (inference-based) are very different from NetMedic

in their aim - a lot of them aim to scale to large networks while

sacrificing detail in the models and results. Other systems (such as

rule-based ones) are deemed unfit for NetMedic's purpose because they

require a high level of domain-specific knowledge, which is difficult in a

small enterprise network due to the vast number of interacting services.

Reproducibility:

Not something I would like to undertake given the lack of hard detail in

this paper.

Question:

How valid is it to consider only fail-stop and performance errors?

Criticism:

The test environments are extremely small, especially considering that their

definition of a "small enterprise network" could "vary from a few to a few

hundred computers".

Ideas for Further Work:

Scale the system to larger networks.

On Mon, Nov 1, 2010 at 8:22 PM, Rodrigo Fonseca <rodrigo...@gmail.com> wrote:

Jake Eakle

unread,

Nov 2, 2010, 7:02:21 AM11/2/10

to brown-csci...@googlegroups.com

Paper Title	Detailed Diagnosis in Enterprise Networks
Author(s)	Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl
Date	2009
Novel Idea	Operators of small networks need diagnostic tools that report problems at the granularity of processes and configurations rather than machines. However, this is hard to do in a system-independent way. One primary idea behind this paper is that past system behavior can be used to model expected future behavior, and thus to point to probable causes for erroneous behavior.

Main Result(s)	They describe NetMedic, which consists of a number of parts. First, they must formalize the components that their system will recognize. They record each kind of component as a vector of state variables, where various types have various variables with different meanings, e.g., the 'Machine' type has component states, data sources, states, CPU utilization, memory usage, disk usage, and amount of network and other IO. They record the value of each state variable once a minute. Then they use a set of dependency templates (mappings from each type of component to each type of component that depends on it) to create a dependency graph over the actual components observed, using the actual connections observed. Then they run the diagnosis algorithm, which takes a range of historical system states that are not presumed fault-free, but are assumed unaffected by the current problem, and the current, problematic system state. They then compute the abnormality for each variable in the current state, for which they assume that its historical values approximate the normal distribution, and then take the maximum abnormality of a component's variables as the abnormality for the entire component. Then they do a bunch of math on the differences between current states and previous states for each connected (component, dependent component) pair on the graph and assign an edge weight between them that indicates their estimation of how likely it is that component is responsible for abnormality in dependent component. They also describe a number of additional extensions to this method that try to approximate the benefits that would be gained by deeper knowledge of state variable semantics. Finally, they rank each component->dependent component edge based on their estimation of the likelihood that it is the root cause of the problem. This estimate is based on the product of the path weight between them and the component's global impact.
Impact	There was so much to write about in this paper... and it's 7 AM...
Evidence	They run NetMedic for a month in a real enterprise setting, though they had to run their own servers on which to inject faults. They find that they can correctly rank the at-fault component #1 80% of the time, compared to only 15% by a course version with only one variable per component, and a drastically simpler edge-weighting scheme. Additionally, they rank it in the top 5 almost 100% of the time. They also run a more controlled experiment, where the coarse method does much better. They somewhat disingenuously use this to claim that NetMedic is even better, since it doesn't degrade as much when moved to a live environment, but this is a bit of a straw man. However, it may illustrate the importance of NetMedic's ability to do more fine-grained differentiation between multiple abnormal components, since a more naive system that could appear to be working well in a controlled environment might be prone to failure when exposed to many simultaneously abnormal components. They do a few experiments to determine the usefulness of the history given to the diagnosis algorithm. They find a performance drop off below 30 minutes worth of history, though this is clearly extremely system-dependent. The more interesting note here is that they find a difference between sampling from active historical periods vs. more passive ones, like day vs. night. This makes a lot of sense, but as they note, further research into which time periods to gather history from for a given problem is an interesting problem -- if a component fails at night, is historical daytime data still more useful?
Prior Work	They give a good overview of other kinds of fault-detection software, but kinda don't say which ones they mostly got their ideas from. They do classify themselves with existing inference-based fault-detection schemes, but say that most of them target large-scale networks and are thus really different.
Reproducibility	They do show a lot of math, but show little of the custom code they had to write to augment the Windows Performance Counter framework (and also don't give a lot of details about their exact configuration of that software). Something similar could be made, but to deploying it a lot of research would have to be duplicated.
Question	This research claims to be aimed at small networks, but also makes a big deal out of being application-agnostic. Is this really a good design choice to push for so hard? Certainly programming special cases for ever application is a bad idea, but would mandating/allowing a larger degree of configurability allow this system to perform well in a much larger range of environments, without sacrificing an unacceptable amount of ease-of-use?
Criticism	It seems like they could do better than full command line for identifying processes - since it appears plenty of user configuration is already required, they could at least let the operator setting up the system define a regex over command lines that would, say, treat the same executable with different options as one component. The abnormality computation makes a lot of assumptions. They look like they are ok on /most/ sorts of variables they track, but it's the kind of thing where an operator defining a new kind of component could easily choose variables that adhere to none of their assumptions, and thus produce garbage values. For instance, a component with many interrelated high-variance variables might only really be abnormal when all of its state variables take on abnormal values, rather than just one, as they assume by taking the max. They also pick a totally arbitrary threshold for binary abnormality, which they do at least acknowledge, but its the kind of thing
Ideas for further work	There are so many little logic bits to fiddle with here! This paper seems ripe for future research. It is a very cool and very useful idea, but they make many assumptions and choose many scoring functions somewhat arbitrarily. It is true that rigorously exploring the possibility space of scoring functions to find the one that will perform best in the real world is fundamentally intractable - not only is the space infinite, but every real-world scenarios will have a different character and be better served by a different function. However, there are many intuitive directions one could take this. Including more user customizability is one; trying to come up with a more complete picture of component abnormality is another. One interesting idea might be to do try to come up with a theoretical justification for the path weight * global influence heuristic, and perhaps tweak it if it turns out not to look that good. This could be done by assuming that abnormalities are assigned correctly (ie, actually have the assigned probability of being abnormal) and running simulations to determine how well the heuristic identifies the actual causes/find the mathematical reasons behind its false positives.

Reply all

Reply to author

Forward