Paper Title |
Detailed Diagnosis in Enterprise Networks |
Author(s) |
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, |
Sharad Agarwal, Jitendra Padhye, Paramvir Bahl | |
Date | 2009 |
Novel Idea | Operators of small networks need diagnostic tools that report problems at the granularity of processes and configurations rather than machines. However, this is hard to do in a system-independent way. One primary idea behind this paper is that past system behavior can be used to model expected future behavior, and thus to point to probable causes for erroneous behavior. |
Main Result(s) |
<more coming...? stay tuned!> |
Impact | |
Evidence | |
Prior Work | |
Reproducibility | |
Question | |
Criticism |
|
Ideas for further work |
|
Detailed Diagnosis in Enterprise Networks
Authors
Srikanth Kandula, Sharad Agarwal, Ratul Mahajan, Patrick Verkaik,
Jitendra Padhye, Paramvir Bahl
Date
SIGCOMM'09 - August 2009
Novel Idea
Providing detailed diagnostic for distributed systems based on
fine-grained historic data analysis.
Main Results
The paper presents NetMedic, a tool that, with little application
specific knowledge and statistical inference based on historical data,
can provide indications of faulty components on a distributed system.
The particular goals of the project are application agnosticism and
detailed diagnose.
Impact
Small networks comprised of software that provides the appropriate
named counters can benefit from the application. Therefore, although
the NetMedic system is agnostic to applications, the applications
themselves must provide the appropriate information.
Evidence
The authors describe limitations to existing applications, namely
coarse-grained indicator variables, uniform failure propagation
abstraction (a failure uniformly affects all the other ones), and the
lack of support for circular failure propagation semantics.
They also describe their mechanism for their history-based inference
analysis, and only then describe the system design in appropriate
detail.
The evaluation is mostly carried in a system of a server + 10 clients.
Their comparison is to a generic diagnostics method that has
coarse-grained indicator variables, and simple failure propagation
semantics. They evaluate the system, with good results, in different
dimensions, including cases where simultaneous failure occur, where
the historical data is constrained and where unusual faults are
injected.
Prior Work
The authors mainly build upon inference-based techniques for analyzing
failure scenarios. Regarding the system implementation, they also
depend on the Windows Performance Counter framework.
The authors also mention that they use techniques to gather
information, analyze historical data sets and promote system
monitoring form previous literature on how to identify faults within a
single machine.
Competitive Work
How to the compare their results to related prior or contemporary work.
The authors describe the related projects in four categories: (a)
inference-based. This is the case of NetMedic. The authors claim that
prior work on this area focuses on large networks. (b) rule-based.
They mention these systems are not flexible enough, as they are based
on a set of a priori rules. (c) classifier-based. Based on training
software with indicators in the case of a system malfunction and in
the case of correct functioning. (d) single-machine. Work that
comprises on identifying faults in a single machine. See previous
section.
Reproducibility
It is difficult to reproduce the experiments. Not enough detail was provided.
Criticism
I believe that this system is a good trade-off between black-box
approaches and techniques based on providing information from within
the applications (more complex information, not only indicator
variables). I still believe that true fine-grained diagnostic
information for big systems should be based on detailed tracing
information generated from within the applications.
In other words, I believe that this paper presents an interesting
technique for small networks (and with good evaluation results!), but
as the system increases in complexity, the variables involved also do
so. Therefore, identifying those complex variables with general
"templates" probably becomes more and more difficult, up to the point
of actually requiring application-specific data generation.
Anyway, I'd like to finish restating that this system is well thought,
and a good trade-off between black-box and application-based complex
trace generation techniques.
On Mon, Nov 1, 2010 at 8:22 PM, Rodrigo Fonseca
<rodrigo...@gmail.com> wrote:
Paper Title |
Detailed Diagnosis in Enterprise Networks |
Author(s) |
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl |
Date | 2009 |
Novel Idea | Operators of small networks need diagnostic tools that report problems at the granularity of processes and configurations rather than machines. However, this is hard to do in a system-independent way. One primary idea behind this paper is that past system behavior can be used to model expected future behavior, and thus to point to probable causes for erroneous behavior. |
Main Result(s) |
They describe NetMedic, which consists of a number of parts. First, they must formalize the components that their system will recognize. They record each kind of component as a vector of state variables, where various types have various variables with different meanings, e.g., the 'Machine' type has component states, data sources, states, CPU utilization, memory usage, disk usage, and amount of network and other IO. They record the value of each state variable once a minute. Then they use a set of dependency templates (mappings from each type of component to each type of component that depends on it) to create a dependency graph over the actual components observed, using the actual connections observed. Then they run the diagnosis algorithm, which takes a range of historical system states that are not presumed fault-free, but are assumed unaffected by the current problem, and the current, problematic system state. They then compute the abnormality for each variable in the current state, for which they assume that its historical values approximate the normal distribution, and then take the maximum abnormality of a component's variables as the abnormality for the entire component. Then they do a bunch of math on the differences between current states and previous states for each connected (component, dependent component) pair on the graph and assign an edge weight between them that indicates their estimation of how likely it is that component is responsible for abnormality in dependent component. They also describe a number of additional extensions to this method that try to approximate the benefits that would be gained by deeper knowledge of state variable semantics. Finally, they rank each component->dependent component edge based on their estimation of the likelihood that it is the root cause of the problem. This estimate is based on the product of the path weight between them and the component's global impact. |
Impact | There was so much to write about in this paper... and it's 7 AM... |
Evidence |
They run NetMedic for a month in a real enterprise setting, though they had to run their own servers on which to inject faults. They find that they can correctly rank the at-fault component #1 80% of the time, compared to only 15% by a course version with only one variable per component, and a drastically simpler edge-weighting scheme. Additionally, they rank it in the top 5 almost 100% of the time. They also run a more controlled experiment, where the coarse method does much better. They somewhat disingenuously use this to claim that NetMedic is even better, since it doesn't degrade as much when moved to a live environment, but this is a bit of a straw man. However, it may illustrate the importance of NetMedic's ability to do more fine-grained differentiation between multiple abnormal components, since a more naive system that could appear to be working well in a controlled environment might be prone to failure when exposed to many simultaneously abnormal components. They do a few experiments to determine the usefulness of the history given to the diagnosis algorithm. They find a performance drop off below 30 minutes worth of history, though this is clearly extremely system-dependent. The more interesting note here is that they find a difference between sampling from active historical periods vs. more passive ones, like day vs. night. This makes a lot of sense, but as they note, further research into which time periods to gather history from for a given problem is an interesting problem -- if a component fails at night, is historical daytime data still more useful? |
Prior Work | They give a good overview of other kinds of fault-detection software, but kinda don't say which ones they mostly got their ideas from. They do classify themselves with existing inference-based fault-detection schemes, but say that most of them target large-scale networks and are thus really different. |
Reproducibility | They do show a lot of math, but show little of the custom code they had to write to augment the Windows Performance Counter framework (and also don't give a lot of details about their exact configuration of that software). Something similar could be made, but to deploying it a lot of research would have to be duplicated. |
Question | This research claims to be aimed at small networks, but also makes a big deal out of being application-agnostic. Is this really a good design choice to push for so hard? Certainly programming special cases for ever application is a bad idea, but would mandating/allowing a larger degree of configurability allow this system to perform well in a much larger range of environments, without sacrificing an unacceptable amount of ease-of-use? |
Criticism |
It seems like they could do better than full command line for identifying processes - since it appears plenty of user configuration is already required, they could at least let the operator setting up the system define a regex over command lines that would, say, treat the same executable with different options as one component. The abnormality computation makes a *lot* of assumptions. They look like they are ok on /most/ sorts of variables they track, but it's the kind of thing where an operator defining a new kind of component could easily choose variables that adhere to none of their assumptions, and thus produce garbage values. For instance, a component with many interrelated high-variance variables might only really be abnormal when *all* of its state variables take on abnormal values, rather than just one, as they assume by taking the max. They also pick a totally arbitrary threshold for binary abnormality, which they do at least acknowledge, but its the kind of thing |
Ideas for further work |
There are so many little logic bits to fiddle with here! This paper seems ripe for future research. It is a very cool and very useful idea, but they make many assumptions and choose many scoring functions somewhat arbitrarily. It is true that rigorously exploring the possibility space of scoring functions to find the one that will perform best in the real world is fundamentally intractable - not only is the space infinite, but every real-world scenarios will have a different character and be better served by a different function. However, there are many intuitive directions one could take this. Including more user customizability is one; trying to come up with a more complete picture of component abnormality is another. One interesting idea might be to do try to come up with a theoretical justification for the path weight * global influence heuristic, and perhaps tweak it if it turns out not to look that good. This could be done by assuming that abnormalities are assigned correctly (ie, actually have the assigned probability of being abnormal) and running simulations to determine how well the heuristic identifies the actual causes/find the mathematical reasons behind its false positives. |