Hello,
I've been in monitoring in enterprise for a couple decades and am ramping up on thinking in Prometheus. I had a couple areas that jumped out and wanted to see how folks handle things in production with Prometheus and its cohorts.
1. Alert rule management. In large environments political and technical rules can get excessive. For example, IBM's NcKL rules run a million lines of a glorified if-then-else assignment. Obviously Prometheus will cut that down by an order of magnitude but it will still be significant. What do folks do around managing significant numbers of alert rules?
2. Aggregated notification. Is there a nice way to aggregate notifications so actionable events are not lost but someone's cell phone isn't overwhelmed with texts?
3. Delayed notification/event queuing. As event numbers grow, event lists and queue systems start to make sense along with concepts of severity ranking (urgency + importance), etc. Are there adds that can handle this? If not, how does Prometheus approach this?
4. Change management. When scheduled maintenance occurs on a component/device/etc, is there a way to suppress alerts associated in an easy way?
5. Exception faults. Performance metrics can catch a lot of things but will miss component failures such as fan or battery failures, transaction failures, etc. How does Prometheus manage this?
6. Graphical reporting tools. Grafana is great as a dashboard. Is there the equivalence for historical reporting, top failures, etc? How about a nifty adhoc report generator? What are folks using in the field?
7. Tracing tools. Prometheus with its timed series data aligns well with protocol tracing. Are there any tools that can be used to provide a nifty front end for consuming, processing, and presenting traces in a more ad-hoc and abstracted way (i.e. they don't have to be reinvented from scratch.)?
Sorry I know it's a lot. Any insights and thoughts are appreciated!!!
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c4844b4b-8654-4f0f-b985-5666932ddecf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Excellent! Thanks! A couple follow ups.#1 Thanks so much for such a detailed response. I fully agree with #1 and the doc simply for the fact I would have a lot less grey hairs. In the past I compare it to the battle of the 300. When you have well defined boundaries on input, output and process, you can really simplify (and mow down) complexity. Often the starting point is reduction of variance via configuration control and organized change control, which kills problems before the begin. The only thing you have to watch out for is faith in process, instead of people. "It's not my job" becomes much more common and you loose your heros that cover the cracks when you implement the spartan approach.One issue though is reality has a way of not complying when companies grow to a certain size due to a mixture of politics (multiple groups/requirements outside your control) and sheer size (hundreds of thousands of elements on heterogeneous and hybrid systems, networks, clouds, etc.) In these cases false negatives are minimized over false positives as you do not know what you do not know. Symptom based monitoring assumes problems can be detected by outward signs prior to exceeding the threshold of pain. Self healing and catastrophic cascading event are still common in large/dinosaur organizations with mainframes processing payroll. Reality sucks sometimes. 8-/
That said I do overwhelmingly like the "Sparta" approach aka dictating terms for monitoring over the "you cannot know what you do not know" approach to monitoring. And there is a case for redirecting pain back to the areas that need to change The "Sparta" approach definitely does that.#2. Ah thanks for the RTFM -- sorry for the noob question. 8-)#3. Thanks! So I am assuming the design is to plug "alerts" into a ticketing system rather than have an event repository intermediary, which does make sense if you focus on actionable events via #1 Spartan approach.#4. Cool. I'll RTFM this. I suspect with the identity emphasis with a nice clean hierarchy for the keys, it will be relatively easy to aggregate as needed.#5. My thought is there are some metrics that are difficult to poll when you scale to very large numbers such as component failures as well as some events that are simply that and not poll-able. I suspect this will be difficult to integrate but with the cloud component failures and some app failures will already be mitigated due to the self healing "holistic" nature of cloud - transferring loads and such.
#6. What I am wondering here are there some tools folks already use that perform this function so that for example ETL like folk can pull up and run ad hoc queries to their delight. (This is really pushing the fringe.)
#7. Any thoughts, insights on Zipkin or Loki?Thanks!!Daniel
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAD%3DWOv5nN%3DuOgni4H3mNof4zBz%2BD1W-HBdTug3czc2%2BBBB3L%2BA%40mail.gmail.com.
I really appreciate the insights and input! I know I am definitely kicking at the periphery here and stretching the product over too huge of a scope. No product can cover everything for everyone all the time. And that is despite how well Prometheus is constructed. This exercise is to help me “grok” the product and “think in Prometheus” by pushing it till it breaks (in thought experiments initially) which not only reveals its weaknesses, but also its strengths, which are almost always the flip sides of the same coin. To do this I am going to delve a bit deep to better convey what I meant earlier.
For example, imagine if we did this for Netcool, which is still in most Fortune 1000s. When pushed on volume, Netcool will be fast and quick using a “memory resident” database, similar to how Spark is faster than traditional Hadoop/MapReduce. When the load initially becomes too much, Netcool breaks by sacrificing timeliness (i.e. it buffered) instead of dropping events when things overload. When pressed further the memory resident database cannot “scale out” Netcool handles this by commoditizing the “actionable things” into events and makes them stateful. Then it allows the installation of additional Netcools so “information” can be exchanged, pooled, and prioritized via the commoditized events. These can be consumed much like packets in a network protocol to drive dashboards, correlations, etc. Of course the commoditization and flexibility creates too many human based rules at multiple levels (all using different proprietary languages due to a string of acquisitions.) Further, to be completely flexible at the expense of applicability, Netcool sacrificed “identity”. That is, much like DNS bind is a dot delimited database with no knowledge of domains, hosts and IPs, Netcool is not “aware” of host, interface, property (i.e crx errors, drops, counts, etc). This allowed complex and arbitrary correlations, aggregations, and enrichment. However, for example, if machine learning was used to replace some of these human rules, the arbitrariness of the design will make it impossible to anchor to true causations among the many meaningless data correlations…. Anyway, hopefully that gives you an idea of what I am doing.
So looking at Prometheus it falls into the traditional Performance management/Historical reporting space with some really cool upgrades and simplifications. It exploits scaling horizontally and the holistic nature of cloud. It emphasizes notification/ticketing rules at a higher level of abstraction – as generalized queries. It also exploits the development of NoSQL with massive improvements in identity and speed via the variable, value pairs. Its “discovery” mechanism is closer to how complex systems are managed in nature – as a trusted, registered wholes verses a collection of parts. (The later approach is an unfortunate byproduct of the Internet being invented by DARPA, because the military always sees things through the lens of security, but in nature the immune system doesn’t pound the living hell out of every cell to discover what constitutes the body. 8-) All these traits puts Prometheus on the clarity (vs completeness) end of the Tao. That is, in environments you can influence and control, Prometheus emphasizes compliance and transparency (via identity) which will significantly reduce necessary human rules and configurations from alerting to enriching (topology relationships (circuit numbers, downstream suppression, etc), business relationships (what group owns this app, contract to vendor support, rotating on call notification targets/tags, escalation process, etc) The bells and whistles required are quite attainable due to its emphasis on enforcing compliance over adapting to the whims of the monitored elements.
The flip side of these strengths become weaknesses where the group’s ability to control and influence wain and yet they are still responsible for a holistic view of situational awareness in IT. Imagine a ridiculously large and diverse environment – for example the dynamic, headless state of the Pentagon in the US Federal Government right now. This ramps up the number and diversity of monitored elements to the point where there are too many gauges and screens. Unless I am willing to scale horizontally on staff to cover actionable items, then some sort of “summarization” is required, especially if I am the group responsible for the entire mess. In Prometheus these “actionable things” are “alerts” which effectively are events whether they are manifest as tickets, pages, emails, etc. Events transferred between components are like the hidden layer in a deep learning model, in that they are like tickets, pages, and emails but they simply haven’t been actualized. Now this situation is compounded by the fact that as you get bigger, the governance of the “monitoring” group will become a much smaller subset of the “monitored elements.” Sometimes you can get strong executive sponsorship to force the minions to comply (i.e not Trump), but even with that if you are on an aggressive acquisition schedule the transition time for IT migration can effectively be infinite. Worse there are former CIOs, CTOs in these other companies (i.e Obama staff) that were not part of the nice take over that are now politically motivated to make the merger a hard thing. In this case, there are three choices when actionable items are larger than available staff. You can either “buffer” and sacrifice timeliness or “drop” and sacrifice completion or prioritize and get a mix of the two. In most cases the third option is chosen. The monitoring “patterns” I have seen that mirror Prometheus in this case basically extend the alert area of Prometheus (and applications north) to handle this.
All this said, you might have found another way. For example, sometime what I do is I decide this sort of customer is a bit too much of a headache and walk away. 8-) Again, thanks so much for your input so far. I have a much better feel for what Prometheus is and how to apply it. Of course there is a still a lot to see and learn.
--
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c4844b4b-8654-4f0f-b985-5666932ddecf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
DANIEL L. NEEDLES
Principal Consultant
NMSGuru, Inc.
512.233.1000
gu...@nmsguru.com
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAD%3DWOv5nN%3DuOgni4H3mNof4zBz%2BD1W-HBdTug3czc2%2BBBB3L%2BA%40mail.gmail.com.