Enterprise Approaches in Prometheus.

841 views
Skip to first unread message

Daniel Needles

unread,
Oct 23, 2017, 2:48:39 PM10/23/17
to Prometheus Users
Hello,
   I've been in monitoring in enterprise for a couple decades and am ramping up on thinking in Prometheus.  I had a couple areas that jumped out and wanted to see how folks handle things in production with Prometheus and its cohorts.

1.  Alert rule management.  In large environments political and technical rules can get excessive.  For example, IBM's NcKL rules run a million lines of a glorified if-then-else assignment.  Obviously Prometheus will cut that down by an order of magnitude but it will still be significant.  What do folks do around managing significant numbers of alert rules?

2.  Aggregated notification.  Is there a nice way to aggregate notifications so actionable events are not lost but someone's cell phone isn't overwhelmed with texts?

3.  Delayed notification/event queuing.  As event numbers grow, event lists and queue systems start to make sense along with concepts of severity ranking (urgency + importance), etc.  Are there adds that can handle this?  If not, how does Prometheus approach this?

4.  Change management.  When scheduled maintenance occurs on a component/device/etc, is there a way to suppress alerts associated in an easy way?

5.  Exception faults.  Performance metrics can catch a lot of things but will miss component failures such as fan or battery failures, transaction failures, etc.  How does Prometheus manage this?

6.  Graphical reporting tools.  Grafana is great as a dashboard.  Is there the equivalence for historical reporting,  top failures, etc?  How about a nifty adhoc report generator?  What are folks using in the field?

7.  Tracing tools.  Prometheus with its timed series data aligns well with protocol tracing.  Are there any tools that can be used to provide a nifty front end for consuming, processing, and presenting traces in a more ad-hoc and abstracted way (i.e. they don't have to be reinvented from scratch.)?

Sorry I know it's a lot.  Any insights and thoughts are appreciated!!!

dnee...@stonedoorgroup.com

unread,
Oct 23, 2017, 5:24:07 PM10/23/17
to Prometheus Users
A bit of clarification on #3.  Prometheus is NOT an event based system, but it produces events.  That is, the checks it performs will produce exceptions/threshold triggers, etc. These are effectively events.  Now in small enterprises they can simply forward these as pages, texts, etc.  However, for larger organizations this will produce quite a few events.  Instead of hiring a lot of people to field these events, the events are buffered and prioritized.  It seems to me that Prometheus could "feed" a system like this similar to how IBM ITM feeds IBM Netcool.  Or is the expectation to keep the events to a low number.  In short, does Prometheus "punt" on this problem or am I missing something?

Brian Brazil

unread,
Oct 23, 2017, 6:06:59 PM10/23/17
to Daniel Needles, Prometheus Users
On 23 October 2017 at 19:48, Daniel Needles <dnee...@gmail.com> wrote:
Hello,
   I've been in monitoring in enterprise for a couple decades and am ramping up on thinking in Prometheus.  I had a couple areas that jumped out and wanted to see how folks handle things in production with Prometheus and its cohorts.

1.  Alert rule management.  In large environments political and technical rules can get excessive.  For example, IBM's NcKL rules run a million lines of a glorified if-then-else assignment.  Obviously Prometheus will cut that down by an order of magnitude but it will still be significant.  What do folks do around managing significant numbers of alert rules?

We recommend source control for managing rules. 

A million lines worth of alert rules sounds excessive. I'd suggest reading https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit for how we think about things. As a rough rule of thumb, I've heard it claimed that every 100 alerts cost one full time engineer between handling the alert when it fires and updating it over time.
 

2.  Aggregated notification.  Is there a nice way to aggregate notifications so actionable events are not lost but someone's cell phone isn't overwhelmed with texts?

This is core functionality of the Alertmanager, you configure how to group alerts into notifications.
 

3.  Delayed notification/event queuing.  As event numbers grow, event lists and queue systems start to make sense along with concepts of severity ranking (urgency + importance), etc.  Are there adds that can handle this?  If not, how does Prometheus approach this?

Incident management is out of scope for Prometheus, we recommend some form of ticketing system for this.
 

4.  Change management.  When scheduled maintenance occurs on a component/device/etc, is there a way to suppress alerts associated in an easy way?

The Alertmanager supports silencing alerts in advance.
 

5.  Exception faults.  Performance metrics can catch a lot of things but will miss component failures such as fan or battery failures, transaction failures, etc.  How does Prometheus manage this?

If it's a metric you can create an alert on it.
 

6.  Graphical reporting tools.  Grafana is great as a dashboard.  Is there the equivalence for historical reporting,  top failures, etc?  How about a nifty adhoc report generator?  What are folks using in the field?

The same API Grafana uses is available for you to use in your reporting scripts.
 

7.  Tracing tools.  Prometheus with its timed series data aligns well with protocol tracing.  Are there any tools that can be used to provide a nifty front end for consuming, processing, and presenting traces in a more ad-hoc and abstracted way (i.e. they don't have to be reinvented from scratch.)?

Prometheus is not designed for tracing or event logging, we'd recommend using tools specifically designed for this purpose.

Brian
 

Sorry I know it's a lot.  Any insights and thoughts are appreciated!!!

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c4844b4b-8654-4f0f-b985-5666932ddecf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Daniel Needles

unread,
Oct 23, 2017, 6:58:28 PM10/23/17
to Brian Brazil, Prometheus Users
Excellent!  Thanks!  A couple follow ups.  

#1  Thanks so much for such a detailed response.  I fully agree with #1 and the doc simply for the fact I would have a lot less grey hairs.  In the past I compare it to the battle of the 300. When you have well defined boundaries on input, output and process, you can really simplify (and mow down) complexity.  Often the starting point is reduction of variance via configuration control and organized change control, which kills problems before the begin.  The only thing you have to watch out for is faith in process, instead of people.  "It's not my job" becomes much more common and you loose your heros that cover the cracks when you implement the spartan approach.  

One issue though is reality has a way of not complying when companies grow to a certain size due to a mixture of politics (multiple groups/requirements outside your control) and sheer size (hundreds of thousands of elements on heterogeneous and hybrid systems, networks, clouds, etc.)   In these cases false negatives are minimized over false positives as you do not know what you do not know. Symptom based monitoring assumes problems can be detected by outward signs prior to exceeding the threshold of pain.  Self healing and catastrophic cascading event are still common in large/dinosaur organizations with mainframes processing payroll.  Reality sucks sometimes. 8-/

That said I do overwhelmingly like the "Sparta" approach aka dictating terms for monitoring over the "you cannot know what you do not know" approach to monitoring.  And there is a case for redirecting pain back to the areas that need to change  The "Sparta" approach definitely does that.

#2.  Ah thanks for the RTFM -- sorry for the noob question.    8-)

#3.  Thanks!  So I am assuming the design is to plug "alerts" into a ticketing system rather than have an event repository intermediary, which does make sense if you focus on actionable events via #1 Spartan approach.

#4.  Cool.  I'll RTFM this.  I suspect with the identity emphasis with a nice clean hierarchy for the keys, it will be relatively easy to aggregate as needed. 

#5.  My thought is there are some metrics that are difficult to poll when you scale to very large numbers such as component failures as well as some events that are simply that and not poll-able.  I suspect this will be difficult to integrate but with the cloud component failures and some app failures will already be mitigated due to the self healing "holistic" nature of cloud - transferring loads and such. 

#6.  What I am wondering here are there some tools folks already use that perform this function so that for example ETL like folk can pull up and run ad hoc queries to their delight.  (This is really pushing the fringe.)

#7.  Any thoughts, insights on Zipkin or Loki?

Thanks!!
Daniel

--
DANIEL L. NEEDLES
Principal Consultant
NMSGuru, Inc.
512.233.1000
gu...@nmsguru.com

Ben Kochie

unread,
Oct 24, 2017, 3:18:19 AM10/24/17
to Daniel Needles, Brian Brazil, Prometheus Users
On Tue, Oct 24, 2017 at 12:58 AM, Daniel Needles <dnee...@gmail.com> wrote:
Excellent!  Thanks!  A couple follow ups.  

#1  Thanks so much for such a detailed response.  I fully agree with #1 and the doc simply for the fact I would have a lot less grey hairs.  In the past I compare it to the battle of the 300. When you have well defined boundaries on input, output and process, you can really simplify (and mow down) complexity.  Often the starting point is reduction of variance via configuration control and organized change control, which kills problems before the begin.  The only thing you have to watch out for is faith in process, instead of people.  "It's not my job" becomes much more common and you loose your heros that cover the cracks when you implement the spartan approach.  

One issue though is reality has a way of not complying when companies grow to a certain size due to a mixture of politics (multiple groups/requirements outside your control) and sheer size (hundreds of thousands of elements on heterogeneous and hybrid systems, networks, clouds, etc.)   In these cases false negatives are minimized over false positives as you do not know what you do not know. Symptom based monitoring assumes problems can be detected by outward signs prior to exceeding the threshold of pain.  Self healing and catastrophic cascading event are still common in large/dinosaur organizations with mainframes processing payroll.  Reality sucks sometimes. 8-/

One of the things we designed into Prometheus was the idea that there can be many, many, Prometheus instances in an organization.  This allows for high granularity of political control over monitoring.  It can provide natural isolation against the reality you're talking about.  For example, at SoundCloud, there are over 20 different Prometheus servers handling different uses.  One for all the hardware node monitoring, one for this front-end app, another for all database metrics.

All of these Prometheus servers forward their alerts to a single organization-wide alertmanager cluster that knows all of the notification methods, teams, etc.  This allows for central silencing and de-duplication.

The big advantage here is that a team can fuck up their Prometheus server, say with too much data, or a bad recording rule, or a bad dashboard, without causing an outage for other well managed production critical Prometheus instances.
 

That said I do overwhelmingly like the "Sparta" approach aka dictating terms for monitoring over the "you cannot know what you do not know" approach to monitoring.  And there is a case for redirecting pain back to the areas that need to change  The "Sparta" approach definitely does that.

#2.  Ah thanks for the RTFM -- sorry for the noob question.    8-)

#3.  Thanks!  So I am assuming the design is to plug "alerts" into a ticketing system rather than have an event repository intermediary, which does make sense if you focus on actionable events via #1 Spartan approach.

#4.  Cool.  I'll RTFM this.  I suspect with the identity emphasis with a nice clean hierarchy for the keys, it will be relatively easy to aggregate as needed. 

#5.  My thought is there are some metrics that are difficult to poll when you scale to very large numbers such as component failures as well as some events that are simply that and not poll-able.  I suspect this will be difficult to integrate but with the cloud component failures and some app failures will already be mitigated due to the self healing "holistic" nature of cloud - transferring loads and such.

Again, the way we do thing is we don't do events, but we like to poll aggregated event counts.  This scales very well.

I'm not sure exactly what you're getting at with "could component failures", but Prometheus was developed to monitor highly dynamic container/VM environments.  By using dynamic discovery APIs, Prometheus will easily keep up with high churn infrastructure.


#6.  What I am wondering here are there some tools folks already use that perform this function so that for example ETL like folk can pull up and run ad hoc queries to their delight.  (This is really pushing the fringe.)

#7.  Any thoughts, insights on Zipkin or Loki?

Thanks!!
Daniel

Daniel Needles

unread,
Oct 24, 2017, 9:46:30 AM10/24/17
to Ben Kochie, Brian Brazil, Prometheus Users

I really appreciate the insights and input!   I know I am definitely kicking at the periphery here and stretching the product over too huge of a scope.  No product can cover everything for everyone all the time.  And that is despite how well Prometheus is constructed.  This exercise is to help me “grok” the product and “think in Prometheus” by pushing it till it breaks (in thought experiments initially) which not only reveals its weaknesses, but also its strengths, which are almost always the flip sides of the same coin.   To do this I am going to delve a bit deep to better convey what I meant earlier.

For example, imagine if we did this for Netcool, which is still in most Fortune 1000s.  When pushed on volume, Netcool will be fast and quick using a “memory resident” database, similar to how Spark is faster than traditional Hadoop/MapReduce.   When the load initially becomes too much, Netcool breaks by sacrificing timeliness (i.e. it buffered) instead of dropping events when things overload. When pressed further the memory resident database cannot “scale out”  Netcool handles this by commoditizing the “actionable things” into events and makes them stateful.  Then it allows the installation of additional Netcools  so “information” can be exchanged, pooled, and prioritized via the commoditized events.  These can be consumed much like packets in a network protocol to drive dashboards, correlations, etc.  Of course the commoditization and flexibility creates too many human based rules at multiple levels (all using different proprietary languages due to a string of acquisitions.)  Further, to be completely flexible at the expense of applicability, Netcool sacrificed “identity”. That is, much like DNS bind is a dot delimited database with no knowledge of domains, hosts and IPs, Netcool is not “aware” of host, interface, property (i.e crx errors, drops, counts, etc).  This allowed complex and arbitrary correlations, aggregations, and enrichment.  However, for example, if machine learning was used to replace some of these human rules, the arbitrariness of the design will make it impossible to anchor to true causations among the many meaningless data correlations…. Anyway, hopefully that gives you an idea of what I am doing.

 

So looking at Prometheus it falls into the traditional Performance management/Historical reporting space with some really cool upgrades and simplifications.  It exploits scaling horizontally and the holistic nature of cloud.  It emphasizes notification/ticketing rules at a higher level of abstraction – as generalized queries.  It also exploits the development of NoSQL with massive improvements in identity and speed via the variable, value pairs.  Its “discovery” mechanism is closer to how complex systems are managed in nature – as a trusted, registered wholes verses a collection of parts.  (The later approach is an unfortunate byproduct of the Internet being invented by DARPA, because the military always sees things through the lens of security, but in nature the immune system doesn’t pound the living hell out of every cell to discover what constitutes the body. 8-)  All these traits puts Prometheus on the clarity (vs completeness) end of the Tao.  That is, in environments you can influence and control, Prometheus emphasizes compliance and transparency (via identity) which will significantly reduce necessary human rules and configurations from alerting to enriching (topology relationships (circuit numbers, downstream suppression, etc), business relationships (what group owns this app, contract to vendor support, rotating on call notification targets/tags, escalation process, etc)  The bells and whistles required are quite attainable due to its emphasis on enforcing compliance over adapting to the whims of the monitored elements.

The flip side of these strengths become weaknesses where the group’s ability to control and influence wain and yet they are still responsible for a holistic view of situational awareness in IT.  Imagine a ridiculously large and diverse environment – for example the dynamic, headless state of the Pentagon in the US Federal Government right now. This ramps up the number and diversity of monitored elements to the point where there are too many gauges and screens.  Unless I am willing to scale horizontally on staff to cover actionable items, then some sort of “summarization” is required, especially if I am the group responsible for the entire mess.  In Prometheus these “actionable things” are “alerts” which effectively are events whether they are manifest as tickets, pages, emails, etc.  Events transferred between components are like the hidden layer in a deep learning model, in that they are like tickets, pages, and emails but they simply haven’t been actualized.  Now this situation is compounded by the fact that as you get bigger, the governance of the “monitoring” group will become a much smaller subset of the “monitored elements.”  Sometimes you can get strong executive sponsorship to force the minions to comply (i.e not Trump), but even with that if you are on an aggressive acquisition schedule the transition time for IT migration can effectively be infinite.   Worse there are former CIOs, CTOs in these other companies (i.e Obama staff) that were not part of the nice take over that are now politically motivated to make the merger a hard thing.  In this case, there are three choices when actionable items are larger than available staff.  You can either “buffer” and sacrifice timeliness or “drop” and sacrifice completion or prioritize and get a mix of the two.  In most cases the third option is chosen.  The monitoring “patterns” I have seen that mirror Prometheus in this case basically extend the alert area of Prometheus (and applications north)  to handle this. 

 

All this said, you might have found another way.   For example, sometime what I do is I decide this sort of customer is a bit too much of a headache and walk away.  8-)   Again, thanks so much for your input so far.  I have a much better feel for what Prometheus is and how to apply it.  Of course there is a still a lot to see and learn.

--

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



 

--



 

--

DANIEL L. NEEDLES
Principal Consultant
NMSGuru, Inc.
512.233.1000
gu...@nmsguru.com

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Daniel Needles

unread,
Oct 24, 2017, 5:52:08 PM10/24/17
to Prometheus Users
I had a bit more of an epiphany.  

The Prometheus approach actually models traditional methods but with significant improvements mentioned earlier.   

Netcool, Nimsoft, ITM, etc all basically have a loosely categorized three layer model - collection, aggregation, and display.  In Prometheus collection equates to the Exporters and Class Libraries, Aggregation to Retrieval, Storage and exposing PromQL (along with alertmgr and push).  The Display layer equates to Grafana and alike.  

For example the Netcool equivalents are Probes and SSM for Client Libraries and Exporters.  The OMNIbus core is Retrieval, Storage, and PromQL (OMNIbus similarly scrapes from the probes on a regular basis using a variable Granularity, which defaults to a minute.)  Prometheus' AlertManager is contained within the functionality of IBM Impact while Prometheus Push is what Netcool calls a gateway.  Grafana is what IBM tried to do with WebSphere/TIP with spotty success - sort of what you get from using a Portal/CMS developed in the 90s.  

There are differences.  Effectively the rows in Prometheus NOSQL represent "events".  In fact in Netcool the events are also stored in a flat table - alerts.status.  One big difference is Netcool uses a SQL table and so a single row in netcool is broken out in Prometheus (for better accessibility) and every value in Prometheus is strongly tied to its source, something not hardwired into Netcool.  What takes up a million lines of mapping code in Netcool is equivalent to the work of the Client Libraries and Exporters which normalizes the data to key value pairs.  Because of "strong typing" in Prometheus this is a few orders of magnitude cleaner though some flexibility is lost.  Also,  Prometheus mirrors IBM's ITM product (originally Candel) which also monitors based on polls systems and applications.  However Prometheus uses the modern contructs, though it has less widgets, Prometheus is much more flexibly and a "tad" cheaper.  8-)

 Anyway thanks for all the help.  I see the mapping now and can explain it to the client.  8-)


dnee...@redhat.com

unread,
Oct 27, 2017, 12:20:39 PM10/27/17
to Prometheus Users
One more interesting detail.   The "scraping" behavior is slightly different for IBM Netcool vs Prometheus as Netcool falls in to a traditional fault management tool while prometheus models as performance management tool that can be used to handle fault detection similar to Candle (now IBM ITM).  The omni.dat file in Netcool represents the equivallent configuration of prometheus.yml in Prometheus.  While in Netcool external systems and gateways are scraped for events as well as "presenting" to users are scrapped on a periodic basis based on granularity, Netcool is a fault management system and so ingestion is done in real time via its probes.  This is not done in Prometheus.  What this does is change the sync/async line of the tool.  In Netcool it is between (users & external systems)  and the datastore.  In prometheus this line is pushed down into the equivallent of Netcool probes as the exposure of the ReSTful calls is where things below the ReSTful calls are realtime/async while the ReSTful API provides a periodic/sync pull point.  This cost granularity and buys simplicity and scalability.
Reply all
Reply to author
Forward
0 new messages