CloudWatch Exporter Feature Request: Config option for how to treat 'missing' data per metric

903 views
Skip to first unread message

tlo...@rmn.com

unread,
Apr 30, 2018, 5:51:02 PM4/30/18
to Prometheus Developers
tl;dr -- I'd like to add a feature that allows users to user metric config options specify how to treat missing data from CloudWatch. This will allow me to treat missing Sum(HTTPCode_Backend_4XX) data as 0, which then allows me to perform Prometheus operations on that data.


Currently, I am unable to calculate my availability SLI using CloudWatch ELB metrics. The cloudwatch_exporter reports incorrect values (either missing or equivalent to previous gauge value)[0], which then interferes with Prometheus operator math. When Prometheus calculates (2xx+4xx)/total, the result is often either over 100% or missing data. In production, 4xx's are often missing, so my SLIs are missing as well.

CloudWatch expects users to interpret missing data on a per-metric basis, based on "Reporting criteria". In some cases, missing means no data. In others (e.g. ELB HTTPCode_Backend_*) missing data means no occurrences, i.e. zero[1].

CloudWatch Alarms is one example where users are intended to interpret missing data[2]. For example, an ELB HTTPCode_Backend_5XX alarm would treat missing data as 'notBreaching', when 5xx errors occur infrequently.

For the prometheus/cloudwatch_exporter to fully support the CloudWatch Metrics API, it also needs a mechanism for allowing users to decide how to handle missing data. I propose we add an addition per-metric configuration option to MetricRule that may trigger post-processing of each Datapoint in the scrape. It might look like this:

```
- aws_namespace: 'AWS/ELB'
aws_metric_name: 'HTTPCode_Backend_4XX'
aws_dimensions: ['LoadBalancerName', 'AvailabilityZone']
aws_dimension_select:
LoadBalancerName: ['my-loadbalancer']
aws_statistics: ['Sum']
treat_missing_data_as: 0

- aws_namespace: 'AWS/ELB'
aws_metric_name: 'HealthyHostCount'
aws_dimensions: ['LoadBalancerName', 'AvailabilityZone']
aws_dimension_select:
LoadBalancerName: ['my-loadbalancer']
aws_statistics: ['Sum']
# treat_missing_data_as not specified, defaults to treat-as-missing
```

Where by default we use the existing behavior, and missing data is treated as missing. Optionally users can specify a value to use instead. It could potentially be a predefined string ('zero'), an integer (0, 100, etc.), or something else.

I'd be happy to help write or review a Pull Request on this.


[0] https://github.com/prometheus/cloudwatch_exporter/blob/master/src/main/java/io/prometheus/cloudwatch/CloudWatchCollector.java#L367-L370
[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-cloudwatch-metrics.html
[2] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data

Brian Brazil

unread,
Apr 30, 2018, 5:56:56 PM4/30/18
to tlo...@rmn.com, Prometheus Developers
On 30 April 2018 at 22:51, tlovett via Prometheus Developers <prometheus...@googlegroups.com> wrote:
tl;dr -- I'd like to add a feature that allows users to user metric config options specify how to treat missing data from CloudWatch. This will allow me to treat missing Sum(HTTPCode_Backend_4XX) data as 0, which then allows me to perform Prometheus operations on that data.


Currently, I am unable to calculate my availability SLI using CloudWatch ELB metrics. The cloudwatch_exporter reports incorrect values (either missing or equivalent to previous gauge value)[0], which then interferes with Prometheus operator math. When Prometheus calculates (2xx+4xx)/total, the result is often either over 100% or missing data. In production, 4xx's are often missing, so my SLIs are missing as well.

The Cloudwatch exporter only exposes the data that Cloudwatch exposes, it doesn't invent data that doesn't exist - as that's not fundamentally possible as we don't know which time series are meant to exist and which are not. This is something that Cloudwatch really needs to change on their end, so I suggest using the techniques at https://www.robustperception.io/existential-issues-with-metrics/ to deal with such suboptimal metrics.

Brian
 

CloudWatch expects users to interpret missing data on a per-metric basis, based on "Reporting criteria". In some cases, missing means no data. In others (e.g. ELB HTTPCode_Backend_*) missing data means no occurrences, i.e. zero[1].

CloudWatch Alarms is one example where users are intended to interpret missing data[2]. For example, an ELB HTTPCode_Backend_5XX alarm would treat missing data as 'notBreaching', when 5xx errors occur infrequently.

For the prometheus/cloudwatch_exporter to fully support the CloudWatch Metrics API, it also needs a mechanism for allowing users to decide how to handle missing data. I propose we add an addition per-metric configuration option to MetricRule that may trigger post-processing of each Datapoint in the scrape. It might look like this:

```
  - aws_namespace: 'AWS/ELB'
    aws_metric_name: 'HTTPCode_Backend_4XX'
    aws_dimensions: ['LoadBalancerName', 'AvailabilityZone']
    aws_dimension_select:
      LoadBalancerName: ['my-loadbalancer']
    aws_statistics: ['Sum']
    treat_missing_data_as: 0

  - aws_namespace: 'AWS/ELB'
    aws_metric_name: 'HealthyHostCount'
    aws_dimensions: ['LoadBalancerName', 'AvailabilityZone']
    aws_dimension_select:
      LoadBalancerName: ['my-loadbalancer']
    aws_statistics: ['Sum']
    # treat_missing_data_as not specified, defaults to treat-as-missing
```

Where by default we use the existing behavior, and missing data is treated as missing. Optionally users can specify a value to use instead. It could potentially be a predefined string ('zero'), an integer (0, 100, etc.), or something else.

I'd be happy to help write or review a Pull Request on this.


[0] https://github.com/prometheus/cloudwatch_exporter/blob/master/src/main/java/io/prometheus/cloudwatch/CloudWatchCollector.java#L367-L370
[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-cloudwatch-metrics.html
[2] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data
--


This e-mail, including attachments, contains confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. The reader is hereby notified that any
dissemination, distribution or copying of this e-mail is prohibited. If you
have received this e-mail in error, please notify the sender by replying to
this message and delete this e-mail immediately.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/91d19497-8525-4f92-8469-6ea1b86a2c45%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

tlo...@rmn.com

unread,
Apr 30, 2018, 6:21:52 PM4/30/18
to Prometheus Developers
Yeah, CloudWatch Metrics does not cleanly fit the Prometheus model, but I think we can support this translation of CloudWatch to Prometheus:

We do know which series are _meant_ to exist -- any series which the user lists in their config file.

From there it _is_ fundamentally possible to invent data (at least with certain statistic types), by simply creating a DataPoint with value 0, for example, if the CW API returns an empty list of Datapoints.

As long as the user explicitly configures the exporter to do that, it seems reasonable to do so. And for the gauge metric typed used by the library, with simple Sum-of-counts metrics, it seems to fit within the Prometheus model. Certainly it may not be possible to do in _all_ circumstances, but we can write validation logic around it.

Am I misunderstanding something?

Brian Brazil

unread,
Apr 30, 2018, 7:10:14 PM4/30/18
to tlo...@rmn.com, Prometheus Developers
On 30 April 2018 at 23:21, tlovett via Prometheus Developers <prometheus...@googlegroups.com> wrote:
Yeah, CloudWatch Metrics does not cleanly fit the Prometheus model, but I think we can support this translation of CloudWatch to Prometheus:

We do know which series are _meant_ to exist -- any series which the user lists in their config file.

The user doesn't list series in their config file, they list metrics.
 

From there it _is_ fundamentally possible to invent data (at least with certain statistic types), by simply creating a DataPoint with value 0, for example, if the CW API returns an empty list of Datapoints.

We don't know the labels for the series. We don't know when the series would start and stop. We can't fabricate samples out of thin air.
 

As long as the user explicitly configures the exporter to do that, it seems reasonable to do so. And for the gauge metric typed used by the library, with simple Sum-of-counts metrics, it seems to fit within the Prometheus model. Certainly it may not be possible to do in _all_ circumstances, but we can write validation logic around it.

Am I misunderstanding something?

If you want to try and do something like this you're better off working with PromQL. An exporter like the CloudWatch exporter only exposes the data that CloudWatch exposes, it's not the right place to add heuristics specific to certain user's environments.

--
Reply all
Reply to author
Forward
0 new messages