Cloudwatch exporter / Promentheus / AlertManager and Lambda

447 views
Skip to first unread message

Stuart Pelton

unread,
Jan 9, 2023, 6:50:59 AM1/9/23
to Prometheus Users
Hello all,
i have inherited a prometheus system to look after, this was setup by the person before and just finding my feet with how it all works.

So as far as I can see , CloudWatch Exporter gets the info from cloudwatch > passes to Prometheus then to alert manager which then posts to in this case PagerDuty.

My question is - the Lambda does not seem to report when there is an issue so unsure if this is setup correctly?

Does anyone have any examples of Cloudwatch exporter and prometheus alert files they can show as an example for scraping Lambda errors, I have the below setup but they dont seem to work? (sorry noob to this) or is there a better option than cloudwatch exporter?

CloudWatch Exporter file:

  - aws_namespace: AWS/Lambda
    aws_metric_name: ConcurrentExecutions
    aws_dimensions: []
    aws_statistics: [Average]
  - aws_namespace: AWS/Lambda
    aws_metric_name: Errors
    aws_dimensions: [FunctionName,Resource]
    aws_statistics: [Sum]

  - aws_namespace: AWS/Lambda
    aws_metric_name: lambda_auth_errors
    aws_dimensions: []
    aws_statistics: [Sum]

Prometheus yml file content

  #QUEUEPROCESSOR_ERRORS
  - alert: FUNCTION-QUEUEPROCESSOR_ERRORS
    expr: (aws_lambda_errors_sum{functionname="function-QueueProcessor"} offset 8m) > 0
    labels:
      severity: error
      capability: function
      service: aws/lambda
    annotations:
      summary: "Multiple LAMBDA Errors "
      description: "There has been more than 1 LAMBDA errors within 30 minutes for Function Capability"
      category: "Software/System"
      subcategory: "Problem/Bug"
      instance: "Function Capability - P"
      environment: "Production"

AlertManager yml content

    #FUNCTION-LAMBDA
    - match:
        capability: function
        service: aws/lambda
      receiver: function-lambda


#FUNCTION-LAMBDA
- name: 'function-lambda'
  pagerduty_configs:
  - routing_key: 'xxxxxxx'
    severity: '{{if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower}}{{ else }}error{{ end}}'
    description: '[FIRING:{{ .Alerts.Firing | len }}] {{ .CommonAnnotations.summary }}'

Brian Candler

unread,
Jan 9, 2023, 10:03:03 AM1/9/23
to Prometheus Users
You'll have to find out which bit isn't working:
- the data collection from aws?
- the alerting expression?
- the alert delivery from alertmanager to pagerduty?

Start by doing PromQL queries in the prometheus' own web interface:

    aws_lambda_errors_sum     # do you get any results?

    aws_lambda_errors_sum{functionname="function-QueueProcessor"}    # do you get any results?

    (aws_lambda_errors_sum{functionname="function-QueueProcessor"} offset 8m) > 0    # do you get any results when there's an error?

(This last expression, by the way, is a bit silly.   By including "offset 8m" you've just delayed the alert by 8 minutes; but you've not done anything to hide spurious alerts.  You'll still get the same alert, just 8 minutes late!)

If you do get data from the last query at times when there's a problem, then you focus on why your alerts aren't being delivered.  There are metrics from alertmanager(*) which will tell you how many alerts have been received, how many delivery attempts have been made, and how many delivery failures there have been.

If you don't get data from these queries, then your problem is with data collection - dig further on that side.

Good luck,

Brian.

(*) Since you didn't show the full prometheus.yml I don't know if you're collecting alertmanager metrics.  You'd need something like this:

  - job_name: alertmanager
    scrape_interval: 1m
    static_configs:
      - targets: ['localhost:9093']

Then you can find out what metrics are being collected with this query:

{job="alertmanager"}

Stuart Pelton

unread,
Jan 10, 2023, 4:29:43 AM1/10/23
to Prometheus Users
Hi Brian,

Thanks for getting back to me, ok so seems that all the above queries when run, come back with empty results - does that mean there are no alerts or errors or does it mean that there is no data collected?
I can see from the metrics page 

# HELP cloudwatch_requests_total API requests made to CloudWatch 
# TYPE cloudwatch_requests_total counter cloudwatch_requests_total{action="listMetrics",namespace="AWS/RDS",} 1164900.0 
cloudwatch_requests_total{action="getMetricStatistics",namespace="Account",} 3880587.0 
cloudwatch_requests_total{action="listMetrics",namespace="AWS/ApplicationELB",} 582450.0 
cloudwatch_requests_total{action="getMetricStatistics",namespace="AWS/ApplicationELB",} 9809199.0 
cloudwatch_requests_total{action="listMetrics",namespace="AWS/StorageGateway",} 291225.0 
cloudwatch_requests_total{action="listMetrics",namespace="Account",} 1456125.0 
cloudwatch_requests_total{action="listMetrics",namespace="System/Linux",} 291225.0 
cloudwatch_requests_total{action="getMetricStatistics",namespace="AWS/RDS",} 3.9647125E7 
cloudwatch_requests_total{action="listMetrics",namespace="AWS/Usage",} 291225.0 
cloudwatch_requests_total{action="getMetricStatistics",namespace="AWS/Lambda",} 291225.0 
cloudwatch_requests_total{action="listMetrics",namespace="AWS/Lambda",} 291225.0 


and get the below
# HELP aws_lambda_concurrent_executions_average CloudWatch metric AWS/Lambda ConcurrentExecutions Dimensions: [] Statistic: Average Unit: Count 
# TYPE aws_lambda_concurrent_executions_average gauge aws_lambda_concurrent_executions_average{job="aws_lambda",instance="",} 4.781021897810219 1673340420000

When i run that it pulls back data within the graph, but if i run "aws_lambda_concurrent_executions_average{job="aws_lambda",instance=""," and add into the relevant lambda instance its pulls back nothing.

do you have any examples of configs I could take a look at and learn from?

Stu

Stuart Pelton

unread,
Jan 10, 2023, 6:03:48 AM1/10/23
to Prometheus Users
Hi, found the issue and now its pulling back the lambda data ok - seems the config.yml had some spaces in that it didnt like

<phew>

Reply all
Reply to author
Forward
0 new messages