Help with Exporting Kueue Metrics to CloudWatch Using CloudWatch Agent + Prometheus

9 views
Skip to first unread message

Nishanth Reddy

unread,
Jun 26, 2025, 3:32:40 AMJun 26
to wg-batch
Hi Kueue team,I’m currently working on exporting Kueue metrics to Amazon CloudWatch using the CloudWatch Agent with Prometheus integration, and I’ve run into an issue that I was hoping to get your input on.What I’m trying to do:I’ve set up a Prometheus scrape_config targeting the kueue-controller-manager pod, and metrics are being scraped successfully. I confirmed this by curling the /metrics endpoint from within the CloudWatch Agent container.The relevant prometheus.yaml includes:# ConfigMap: Prometheus scrape config
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: amazon-cloudwatch
data:
  prometheus.yaml: |
    global:
      scrape_interval: 1m
      scrape_timeout: 10s
    scrape_configs:
      - job_name: 'kueue-controller-manager'
        metrics_path: /metrics
        sample_limit: 10000
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_container_port_name]
            action: keep
            regex: .*
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: Namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod_name
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: container_name
          - source_labels: [__meta_kubernetes_pod_controller_name]
            action: replace
            target_label: pod_controller_name
          - source_labels: [__meta_kubernetes_pod_controller_kind]
            action: replace
            target_label: pod_controller_kind
          - source_labels: [__meta_kubernetes_pod_phase]
            action: replace
            target_label: pod_phase
          - target_label: job
            replacement: kueue-controller-manager
          # Optional: pull in all Kubernetes pod labels as Prometheus labels
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
And my cwagentconfig.json includes:# ConfigMap: CloudWatch Agent config for Prometheus metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-cwagentconfig
  namespace: amazon-cloudwatch
data:
  cwagentconfig.json: |
    {
      "agent": {
        "region": "eu-north-1",
        "debug": true
      },
      "logs": {
        "metrics_collected": {
          "prometheus": {
            "prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml",
            "log_group_name": "/aws/containerinsights/Cluster9EE0221C-test12345/prometheus",
            "cluster_name": "Cluster9EE0221C-7e214e03e5374e27bd58561567510b4c",
            "emf_processor": {
              "metric_declaration": [
                {
                  "source_labels": ["job"],
                  "label_matcher": "kueue-controller-manager",
                  "dimensions": [["ClusterName"]],
                  "metric_selectors": [
                    "^kueue_admitted_workloads_total$",
                    "^kueue_cluster_queue_status$",
                    "^kueue_evicted_workloads_total$",
                    "^kueue_pending_workloads$",
                    "^kueue_quota_reserved_workloads_total$",
                    "^kueue_reserving_active_workloads$"
                  ]
                }
              ]
            }
          }
        },
        "force_flush_interval": 5
      }
    }
What’s going wrong:
The CloudWatch Agent log shows metrics are scraped, but many metrics are being dropped with messages like:Dropped metric: no metric declaration matched metric nameMetrics like controller_runtime_webhook_requests_total, workqueue_depth, etc., are expected to be dropped — 

but I don’t see any of the kueue_* metrics being emitted to CloudWatch, even though they are visible in the /metrics output.
I’ve also verified that the job label on those metrics is correctly set to kueue-controller-manager.I suspect this may be a label mismatch or an unexpected metric structure that doesn’t match metric_declaration logic in the CloudWatch Agent.
Question:
Can you confirm if the Kueue metrics (like kueue_admitted_workloads_total, etc.) are expected to be exposed under the job=kueue-controller-manager label? And if not, what is the correct label or configuration to capture them via Prometheus scraping?Appreciate any guidance or examples you could provide!
Reply all
Reply to author
Forward
0 new messages