Hi Kueue team,I’m currently working on exporting Kueue metrics to Amazon CloudWatch using the CloudWatch Agent with Prometheus integration, and I’ve run into an issue that I was hoping to get your input on.What I’m trying to do:I’ve set up a Prometheus scrape_config targeting the kueue-controller-manager pod, and metrics are being scraped successfully. I confirmed this by curling the /metrics endpoint from within the CloudWatch Agent container.The relevant prometheus.yaml includes:# ConfigMap: Prometheus scrape config
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: amazon-cloudwatch
data:
prometheus.yaml: |
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: 'kueue-controller-manager'
metrics_path: /metrics
sample_limit: 10000
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: .*
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: Namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod_name
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: container_name
- source_labels: [__meta_kubernetes_pod_controller_name]
action: replace
target_label: pod_controller_name
- source_labels: [__meta_kubernetes_pod_controller_kind]
action: replace
target_label: pod_controller_kind
- source_labels: [__meta_kubernetes_pod_phase]
action: replace
target_label: pod_phase
- target_label: job
replacement: kueue-controller-manager
# Optional: pull in all Kubernetes pod labels as Prometheus labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
And my cwagentconfig.json includes:# ConfigMap: CloudWatch Agent config for Prometheus metrics
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-cwagentconfig
namespace: amazon-cloudwatch
data:
cwagentconfig.json: |
{
"agent": {
"region": "eu-north-1",
"debug": true
},
"logs": {
"metrics_collected": {
"prometheus": {
"prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml",
"log_group_name": "/aws/containerinsights/Cluster9EE0221C-test12345/prometheus",
"cluster_name": "Cluster9EE0221C-7e214e03e5374e27bd58561567510b4c",
"emf_processor": {
"metric_declaration": [
{
"source_labels": ["job"],
"label_matcher": "kueue-controller-manager",
"dimensions": [["ClusterName"]],
"metric_selectors": [
"^kueue_admitted_workloads_total$",
"^kueue_cluster_queue_status$",
"^kueue_evicted_workloads_total$",
"^kueue_pending_workloads$",
"^kueue_quota_reserved_workloads_total$",
"^kueue_reserving_active_workloads$"
]
}
]
}
}
},
"force_flush_interval": 5
}
}
What’s going wrong:
The CloudWatch Agent log shows metrics are scraped, but many metrics are being dropped with messages like:Dropped metric: no metric declaration matched metric nameMetrics like controller_runtime_webhook_requests_total, workqueue_depth, etc., are expected to be dropped —
but I don’t see any of the kueue_* metrics being emitted to CloudWatch, even though they are visible in the /metrics output.
I’ve also verified that the job label on those metrics is correctly set to kueue-controller-manager.I suspect this may be a label mismatch or an unexpected metric structure that doesn’t match metric_declaration logic in the CloudWatch Agent.
Question:
Can you confirm if the Kueue metrics (like kueue_admitted_workloads_total, etc.) are expected to be exposed under the job=kueue-controller-manager label? And if not, what is the correct label or configuration to capture them via Prometheus scraping?Appreciate any guidance or examples you could provide!