Re: [kubernetes/kubernetes] Allow specifying window over which metrics for HPA are collected (#57660)

3 views
Skip to first unread message

Frederic Branczyk

unread,
Jan 2, 2018, 10:20:22 AM1/2/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

@kubernetes/sig-instrumentation-feature-requests @kubernetes/sig-api-machinery-api-reviews

Note that by design the core/custom metrics APIs are currently intentionally not historical. They only return a single value. Historical metrics APIs are still to be done. Possibly as extensions to the existing APIs or entirely new APIs - there are a lot of open questions.

Calculations/aggregations are a non-goal of the metrics APIs, it's up to the backing monitoring system to perform these. Otherwise it wouldn't be a backing monitoring system but a backing tsdb, and we'd be back to heapster.

More precisely, the custom-metrics-api returns a single scalar, rather than a vector and series of data points. Meaning the "query" should be able to be anything that obliges to that rule.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Frederic Branczyk

unread,
Jan 2, 2018, 10:20:50 AM1/2/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

discordianfish

unread,
Jan 2, 2018, 10:28:32 AM1/2/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Okay, so the types I linked to would be returned by the backing monitoring system based on a "query"? So Window wouldn't be set by the reporter (as I assumed) but by the monitoring system?

So in the case of the HPA, it would require configuration of a query (whether abstracted or specific to the backing system)? That would work I assume.

Solly Ross

unread,
Jan 3, 2018, 3:30:02 PM1/3/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

So Window wouldn't be set by the reporter (as I assumed) but by the monitoring system?

Correct -- it's more intended as information that consumers can use to judge the freshness of rate metrics (like CPU).

Historically, we've shyed away from exposing the interval knob directly to user because it was felt that implementation detail, and that we'd like to do things to make it not necessary to set that knob at all (e.g. automatically adjusting the refresh interval, or something). Additionally, for the spike case, we could implement better algorithms that use more than one data point, to calculate a better derivative, use a PID loop, etc. However, I could definitely buy that the above is highly hypothetical, and that we should have a more concrete solution to problems now.

Daniel Smith

unread,
Jan 4, 2018, 4:23:08 PM1/4/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

I think this is not api machinery?

Solly Ross

unread,
Jan 10, 2018, 10:53:41 AM1/10/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

I think this is not api machinery?

correct, I think the API review request added that tag.

Frederic Branczyk

unread,
Jan 10, 2018, 11:04:38 AM1/10/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Sorry about that, that was my mistake.

Piotr Szczesniak

unread,
Jan 11, 2018, 11:11:27 AM1/11/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

fejta-bot

unread,
Apr 11, 2018, 12:44:58 PM4/11/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

discordianfish

unread,
Apr 11, 2018, 1:15:42 PM4/11/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale
/lifecycle freeze

fejta-bot

unread,
Jul 10, 2018, 2:04:27 PM7/10/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Solly Ross

unread,
Jul 16, 2018, 2:39:35 PM7/16/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

fejta-bot

unread,
Oct 14, 2018, 3:23:32 PM10/14/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle stale

Frederic Branczyk

unread,
Oct 15, 2018, 3:21:28 AM10/15/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

fejta-bot

unread,
Jan 13, 2019, 2:49:36 AM1/13/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle stale

discordianfish

unread,
Jan 17, 2019, 5:32:50 PM1/17/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

It's beyond me how this bot is acceptable to anyone. I'm going to just unsubscribe here.

Frederic Branczyk

unread,
Jan 18, 2019, 3:05:16 AM1/18/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

fejta-bot

unread,
Apr 18, 2019, 4:19:57 AM4/18/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle stale

Frederic Branczyk

unread,
Apr 18, 2019, 5:16:04 AM4/18/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

fejta-bot

unread,
Jul 17, 2019, 5:32:42 AM7/17/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle stale

Anton Bessonov

unread,
Jul 17, 2019, 2:22:05 PM7/17/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

Anton Bessonov

unread,
Jul 17, 2019, 2:22:07 PM7/17/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Activity!

fejta-bot

unread,
Oct 15, 2019, 3:08:25 PM10/15/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale


You are receiving this because you are on a team that was mentioned.

Reply to this email directly, view it on GitHub, or unsubscribe.

Anton Bessonov

unread,
Oct 15, 2019, 3:11:57 PM10/15/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

Matthias Bertschy

unread,
Jan 5, 2020, 12:22:46 PM1/5/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

I just came after reading an article linking to here...
@brancz if you really want this to advance you should open a KEP and find a sponsoring SIG.
This is the standard way to propose API changes.

Frederic Branczyk

unread,
Jan 6, 2020, 2:08:05 AM1/6/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

I’m fully aware how this project works and am actively thinking about this also in larger scope than hpa but potentially iterating on the metrics APIs we defined as sig instrumentation.

fejta-bot

unread,
Apr 5, 2020, 3:26:51 AM4/5/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Anton Bessonov

unread,
Apr 5, 2020, 5:56:37 PM4/5/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

fejta-bot

unread,
Jul 4, 2020, 6:50:50 PM7/4/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle stale

Anton Bessonov

unread,
Jul 5, 2020, 5:00:53 AM7/5/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

Han Kang

unread,
Aug 26, 2020, 12:22:52 PM8/26/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/assign @serathius

fejta-bot

unread,
Nov 24, 2020, 11:29:22 AM11/24/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Anton Bessonov

unread,
Nov 24, 2020, 2:29:24 PM11/24/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

Sam Weston

unread,
Nov 24, 2020, 5:30:25 PM11/24/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

@Bessonov Have you seen the new autoscaling features in 1.18? If not, what is missing from that for your use case?

https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-configurable-scaling-behavior

Anton Bessonov

unread,
Nov 24, 2020, 7:09:13 PM11/24/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

@cablespaghetti

Have you seen the new autoscaling features in 1.18?

Thank you, I've didn't see it.

If not, what is missing from that for your use case?

Hm, I'm not sure that it's the same:

When the metrics indicate that the target should be scaled down the algorithm looks into previously computed desired states and uses the highest value from the specified interval.

It seems like something different from original request:

Now imaging a traffic pattern with short dips. If there is a dip in the calculation window, it will cause the HPA to scale down aggressively. Same applies the other way around, if there is a spike spanning the calculation window, it will add lots of replicas even if the spike is gone already.

fejta-bot

unread,
Feb 22, 2021, 7:49:16 PM2/22/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

Anton Bessonov

unread,
Feb 23, 2021, 12:05:31 AM2/23/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

fejta-bot

unread,
May 24, 2021, 1:50:41 AM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Anton Bessonov

unread,
May 24, 2021, 3:26:36 AM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/remove-lifecycle stale

Marek Siarkowicz

unread,
May 24, 2021, 3:48:15 PM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/lifecycle frozen

Marek Siarkowicz

unread,
May 24, 2021, 4:14:22 PM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

What you are proposing is basically changing the logic to make decision over a larger window then just latest data. As so it's not clear where this problem should be solved (either in HPA by collecting multiple samples, or Metrics Server by allowing to pick a window).

Current direction of Metrics Server is speed up metric collection and reporting window to provide the freshest data available when HPA makes scaling decisions which is somewhat different direction then this issue. Forsing metrics server to store metrics over large window would result in resource overhead as it is not aware of HPAs and would store metrics to match largest window configured for all pods running in cluster.

Looking at currently existing solutions, this problem is somewhat similar to VPA. It collects resource usage over larger time window and makes decision based on that. As so I would first delegate this problem to SIG Autoscaling if they can introduce similar behavior in HPA.

Marek Siarkowicz

unread,
May 24, 2021, 4:15:04 PM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Marek Siarkowicz

unread,
May 24, 2021, 4:15:14 PM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/unassign

Anton Bessonov

unread,
May 24, 2021, 4:26:26 PM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

What you are proposing is basically changing the logic to make decision over a larger window then just latest data.

I'm not op, but I'm interested in opposite, in more responsive scaling reaction for some of pods. We have some pods which are cheap to scale up and down. It would be nice to reduce the window for average.

Marek Siarkowicz

unread,
May 24, 2021, 4:48:26 PM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

For next release of Metrics Server v0.5.0 we are proposing to have the default 15s metrics resolution. This reduces the average scale up time from around 2 minutes in v0.4.4 to around 60s. This is defined as a time needed from container starting to HPA making a next autoscaling decision based on metrics collected from it. This includes 10-20 cAdvisor calculation window, 15s HPA decision period, 15-20 container start time and 15s metric resolution window, but doesn't include your node provision time. More details here kubernetes-sigs/metrics-server#763

Is this something that will help for your use case?

Anton Bessonov

unread,
May 24, 2021, 5:06:28 PM5/24/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

Yeah, I think it would be a great improvement for my use case! Thank you for information!

Marek Siarkowicz

unread,
May 25, 2021, 3:10:16 AM5/25/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

/unassign @gjtempleton @mwielgus

Jonathan Juares Beber

unread,
Jul 1, 2021, 3:13:43 PM7/1/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

I'm still failing to understand how we can make the metric so hidden at this moment. In the issue description, @discordianfish described that ...without specifying the window over which this was observed. Haven't found where it's defined but it seems like it's ~1-5 minutes.. And it's still the case.

The resource metrics pipeline docs say that:

This value is derived by taking a rate over a cumulative CPU counter provided by the kernel (in both Linux and Windows kernels). The kubelet chooses the window for the rate calculation.

There's no documentation on the kubelet to identify that.

I understand that the rate interval to what the metric is captured is an implementation detail, but it affects users. Here an example:

In the case of a very spike CPU consumer application
If I get a metric like sum(rate(container_cpu_usage_seconds_total{pod_name=~"PODNAME", namespace="NAMESPACE"}[1m]))
image

I'm not sure what value the kubelet would capture and by consequence what value would the HPA utilize. It varies from 0.9 to 0.6 cores. If I request 1 core, will the HPA trigger with 60%?

Now if the interval window is 5 minutes sum(rate(container_cpu_usage_seconds_total{pod_name=~"PODNAME", namespace="NAMESPACE"}[5m]))
image

The value still varies a lot but from 0.75 to 0.81, and clearly wouldn't trigger HPA actions (considering 1 CPU request and 60& trigger).

Currently, I'm suffering to understand a) how to get the value close to what the HPA will use, but, also b) what to report to the application developers in this case?

Searching a bit seems like there's no consense on what's the right interval:

I can see the value in letting applications like that configure their interval. I also see a lot of value in making it clear in the documentation, so that monitoring systems do not have a huge discrepancy between what's reported and the decisions made by the HPA.

Alexi Kessler

unread,
Aug 19, 2021, 5:55:58 PM8/19/21
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention

I would personally see a ton of value in exposing the interval knob. I'm not sure I understand why it's reasonable to expose

spec:
  behavior:
          scaleDown:
              policies:
                - type: Percent
                  value: 20
                  periodSeconds: 30

but "leaking implementation details" to add a more tunable metric threshold. At my org we tune our prometheus queries and alerts to different levels of sensitivity in different environments, requiring different thresholds and threshold durations for different ones to trigger. When an alert triggers, somebody goes in and edits the replicas or resource requests in the Deployment.

The value add I see in migrating to HPAs is to remove human actors from the replica side of that process. If I'm going to hand over the scaling of my services to automation, I would hope that the scaling behavior would be at least as customizable as what I have now.

As a more concrete use case:

  • In my test cluster I want to scaleUp and scaleDown in big bounds, as it's fine to have a few 5xx errors. However, I only
    want to make those changes when I'm sure that that the test cluster has reached a steady state of requiring more resources, whether due to more users or systems. I don't want to react to spikes in test, as, to be honest, they're not worth the resources. I basically intentionally want the scaling to be insensitive, and large.
  • In my prod cluster, I want to scaleUp and scaleDown very conservatively, but start to do so immediately when I surpass a threshold. I am perfectly fine wasting some short-lived replicas on a spike if it means that I can get ahead of a building negative user experience.

The two cases strike me as opposite behaviors. Without being able to tune the threshold period, I find it hard to trust that the underlying metric calculation will optimally handle both of my use cases, especially when I can't find it in the docs.

@cablespaghetti I think this might answer your question about the 1.18 scaling behaviors not being fully sufficient? To be clear, I absolutely love having the scaling behavior, they make HPAs tons more customizable! I just feel like HPAs as they stand right now in v2beta2 are missing the final knob that I need to fully migrate to them.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

Triage notifications on the go with GitHub Mobile for iOS or Android.

Pranshu Srivastava

unread,
Oct 3, 2025, 11:53:51 AM10/3/25
to kubernetes/kubernetes, k8s-mirror-api-machinery-api-reviews, Team mention
rexagod left a comment (kubernetes/kubernetes#57660)

/assign

 	// indicates the window ([Timestamp-Window, Timestamp]) from
	// which these metrics were calculated, when returning rate
	// metrics calculated from cumulative metrics (or zero for
	// non-calculated instantaneous metrics).
	WindowSeconds *int64 `json:"windowSeconds,omitempty" protobuf:"bytes,4,opt,name=windowSeconds"`

Hello, is this effort still relevant?


Reply to this email directly, view it on GitHub, or unsubscribe.

You are receiving this because you are on a team that was mentioned.Message ID: <kubernetes/kubernetes/issues/57660/3366263743@github.com>

Reply all
Reply to author
Forward
0 new messages