Cold-Start time reduction

Gregory Haynes

unread,

Nov 13, 2018, 8:13:08 PM11/13/18

to knati...@googlegroups.com

Hello Fellow Knativers,

Currently, Knative has significant cold-start time relative to other serverless platforms - about 7 seconds with Istio and 4 without compared to the sub-second cold start time which seem common elsewhere. In addition to the obvious user experience issues this causes during scale-up windows this also has significant cost effects for service providers: How much we have to overshoot capacity to avoid blocking requests and how quickly we can scale down are functions of this cold-start time and result in a cost that must be passed on to users.

AFAIK we don't have a stated target for cold-start time anywhere so I think getting agreement on this is a good place to start. For reference, we see about a 500ms upper bound on cold-start times in OpenWhisk, Lambda (according to random internet articles) varies wildly by language but seems to generally be less than this. I've seen 1 second mentioned in several conversations and this seems like a good initial target to me but with the caveat that long term this is probably too conservative of a target considering the landscape. Thoughts?

In addition to getting consensus on a target I'd like to also get some organization around the current time sinks and potential mitigations (revival of https://github.com/knative/serving/issues/1297). While we seem well aware Istio accounts for a little over 2 seconds of time there are still 4 seconds to account for, it'd be great to drill in to this! I've also created https://github.com/knative/serving/issues/2485 to track work related to the remaining istio startup time reduction.

Approximately 2 of these seconds I've measured to be a baseline pod startup time for Kubernetes which is concerning given the conservative proposal target of 1 second. I am very interested to know if anyone has looked in to where this time is spent and potential ways we can work around this? I could certainly see us doing something like a custom scheduler and daemonset in the short term while we investigate upstream solutions. I've created https://github.com/knative/serving/issues/2484 to track work and information related to this.

This still leaves us with another ~2.5 seconds unaccounted for which I'd love to get information on. Theres a great breakdown here https://github.com/knative/serving/issues/1297#issuecomment-399226386 which seems to have info on this but it may be a bit stale at this point. Any components of time we can break out in to an issue we can drive down on would be great.

Finally, maybe we could get a milestone created to burndown on this startup time reduction? I don't really see this as a thing we ever 'complete' but I find these milestones very helpful for pointing contributors to sets of issues around a particular topic.

Thanks!

--
Gregory Haynes
gr...@greghaynes.net

markust...@me.com

unread,

Nov 14, 2018, 11:25:55 AM11/14/18

to Knative Developers

Hi Greg,

thanks for putting this together!

Am Mittwoch, 14. November 2018 02:13:08 UTC+1 schrieb Gregory Haynes:

Hello Fellow Knativers,

Currently, Knative has significant cold-start time relative to other serverless platforms - about 7 seconds with Istio and 4 without compared to the sub-second cold start time which seem common elsewhere. In addition to the obvious user experience issues this causes during scale-up windows this also has significant cost effects for service providers: How much we have to overshoot capacity to avoid blocking requests and how quickly we can scale down are functions of this cold-start time and result in a cost that must be passed on to users.

Can you put these numbers into perspective? How have they been produced and measured for instance? What was the clustersize, how big did you scale. It'll be very important going forward to have a continuous and reproducible way of generating these numbers.

AFAIK we don't have a stated target for cold-start time anywhere so I think getting agreement on this is a good place to start. For reference, we see about a 500ms upper bound on cold-start times in OpenWhisk, Lambda (according to random internet articles) varies wildly by language but seems to generally be less than this. I've seen 1 second mentioned in several conversations and this seems like a good initial target to me but with the caveat that long term this is probably too conservative of a target considering the landscape. Thoughts?

Subsecond sounds good to me and AFAIK is the agreed upon goal for now looking at the scaling roadmap here: https://github.com/knative/serving/blob/master/docs/roadmap/scaling-2018.md.

In addition to getting consensus on a target I'd like to also get some organization around the current time sinks and potential mitigations (revival of https://github.com/knative/serving/issues/1297). While we seem well aware Istio accounts for a little over 2 seconds of time there are still 4 seconds to account for, it'd be great to drill in to this! I've also created https://github.com/knative/serving/issues/2485 to track work related to the remaining istio startup time reduction.

Approximately 2 of these seconds I've measured to be a baseline pod startup time for Kubernetes which is concerning given the conservative proposal target of 1 second. I am very interested to know if anyone has looked in to where this time is spent and potential ways we can work around this? I could certainly see us doing something like a custom scheduler and daemonset in the short term while we investigate upstream solutions. I've created https://github.com/knative/serving/issues/2484 to track work and information related to this.

This still leaves us with another ~2.5 seconds unaccounted for which I'd love to get information on. Theres a great breakdown here https://github.com/knative/serving/issues/1297#issuecomment-399226386 which seems to have info on this but it may be a bit stale at this point. Any components of time we can break out in to an issue we can drive down on would be great.

Finally, maybe we could get a milestone created to burndown on this startup time reduction? I don't really see this as a thing we ever 'complete' but I find these milestones very helpful for pointing contributors to sets of issues around a particular topic.

I absolutely agree here. We should form a milestone for this piece of work to make it more obvious where to look and to better track the work that's been done here. I feel like some of that's already happened is a bit hidden and hard to inspect.

Thanks!

--
Gregory Haynes
gr...@greghaynes.net

Cheers,

Markus

Matthew Moore

unread,

Nov 14, 2018, 11:26:12 AM11/14/18

to Gregory Haynes, knati...@googlegroups.com

I think that we will ultimately need to push for the ability to make node-local scheduling decisions upstream for a handful of reasons, this included.

The major reason in my mind is the fact that cold starts put the k8s control plane on our data plane, which gives me concerns about cold start reliability SLOs. Writing this up has been on my TODO list for a while. :(

A strawman I've proposed in the past is that we run the activator as aprivileged DaemonSet that lame ducks when it's node reaches capacity, but otherwise coschedules pods to handle buffered traffic as "real" pods are started.

There are some rudimentary facilities for local pod creation today, but poorly documented and of questionable support. I also don't know their latency SLOs, and thockin waved us off of them.

One of the major problems is how this works with a mesh? How does sidecar injection work? How do we handle mTLS when trying to directly address the coscheduled pod?

This is definitely something I think we need to chase more, but will likely require some deep conversations with the k8s and istio folks.

-M

PS - I have seen the 2 second figure before, and it seems like the standard 99P of k8s pod latency since like 1.6. What I don't know is if/how much further this number could be pushed.

--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/1542157986.3210942.1576088152.5DB1E08D%40webmail.messagingengine.com.
For more options, visit https://groups.google.com/d/optout.

Matthew Moore

unread,

Nov 14, 2018, 11:36:47 AM11/14/18

to gr...@greghaynes.net, Markus Thömmes, Jessie Zhu, Srinivas Hegde, knati...@googlegroups.com

Oops forked. +Markus Thömmes

One other thing: I would like to see us have some way to reliably measure performance before we invest too much (otherwise regressions slip in).

Ideally, we'd add a benchmark demonstrating poor performance, and then follow up with improvements that move the dial.

+Jessie Zhu and +Srinivas Hegde have been putting together plans and infrastructure for this.

Lastly, I'm happy to push buttons to create the things we want to start tracking this, but I'd like it to be somewhat self-service by contributors, which may limit us to `/area` or `/kind`

Perhaps `/kind performance` `/area cold-start`?

-M

--

Matthew Moore

Container Development Uber-TL

Developer Infrastructure @ Google

Joe Burnett

unread,

Nov 14, 2018, 11:50:56 AM11/14/18

to Matthew Moore, se...@redhat.com, gr...@greghaynes.net, Markus Thömmes, Jessie Zhu, Srinivas Hegde, knati...@googlegroups.com

I don't know if a new area is necessary. How about a Milestone like Markus was suggesting "Scaling: Sub-Second Cold-Start"? Initial issues might be:

Automate cold-start timing collection. I think that +se...@redhat.com was working on this previously.
Kubelet API for creating pods locally. Mostly tracking the upstream work.
Activator as a daemon set. We can do this early and schedule locally as an experiment. This will allow us to play around with the lameducking based on capacity too.
Prototype local scheduling. A hack using https://kubernetes.io/docs/tasks/administer-cluster/static-pod/#configuration-files

To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CACW46hdYd0vej90%3D62%2BKh0Ey8UQzX3kgedZYkYuoyvqet5Fa0w%40mail.gmail.com.

Joe Burnett

unread,

Nov 14, 2018, 11:54:44 AM11/14/18

to Matthew Moore, se...@redhat.com, gr...@greghaynes.net, Markus Thömmes, Jessie Zhu, Srinivas Hegde, knati...@googlegroups.com

And then figure out the Istio mesh and mTLS.

5. Inject Envoy locally. Resurrect https://github.com/istio/istio/issues/3816 for a function we could run locally.

Sebastien Goasguen

unread,

Nov 14, 2018, 12:50:35 PM11/14/18

to joseph...@google.com, Matthew Moore, se...@redhat.com, gr...@greghaynes.net, me.marku...@googlemail.com, jess...@google.com, sriniv...@google.com, knati...@googlegroups.com

Hi all,

To bring a slightly different opinion on this: First, init-containers are very costly so definitely the istio side-car injections are going to kill the cold start for now, the rest is indeed k8s specific startup time. Second "how do people handle cold start now ?".

1. Istio overhead

When comparing with other technologies, I would ask "do they use service meshes ?", if they don't then we are not doing a good comparison. Istio gives us a ton of features which other solutions don't have and that currently has a cost.

Therefore if we want an optimized knative, I would argue for a "knative lite" which would be without istio. That would remove that big cost right away. This thought goes with the idea of making the service meshes pluggable so that folks could for example use hashicorp connect, or not use a mesh all together and get their ingress via solutions like contour or Kong (of course loosing some of the features from Istio).

2. Cold start

We should have some clear user scenarios to start defining some SLOs. Cold-start are always brought up in serverless, this is brought up together with the need to scale down to zero when not used so that you do not pay. Yet users have started to pre-warm lambdas: https://serverless.com/blog/keep-your-lambdas-warm/ , hence they prefer to pay a little bit to keep their lambdas up all the time. (Beer chat on that one !)

I am actually a bit more concerned with the actual scaling and latency. How can we handle 10M requests/s , what is the latency to an endpoint etc.. Cloudflare workers are making a good argument about performance: https://blog.cloudflare.com/serverless-performance-comparison-workers-lambda/ and very soon we will need to show that knative can be as "fast" and this has little to do with cold-start.

my 2 cts.

-sebastien

To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CACJ4aW1KH2kKB5M66%3Dwqac7cKGMotEkB_vAT3EXqjmPqT0%3Dsaw%40mail.gmail.com.

Gregory Haynes

unread,

Nov 14, 2018, 1:09:31 PM11/14/18

to knati...@googlegroups.com

Hey Markus,

On Wed, Nov 14, 2018, at 08:25, markusthoemmes via Knative Developers wrote:

Hi Greg,

thanks for putting this together!

Am Mittwoch, 14. November 2018 02:13:08 UTC+1 schrieb Gregory Haynes:

Hello Fellow Knativers,

Currently, Knative has significant cold-start time relative to other serverless platforms - about 7 seconds with Istio and 4 without compared to the sub-second cold start time which seem common elsewhere. In addition to the obvious user experience issues this causes during scale-up windows this also has significant cost effects for service providers: How much we have to overshoot capacity to avoid blocking requests and how quickly we can scale down are functions of this cold-start time and result in a cost that must be passed on to users.

Can you put these numbers into perspective? How have they been produced and measured for instance? What was the clustersize, how big did you scale. It'll be very important going forward to have a continuous and reproducible way of generating these numbers.

This was measured as the time required to begin serving a single request after a single revision scaled down to zero. This was on a single node kubeadm cluster using calico. I completely agree with you (and Matt) this needs to be reproducible/automated. I created https://github.com/knative/serving/pull/2323 (which I need to go back and fix up) so we can start feeding a number for this in to testgrid, this is also the test I used to get this number.

Clayton Coleman

unread,

Nov 14, 2018, 2:05:42 PM11/14/18

to Matthew Moore, gr...@greghaynes.net, knati...@googlegroups.com

On Wed, Nov 14, 2018 at 11:26 AM 'Matthew Moore' via Knative Developers <knati...@googlegroups.com> wrote:

I think that we will ultimately need to push for the ability to make node-local scheduling decisions upstream for a handful of reasons, this included.

The major reason in my mind is the fact that cold starts put the k8s control plane on our data plane, which gives me concerns about cold start reliability SLOs. Writing this up has been on my TODO list for a while. :(

A strawman I've proposed in the past is that we run the activator as aprivileged DaemonSet that lame ducks when it's node reaches capacity, but otherwise coschedules pods to handle buffered traffic as "real" pods are started.

There are some rudimentary facilities for local pod creation today, but poorly documented and of questionable support. I also don't know their latency SLOs, and thockin waved us off of them.

Just hearing you say this will make me wake up with cold sweats for the next couple of months. They are indeed of questionable support.

There have been a few discussions about local containerization support, but almost all of them start with a round trip through the control plane. We could consider static pods as an option (which has some advantages), but then you have a parallel kubelet system and means you have to reinvent scheduling.

Is having the k8s control plane in the data plane a true blocker, or just a risk? The k8s control plane has to meet some base reliability SLOs - I'd love to see more about how if those SLOs aren't sufficient why it wouldn't be sufficient for other users (other than latency, which I agree should be faster).

To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CACW46hc%2BTP7-PcHFcZgN%2BAibPiaeFZcG1PQJioMrX9KeZin0BQ%40mail.gmail.com.

Gregory Haynes

unread,

Nov 14, 2018, 2:37:34 PM11/14/18

to Matthew Moore, knati...@googlegroups.com

Good to hear doing our own scheduling/placement isn't too far fetched of an idea :)

For co-scheduling, I don't follow why you'd want to have two valid paths for scheduling/starting a user container? ISTM that if we end up having to go the route of making some out of band scheduling system which is fully functional then why also have the second 'real pod' scheduling path? I'm not trying to propose we want to be as heavy handed as possible in working around k8s (and that we *want* the OOB method), just that I'd expect a given knative deployment to have one scheduling codepath.

For Istio, having talked with some upstream folks the current suggestion seems to be for us to use the CNI adapter work (https://github.com/istio/cni I think) to perform the iptables programming which eliminates init-container overhead. If we do this then were limited a bit in how much we can work around k8s in that we have to make sure we hit the CNI as part of starting up an application.

Gregory Haynes

unread,

Nov 14, 2018, 5:47:52 PM11/14/18

to Joe Burnett, Matthew Moore, se...@redhat.com, Markus Thömmes, Jessie Zhu, Srinivas Hegde, knati...@googlegroups.com

On the self-service front it seems prow supports managing milestones in a recentish update (https://github.com/kubernetes/test-infra/pull/6982). I am not sure if our setup works for this (im a prow noob) but this seems like the route to go.

On Wed, Nov 14, 2018, at 08:54, Joe Burnett wrote:

And then figure out the Istio mesh and mTLS.

5. Inject Envoy locally. Resurrect https://github.com/istio/istio/issues/3816 for a function we could run locally.

I created https://github.com/knative/serving/issues/2496 - This is the first I've heard of this one so let me know if I didn't correctly characterize it.

On Wed, Nov 14, 2018 at 8:50 AM Joe Burnett <joseph...@google.com> wrote:

I don't know if a new area is necessary. How about a Milestone like Markus was suggesting "Scaling: Sub-Second Cold-Start"? Initial issues might be:

Automate cold-start timing collection. I think that +se...@redhat.com was working on this previously.

I created https://github.com/knative/serving/issues/2495 for the timing collection.

Kubelet API for creating pods locally. Mostly tracking the upstream work.

I think for this I'd really like to experiment a bit (with #4) and get an idea of whether local scheduling buys us much / what kind of API we need.

Activator as a daemon set. We can do this early and schedule locally as an experiment. This will allow us to play around with the lameducking based on capacity too.

https://github.com/knative/serving/issues/2497 has been created for #3

Prototype local scheduling. A hack using https://kubernetes.io/docs/tasks/administer-cluster/static-pod/#configuration-files

https://github.com/knative/serving/issues/2498 has been created for #4

James Bayer

unread,

Nov 14, 2018, 7:18:57 PM11/14/18

to gr...@greghaynes.net, joseph...@google.com, Matthew Moore, se...@redhat.com, me.marku...@googlemail.com, jess...@google.com, sriniv...@google.com, knati...@googlegroups.com

i'm happy to see energy in looking into this area. one thing i believe that is important to highlight is that i suspect this problem is most significant for the subset of knative providers that want to offer a large free tier with a very low cost-to-serve. while it is certainly a benefit to have low cold start times from zero for all users, i believe the vast majority of knative users would find it acceptable to set minScale annotations [1] to 1-2 instances for their important apps that are cold-start sensitive and scale-up from there. this is especially true for the larger organizations that have reasonably sized app/function portfolios that are not trying to offer a free tier.

one of my colleagues at pivotal recently wrote up his experiences working with a large organization about a related topic here [2].

i share this not to discourage work in this area, but to call out that this particular cold start problem is likely most applicable to a subset of knative users and problems and therefore i hope we do not make solving for a potential hard-to-achieve SLO for cold starts a blocker to progressing knative (say a blocker for 1.0). stated simply, sub-1-sec cold-start seem nice-to-have for all, but must-have for a smaller subset of situations.

thanks,

james

[1] https://github.com/knative/serving/pull/1980

[2] https://twitter.com/jxxf/status/1035294838630084608

--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/1542235670.282878.1577286320.4C3FAC53%40webmail.messagingengine.com.

For more options, visit https://groups.google.com/d/optout.

--

Thank you,

James Bayer

Matthew Moore

unread,

Nov 15, 2018, 12:29:49 AM11/15/18

to Clayton Coleman, Gregory Haynes, knati...@googlegroups.com

Just hearing you say this will make me wake up with cold sweats for the next couple of months. They are indeed of questionable support.

Like I said, Tim waved us off of this, and I have no interest in building on something of questionable support. Sorry for the cold sweats :)

Is having the k8s control plane in the data plane a true blocker, or just a risk? The k8s control plane has to meet some base reliability SLOs - I'd love to see more about how if those SLOs aren't sufficient why it wouldn't be sufficient for other users (other than latency, which I agree should be faster).

My concerns generally break down as:

1. Control Plane Availability: I think that the SLOs for an HA master are on the low side for what I'd expect from the data plane, and HA masters can be an expensive barrier to entry.

2. Latency: The roundtrip through the API server feels like pure overhead. I also worry about the latency of propagating network programming on this critical path when we cannot rely on co-scheduling and direct-routing.

Focusing on the relevance of #2 to the thread at hand...

This[1] is the post I mentioned earlier: https://webcache.googleusercontent.com/search?q=cache:EqMLI8xVIlIJ:https://kubernetes.io/blog/2017/03/scalability-updates-in-kubernetes-1.6+&cd=1&hl=en&ct=clnk&gl=us

Pod startup time : 99% of pods and their containers (with pre-pulled images) start within 5s

The graph below this line shows that this is generally <2s.

Perhaps this is way outdated, in which case I'd love to get a better sense for both where K8s is today and where it could be if pushed further.

thanks,

-M

[1] - cached version because for some reason it wants to download the page otherwise...

Michael M Behrendt

unread,

Nov 15, 2018, 5:46:49 AM11/15/18

to James Bayer, gr...@greghaynes.net, jess...@google.com, joseph...@google.com, knati...@googlegroups.com, Matthew Moore, me.marku...@googlemail.com, se...@redhat.com, sriniv...@google.com

While fast cold start is also helpful for a free-tier, I actually don't see the "free-tier scenario" as the main driver here. A common FaaS scenario is to run cpu- and/or mem-intensive scenarios in a massively parallel fashion (e.g. image/audio/pdf/etc. processing), often being latency sensitive (e.g. in RPA scenarios, but certainly not limited to them). With that, many instances (potentially 1000's) need to be spun up in parallel, each of them having their own dedicated mem and cpu. Having a few warm instances wouldn't solve that problem. So for us to be able to serve these kinds of production workloads, rapid scaling is crucial.

From: James Bayer <jba...@pivotal.io>
To: gr...@greghaynes.net
Cc: joseph...@google.com, Matthew Moore <matt...@google.com>, se...@redhat.com, me.marku...@googlemail.com, jess...@google.com, sriniv...@google.com, knati...@googlegroups.com
Date: 11/15/2018 01:19 AM
Subject: Re: Cold-Start time reduction
Sent by: knati...@googlegroups.com

i'm happy to see energy in looking into this area. one thing i believe that is important to highlight is that i suspect this problem is most significant for the subset of knative providers that want to offer a large free tier with a very low cost-to-serve. while it is certainly a benefit to have low cold start times from zero for all users, i believe the vast majority of knative users would find it acceptable to set minScale annotations [1] to 1-2 instances for their important apps that are cold-start sensitive and scale-up from there. this is especially true for the larger organizations that have reasonably sized app/function portfolios that are not trying to offer a free tier.

one of my colleagues at pivotal recently wrote up his experiences working with a large organization about a related topic here [2].

i share this not to discourage work in this area, but to call out that this particular cold start problem is likely most applicable to a subset of knative users and problems and therefore i hope we do not make solving for a potential hard-to-achieve SLO for cold starts a blocker to progressing knative (say a blocker for 1.0). stated simply, sub-1-sec cold-start seem nice-to-have for all, but must-have for a smaller subset of situations.

thanks,

james

[1] https://github.com/knative/serving/pull/1980
[2] https://twitter.com/jxxf/status/1035294838630084608

On Wed, Nov 14, 2018 at 2:47 PM Gregory Haynes <gr...@greghaynes.net> wrote:
On the self-service front it seems prow supports managing milestones in a recentish update (https://github.com/kubernetes/test-infra/pull/6982). I am not sure if our setup works for this (im a prow noob) but this seems like the route to go.

On Wed, Nov 14, 2018, at 08:54, Joe Burnett wrote:
And then figure out the Istio mesh and mTLS.

5. Inject Envoy locally. Resurrect https://github.com/istio/istio/issues/3816for a function we could run locally.

.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CAB%3Dt-sUo%3Dn%3DO3x9%3DLfK8WPdnfMyX8CeOfNQCVTjunxkGCcm7UQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jacques Chester

unread,

Nov 15, 2018, 6:16:53 AM11/15/18

to knati...@googlegroups.com

On Thu, Nov 15, 2018 at 10:46 AM Michael M Behrendt <Michael...@de.ibm.com> wrote:

While fast cold start is also helpful for a free-tier, I actually don't see the "free-tier scenario" as the main driver here. A common FaaS scenario is to run cpu- and/or mem-intensive scenarios in a massively parallel fashion (e.g. image/audio/pdf/etc. processing), often being latency sensitive (e.g. in RPA scenarios, but certainly not limited to them). With that, many instances (potentially 1000's) need to be spun up in parallel, each of them having their own dedicated mem and cpu. Having a few warm instances wouldn't solve that problem. So for us to be able to serve these kinds of production workloads, rapid scaling is crucial.

I am not sure this completely changes Bayer's argument. If my workload is latency-sensitive, holding idle capacity is the fastest alternative. More so for very large cases, where cold starts will get worse on average.

To deal with changing demand, you wind up with one of three kinds of buffer: time, inventory or capacity. So when demand surges, you either (a) make requests wait, (b) retrieve something from a cache or (c) activate idle capacity.

The cross-cutting factors are (a) lead time and (b) workload variability. Where faster cold starts pays off is that you can get away with smaller buffers for any given service level. If it takes 6 seconds to spin up, then for safety I will need to spin up more instances as demand rises in order to probabilistically meet the service level in the near future, or hold more instances in reserve, or just make requests wait longer. If start time is 100ms then I need to hold very little reserve capacity or inventory -- possibly zero.

We can't directly fix workload variability within a single Knative instance. The more variable the workload, the bigger your buffers need to be for any given service level. But as Knative matures into multi-tenancy, it will become more and more possible to pool variability by having different workloads running in the same cluster. This pays off best for large platform operators like GCP, Azure or AWS, but it's not a linear function. Large on-prem instances will benefit from the same effect.

Cheers,

JC.

James Bayer

unread,

Nov 15, 2018, 9:29:09 AM11/15/18

to Jacques Chester, knati...@googlegroups.com

jacques and michael, i agree with you that there are many use cases for sub-1-second cold starts that go beyond providers offering a free tier. my main point is that there are many use cases or scenarios where several-seconds of latency to add instances would be fine. i feel it would be helpful to categorize workloads and speak about use cases and scenarios, particularly if we're talking about nice-to-have vs must-have for particular knative milestones.

for example, i expect one common use case or workload would be a transactional web api such as implementing the backend to receive and respond to a webhook. setting a service level objective for that use case probably should factor in start latency, workload variability, and sensitivity to using idle infrastructure. we could work through this scenario by varying the factors and see how suitable current knative cold-start latency affects that use case. i can think of a few good examples where we can draw upon some real-world instances of this use case and model this out.

--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CALjqBxTmLCmEv3vHEnvdG22YSq-P2mp9JqA8qKreScvuwdYu2Q%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Matthew Moore

unread,

Nov 15, 2018, 10:12:47 AM11/15/18

to James Bayer, Jessie Zhu, Srinivas Hegde, Joe Burnett, Jacques Chester, knati...@googlegroups.com

I think that this is a really important topic.

To return to my earlier comment, I'd really like to see us come up with a set of scenarios we'd like to improve and work with +Jessie Zhu and +Srinivas Hegde to make sure that we have the infrastructure to reliably track them.

I think we've got a number of scenarios in mind, including these, but it would be good to reflect those as issues (if they aren't) that folks can start to tackle once the infrastructure lands.

+Joe Burnett Do you want to take the AI to open the issues pertinent to this thread?

thanks,

-M

To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CAB%3Dt-sVRmSVWAN2OT9MLdqv_TK_%2BpSqzhQPBfy7kAx1H2d3n9A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Markus Thömmes

unread,

Nov 15, 2018, 10:16:00 AM11/15/18

to Knative Developers

One more thing that we have to keep in mind:

Which cold-start are we talking about? There is the "warmth" hierarchy for once, but I also think along the lines of whether we talk about scaling from 0 to N or from N to M. Best case there's no difference, as our design currently stands there might be though. Just something to keep in mind, permutationwise, as well I think.

Cheers,

-the other M :)

DL duglin

unread,

Nov 16, 2018, 2:11:03 AM11/16/18

to James Bayer, Jacques Chester, knati...@googlegroups.com

In an abstract sense what's being talked about here (from both sides) makes sense to me and yes there are lots of different usecases out there and some might require sub-second startup times and some may not. However, there's part of me that thinks the distinction between the various usecases isn't something that should impact our designs because to do so would imply that there are different ways a user might deploy their app based on which usecase their app fits into - and I think that would be a mistake. Or to put it another way, I don't want users to have to think about this topic at all - I just want them to deploy their app and it's always going to be deployed the same way and always get the same (hopefully, sub-second) startup time.

Now, that's not to say that if a user **chooses** to turn on certain features that those choice won't impact start-up times, and that's fine. But it should be feature driven and there shouldn't be some "make my app start-up time faster" flag. All I'm saying is that two apps with the same set of Knative/Kube features should always get the same performance characteristics regardless of whether the user actually cares about sub-second cold-starts or not. And I'm mentioning this because it's not clear to me if there's an implied statement in this thread that it's ok for this not to be true.

Just a slight tangent/commentary... back to your regularly scheduled programming...

-Doug

To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CAB%3Dt-sVRmSVWAN2OT9MLdqv_TK_%2BpSqzhQPBfy7kAx1H2d3n9A%40mail.gmail.com.

Reply all

Reply to author

Forward