Hello Fellow Knativers,
Currently, Knative has significant cold-start time relative to other serverless platforms - about 7 seconds with Istio and 4 without compared to the sub-second cold start time which seem common elsewhere. In addition to the obvious user experience issues this causes during scale-up windows this also has significant cost effects for service providers: How much we have to overshoot capacity to avoid blocking requests and how quickly we can scale down are functions of this cold-start time and result in a cost that must be passed on to users.
AFAIK we don't have a stated target for cold-start time anywhere so I think getting agreement on this is a good place to start. For reference, we see about a 500ms upper bound on cold-start times in OpenWhisk, Lambda (according to random internet articles) varies wildly by language but seems to generally be less than this. I've seen 1 second mentioned in several conversations and this seems like a good initial target to me but with the caveat that long term this is probably too conservative of a target considering the landscape. Thoughts?
In addition to getting consensus on a target I'd like to also get some organization around the current time sinks and potential mitigations (revival of https://github.com/knative/serving/issues/1297). While we seem well aware Istio accounts for a little over 2 seconds of time there are still 4 seconds to account for, it'd be great to drill in to this! I've also created https://github.com/knative/serving/issues/2485 to track work related to the remaining istio startup time reduction.
Approximately 2 of these seconds I've measured to be a baseline pod startup time for Kubernetes which is concerning given the conservative proposal target of 1 second. I am very interested to know if anyone has looked in to where this time is spent and potential ways we can work around this? I could certainly see us doing something like a custom scheduler and daemonset in the short term while we investigate upstream solutions. I've created https://github.com/knative/serving/issues/2484 to track work and information related to this.
This still leaves us with another ~2.5 seconds unaccounted for which I'd love to get information on. Theres a great breakdown here https://github.com/knative/serving/issues/1297#issuecomment-399226386 which seems to have info on this but it may be a bit stale at this point. Any components of time we can break out in to an issue we can drive down on would be great.
Finally, maybe we could get a milestone created to burndown on this startup time reduction? I don't really see this as a thing we ever 'complete' but I find these milestones very helpful for pointing contributors to sets of issues around a particular topic.
--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/1542157986.3210942.1576088152.5DB1E08D%40webmail.messagingengine.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CACW46hdYd0vej90%3D62%2BKh0Ey8UQzX3kgedZYkYuoyvqet5Fa0w%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CACJ4aW1KH2kKB5M66%3Dwqac7cKGMotEkB_vAT3EXqjmPqT0%3Dsaw%40mail.gmail.com.
Hi Greg,thanks for putting this together!Am Mittwoch, 14. November 2018 02:13:08 UTC+1 schrieb Gregory Haynes:Hello Fellow Knativers,Currently, Knative has significant cold-start time relative to other serverless platforms - about 7 seconds with Istio and 4 without compared to the sub-second cold start time which seem common elsewhere. In addition to the obvious user experience issues this causes during scale-up windows this also has significant cost effects for service providers: How much we have to overshoot capacity to avoid blocking requests and how quickly we can scale down are functions of this cold-start time and result in a cost that must be passed on to users.Can you put these numbers into perspective? How have they been produced and measured for instance? What was the clustersize, how big did you scale. It'll be very important going forward to have a continuous and reproducible way of generating these numbers.
I think that we will ultimately need to push for the ability to make node-local scheduling decisions upstream for a handful of reasons, this included.The major reason in my mind is the fact that cold starts put the k8s control plane on our data plane, which gives me concerns about cold start reliability SLOs. Writing this up has been on my TODO list for a while. :(A strawman I've proposed in the past is that we run the activator as aprivileged DaemonSet that lame ducks when it's node reaches capacity, but otherwise coschedules pods to handle buffered traffic as "real" pods are started.
There are some rudimentary facilities for local pod creation today, but poorly documented and of questionable support. I also don't know their latency SLOs, and thockin waved us off of them.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CACW46hc%2BTP7-PcHFcZgN%2BAibPiaeFZcG1PQJioMrX9KeZin0BQ%40mail.gmail.com.
And then figure out the Istio mesh and mTLS.5. Inject Envoy locally. Resurrect https://github.com/istio/istio/issues/3816 for a function we could run locally.
On Wed, Nov 14, 2018 at 8:50 AM Joe Burnett <joseph...@google.com> wrote:I don't know if a new area is necessary. How about a Milestone like Markus was suggesting "Scaling: Sub-Second Cold-Start"? Initial issues might be:
- Automate cold-start timing collection. I think that +se...@redhat.com was working on this previously.
- Kubelet API for creating pods locally. Mostly tracking the upstream work.
- Activator as a daemon set. We can do this early and schedule locally as an experiment. This will allow us to play around with the lameducking based on capacity too.
- Prototype local scheduling. A hack using https://kubernetes.io/docs/tasks/administer-cluster/static-pod/#configuration-files
--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/1542235670.282878.1577286320.4C3FAC53%40webmail.messagingengine.com.
For more options, visit https://groups.google.com/d/optout.
Just hearing you say this will make me wake up with cold sweats for the next couple of months. They are indeed of questionable support.
Is having the k8s control plane in the data plane a true blocker, or just a risk? The k8s control plane has to meet some base reliability SLOs - I'd love to see more about how if those SLOs aren't sufficient why it wouldn't be sufficient for other users (other than latency, which I agree should be faster).
Pod startup time : 99% of pods and their containers (with pre-pulled images) start within 5s
.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CAB%3Dt-sUo%3Dn%3DO3x9%3DLfK8WPdnfMyX8CeOfNQCVTjunxkGCcm7UQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
While fast cold start is also helpful for a free-tier, I actually don't see the "free-tier scenario" as the main driver here. A common FaaS scenario is to run cpu- and/or mem-intensive scenarios in a massively parallel fashion (e.g. image/audio/pdf/etc. processing), often being latency sensitive (e.g. in RPA scenarios, but certainly not limited to them). With that, many instances (potentially 1000's) need to be spun up in parallel, each of them having their own dedicated mem and cpu. Having a few warm instances wouldn't solve that problem. So for us to be able to serve these kinds of production workloads, rapid scaling is crucial.
--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CALjqBxTmLCmEv3vHEnvdG22YSq-P2mp9JqA8qKreScvuwdYu2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CAB%3Dt-sVRmSVWAN2OT9MLdqv_TK_%2BpSqzhQPBfy7kAx1H2d3n9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CAB%3Dt-sVRmSVWAN2OT9MLdqv_TK_%2BpSqzhQPBfy7kAx1H2d3n9A%40mail.gmail.com.