Stateful functions in Knative

510 views
Skip to first unread message

James Roper

unread,
May 8, 2019, 9:53:50 PM5/8/19
to knati...@googlegroups.com
Hi,

This is my first post to this group so allow me to introduce myself. I'm James Roper, I'm Architect of OSS at Lightbend, the company behind, Akka, Play and Scala.

At Lightbend we've been experimenting with something that we call stateful serverless. This is an idea for how we can bring state to serverless functions, allowing for a much broader range of use cases than is currently allowed, without subverting the serverless architecture (eg, if a function goes directly to a database, then autoscaling decisions can no longer be meaningfully made by the serverless framework because there's no way to know if its the function, or the database that is the cause of latency).

Our approach has been somewhat validated in the past few days, when Microsoft announced a very similar feature for Azure, called Stateful Entities:


So we think it's time that Knative seriously considers this approach to providing stateful functions. To this end, we've created a screencast that describes our approach, and demonstrates writing a stateful function in JavaScript, from nothing to deployed and running in Kubernetes in 15 minutes (with liberal time taken along the way to describe exactly what the code and descriptors are doing):


Our next step is to actually integrate this proof of concept into Knative, but we need guidance as to the right approach to do that. Any feedback or thoughts would be welcomed.

Regards,

James

--
James Roper
OSS Architect, Lightbend, Inc.
@jroper

Carlos Santana

unread,
May 8, 2019, 10:51:10 PM5/8/19
to ja...@lightbend.com, knati...@googlegroups.com
Hi James very interested pattern using Akka sidecard
 
I'm guessing a start would be looking into annotation on the serving API to enable/disable the side card injection/hook
 

Carlos Santana
Senior Technical Staff Member (STSM) - IBM Hybrid Cloud
Cloud Architecture and Solutions Engineering (CASE)
https://www.ibm.com/cloud/garage/architectures
Mobile/SMS 919.332.9619
Twitter @csantanapr
Time Zone GMT-5
 
 
--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CABY0rKMKAaSK7VG5TmBbppEw%2BE_NhftaLYh1ZG0Regwa91UEWA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
 

Markus Thömmes

unread,
May 9, 2019, 6:47:14 AM5/9/19
to Knative Developers
Hi James,

interesting demo, thanks for bringing it up! Disclaimer: I know the akka framework and built stuff with it over at OpenWhisk.

A few questions (enumerating them to ease referring later):
  1. As you say, state is brought to serverless architectures via database accesses today. Scaling these backend services according to the incoming load is indeed an issue, but not necessarily related to serverless itself. It is very much an issue in FaaS-platforms that only allow a single request at a time. The created database-connections are sometimes in an unexpected volume. Crossing that out though, I don't quite see why the scalability aspects of databases should be fundamentally different from serverless applications though.
  2. If we assume that to be a problem, how does your proposed solution help to fix that? Event-sourcing and CRDT are neat concepts to distributed-state-sharing but ultimately they are backed by databases as well. The akka-cluster can alleviate that in a way by caching some of it, but ultimately these databases might become slow and get scaled as well. Does that not impact end-user latency and thus transitively concurrency?
  3. Knative Serving scales based on request-concurrency, not based on request-latency. Transitively, increased latency can cause increased concurrency as requests can pile up more. The database in turn needs to have its own autoscaling rules to handle the load. Is that different in your model?
  4. You propose a very specific gRPC protocol to achieve what you're showing. Is this necessarily relying on akka and the sidecar you're introducing or could event-sourcing also be implemented into the user-function (via good libraries to hide the nasty details).
Sorry for the long mail and many questions. I hope I haven't completely missed something that makes them all obsolete.

Cheers,
Markus

James Roper

unread,
May 9, 2019, 7:44:48 AM5/9/19
to Markus Thömmes, Knative Developers
On Thu, 9 May 2019 at 20:47, 'Markus Thömmes' via Knative Developers <knati...@googlegroups.com> wrote:
Hi James,

interesting demo, thanks for bringing it up! Disclaimer: I know the akka framework and built stuff with it over at OpenWhisk.

A few questions (enumerating them to ease referring later): 
 
1. As you say, state is brought to serverless architectures via database accesses today. Scaling these backend services according to the incoming load is indeed an issue, but not necessarily related to serverless itself. It is very much an issue in FaaS-platforms that only allow a single request at a time. The created database-connections are sometimes in an unexpected volume. Crossing that out though, I don't quite see why the scalability aspects of databases should be fundamentally different from serverless applications though.
 
The difference is that if it's the platform that does the database access, then the platform can collect metrics on just the database access, and collect metrics on just the function invocation. Otherwise, if the function itself is accessing the database, the only thing the platform can collect metrics on is the function invocation which includes the database access. Note that I'm assuming that the Akka sidecar will replace the Knative sidecar for these stateful functions - one of the Knative sidecars primary roles is to collect metrics, so that responsibility will go to the Akka sidecar, and it will be able to collect different metrics for database access and for function invocation.

2. If we assume that to be a problem, how does your proposed solution help to fix that? Event-sourcing and CRDT are neat concepts to distributed-state-sharing but ultimately they are backed by databases as well.
 
Actually, CRDTs are not necessarily backed by databases (Akka supports both durable and non durable CRDTs, non durable are just held in memory, replicated across the cluster). 

The akka-cluster can alleviate that in a way by caching some of it, but ultimately these databases might become slow and get scaled as well. Does that not impact end-user latency and thus transitively concurrency?

Yes, absolutely - and that's why having the platform be able to specifically collect metrics on the database, metrics that are distinguished from the user function invocation, is helpful, because the platform can, at a minimum raise alerts, if not even do some scaling of the database itself (though of course scaling databases and scaling compute workloads are completely different things, it's not usually possible to just add nodes as needed for a database).

3. Knative Serving scales based on request-concurrency, not based on request-latency. Transitively, increased latency can cause increased concurrency as requests can pile up more. The database in turn needs to have its own autoscaling rules to handle the load. Is that different in your model?

I must admit that I am very naive on how Knative Serving works, and also I don't have a lot of experience in autoscaling in general, so it's very possible that I've made some bad assumptions. That said, if scaling is based on request concurrency, I can imagine the Akka sidecar measuring how many requests were outstanding and/or queued waiting on the database, vs how many were outstanding and/or queued waiting on the user function. If there are no requests queued waiting for the user function, and all of the requests are queued waiting to access the database, then that decision is easy, scale the database, not the user function.

4. You propose a very specific gRPC protocol to achieve what you're showing. Is this necessarily relying on akka and the sidecar you're introducing or could event-sourcing also be implemented into the user-function (via good libraries to hide the nasty details).

This is all very experimental at the moment, but our current vision is that the protocol would be a sort of standard, and that you'd be able to have multiple technologies, not just Akka, implement it and be provided as a sidecar. So the Akka part would be pluggable, and indeed, we're looking at bigger than just event sourcing here, there are more patterns for working with state than what Akka supports, we could imagine other protocols created for other patterns. Some of the wilder ideas we have include sidecars that could map incoming requests to database queries, and inject the result of the query into request for the user function to then process (it could be modelled as a merge). Again, the advantage is that the platform has full visibility into where requests are backing up, are they backing up when you hit the database, or when you hit the user function? And intelligent scaling decisions can be made from that. No idea if that particular idea is going to fly, but we would hope, and I think for it to be successful in Knative, it has to be bigger than Akka.

Sorry for the long mail and many questions. I hope I haven't completely missed something that makes them all obsolete.

There's a high chance I've misunderstood something about where you're coming from so please keep the questions coming.
 

Cheers,
Markus

Am Donnerstag, 9. Mai 2019 03:53:50 UTC+2 schrieb James Roper:
Hi,

This is my first post to this group so allow me to introduce myself. I'm James Roper, I'm Architect of OSS at Lightbend, the company behind, Akka, Play and Scala.

At Lightbend we've been experimenting with something that we call stateful serverless. This is an idea for how we can bring state to serverless functions, allowing for a much broader range of use cases than is currently allowed, without subverting the serverless architecture (eg, if a function goes directly to a database, then autoscaling decisions can no longer be meaningfully made by the serverless framework because there's no way to know if its the function, or the database that is the cause of latency).

Our approach has been somewhat validated in the past few days, when Microsoft announced a very similar feature for Azure, called Stateful Entities:


So we think it's time that Knative seriously considers this approach to providing stateful functions. To this end, we've created a screencast that describes our approach, and demonstrates writing a stateful function in JavaScript, from nothing to deployed and running in Kubernetes in 15 minutes (with liberal time taken along the way to describe exactly what the code and descriptors are doing):


Our next step is to actually integrate this proof of concept into Knative, but we need guidance as to the right approach to do that. Any feedback or thoughts would be welcomed.

Regards,

James

--
James Roper
OSS Architect, Lightbend, Inc.
@jroper

--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To post to this group, send email to knati...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

James Roper

unread,
May 9, 2019, 7:52:52 AM5/9/19
to Carlos Santana, knati...@googlegroups.com
Hi Carlos,

On Thu, 9 May 2019 at 12:51, Carlos Santana <csan...@us.ibm.com> wrote:
Hi James very interested pattern using Akka sidecard
 
I'm guessing a start would be looking into annotation on the serving API to enable/disable the side card injection/hook

I'm not 100% sure on what you mean here, but let me take a guess. Are you saying that if we could create an annotation to disable Knative serving from handling for example a serving resource that's been deployed, we could create our own operator that jumped in and handled it itself? So essentially we plugin to Knative by disabling it selectively on a function by function basis, and handling those functions where it was disabled with our own operator? If that's what you mean then that sounds like a pretty good idea, it would allow us to prove the technology somewhat outside of Knative, while still giving users the experience of just using Knative, and not something different, and hopefully we'll be able to still integrate with Knative build and perhaps get some of the scaling functionality too. If any of what we've developed does make sense to eventually be pulled into Knative, then there'll be a path forward for that.

Markus Thömmes

unread,
May 9, 2019, 7:59:26 AM5/9/19
to Knative Developers
Hi James,

thank you for the detailed answers. That clears a lot of things up! Having the whole persistence being done by "the system" is indeed a very nice approach.

In terms of integrating with the system: You can exchange the sidecar image for created revisions as of today. There is a configMap property to define which image to load as what we call "queue-proxy". That's the sidecar we're launching today. That is, however, a global switch today so you can't do that per revision.

To your mesh problems: You can selectively disable the mesh just for the pods of a given revision by adding the sidecar injection annotation to it and set it to false.

What I'm a little unclear about is how tight the integration with Knative itself needs to be here. Do you essentially "only" need to selectively exchange the sidecar for a revision and provide some general information to that sidecar (where to get data from etc.) or is there tighter integration you need? I could see us making Knative Serving pluggable on such a level.

Cheers,
Markus
Hi Carlos,

To unsubscribe from this group and stop receiving emails from it, send an email to knati...@googlegroups.com.

James Roper

unread,
May 9, 2019, 8:21:35 AM5/9/19
to Markus Thömmes, Knative Developers
On Thu, 9 May 2019 at 21:59, 'Markus Thömmes' via Knative Developers <knati...@googlegroups.com> wrote:
Hi James,

thank you for the detailed answers. That clears a lot of things up! Having the whole persistence being done by "the system" is indeed a very nice approach.

In terms of integrating with the system: You can exchange the sidecar image for created revisions as of today. There is a configMap property to define which image to load as what we call "queue-proxy". That's the sidecar we're launching today. That is, however, a global switch today so you can't do that per revision.

Perhaps as a start for our work we can work with the global switch, and get a better idea of how we can fit in with Knative through that.

To your mesh problems: You can selectively disable the mesh just for the pods of a given revision by adding the sidecar injection annotation to it and set it to false.

Actually, the mesh problems I solved today, when I dived into the Istio codebase I found there was already a lot of what we needed there, it just needed some features being filled out a little, nothing crazy new or too controversial I hope, so I should have a PR for that up soon, so that's a non issue now.
 
What I'm a little unclear about is how tight the integration with Knative itself needs to be here. Do you essentially "only" need to selectively exchange the sidecar for a revision and provide some general information to that sidecar (where to get data from etc.) or is there tighter integration you need? I could see us making Knative Serving pluggable on such a level.

This is exactly the sort of questions I'm asking now. I'm not at all familiar with Knative internals at the moment (though that will be slowly changing in the coming weeks), so I really don't know at what level it makes sense to plug into Knative (if at all). My hope is to gain insight through these types of discussions, and hopefully end with an answer in the right direction - I don't want to charge off hacking into the Knative codebase only to find I'm doing something that's completely not the direction that the project maintainers want to go. I think what you describe is probably the right level - but one big question in my mind is whether it would be our responsibility to implement scaling logic for our functions, or if that would be Knative, or if there would be some split responsibilities, etc. I'm also not quite on top of how routing works in Knative (though I guess it'll only take a day or two of poking around to understand it), so I'm not sure if we need anything special there. Some of my questions can probably be answered by me trying to use Knative more in anger.

To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.

To post to this group, send email to knati...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

James Roper

unread,
May 23, 2019, 7:31:33 AM5/23/19
to Markus Thömmes, Knative Developers
Hi all,

I've submitted a PR that implements what we need from Knative to provide a nice integration point for us to plug into. Using this plugin point, I've been able to deploy the event sourced functions demoed in the screencast above as Knative services that get activated/scaled by Knative just like any other service, the difference is that we've supplied our own sidecar and deployment configuration to do the database access, clustering and statement management.


I realised I forgot to post a link to the work we're doing, it can be found here:


Regards,

James
Reply all
Reply to author
Forward
0 new messages