On 15.06.21 20:59, Bartłomiej Płotka wrote:
>
> Let's now talk about FaaS/Serverless.
Excellent! That's my 2nd favorite topic after histograms. (And while I
provably talked about histograms as my favorite topic since early
2015, I have only started to talk about FaaS/Serverless as an
important gap to fill in the Prometheus story since 2018.)
I think "true FaaS" means that the function calls are
lightweight. The additional overhead of sending anything over the
networks defeats that purpose. So similar to what has been said
before, and what Bartek has already nicely worked out, I think the
metrics have to be managed by the FaaS runtime, in the same path as
billing is managed.
And that's, of course, what cloud providers are doing, and it's also a
formidable way of locking their customers into their own metrics and
monitoring system.
And that's in turn precisely where I think Prometheus can use its
weight. Prometheus has already proven that cloud providers can
essentially not get away with ignoring it, and even halfhearted
integrations won't be enough. With more or less native Prometheus
support by cloud providers, it might actually just require a small
step to come to some convention how to collect and present FaaS
metrics in a "Promethean" way. If all cloud providers do it the same
way, the lock-in is gone.
I think it would be very valuable to study what OpenFaaS has already
done:
https://docs.openfaas.com/architecture/metrics/
In the simplest case, we could just say: Please, dear cloud providers,
please expose exactly the same metrics for general benefit. If there
is anything to improve with the OpenFaaS approach, I'm sure they will
be delighted to get help. (Spontaneously, I'm missing a way to define
custom metrics, e.g. how many records a function call has processed.)
> * Suggestion to use event aggregation proxy
> <
https://github.com/weaveworks/prom-aggregation-gateway>
> * Pushgateway improvements
> <
https://groups.google.com/g/prometheus-users/c/sm5qOrsVY80/m/nSfbzHd9AgAJ> for
> serverless cases
Despite all of what I said above, I think there _are_ quite a few user
of FaaS who have fairly heavy-weight function calls. For them, pushing
counter increments etc. via the network might actually be more
convenient than funneling metrics through the FaaS runtime. This is
then just another use-case of the "distributed counter" idea, which
the Pushgateway quite prominently is not catering for. As discussed
in the thread linked above and at countless other places, I strongly
recommend to not shoehorn the Pushgateway into this use-case, but
create a separate project for it, which would be designed from the
beginning for this use-case. Perhaps
weaveworks/prom-aggregation-gateway is just that. I haven't studied it
in detail yet. In a way, we need "statsd done right". Again, I would
suggest to look what others have already done. For example, there are
tons of statsd users out there. What have they done in the last years
to overcome the known shortcomings? Perhaps statsd instrumentation and
the Prometheus statsd exporter just needs a bit of development in that
way to make it a viable solution.
> I think the main problem appears if those FaaS runtimes are short-living
> workloads that automatically spins up only to run some functions (batch
> jobs). In some way, this is then a problem of short-living jobs and the
> design of those workloads.
>
> For those short-living jobs, we again see users try to use the push model.
> I think there is room to either streamline those initiatives OR propose
> an alternative. A quick idea, yolo... why not killing the job after the
> first successful scrape (detecting usage on /metric path)?
Ugh, that doesn't sound right. I think this problem should be solved
within the FaaS runtime in the way they prefer. Cloud providers need
billing in any case (they want to make money after all), so they have
already solved reliably metrics collection for that. They just need to
hook in a simple exporter to present Prometheus metrics. See how
OpenFaaS has done it. Knative seems to have gone down the OTel path,
but that could be seen as an implementation detail. If they in the end
expose a /metrics endpoint with the desired metrics for Prometheus to
scrape, all is good. It's just a terribly overengineered exporter,
effectively. (o;
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email]
bjo...@rabenste.in