Best practice for Prometheus + GCP Cloud Run

Rohit Ramkumar

unread,

Dec 18, 2020, 9:50:16 AM12/18/20

to Prometheus Users

Hi,

I'm running a service in Cloud Run (https://cloud.google.com/run) and wondering what the best practice is here for setting up Prometheus. Specifically, I'm wondering how to handle the case when there are multiple container instances running behind a single Cloud Run API endpoint.

If there is only one container instance ever, then this is easy. I can simply deploy the Prometheus server along with my application server and expose it. Clients can hit the Cloud Run endpoint and get the metrics. However, if there is more than one container instance (during autoscaling for example) how will this work? Wouldn't a client request for metrics get sent to any of the backends? Is using a push gateway the best practice in this case?

Thanks!

Stuart Clark

unread,

Dec 18, 2020, 11:28:28 AM12/18/20

to Rohit Ramkumar, Prometheus Users

I'll start by saying that I'm not all that familiar with Google Cloud,
as we use AWS mostly, but in terms of good practice for Prometheus the
answer is always to access the underlying instances/pods/containers
directly and not via a load balancer. I'd normally use one of the
Service Discovery (SD) mechanisms to find those (e.g. Kubernetes SD for
pods or AWS SD for EC2 instances). Hopefully you can do something
similar with the GCE SD
(https://prometheus.io/docs/prometheus/latest/configuration/configuration/#gce_sd_config).

If it isn't possible to connect to such instances (for example for
Lambdas in AWS) I would then look to connect the cloud's native metrics
system to Prometheus. So for AWS I'd look at using the CloudWatch
Exporter. It looks like there is a Stackdriver Exporter, which I think
would be the equivalent for GCE?

The Push Gateway isn't designed, and is a very poor fit, for these sort
of use cases. The Push Gateway is really for short lived processes that
can't be directly scraped due to the limited time they exist (for
example cron jobs). Equally it works best when there is only a single
(or a fixed number) of parallel instances of that short lived process
(e.g. for a cron job you'd expect only a single run every configured
period). When you send metrics to the Push Gateway you replace the
previous set, so for multiple instances (or jobs) you'd have different
prefixes. If the number of instances is dynamic you'd end up with
metrics for instances that still exist in Push Gateway, but no longer
exist in reality. People then engineer something which tries to keep the
Push Gateway "tidy", but you end up with something that is complex and
probably not that reliable.

So in short, the Push Gateway is unlikely to be useful at all for your
use case. Instead try to connect to instances directly (behind the load
balancer) and if not possible look at integration with the Google
metrics system.

Rohit Ramkumar

unread,

Dec 18, 2020, 2:43:07 PM12/18/20

to Stuart Clark, Prometheus Users

I always knew Stackdriver existed but didn't think to incorporate it here.

Thanks! This is helpful.

Reply all

Reply to author

Forward