Removing namespace component in metric's keys

57 views
Skip to first unread message

Manjula Amunugama

unread,
Mar 1, 2021, 1:44:43 AM3/1/21
to Prometheus Users
Hi all,

In our environment for monitoring about 200 micro-services, we use Prometheus & Grafana.

In one application to another, developers used different different strings as the namespace component.
i.e.  we have used Prometheus keys like "booking_engine_driver_eta_location_service_outboundcall_latency_microseconds_count" to count the latency from "BookingEngine.Driver-ETA" to "Location-Service" 
In this "Booking Engine" is the "Service Group" and "Driver-ETA" is the service and "Location-Service" is the outbound service

In monitoring its a must to monitor "Inbound Request Rates by Endpoint", "Inbound Request Error Rates by Endpoint", "Processing Latency by Endpoint", "Outbound Request Rates by Endpoint", "Outbound Request Rates by Endpoint", "Outbound Request Error Rates by Endpoint" for API based requests.

We can monitor all the services with about 3 dashboards "Inbound Service Monitor Rates", "Outbound Service Monitor Rates", "Processing Latencies" we know the Prometheus keys used. 
So we wanted to standardize the Prometheus Keys as the following
- We use namespace to define the "Development Team"
- Application Name will be a label in the key - i.e. label will be "app"
- Endpoint also will be a label in the key
- Error will be a label in the key

So the previous key with labels will be changed to "outboundcall_latency_microseconds_count{app="booking_engine_driver_eta_location_service"}"

Doing this we can automate most of the things related Dashboarding and Alerting.

By doing this about 200 time series-es will be grouped into about 4 groups and hence 200 time series into 4 time series.

Doing so, will there be a big hit for Prometheus performance?

As Prometheus sounds, it can handle millions of time series but even though we did so, there will be about 50,000 time series. But in one time series, there are data for about 200 services. To get the latency count expression would be "outboundcall_latency_microseconds_count{app="booking_engine_driver_eta_location_service"}".

Is it advisable to do like this? 


Stuart Clark

unread,
Mar 1, 2021, 4:58:22 AM3/1/21
to Manjula Amunugama, Prometheus Users
A time series is different to a metric.

A metric has a name and an optional selection of labels.

A time series is one specific metric & label combination.

So, for example, a metric could be called "requests_count", but two time
series could be "requests_count{response_code='200'}" or
"requests_count{system='frontend',authenticated='false'}".

As a result, in terms of the number of time series there is no
difference between 100 metrics with no labels and a single metric with a
label with 100 values.

How the difference affects performance will depend on how things are
being used. There is likely to be little difference in performance
during scraping, but query usage could make a bigger difference. A
metric with labels is expected to be aggregatable, so it would make
sense to arrange the data in that way if that would be true. If you were
to sum together all the different label combinations of a particular
metrics would the result make sense? An example, a metrics which counts
requests and has labels for error code would still make sense if you
summed everything together (rather than requests per code you would have
total number of requests).

Would it make sense in your case to use labels within a single metric?
If the different systems are completely unrelated that might not be the
case - a sum wouldn't mean anything and an average would be equally
useless as the different systems do a totally different selection of
work. However if you are looking at latencies end-to-end across multiple
systems in a flow, or have multiple instances of a system, then it does
sound like the use of labels would make more sense - sum would give you
the overall end-to-end latency or you could produce averages for a
particular system across instances.

--
Stuart Clark

Manjula Amunugama

unread,
Mar 1, 2021, 5:47:05 AM3/1/21
to Prometheus Users
Hi Clark,

There will be no duplicated values as you  mentioned the way we are going to do.
If you read through carefully, at last we will have metrics like the following
Call flow is like this
incoming request ==> Service 1 (endpoint 1)
Service 1 ==> Service 2 (endpoint 1)
Service 1 ==> Service 2 (endpoint 2)
Service 1 ==> Service 3 (endpoint 1)

For the incoming request to Service 1, Service 1 in-turn generates multiple outbound calls as mentioned above.

In Service 1
------------------
inboundcall_latency_microseconds_sum{system="booking_engine" ,app="driver_eta",endpoint="endpoint1"}"
inboundcall_latency_microseconds_count{system="booking_engine" ,app="driver_eta",endpoint="endpoint1"}"

outboundcall_latency_microseconds_sum{system="booking_engine" ,app="driver_eta",outbound_service="Service2",endpoint="endpoint1"}"
outboundcall_latency_microseconds_count{system="booking_engine" ,app="driver_eta",outbound_service="Service2",endpoint="endpoint1"}"

outboundcall_latency_microseconds_sum{system="booking_engine" ,app="driver_eta",outbound_service="Service2",endpoint="endpoint2"}"
outboundcall_latency_microseconds_count{system="booking_engine" ,app="driver_eta",outbound_service="Service2",endpoint="endpoint2"}"

outboundcall_latency_microseconds_sum{system="booking_engine" ,app="driver_eta",outbound_service="Service3",endpoint="endpoint1"}"
outboundcall_latency_microseconds_count{system="booking_engine" ,app="driver_eta",outbound_service="Service3",endpoint="endpoint1"}"

In Service 1 we expose the above sums and counts by service and endpoint

So there will be no duplicated values.

The issue here is as per my knowledge, in one series there will be lots of data related to multiple services.

And the plus point is we can generate Grafana, widgets (Meters/ Graphs etc) automatically as we know the keys, and also we can automate Alerts too.

Your feedback is very important.

Best Regards,
Manjula 
Reply all
Reply to author
Forward
0 new messages