--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/00bda94d-af54-4a44-b9bf-f9e9af8bd042%40googlegroups.com.
The other big reason to split into multiple servers is organisational rather than technical.
Having servers per team or per service allows them to be owned by the same people as the service, rather than by a central team. For larger organisations this can give a lot more control over monitoring to the people who actually use it. For example, a team controlling their own alerts instead of a central team having generic alerts or a slow process to get things changed.
The other big reason to split into multiple servers is organisational rather than technical.
> an email to promethe...@googlegroups.com.
Hi Stuart
Agree with you points.
about this section:"The options which run queries on the "local" Prometheus servers require
those services to be available and not too busy - you can have the
situation that a query from somewhere else breaks a server because it is
too big/too slow. Equally a server being unavailable (down/network
issues) will cause a query to fail."
You didn't mentioned promxy or Thanos query - these could help to avoid failing the whole query if one single prometheus instance does not responding.
It could help (or hinder) depending on the failure mode & query purpose.
If you are trying a query across multiple sharded servers (e.g.
different environments) Thanos/promxy isn't going to help with the
missing data. However if you have HA pairs of servers everywhere
it can be very useful if a single server has issues.
If you have queries which stress a server (either due to amount of timeseries covered or just overall query volume) systems which duplicate queries could in certain situations make things worse - maybe every server is now overloaded.
As I say, the exact "best option" very much depends on your
particular situation. Is it a single environment in one location,
or lots of environments globally? Do you have a single easily
defined set of users (dashboards/alerts) or lots of different
teams with different needs & requirements (e.g. some needing
longer term querying for capacity management, while others are
just short term incident management)? Does the way you operate fit
into a more hierarchical structure/process (e.g region ->
environment -> service -> instance) or are things more
"flat"?
-- Stuart Clark