Thanos - partial response behavior

360 views
Skip to first unread message

Karthik J

unread,
Jun 2, 2021, 2:05:58 PM6/2/21
to Prometheus Users

Hello team,

We are currently evaluating Thanos as a solution for horizontally scaling our Prometheus setup. 

For Global rule evaluation with Thanos ruler, one has to make a tradeoff between availability and accuracy. For our use case, we favor accuracy compared to availability. But wondering if the tradeoff with availability can be improved

Thanos querier declares a response is partial when atleast one instance exposing Store APIs is down. Systems preferring accuracy will “abort” rule evaluations during partial responses. But considering a typical Prometheus HA setup contains replicas of Prometheus instances ,  it’s very inconvenient to abort alert rule evaluations every time any single replica is down. Any one instance could be down for various reasons(scheduled maintenance, patching, deployment etc).

Is there any way to improve the availability of Global alert rules? 

Does it make sense to enhance the store APIs to be replica-aware? During partial responses, Can the querier indicate if there is an error in retrieving data from all replicas or the error is in receiving data from only subset of them.

Thanks

Bartłomiej Płotka

unread,
Jun 2, 2021, 5:27:19 PM6/2/21
to Karthik J, Prometheus Users
Hi, this question might better be suited for either https://github.com/thanos-io/thanos Github Discussion/Issue of #thanos Slack channel.

Thanos querier declares a response is partial when atleast one instance exposing Store APIs is down. Systems preferring accuracy will “abort” rule evaluations during partial responses. But considering a typical Prometheus HA setup contains replicas of Prometheus instances ,  it’s very inconvenient to abort alert rule evaluations every time any single replica is down. Any one instance could be down for various reasons(scheduled maintenance, patching, deployment etc).

Is there any way to improve the availability of Global alert rules? 

Does it make sense to enhance the store APIs to be replica-aware? During partial responses, Can the querier indicate if there is an error in retrieving data from all replicas or the error is in receiving data from only subset of them.

Thanks, good idea, but unfortunately it's not that simple for Prometheus. The reason is that the data each Prometheus replica stores is not replicated between each instance. This means that if one instance is down and then up, even if queried together shows data without gaps, if we query one, we can still see gaps.

There are few things one can do:

* Deploy more than two Prometheus instances (3+) and ensure some store API replication understands if we don't see one it's fine, if we don't see two it's aborted. This would need to ensure that all rollouts are gradual too. If you are interested in such flow, it would be valid to request on Thanos GH issues, and wouldn't be hard to implement.
* Don't do global alert rules, do most of them locally so alerts are close to data source.
* If you really want to have more availability, remote write might be a better approach as Thanos Receivers can easily do 3x or greater replication (needed for ingestion part). Since alerting is then on top of receivers with data within same network, the querying path is more reliable too. You can of course mix those two deployments models (for some clusters use receive, for some use Prom+sidecar).

Kind Regards,
Bartek Płotka (@bwplotka)


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9a22b98a-dc9e-4d16-aeac-004a677675fbn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages