Accepting features into TSDB required by other users (but not Prometheus)

187 views
Skip to first unread message

Goutham Veeramachaneni

unread,
Jan 31, 2020, 8:27:26 AM1/31/20
to Prometheus Developers
Hi All,

So we now know that the TSDB has users beyond just Prometheus now and users that are directly contributing to the ecosystem, mainly Thanos and Cortex. I know we have a long-standing policy of not allowing code in Prometheus that is not directly required by it, but I want to make an exception for TSDB.

This is because TSDB was written as a general purpose embedded database with one of the goals being that it's useful outside of just Prometheus. And both Thanos and Cortex as users of TSDB would like to have a limit to the number of series as a feature. We have tried to do it outside of TSDB, but it's ugly and near impossible. Only way I see is to reject _all writes_ if someone hits the limit, not just the writes that try to create new series. The change to support it in TSDB is in itself small, and this doesn't affect Prometheus at all. We will never use this option in Prometheus.

But I do realise that this might set the precedent for even more features to make it into TSDB that are not useful for Prometheus, but I think it's okay as long as we're careful and make sure that everything that comes is simple changes and are guaranteed to be maintained. We can have similar requirements as SD, that the new feature is accepted only if _both/all_ the current maintainers of TSDB say yes, and if one of the maintainers takes responsibility for the new code.

I am curious to know what others think.

Thanks,
Goutham.

Goutham Veeramachaneni

unread,
Jan 31, 2020, 8:34:35 AM1/31/20
to Prometheus Developers
Ugh, the link to the change was lost: https://github.com/prometheus/tsdb/pull/617

Note: That is not complete yet, but the idea is that we reject scrapes that add new series once the limit is hit.

Thanks
Goutham.

Frederic Branczyk

unread,
Jan 31, 2020, 8:46:09 AM1/31/20
to Goutham Veeramachaneni, Prometheus Developers
I personally even think there is space for this in Prometheus itself as a protection mechanism. I am open for both discussing this feature itself as well as the case of "helping integrations flourish".

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/391e0852-f0ec-4c1a-ad97-86ab1f9db28a%40googlegroups.com.

Bjoern Rabenstein

unread,
Jan 31, 2020, 9:49:10 AM1/31/20
to Goutham Veeramachaneni, Prometheus Developers
On 31.01.20 05:27, Goutham Veeramachaneni wrote:
>
> I know we have a long-standing policy of not allowing code in Prometheus that is
> not directly required by it, but I want to make an exception for TSDB.

Same could be said about PromQL, SD, and likely other parts of the
codebase.

In general, we like it of projects like Cortex or Thanos reuse our
codebase, and in turn we are irritated if projects like Victoria
Metrics recreate the same functionality with their own code, thereby
accidentally or deliberately introducing compatibility problems.

Encouraging reuse doesn't go along nicely with unwillingness to invest
not even the tiniest of efforts to support external users.

This is related to the discussion about versioning and how to do it in
a way that plays well with Go Modules. Only that we talk about actual
features in the code, which understandably appears as more invasive.

I don't think we should have a strict rule to not accept any code
changes in Prometheus that are not directly required by it. The
maintainers of the affected area shold be able to use their best
judgement along the following criteria:

- The change is limited in scope and doesn't negatively affect any
existing functionality.

- The change helps projects that are part of the wider ecosystem (so
that ultimately Prometheus will benefit indirectly).

- It would be much harder to implement the same change outside of the
Prometheus code.

- Maintenance is assured.

Needless to say, we should always be explicit that none of those added
features are covered by our stability commitment for the Prometheus
server as a whole. And of course, we have the freedom to adjust
criteria and revise our decisions. I don't buy the "slippery slope"
line of argument to justify a blanket ban of any of those
changes. Such strict rules are easy to follow but don't gain us much
while they can be very destructive for the ecosystem as a whole.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Brian Brazil

unread,
Jan 31, 2020, 10:57:06 AM1/31/20
to Bjoern Rabenstein, Goutham Veeramachaneni, Prometheus Developers
This is the sticking point for me. Should we accept something if it's not unlikely that either a) we're only going to break it later, as we don't use it and it's in the way of some new feature/optimization or b) we don't break it, and development of Prometheus is slowed or stalled? How would this actually play out in future?

This is not theoretical, we've already had such things get into the TSDB codebase with no non-obvious-semantics, no unitests, that are in the way of development, but it'd be a bit impolite to just remove it.


We have had a few places where code structure/factoring has been tweaked to be more amenable to external usage and that seems to have worked out fine so far. Most recently for example one of our importers requested that PromQL not depend on the config package as part of adding a new feature. I suspect there's a few trivial accessors around that we don't use ourselves too.

Brian

 
I don't buy the "slippery slope"
line of argument to justify a blanket ban of any of those
changes. Such strict rules are easy to follow but don't gain us much
while they can be very destructive for the ecosystem as a whole.


 

Julien Pivotto

unread,
Jan 31, 2020, 2:10:46 PM1/31/20
to Brian Brazil, Bjoern Rabenstein, Goutham Veeramachaneni, Prometheus Developers
On 31 Jan 15:56, Brian Brazil wrote:
> On Fri, 31 Jan 2020 at 14:49, Bjoern Rabenstein <bjo...@rabenste.in> wrote:
>
> > On 31.01.20 05:27, Goutham Veeramachaneni wrote:
> > >
> > > I know we have a long-standing policy of not allowing code in Prometheus
> > that is
> > > not directly required by it, but I want to make an exception for TSDB.
> >
> > Same could be said about PromQL, SD, and likely other parts of the
> > codebase.
> >
> > In general, we like it of projects like Cortex or Thanos reuse our
> > codebase, and in turn we are irritated if projects like Victoria
> > Metrics recreate the same functionality with their own code, thereby
> > accidentally or deliberately introducing compatibility problems.
> >
> > Encouraging reuse doesn't go along nicely with unwillingness to invest
> > not even the tiniest of efforts to support external users.

As things goes we are happy to help them. Lately we informed them of
some changes related to the engine that impacted them, and we tweaked
some pull requests too.

> > This is related to the discussion about versioning and how to do it in
> > a way that plays well with Go Modules. Only that we talk about actual
> > features in the code, which understandably appears as more invasive.
> >
> > I don't think we should have a strict rule to not accept any code
> > changes in Prometheus that are not directly required by it. The
> > maintainers of the affected area shold be able to use their best
> > judgement along the following criteria:
> >
> > - The change is limited in scope and doesn't negatively affect any
> > existing functionality.
> >
> > - The change helps projects that are part of the wider ecosystem (so
> > that ultimately Prometheus will benefit indirectly).
> >
> > - Maintenance is assured.

(I reordered the points)

Those 2 bullets (wider ecosystem+maintenance) will definitively play in
favour of thanos and cortex, as they have maintainers in the prometheus
core team. What will we do if a third party solution starts asking such
features too? That will be either unfair or unmaintainable. I think we
should be clear at some point about the cortex and thanos. Do we want to
make them first class prometheus remote write because they are written
in go, so can import prom, and part of the CNCF? This is quite complex.

> > - It would be much harder to implement the same change outside of the
> > Prometheus code.
> >
> >
> > Needless to say, we should always be explicit that none of those added
> > features are covered by our stability commitment for the Prometheus
> > server as a whole. And of course, we have the freedom to adjust
> > criteria and revise our decisions.

I understand that sentence but that would defeat the point. If those
feature for our importers get broken or we don't promise anything, how
can they rely on it?

> This is the sticking point for me. Should we accept something if it's not
> unlikely that either a) we're only going to break it later, as we don't use
> it and it's in the way of some new feature/optimization or b) we don't
> break it, and development of Prometheus is slowed or stalled? How would
> this actually play out in future?
>
> This is not theoretical, we've already had such things get into the TSDB
> codebase with no non-obvious-semantics, no unitests, that are in the way of
> development, but it'd be a bit impolite to just remove it.
>
>
> We have had a few places where code structure/factoring has been tweaked to
> be more amenable to external usage and that seems to have worked out fine
> so far. Most recently for example one of our importers requested that
> PromQL not depend on the config package as part of adding a new feature. I
> suspect there's a few trivial accessors around that we don't use ourselves
> too.

+1

>
> Brian
>
>
>
> > I don't buy the "slippery slope"
> > line of argument to justify a blanket ban of any of those
> > changes. Such strict rules are easy to follow but don't gain us much
> > while they can be very destructive for the ecosystem as a whole.

I agree that we probably should avoid a corthanos tsdb/promql fork, as
it looks like this is the risk.

> >
>
>
>
>
> --
> Brian Brazil
> www.robustperception.io
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLp%2Bz5ZLhojhQhY9zf_JtJvwrs99pdvCvYykq-PtkDFiuA%40mail.gmail.com.

--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
signature.asc

Goutham Veeramachaneni

unread,
Jan 31, 2020, 4:51:59 PM1/31/20
to Julien Pivotto, Brian Brazil, Bjoern Rabenstein, Prometheus Developers
 > This is not theoretical, we've already had such things get into the TSDB codebase with no non-obvious-semantics, no unitests, that are in the way of development, but it'd be a bit impolite to just remove it.

That is unfortunate tbh. Could you list the feature? If it doesn't have enough tests and is actually hindering dev, I am all for improving or removing it. And please call them out as you see them.

> a) we're only going to break it later, as we don't use it and it's in the way of some new feature/optimization

If it really hinders a new optimisation/feature, I'm all for breaking it. I'd rather break Cortex/Thanos than stand in the way of your sweet optimisations :) If it's an inconvenience for optimisations, I can guarantee that the Thanos/Cortex maintainers can step in remove those inconveniences. You could always remove it, do the optimisation and suggest a way to implement the feature along side the new optimisation, and I am sure the maintainers would be glad to do the extra work of reimplementing it. I know the removal might be considered extra work, but I think its a justified price to pay in the light of the community benefit.

> Lately we informed them of some changes related to the engine that impacted them, and we tweaked some pull requests too.
+++1, really appreciate it!

> What will we do if a third party solution starts asking such features too? That will be either unfair or unmaintainable.

They would need to get the approval and sign off of the maintainers for sure. I do realise that this means an additional advantage for Cortex and Thanos (given both maintainers are involved directly in one of the projects), but I'd rather have someone responsible for a piece of code. Asking a non-maintainer to be responsible doesn't work out in practice.

> If those feature for our importers get broken or we don't promise anything, how can they rely on it?
Actually, I don't think we should guarantee anything tbh. While this does give external users a subpar experience, I completely agree with upstream Prometheus. Prometheus never promises anything for those importing the internal code. And interfaces break quite frequently (we vendor ruler and it breaks _all the time_ as new features get added) and we are quite okay with it. As long as we can get our work done ;)

But we would need to fork things when we can't get our work done :/ When support for the v1 storage was removed, we just copied the code into Cortex. While I don't think Prometheus should carry v1 code around, I really would not approve of copying TSDB code in and apply patches.

> I agree that we probably should avoid a corthanos tsdb/promql fork, as it looks like this is the risk.

I just want to make it clear that don't think I'll ever approve of a fork, and neither would the Thanos maintainers. I'd rather make the changes in TSDB, and if that is not possible, I'd prefer a crappy implementation of limits or no limits to a fork. But I really hope that changes in TSDB are approved.

Thanks,
Goutham.

Brian Brazil

unread,
Feb 1, 2020, 2:44:41 AM2/1/20
to Goutham Veeramachaneni, Julien Pivotto, Bjoern Rabenstein, Prometheus Developers
On Fri, 31 Jan 2020 at 21:51, Goutham Veeramachaneni <gout...@gmail.com> wrote:
 > This is not theoretical, we've already had such things get into the TSDB codebase with no non-obvious-semantics, no unitests, that are in the way of development, but it'd be a bit impolite to just remove it.

That is unfortunate tbh. Could you list the feature? If it doesn't have enough tests and is actually hindering dev, I am all for improving or removing it. And please call them out as you see them.

It was a single function that only Thanos used, and there were already (unrelated) plans for them to move to a different way of doing things. This was one that iirc they had asked really really nicely for as there was no sane alternative.

> a) we're only going to break it later, as we don't use it and it's in the way of some new feature/optimization

If it really hinders a new optimisation/feature, I'm all for breaking it. I'd rather break Cortex/Thanos than stand in the way of your sweet optimisations :) If it's an inconvenience for optimisations, I can guarantee that the Thanos/Cortex maintainers can step in remove those inconveniences. You could always remove it, do the optimisation and suggest a way to implement the feature along side the new optimisation, and I am sure the maintainers would be glad to do the extra work of reimplementing it. I know the removal might be considered extra work, but I think its a justified price to pay in the light of the community benefit.

Here we have Thanos/Cortex not merely asking for an accessor or otherwise to make some internals public (which we're generally fine with), but to add a whole first-class feature that Prometheus itself doesn't need. Having to build up and maintain code infrastructure for a 3rd party feature is messy, and it'll also have a performance impact on Prometheus itself (we'd be adding contention on series creation).  We'd also now to implement Thanos/Cortex specific design decisions in the tsdb. What happens when a WAL replay hits the limit? What happens when a scrape hits the limit or otherwise fails? You can't rollback a series creation, only sample ingestion.

Basically I don't think adding this sort of feature to TSDB is viable purely for the sake of 3rd parties, it'd have to be justified and designed to be added to Prometheus too. Given the above implementation considerations, I don't think it's a sane feature for Prometheus itself.

Complexity isn't free. This sort of thing tends to be death by a thousand cuts, until it's no longer practical to change the TSDB as it's just too complicated to safely/sanely change (and it's already quite complicated already).

Brian

Bartłomiej Płotka

unread,
Feb 2, 2020, 11:06:35 AM2/2/20
to Brian Brazil, Goutham Veeramachaneni, Julien Pivotto, Bjoern Rabenstein, Prometheus Developers
Thanks for this discussion. I think it was already mentioned that from Thanos perspective we are relying on TSDB and other Prometheus packages heavily. We reuse certain components in order to join forces and make something better together instead of trying to implement something from scratch. So far this strategy was amazing: Both sides are maintaining the code, fixing some issues because we found it on Thanos or Cortex, and vice versa, e.g recently we gained a lot in Thanos thanks to a couple of Prometheus-focused changes optimizing block loading and compaction. Isn't that amazing? I keep saying that this is how open source should work.

Yes. It is harder to make sure TSDB solve more cases than just Prometheus, without introducing some complexity. 
Yes, we might want to add more public methods to TSDB even if Prometheus does not use it. 
And yes, it might be slower to do than just forking and doing it non-open-source way. 

But we do that because we believe that we can help each other and there are benefits for both sides: Prometheus and Prometheus Ecosystem. That's also why we are all (which we could see on FOSDEM this week!) so close, we treat each other in the community like family (: 

I totally see the problem and without being bias, I fully agree with Björn points. Thanks for those Björn! Maybe we should start some vote if we can add those to our documentation etc if that makes sense. I mean the points below:

> I don't think we should have a strict rule to not accept any code
changes in Prometheus that are not directly required by it. The
maintainers of the affected area should be able to use their best

judgement along the following criteria:

> - The change is limited in scope and doesn't negatively affect any
  existing functionality.

> - The change helps projects that are part of the wider ecosystem (so
  that ultimately Prometheus will benefit indirectly).

> - It would be much harder to implement the same change outside of the
  Prometheus code.

>- Maintenance is assured.

In the certain problem that Goutham mentioned regarding write series limits, we should follow exactly those rules: We should first start to brainstorm if there are any alternatives. If there are none, good ones, we should maybe consider 
embedding this in TSDB as per the rules above.

Idea from my side: I think we can avoid adding this to TSDB and have just a wrapper for Appender and maybe something around Cut head block, so we can count series and act upon when the limit is exceeded (error append etc). So maybe there is no problem particularly for this logic ((((: However we should be very clear and decide if such rules that Björn proposed makes sense for future cases like this.

Kind Regards,
Bartek

Krasimir Georgiev

unread,
Feb 3, 2020, 11:41:40 AM2/3/20
to Bartłomiej Płotka, Brian Brazil, Goutham Veeramachaneni, Julien Pivotto, Bjoern Rabenstein, Prometheus Developers
Nothing to add , just to say that I fully agree with Bratek's opinion.

Krasi Georgiev
Senior Software Engineer
Monitoring Team

Bjoern Rabenstein

unread,
Feb 7, 2020, 12:32:28 PM2/7/20
to Krasimir Georgiev, Bartłomiej Płotka, Brian Brazil, Goutham Veeramachaneni, Julien Pivotto, Prometheus Developers
We effectively had two discussions in one here:

First, about the concrete request for the limit feature. And second,
if we should or should not have a strict rule to accept features that
are needed by 3rd parties but not directly by Prometheus itself.

I got from the discussion that we already had precedents, some worked
out well, some didn't. The latter can inform us how to set the bar so
that future features will not bite us in the same way.

The draft criteria I listed (none of which will be black/white in
practice but they will be scales along which we can judge a proposal)
are perhaps a good first step.

But first we need to assess consensus that we will not strictly rule
out features that are needed by 3rd parties but not directly by
Prometheus itself. Many seem to agree, but Brian's mail could be read
that he doesn't.

The discussion changes considerably in character and welcomingness if
we say "we cannot accept this feature because it comes with a
performance penalty for the Prometheus server without any benefit in
return" instead of "we have a strict rule to not accept features that
only help 3rd parties".

Brian Brazil

unread,
Feb 7, 2020, 3:27:16 PM2/7/20
to Bjoern Rabenstein, Krasimir Georgiev, Bartłomiej Płotka, Goutham Veeramachaneni, Julien Pivotto, Prometheus Developers


On Fri 7 Feb 2020, 18:32 Bjoern Rabenstein, <bjo...@rabenste.in> wrote:
We effectively had two discussions in one here:

First, about the concrete request for the limit feature.  And second,
if we should or should not have a strict rule to accept features that
are needed by 3rd parties but not directly by Prometheus itself.

I got from the discussion that we already had precedents, some worked
out well, some didn't. The latter can inform us how to set the bar so
that future features will not bite us in the same way.

The draft criteria I listed (none of which will be black/white in
practice but they will be scales along which we can judge a proposal)
are perhaps a good first step.

But first we need to assess consensus that we will not strictly rule
out features that are needed by 3rd parties but not directly by
Prometheus itself. Many seem to agree, but Brian's mail could be read
that he doesn't.

I'm against things that can be described as features. What has worked out okay is tweaking how our code is factored to better suit other users. What has not worked is explicitly adding code purely for a 3rd party. I don't see a way that you can realistically say both that we accept such things but that they have no stability gaurantee.

Brian

Bjoern Rabenstein

unread,
Feb 10, 2020, 8:29:35 AM2/10/20
to Brian Brazil, Krasimir Georgiev, Bartłomiej Płotka, Goutham Veeramachaneni, Julien Pivotto, Prometheus Developers
On 07.02.20 21:27, Brian Brazil wrote:
>
> But first we need to assess consensus that we will not strictly rule
> out features that are needed by 3rd parties but not directly by
> Prometheus itself. Many seem to agree, but Brian's mail could be read
> that he doesn't.
>
> I'm against things that can be described as features. What has worked out okay
> is tweaking how our code is factored to better suit other users. What has not
> worked is explicitly adding code purely for a 3rd party. I don't see a way that
> you can realistically say both that we accept such things but that they have no
> stability gaurantee.

That means we do not have consensus. Neither on _not_ applying a strict
rule nor on _applying_ a strict rule.

Thinking about it, as long as we don't vote about it, not having a
consensus of applying a strict rule is effectively the same as having
a consensus to not apply a strict rule. Both alternatives result in
not applying a strict rule.

IIUC, the concrete case at hand seems to be solvable without adding a
feature to Prometheus after all. I'd say let's keep dealing with
future requests on a case-by-case basis and see if the various
opinions will converge naturally or if we need a more explicit ruling.

Brian Brazil

unread,
Feb 10, 2020, 12:08:03 PM2/10/20
to Bjoern Rabenstein, Krasimir Georgiev, Bartłomiej Płotka, Goutham Veeramachaneni, Julien Pivotto, Prometheus Developers
Yeah, thus far things have been reasonably obvious so an explicit ruling doesn't seem necessary and may end up being over-constraining.

If it turns out to be needed, I wouldn't be completely against some form of simple hook for this particular case - though given tsdb internals (rollbacks in particular) I don't think such a hook could be simple.

--
Reply all
Reply to author
Forward
0 new messages