rethinking metrics 2.0

262 views
Skip to first unread message

Dieter Plaetinck

unread,
Oct 20, 2014, 9:59:58 AM10/20/14
to metr...@googlegroups.com
Hi everybody,
after getting a bunch of feedback in the last 6 months (and also after coming back to m2.0 with a fresher and more critical pair of eyes, reviewing the spec/docs i've written, as well as re-reading the threads here) I've realized that there's some valid criticism and a few things need rethinking.

While the tagging and standardisation ideas enable a lot of useful features, the question of how do we define the tags, how do we format
our metric identifiers etc needs to be re-evaluated.  Some of the ideas (metric id = set of unordered key-value pairs only, unit tag, what tag, etc) haven't really changed since the beginning, and
 * they are suboptimal: awkward (especially to newcomers.  too much of a paradigm shift compared to graphite/opentsdb/etc where you have a "key"), sometimes too verbose, needs more support for things that are not key=value pairs, such as regular words, and order (for natural language or to describe a logical ordering)
* I've also noticed that much of this syntax doesn't really matter much towards the end goal (as long as some key tags/values ultimately get assigned so you can leverage them, it doesn't matter much how the metric was emitted and made its way into the system).

When defining how metrics should look like, we should also differentiate between the ingest phase (coming out of your apps and agents), and potentially we can even support multiple formats/protocols for this, vs how should they look like in the index (coming out of carbon-tagger, structured-metrics, etc, i.e. the metrics with metadata to be used for querying etc)

When working all this out, in the discussions we often like to use a few examples that demonstrate a certain idea, but whatever formats we come up with, they *have* to apply to a broad range of metrics, so I want to start a body of known different things we want to measure, describe each, and then see how they would look like according to different formats, and how well each format fares across the various cases.


So let's take a step back and first identify which properties of *indexed* metrics are useful, we can then have another look at how do we format the metrics on ingest and process them, to enable this.

1 Interoperability. ability to switch out different agents, aggregators or dashboards with no or minimal need to change metric names or graph definitions.
2 self-describing, ability to see a metric and understand exactly what it means. ability for aggregation/render to leverage metadata when consolidating or rendering.  (technically, the metric id could also be a short key, and correspond to an external piece of information that describes it more fully)
3 ability to search/filter metrics by any of these words/tags. (greatly reinforced by point 1)
4 tag keys (server=.. etc).  i.e. a named dimension for words.  so that you can do 'sum by', 'group by', 'avg by' etc (note that in many cases, auto-assigned keys such as n1, n2, n3 work well too, as long as the assignemnts are reliable)
5 automatic unit/scale conversion in visualisations (convert scales (prefixes) like G to k to T, but also automatically derive/integrate like MB to MB/day or Mbps to Mb), and both combined
6 dashboards knowing automatically what to put in legends (properties that the rendered things don't have in common) vs y-axis label and graph title (the stuff they have in common, splitting what goes on y-axis vs in title can be a bit tricky though sometimes. for now it's unit, type, target_type on y-axis label, rest in title)
7 finding duplicate metrics (i.e. different things sending the same metric shows that something might be wrong in your setup, or that metrics are not descriptive enough to differentiate themselves from each other).  this is all about the fine art of choosing what are intrinsic words/tags, which are metadata, does order matter for the storage key, etc.
8 validate correctness of keys, where validate means more like "a UI that shows you which are non-standard, which can be totally fine but may give a clue about typos or bad formats or things you should try to add to the spec) and for some keys (like unit, with current spec), we can do the same for values.
9 ability to graph/aggregate/correlate very different metrics together by getting them from different areas of the metric space (unlike the tree model where you're limited), but whatever syntax is used it can be hard to pin down the exact metrics you want based on search terms, so strong what/unit/key information is useful here.
10 metric types, although the main useful thing is that something with type=counter can be shown derived by default, that's about it.  aggregators like statsd use types to instruct which operations/statistical summaries to perform, but that's something else.
11 expressing equivalence (metric for all cores is equivalent to sum(all metrics for each core))


there's a lot of possible implementations for a format/protocol/spec and there's a bunch of subtle implementation details.
I have some fresh ideas, some people have suggested some good stuff via earlier threads or in person.
I'll try to gather them in a followup post.  In the meantime, feel free to let me know your thoughts.

Mark Kegel

unread,
Oct 26, 2014, 11:08:42 PM10/26/14
to metr...@googlegroups.com
I've been working in metrics and monitoring for a while now and a good solution to this problem is desperately needed. I'm using at least 5 different tools for monitoring at work right now and none of them can talk to each other in any more than an adhoc fashion.

One distinction that I've found it useful to make is between "about-ness" and "provenance". That is, there is a distinction between what the metrics state they are about, and where the data is actually coming from.

In the simplest cases you end up having lots of overlap. Take for example the metrics produced by an agent-based system like collectd. Those metrics are almost always about the host the agent is running on and the provenance of the data is some plugin running in that agent on that same host. 

In the general case you end up with no overlap. The metrics are about something totally independent of the host or script doing the collection.

Now this may seem a trivial point but none of the metrics systems I've interacted with so far have this distinction encoded as a first-class abstraction in the model. What ever system we end up with I would want to be able to separately encode my "about" data independent of my "provenance".

This would allow for a much cleaner integration I think of agent-based and agent-less monitoring platforms. And that's a worthy goal if you want to be able to have a plugable visualization front-end to the data, because at that point you can speak more about a generic data flow knowing that any utility can integrate.

A second distinction that I've found useful is that between the levels of abstraction between "data model" and "protocol". What I've read of metrics 2.0 so far makes me think "data model". I think it might help to jump up a level abstraction to "protocol", so that we ask "what do we need to be able to say about metrics?", rather than "how do we say the perfect thing in the simplest way about metrics?".

Current systems like collectd/graphite put no lifecycle around metrics. You just POST data to some server and its written. There's maybe a limited notion of metadata. And notion of the data producer being able to say "hey, these metrics won't be showing up again for a while", or "I'm shutting down, these will metrics will be coming from this other source in the future".

Food for thought at any rate.

Sam Zaydel

unread,
Apr 28, 2015, 5:50:41 PM4/28/15
to metr...@googlegroups.com
I think there are a few things that benefit from indexing.
  • Tying metrics by some common tags or identifiers could support proper correlation across systems perhaps, which I think today the need is greater for this than ever, and not going away any time soon.
  • Should be translatable to representations like JSON, BSON, etc. with minimum of effort, because more and more we are seeing databases using these formats for data storage and acquisition.
  • Some way of representing common data structures, like an array of tags is quite convenient and again, supports representations like JSON, BSON, etc.
  • Incorporate unit of scale into the metadata in some way, i.e. latency#ms=50 to mean latency of 50ms.
  • Ability to represent a hash, something like buffers:buf[4096]=10, buffers:buf[1024]=30, where but is mapping effectively. This is common for representing data with discrete intervals, such as 1MB, 10MB, or for multiple instances of something, like CPU cores for example.
  • Incorporation of type, such as to suggest that datapoints cannot be negative numbers, or to give a hint about proper representation, like a int, vs. double, etc.
  • If data points are already summary statistics produced at their source, making this information known seems useful.

Dieter Plaetinck

unread,
Aug 24, 2015, 4:38:14 AM8/24/15
to metrics2.0
>
Ability to represent a hash, something like buffers:buf[4096]=10, buffers:buf[1024]=30, where but is mapping effectively. This is common for representing data with discrete intervals, such as 1MB, 10MB, or for multiple instances of something, like CPU cores for example.

can you give an example of this?
i've been using core=1, core=2, etc tags for this.

Sam Zaydel

unread,
Aug 24, 2015, 8:38:02 AM8/24/15
to Dieter Plaetinck, metrics2.0
If you think about things like IO size, systems typically will issue IOs using some range of sizes, so if you are trying for example to have something that includes a summary for say last second as a single key value pair, where a number of IO sizes are summarized with value being say latency, you have a number of IOs that were perhaps 128k, 4k, 16k, etc. It is useful in my view to be able to communicate a breakdown of count of IOs by size for each metric sample. This gives you context about how many of what size IOs occurred that resulted in given latency value.

Does this make sense? I feel like I am maybe digging a hole. ;)

Thank you, Sam.
--
You received this message because you are subscribed to a topic in the Google Groups "metrics2.0" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/metrics20/Qiyp3yMhe-I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to metrics20+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Join the geek side, we have π!

Please feel free to connect with me on LinkedIn. http://www.linkedin.com/in/samzaydel

Dieter Plaetinck

unread,
Aug 24, 2015, 11:10:16 AM8/24/15
to Sam Zaydel, metrics2.0
i don't understand this.. i tried reading it 4 times. sorry.
what do you mean with "last second as a single key value pair". key and value being what? a metric tag? why a single pair?
Reply all
Reply to author
Forward
0 new messages