Feedback requested on Services Metrics proposal

Rob Miller

unread,

Oct 13, 2011, 2:47:14 PM10/13/11

to servic...@mozilla.org, Michael Hanson, Ben Adida

Hi everyone,

For the last couple of weeks I've been working on both a proposed spec
and some prototype code for the Services Metrics infrastructure. I
spoke to a number of you ahead of time to get feedback and to get an
idea about the use cases that are out there, and I *think* we've got a
proposed setup that is flexible enough to meet all of the needs I've
encountered (and plenty that I haven't yet), but which will still be
efficient and not be an undue burden on the poor app developers that
will want to use it .

The spec lives here:

https://wiki.mozilla.org/Services/Metrics

There's also some (nascent, very much changing rapidly) prototype code
up. I've been working on the Python client:

https://github.com/mozilla-services/metlog

And Victor Ng has been working on a logstash modules that will receive
the messages that the client sends out, decide what back end(s) they
should arrive at, and deliver them as appropriate:

https://github.com/crankycoder/logstash-metlog

The code isn't so important right now, I just posted it in case you
wanted to peek. But I *would* be very grateful if you could take a
minute to look over the proposed spec and architecture, to see if it
would seem to meet the metrics and reporting needs that you might have.
Also, if you have any specific use cases that are worth mentioning,
please let me know; I'd love to add a few more to the examples that are
already listed.

Feedback would ideally go to services-dev, but sending it directly to me
is also acceptable.

Thanks!

-r
_______________________________________________
Services-dev mailing list
Servic...@mozilla.org
https://mail.mozilla.org/listinfo/services-dev

JR Conlin

unread,

Oct 13, 2011, 4:03:05 PM10/13/11

to servic...@mozilla.org

Looks good. Here's my feedback:

1. In a way the BrowserID stuff is a blessing because it means we can't
presume a single platform or core. For instance, I'm now working on a
component that is SMTP (and can't really use pyramid). Having alternate
methods to submit data is key. In effect: don't have a library, have an API.

2. I'm curious about what metadata actually contains. I thought it was
additional elements of data associated with the error (say stack trace
or user token or something). Does set_message_flavor() simply set the
core dictionary to use with later calls merging data in, or is it
something simpler like just the values to use for subsequent calls (with
later metadata overwriting without a merge)?

I don't want to seem dense about that, but I got burned by making the
wrong guess in the past.

Gregory Szorc

unread,

Oct 13, 2011, 4:17:02 PM10/13/11

to servic...@mozilla.org, mha...@mozilla.com

Things I like:

* You are transporting JSON. No burden on parties to parse/decode each
message (the embedded string "message" field is an exception - see below).
* Metadata map/dict is structured and easily machine readable
* Built-in support for counter increment as metrics
* The timer decorator is pretty sweet

Things I would like to see:

* More details on the ZeroMQ message "protocol." It appears you are
using PUB-SUB and transferring JSON in sends consisting of single
messages. Please document.
* What about ZeroMQ delivery reliability? HWM on sender? How does it
handle durable subscribers (socket identities), where 0MQ indefinitely
buffers messages on the sender, leading to memory bloat? (Hint: you
probably want to explicitly disallow durable sockets, which
unfortunately isn't supported by the 0MQ API right now.)
* Does the metadata map support typing, or is everything normalized to a
string? (Certainly the Python and JSON support typing, but syslog, which
it is based upon, only supports string values, I /think/.)
* In addition to counter increment, what about an API to support gauges
(absolute numeric values)?
* Encouraging people to use structured metadata over string messages.
Personally, I favor this approach so downstream systems don't need to
roll their own text parsers. (Why convert structured data to
unstructured text and back again if your main consumer is machines?) One
downside is you lose some human readability. Though, I suppose you could
put data in both a human-readable string and metadata, at a (hopefully
marginal) cost of encoding size, overhead for producing, transporting,
and decoding. If you wanted to be clever, each message "type" could have
a printf-like formatter to turn the metadata into a human-readable string.

Things to consider:

* What about making a message type/label required? By having a required
label, you remove guesswork to infer a message type from other fields.
Downstream systems could key/index off of it easily, without having to
iterate through metadata. You could even (later) maintain a catalog of
the defined message types and their fields (sources of truth are nice
and allow validation, etc). The label could also be sent as a 0MQ
"envelope" message, allowing SUB sockets to subscribe to only the events
they care about. This would save intermediary agents from having to
deserialize the JSON payload to determine if they even care about the
message, a nice performance win.
* Regarding timers, I really like the concept but wonder if it is ideal
to release one timer event instead of 2 distinct "start" and "end"
events, as they are produced. The reason I like distinct "point" events
is that they are emitted immediately and thus can be consumed by near
real-time monitors for more rapid reaction, potentially while an event
is in-flight. They also have benefits for replaying a system's behavior
directly from the log stream, without needing to muck with the message
timeline. You also don't have to maintain as much state in the producer,
just an ID so a downstream system can correlate the start and end events
for a particular pair. The downside is downstream systems need to pair
up the events and there is overhead of an extra event emitted. But, I
think the benefits are compelling.

Overall, this is looking pretty nifty!

JR Conlin

unread,

Oct 13, 2011, 4:56:40 PM10/13/11

to Gregory Szorc, mha...@mozilla.com, servic...@mozilla.org

* At Netflix they sent both the start time and the stop + clock time. Why both?
The problem they encountered was that threads would be killed in mid process leading to some reporting tools to report some "infinitely" long cycles. Having the reporter calc the time difference also saved a few cycles during reporting. I had guessed that was what was being done here, but I'm usually wrong.

* Using structured metadata is good, the problem is that there can be significant "drift" about what goes into that data and how it's used. Granted, it's worse with plain text, but (sadly) that is often "worked around" by teams injecting things like control character delimiters.

Still, +1

Rob Miller

unread,

Oct 13, 2011, 5:09:00 PM10/13/11

to servic...@mozilla.org

On 10/13/11 1:03 PM, JR Conlin wrote:
> Looks good. Here's my feedback:
>
> 1. In a way the BrowserID stuff is a blessing because it means we can't
> presume a single platform or core. For instance, I'm now working on a
> component that is SMTP (and can't really use pyramid). Having alternate
> methods to submit data is key. In effect: don't have a library, have an
> API.

Yes, that's the idea. Except that, when it comes to the volume of
messages that we're likely to generate, HTTP isn't really a great
transport. So we went with something more well suited: JSON messages
sent over ZeroMQ. We will provide some libraries to make it easier for
folks to package up their messages and send it along, but if you want to
do it yourself you can, it's very easy.

> 2. I'm curious about what metadata actually contains. I thought it was
> additional elements of data associated with the error (say stack trace
> or user token or something). Does set_message_flavor() simply set the
> core dictionary to use with later calls merging data in, or is it
> something simpler like just the values to use for subsequent calls (with
> later metadata overwriting without a merge)?

Actually, in my mind, the "additional elements" you mention would
probably be part of the message itself; I see those as parts of the
primary message payload.

Metadata, on the other hand, is information about the message meant to
be used by the message routing system. The main thing that will be in
there at first will be message "type", a string token that tells the
system whether this is a timer, a counter, an error, or some other
message type. The message routing piece (logstash is what we're
currently exploring) will examine this and decide which back end(s) the
message should be delivered to, w/o having to dig in to the main message
payload.

"set_message_flavor" is just shorthand for setting specific sets of
metadata values that you're going to use again and again, so you don't
have to do it by hand. Say you have a particular type of error message
that you want to be delivered to both a sentry back end AND the Metrics
team's hadoop rig. Your metadata might look like this:

{'type': 'error',
'hadoop': 'true'}

You could register this as a flavor using
"set_message_flavor('hadoop_error', metadata)" where 'metadata' is the
above dictionary. Then later calls to metlog(msg,
flavors=['hadoop_error']) would do this for you.

> I don't want to seem dense about that, but I got burned by making the
> wrong guess in the past.

Hope this helps.

-r

JR Conlin

unread,

Oct 13, 2011, 5:32:14 PM10/13/11

to servic...@mozilla.org

On 10/13/2011 2:09 PM, Rob Miller wrote:
> On 10/13/11 1:03 PM, JR Conlin wrote:
>> Looks good. Here's my feedback:
>>
>> 1. In a way the BrowserID stuff is a blessing because it means we can't
>> presume a single platform or core. For instance, I'm now working on a
>> component that is SMTP (and can't really use pyramid). Having alternate
>> methods to submit data is key. In effect: don't have a library, have an
>> API.
>
> Yes, that's the idea. Except that, when it comes to the volume of
> messages that we're likely to generate, HTTP isn't really a great
> transport. So we went with something more well suited: JSON messages
> sent over ZeroMQ. We will provide some libraries to make it easier
> for folks to package up their messages and send it along, but if you
> want to do it yourself you can, it's very easy.

Quite. API != HTTP. I'm happy to use 0mq, named sockets, or any other
transport mechanism, so long as the interface is predicable and
reliable. HTTP is just another protocol.

Ah, I see. That's does clarify things, and it also really drives Greg's
point about having the message be structured data itself.

Rob Miller

unread,

Oct 13, 2011, 5:52:20 PM10/13/11

to servic...@mozilla.org

Thanks for the feedback! More below..

On 10/13/11 1:17 PM, Gregory Szorc wrote:
> Things I like:
>
> * You are transporting JSON. No burden on parties to parse/decode each
> message (the embedded string "message" field is an exception - see below).
> * Metadata map/dict is structured and easily machine readable
> * Built-in support for counter increment as metrics
> * The timer decorator is pretty sweet
>
> Things I would like to see:
>
> * More details on the ZeroMQ message "protocol." It appears you are
> using PUB-SUB and transferring JSON in sends consisting of single
> messages. Please document.
> * What about ZeroMQ delivery reliability? HWM on sender? How does it
> handle durable subscribers (socket identities), where 0MQ indefinitely
> buffers messages on the sender, leading to memory bloat? (Hint: you
> probably want to explicitly disallow durable sockets, which
> unfortunately isn't supported by the 0MQ API right now.)

Both of these are a function of the fact that we just decided on Tuesday
to use ZeroMQ, and this is the first time I've ever used it, so we're
still figuring out exactly how we want to use it and what the best
choices are w.r.t. fine tuning and etc. Your tips here are useful, and
once we've hammered out further details I'll be sure to update the spec
to include those.

> * Does the metadata map support typing, or is everything normalized to a
> string? (Certainly the Python and JSON support typing, but syslog, which
> it is based upon, only supports string values, I /think/.)

syslog is an inspiration, but is not underlying what we're doing, so
we're not limited to that. Even so, I haven't been inclined to use rich
typing here, b/c we do want to speak to the lowest common denominator,
and it's not clear that typing support will gain us that much. If we
end up w/ some concrete use cases where typed metadata is useful, I'm
not categorically against, but I've only been thinking in strings thus far.

> * In addition to counter increment, what about an API to support gauges
> (absolute numeric values)?

This could be done. Setting an absolute numeric value isn't anything
special, though, it doesn't require any cooperation from the back end in
the same way that a counter does. So I'd just call that a regular
message of a specific type.

> * Encouraging people to use structured metadata over string messages.
> Personally, I favor this approach so downstream systems don't need to
> roll their own text parsers. (Why convert structured data to
> unstructured text and back again if your main consumer is machines?) One
> downside is you lose some human readability. Though, I suppose you could
> put data in both a human-readable string and metadata, at a (hopefully
> marginal) cost of encoding size, overhead for producing, transporting,
> and decoding. If you wanted to be clever, each message "type" could have
> a printf-like formatter to turn the metadata into a human-readable string.

Most of the messages we generate will be fairly structured. I'm not
sure about "encouraging" people to do one thing or another, though. The
idea here is that the message transport doesn't care, it sends the
messages through to a back end appropriate to the message type. The
metadata that our transport relies upon WILL be structured, that's the
'metadata' piece in the metlog calls. But maybe I'm missing something
in your suggestion.

> Things to consider:
>
> * What about making a message type/label required? By having a required
> label, you remove guesswork to infer a message type from other fields.
> Downstream systems could key/index off of it easily, without having to
> iterate through metadata.

I've considered this, and I could yet be convinced. 'type' is
definitely the main thing landing in the metadata right now. I've
thought that maybe arbitrary metadata, and the idea of "flavors" encoded
therein, is maybe too much. Perhaps we should just scrap that
altogether, add a 'type' field, and be done with it. But I keep having
this nagging feeling in the back of my mind that a single message "type"
might not be rich enough to categorize all of the messages that we're
going to deal with, and an open (but still structured) way to decorate
these messages ensures we haven't painted ourselves into any corners.

This is part of why I want to gather more use cases, to see if this
flexibility will actually be useful, or if it's just cognitive burden
that we're not going to need.

> You could even (later) maintain a catalog of
> the defined message types and their fields (sources of truth are nice
> and allow validation, etc). The label could also be sent as a 0MQ
> "envelope" message, allowing SUB sockets to subscribe to only the events
> they care about. This would save intermediary agents from having to
> deserialize the JSON payload to determine if they even care about the
> message, a nice performance win.

The building of a corpus of message types is definitely part of the
plan. And stuffing the metadata into a ZeroMQ message envelope is a
good idea, we'll look into that.

> * Regarding timers, I really like the concept but wonder if it is ideal
> to release one timer event instead of 2 distinct "start" and "end"
> events, as they are produced. The reason I like distinct "point" events
> is that they are emitted immediately and thus can be consumed by near
> real-time monitors for more rapid reaction, potentially while an event
> is in-flight. They also have benefits for replaying a system's behavior
> directly from the log stream, without needing to muck with the message
> timeline. You also don't have to maintain as much state in the producer,
> just an ID so a downstream system can correlate the start and end events
> for a particular pair. The downside is downstream systems need to pair
> up the events and there is overhead of an extra event emitted. But, I
> think the benefits are compelling.

The timer and counter APIs are actually informed directly by statsd, and
in fact most of the Timer "contextdecorator" code is stolen directly
from jsocol's pystatsd client. We want to keep the current semantics so
we can feed this data into a statsd back end w/o having to do any
massaging along the way.

That being said, there's nothing at all preventing you from generating
timer stop and start events like you've described. It'd just be a
matter of defining a message type (or possibly two) and sending it
through w/ the right timestamps. We could provide some helper code for
this, but, again, I want actual usage patterns to drive what sorts of
tweaks and helpers we add. I'll keep this one in mind, though, in case
it becomes clear it'd be well used.

> Overall, this is looking pretty nifty!

Thanks! And thanks for the feedback...

-r

Rob Miller

unread,

Oct 13, 2011, 5:58:30 PM10/13/11

to servic...@mozilla.org

On 10/13/11 2:32 PM, JR Conlin wrote:
> On 10/13/2011 2:09 PM, Rob Miller wrote:
>> On 10/13/11 1:03 PM, JR Conlin wrote:
>>> Looks good. Here's my feedback:
>>>
>>> 1. In a way the BrowserID stuff is a blessing because it means we can't
>>> presume a single platform or core. For instance, I'm now working on a
>>> component that is SMTP (and can't really use pyramid). Having alternate
>>> methods to submit data is key. In effect: don't have a library, have an
>>> API.
>>
>> Yes, that's the idea. Except that, when it comes to the volume of
>> messages that we're likely to generate, HTTP isn't really a great
>> transport. So we went with something more well suited: JSON messages
>> sent over ZeroMQ. We will provide some libraries to make it easier for
>> folks to package up their messages and send it along, but if you want
>> to do it yourself you can, it's very easy.
>
> Quite. API != HTTP. I'm happy to use 0mq, named sockets, or any other
> transport mechanism, so long as the interface is predicable and
> reliable. HTTP is just another protocol.

Yup, it will be. It's not well specified yet b/c we're still working
out the details, but eventually it will be documented well enough that
anyone w/ reasonable skills should be able to write either a client
(i.e. a message publisher) or a listener (i.e. a message subscriber).

Indeed. Most of our messages will be. And as we work through this some
more, if specifying a message structure makes sense, we'll definitely do it.

Gregory Szorc

unread,

Oct 13, 2011, 10:23:18 PM10/13/11

to servic...@mozilla.org

On 10/13/11 2:52 PM, Rob Miller wrote:
>> Things to consider:
>>
>> * What about making a message type/label required? By having a required
>> label, you remove guesswork to infer a message type from other fields.
>> Downstream systems could key/index off of it easily, without having to
>> iterate through metadata.
>
> I've considered this, and I could yet be convinced. 'type' is
> definitely the main thing landing in the metadata right now. I've
> thought that maybe arbitrary metadata, and the idea of "flavors"
> encoded therein, is maybe too much. Perhaps we should just scrap that
> altogether, add a 'type' field, and be done with it. But I keep
> having this nagging feeling in the back of my mind that a single
> message "type" might not be rich enough to categorize all of the
> messages that we're going to deal with, and an open (but still
> structured) way to decorate these messages ensures we haven't painted
> ourselves into any corners.

Since it sounds like the structure is still up for debate, let me put
something out there. My opinion is an emitted message should consist of
"envelope" and "payload" parts. The envelope contains metadata (some may
be optional) that applies to most, if not all messages (time, logger,
severity, type, labels, maybe arbitrary key-value pairs, etc). The
payload can be a string, a set of key-value pairs, etc. The envelope and
payload fields could all be encoded together, interleaved inside a
single JSON object. The important part is they don't overlap. This is
slightly different from the existing implementation, where the
"metadata" map blurs the distinction between the two.

When you look at the structure this way, "type" and a payload for a
specific type would agree. If "type" itself is not sufficient for your
expressive/routing/transport/client needs, then some tuple of *envelope*
fields should be (e.g. (logger, type), (appID, type), (type, labels), etc).

Richard Newman

unread,

Oct 13, 2011, 11:13:48 PM10/13/11

to servic...@mozilla.org, Mike Hanson

(Randomly diving in to parts of this. Sorry!)

> * Regarding timers, I really like the concept but wonder if it is ideal to release one timer event instead of 2 distinct "start" and "end" events, as they are produced. The reason I like distinct "point" events is that they are emitted immediately and thus can be consumed by near real-time monitors for more rapid reaction, potentially while an event is in-flight. They also have benefits for replaying a system's behavior directly from the log stream, without needing to muck with the message timeline. You also don't have to maintain as much state in the producer, just an ID so a downstream system can correlate the start and end events for a particular pair. The downside is downstream systems need to pair up the events and there is overhead of an extra event emitted. But, I think the benefits are compelling.

I concur. To add to Greg's thought:

If the timer's enclosed code doesn't terminate within a reasonable amount of time, logging explicit start and end pairs (or just the start!) is waaay more useful than waiting to log until the code finishes!

The term we used at Tellme for aggregating raw log events into more structured data was "sessionizing". Start and end correlation was one part of this. I presume that our metrics infrastructure has some similar capability for arbitrary stream processing of events. As you approach this problem of generating, delivering, and storing these raw log messages, it's worth thinking about the inevitable analysis layer that goes on top.

The concept of a start and end is much more general than "time this block of code". What about logging the various events submitted during the J-PAKE flow, for example? No single block of code to time, and each part might want to be analyzed separately and in aggregate. From this I infer that analysis will end up having to do some kind of event interpretation regardless, so granular events emitted from `timer` aren't really much more costly than a single event.

Richard Newman

unread,

Oct 13, 2011, 11:30:49 PM10/13/11

to servic...@mozilla.org

> Since it sounds like the structure is still up for debate, let me put something out there. My opinion is an emitted message should consist of "envelope" and "payload" parts. The envelope contains metadata (some may be optional) that applies to most, if not all messages (time, logger, severity, type, labels, maybe arbitrary key-value pairs, etc). The payload can be a string, a set of key-value pairs, etc. The envelope and payload fields could all be encoded together, interleaved inside a single JSON object. The important part is they don't overlap. This is slightly different from the existing implementation, where the "metadata" map blurs the distinction between the two.

Again with the concurring!

The doc already mentions routing:

It will deserialize these messages and examine the metadata to determine the appropriate back end(s) to which each message should be delivered.

so I think it's broadly worthwhile to put a fixed set of routing information — including type — in the envelope. If that means some payloads will be empty, then great! Having concrete examples of data will make determining the envelope contents much easier, of course.

While I'm on the topic: at some point in the future, the format of these log messages, or the required arguments to various bits, is going to change. That might well be an incompatible change. This doesn't have to be directly addressed right now, but it's worth thinking about or mentioning how you'll do protocol or payload versioning. For example, will you simply make sure that an incompatible change involves logging to a different listener, with each listener understanding just its own format? What about API changes?

This might have some impact to the API design. E.g., when I see

MetlogClient(host, port, logger="", severity=6)

I think "how am I going to write a client that streams to a file rather than a host and port?", or "where are we going to put the argument for the separate control socket port?".

At that point I wonder why this stuff is in the constructor anyway, rather than coming from a config or being directly set as attributes:

let logClient = new MetlogClient(); // Does message processing.
logClient.processor = new Metlog0MQProcessor(host, port); // Handles delivery.

// Set up the client.
logClient.severity = 6;
logClient.flavor["vanilla"] = {
horse: "cart"
};

Anyway, just something to think about, and an example JS API ;)

-R

Rob Miller

unread,

Oct 14, 2011, 2:41:29 PM10/14/11

to servic...@mozilla.org

Actually, what you've described is almost precisely what I intended with
the current proposal. What I call "metadata" is what you called
"arbitrary key-value pairs". The "message" part is the payload, and it
can be a string, JSON, or anything that can be serialized and sent over
the wire.

The only difference between what you described and what I've proposed,
AFAICT, is that you pulled "type" and "labels" out into top level pieces
of the envelope, while I imagined them as embedded within the arbitrary
key-value pairs.

Maybe the issue here is just the naming. For some reason everybody
seems thrown off by "metadata" as the field name, thinking it means
something other than it does. Anyone have any better suggestions?

> When you look at the structure this way, "type" and a payload for a
> specific type would agree. If "type" itself is not sufficient for your
> expressive/routing/transport/client needs, then some tuple of *envelope*
> fields should be (e.g. (logger, type), (appID, type), (type, labels), etc).

Again, this is exactly what I had in mind, except that any key-value
pair in the metadata could also be used as part of the equation.

-r

Rob Miller

unread,

Oct 14, 2011, 3:04:07 PM10/14/11

to servic...@mozilla.org

On 10/13/11 8:13 PM, Richard Newman wrote:
> (Randomly diving in to parts of this. Sorry!)
>
>> * Regarding timers, I really like the concept but wonder if it is ideal to release one timer event instead of 2 distinct "start" and "end" events, as they are produced. The reason I like distinct "point" events is that they are emitted immediately and thus can be consumed by near real-time monitors for more rapid reaction, potentially while an event is in-flight. They also have benefits for replaying a system's behavior directly from the log stream, without needing to muck with the message timeline. You also don't have to maintain as much state in the producer, just an ID so a downstream system can correlate the start and end events for a particular pair. The downside is downstream systems need to pair up the events and there is overhead of an extra event emitted. But, I think the benefits are compelling.
>
> I concur. To add to Greg's thought:
>
> If the timer's enclosed code doesn't terminate within a reasonable amount of time, logging explicit start and end pairs (or just the start!) is waaay more useful than waiting to log until the code finishes!
>
> The term we used at Tellme for aggregating raw log events into more structured data was "sessionizing". Start and end correlation was one part of this. I presume that our metrics infrastructure has some similar capability for arbitrary stream processing of events. As you approach this problem of generating, delivering, and storing these raw log messages, it's worth thinking about the inevitable analysis layer that goes on top.

This is a great point. As described in the propoal, so far we've
identified 3 concrete back ends:

- statsd (for counter and timer events)
- sentry (for errors)
- bagheera / hadoop (for everything else)

The idea was that hadoop gives us the ability to right arbitrary
map-reduce jobs or Hive queries to do any sort of analysis that we want.
And, as real world usage informs our needs, we might add more back
ends, or we might decide to structure some of our messages in a way that
makes the after-the-fact analysis we want to do a bit easier.

Any ideas anyone has for additional back ends that we know we're likely
to want, or for additional specific message types that might inform how
we structure thing, please let me know.

> The concept of a start and end is much more general than "time this block of code". What about logging the various events submitted during the J-PAKE flow, for example? No single block of code to time, and each part might want to be analyzed separately and in aggregate. From this I infer that analysis will end up having to do some kind of event interpretation regardless, so granular events emitted from `timer` aren't really much more costly than a single event.

Right. And there's absolutely nothing preventing us from sending
"start" and "end" events, and using them to do any kind of
post-processing we want to do. It's just that there's nothing
particularly noteworthy about these events... they'd just be regular
"metlog" calls, with type values of "start" and "finish", or similar. I
don't see a need for syntactic sugar.

"Time this piece of code" IS given some syntactic sugar, however,
because a) it's something that is often requested, b) setting up
threadsafe timers is non-trivial, and c) statsd explicitly supports
messages that say "this activity took N ms to complete", and from those
messages it will automatically generate 90th percentile, average, and
lower and upper bounds for the duration value.

All in all, I'd say we're in violent agreement. :)

-r

Mike Connor

unread,

Oct 14, 2011, 3:14:07 PM10/14/11

to Rob Miller, servic...@mozilla.org

On 2011-10-14, at 3:04 PM, Rob Miller wrote:

> On 10/13/11 8:13 PM, Richard Newman wrote:
>> (Randomly diving in to parts of this. Sorry!)
>>
>>> * Regarding timers, I really like the concept but wonder if it is ideal to release one timer event instead of 2 distinct "start" and "end" events, as they are produced. The reason I like distinct "point" events is that they are emitted immediately and thus can be consumed by near real-time monitors for more rapid reaction, potentially while an event is in-flight. They also have benefits for replaying a system's behavior directly from the log stream, without needing to muck with the message timeline. You also don't have to maintain as much state in the producer, just an ID so a downstream system can correlate the start and end events for a particular pair. The downside is downstream systems need to pair up the events and there is overhead of an extra event emitted. But, I think the benefits are compelling.
>>
>> I concur. To add to Greg's thought:
>>
>> If the timer's enclosed code doesn't terminate within a reasonable amount of time, logging explicit start and end pairs (or just the start!) is waaay more useful than waiting to log until the code finishes!
>>
>> The term we used at Tellme for aggregating raw log events into more structured data was "sessionizing". Start and end correlation was one part of this. I presume that our metrics infrastructure has some similar capability for arbitrary stream processing of events. As you approach this problem of generating, delivering, and storing these raw log messages, it's worth thinking about the inevitable analysis layer that goes on top.
>
> This is a great point. As described in the propoal, so far we've identified 3 concrete back ends:
>
> - statsd (for counter and timer events)
> - sentry (for errors)
> - bagheera / hadoop (for everything else)

- ArcSight (for security-relevant events) via CEF logs.

-- Mike

Rob Miller

unread,

Oct 14, 2011, 3:21:06 PM10/14/11

to servic...@mozilla.org

Noted, thanks. I'll include mention of this in the next round of spec
revisions that I'm about to make. :)

-r

Gregory Szorc

unread,

Oct 14, 2011, 3:21:15 PM10/14/11

to servic...@mozilla.org

On 10/14/11 12:04 PM, Rob Miller wrote:
> This is a great point. As described in the propoal, so far we've
> identified 3 concrete back ends:
>
> - statsd (for counter and timer events)
> - sentry (for errors)
> - bagheera / hadoop (for everything else)
>
> The idea was that hadoop gives us the ability to right arbitrary
> map-reduce jobs or Hive queries to do any sort of analysis that we
> want. And, as real world usage informs our needs, we might add more
> back ends, or we might decide to structure some of our messages in a
> way that makes the after-the-fact analysis we want to do a bit easier.
>
> Any ideas anyone has for additional back ends that we know we're
> likely to want, or for additional specific message types that might
> inform how we structure thing, please let me know.

I'd add near real-time monitoring to the list. Someone will inevitably
want something faster (lower latency) than Hadoop that is more powerful
than statsd: something like Esper.

Rob Miller

unread,

Oct 14, 2011, 3:23:37 PM10/14/11

to servic...@mozilla.org

On 10/14/11 12:21 PM, Gregory Szorc wrote:
> On 10/14/11 12:04 PM, Rob Miller wrote:
>> This is a great point. As described in the propoal, so far we've
>> identified 3 concrete back ends:
>>
>> - statsd (for counter and timer events)
>> - sentry (for errors)
>> - bagheera / hadoop (for everything else)
>>
>> The idea was that hadoop gives us the ability to right arbitrary
>> map-reduce jobs or Hive queries to do any sort of analysis that we
>> want. And, as real world usage informs our needs, we might add more
>> back ends, or we might decide to structure some of our messages in a
>> way that makes the after-the-fact analysis we want to do a bit easier.
>>
>> Any ideas anyone has for additional back ends that we know we're
>> likely to want, or for additional specific message types that might
>> inform how we structure thing, please let me know.
> I'd add near real-time monitoring to the list. Someone will inevitably
> want something faster (lower latency) than Hadoop that is more powerful
> than statsd: something like Esper.

Great! I'll add it to the list.

-r

Ben Bangert

unread,

Oct 14, 2011, 3:31:30 PM10/14/11

to Rob Miller, servic...@mozilla.org

On Oct 14, 2011, at 11:41 AM, Rob Miller wrote:
> Actually, what you've described is almost precisely what I intended with the current proposal. What I call "metadata" is what you called "arbitrary key-value pairs". The "message" part is the payload, and it can be a string, JSON, or anything that can be serialized and sent over the wire.
>
> The only difference between what you described and what I've proposed, AFAICT, is that you pulled "type" and "labels" out into top level pieces of the envelope, while I imagined them as embedded within the arbitrary key-value pairs.
>
> Maybe the issue here is just the naming. For some reason everybody seems thrown off by "metadata" as the field name, thinking it means something other than it does. Anyone have any better suggestions?

In zilch and sentry, they're called "tags".

- Ben

Rob Miller

unread,

Oct 14, 2011, 3:46:35 PM10/14/11

to servic...@mozilla.org

On 10/14/11 12:31 PM, Ben Bangert wrote:
> On Oct 14, 2011, at 11:41 AM, Rob Miller wrote:
>> Actually, what you've described is almost precisely what I intended with the current proposal. What I call "metadata" is what you called "arbitrary key-value pairs". The "message" part is the payload, and it can be a string, JSON, or anything that can be serialized and sent over the wire.
>>
>> The only difference between what you described and what I've proposed, AFAICT, is that you pulled "type" and "labels" out into top level pieces of the envelope, while I imagined them as embedded within the arbitrary key-value pairs.
>>
>> Maybe the issue here is just the naming. For some reason everybody seems thrown off by "metadata" as the field name, thinking it means something other than it does. Anyone have any better suggestions?
>
> In zilch and sentry, they're called "tags".

Yeah, that was my first thought, too, but "tags" makes me think of a set
of string tokens, rather than key-value pairs, so I didn't use it. Tags
might be a less loaded term, though, so unless I hear any complaints
about it or any better ideas, I'll switch to that.

-r

Rob Miller

unread,

Oct 14, 2011, 3:58:41 PM10/14/11

to servic...@mozilla.org

On 10/13/11 8:30 PM, Richard Newman wrote:
>> Since it sounds like the structure is still up for debate, let me put something out there. My opinion is an emitted message should consist of "envelope" and "payload" parts. The envelope contains metadata (some may be optional) that applies to most, if not all messages (time, logger, severity, type, labels, maybe arbitrary key-value pairs, etc). The payload can be a string, a set of key-value pairs, etc. The envelope and payload fields could all be encoded together, interleaved inside a single JSON object. The important part is they don't overlap. This is slightly different from the existing implementation, where the "metadata" map blurs the distinction between the two.
>
> Again with the concurring!
>
> The doc already mentions routing:
>
> It will deserialize these messages and examine the metadata to determine the appropriate back end(s) to which each message should be delivered.
>
> so I think it's broadly worthwhile to put a fixed set of routing information — including type — in the envelope. If that means some payloads will be empty, then great! Having concrete examples of data will make determining the envelope contents much easier, of course.

Yep. And, since the metadata is part of the envelope, then we were
already planning on doing this. But I'm definitely convinced now that
'type' should be a top-level field, and I'll update the spec accordingly.

> While I'm on the topic: at some point in the future, the format of these log messages, or the required arguments to various bits, is going to change. That might well be an incompatible change. This doesn't have to be directly addressed right now, but it's worth thinking about or mentioning how you'll do protocol or payload versioning. For example, will you simply make sure that an incompatible change involves logging to a different listener, with each listener understanding just its own format? What about API changes?

This is a great point. We could, of course, put a 'format_version' in
the message metadata, but this is probably another one that is important
enough to become a top level field.

> This might have some impact to the API design. E.g., when I see
>
> MetlogClient(host, port, logger="", severity=6)
>
> I think "how am I going to write a client that streams to a file rather than a host and port?", or "where are we going to put the argument for the separate control socket port?".

Actually, the spec here is slightly out of date, I'll update now. We're
no longer passing in 'host' and 'port', but instead a sequence of ZeroMQ
bind strings, which will specify both the transport protocol and the
location of the listener(s). Writing a client that streams to a file,
then, would be as easy as writing one that looks for a bind string such
as "file:///path/to/destination.txt" and writing to the specified location.

> At that point I wonder why this stuff is in the constructor anyway, rather than coming from a config or being directly set as attributes:
>
> let logClient = new MetlogClient(); // Does message processing.
> logClient.processor = new Metlog0MQProcessor(host, port); // Handles delivery.
>
> // Set up the client.
> logClient.severity = 6;
> logClient.flavor["vanilla"] = {
> horse: "cart"
> };

The metlog python library is just that, a library, so it doesn't have
any idea of "config". When we roll this into Demoapp, it will
definitely read the values out of our config file and handle this detail
for you.

As for the constructor part, the values ARE simply stored as attributes
on the client object. It's just a typical python idiom that
constructors will do this sort of thing for you. If you want to change
the values after construction time, setting attributes is how you do it.

The idea of a separate "processor" is something I don't have, though.
With the current implementation you'd need to subclass the client and
override the _send_message method, which is teh suck compared to the
composition pattern you propose, so I'll revise my library implementation.

> Anyway, just something to think about, and an example JS API ;)

Already planned on stealing it. ;)

-r

Gregory Szorc

unread,

Oct 14, 2011, 4:11:41 PM10/14/11

to servic...@mozilla.org

On 10/14/11 11:41 AM, Rob Miller wrote:
> Actually, what you've described is almost precisely what I intended
> with the current proposal. What I call "metadata" is what you called
> "arbitrary key-value pairs". The "message" part is the payload, and
> it can be a string, JSON, or anything that can be serialized and sent
> over the wire.

Ahhh. I was thinking the payload field(s) would be typed (i.e. no need
to look at the type of a field, just the presence). Maybe have payload
exist as a separate field altogether, enforcing the separation of
envelope from payload. e.g.

{
logger: "foo",
metadata: { ... },
severity: 3,
type: "request.finish",
payload: {
string: "Request finished in 0.002 seconds",
fields: {
duration: 0.002

}
}
}
> The only difference between what you described and what I've proposed,
> AFAICT, is that you pulled "type" and "labels" out into top level
> pieces of the envelope, while I imagined them as embedded within the
> arbitrary key-value pairs.

That is what I proposed for consideration in my initial email, yes. The
argument I am throwing out there is that the stronger you define your
messages, the easier it becomes for downstream systems to process and
consume said messages. Without a required message "type" (presence in
metadata/tags/labels/whatever would imply it being optional as I am
reading things), you would have 1 class of messages with types that are
easily consumable (effectively a switch statement in code) and those
that aren't (requiring inference of message type from other fields or
payload). You quickly realize inference sucks (redundant and/or
inconsistent logic across downstream agents, imperfect detection,
sensitive to seemingly innocuous schema changes, less overhead to
calculate, etc), so, you end up adding "type" on all messages to avoid
it. At this point, you wonder why you didn't avoid this mess altogether
by making "type" required.

To address a point JR made, "type" isn't perfect (it is also vulnerable
to schema changes over time). But, these can be combated with a good
review process, a central catalog to define types (or tuples that
constitute a unique type), and versioning types or payloads when they
inevitably change (either version the type name - add V1 - or add a
"version" field to the payload for that specific type).

Rob Miller

unread,

Oct 14, 2011, 4:48:38 PM10/14/11

to servic...@mozilla.org

On 10/14/11 1:11 PM, Gregory Szorc wrote:
> On 10/14/11 11:41 AM, Rob Miller wrote:
>> Actually, what you've described is almost precisely what I intended
>> with the current proposal. What I call "metadata" is what you called
>> "arbitrary key-value pairs". The "message" part is the payload, and it
>> can be a string, JSON, or anything that can be serialized and sent
>> over the wire.
>
> Ahhh. I was thinking the payload field(s) would be typed (i.e. no need
> to look at the type of a field, just the presence). Maybe have payload
> exist as a separate field altogether, enforcing the separation of
> envelope from payload. e.g.
>
> {
> logger: "foo",
> metadata: { ... },
> severity: 3,
> type: "request.finish",
> payload: {
> string: "Request finished in 0.002 seconds",
> fields: {
> duration: 0.002
> }
> }
> }

Again, we're in almost complete agreement. Here's the structure as I
currently imagine it:

{
logger: "foo",
tags: { ... },
severity: 6,
type: "...",
env_version: 0.1,
message: "...",
timestamp: "2011-10-14T20:36:54.609806"
}

There are some naming differences ("payload" may be a better choice than
"message", but their semantic meaning in the structure is identical),
but the only real difference is that I don't imagine defining the
structure of the payload at all, we just say that it's string data.
That string may (and often will) be a nested JSON object, with its own
rich structure, but I don't see that as a concern of the spec itself,
but rather as something that will flesh itself out as we build a corpus
of message types.

>> The only difference between what you described and what I've proposed,
>> AFAICT, is that you pulled "type" and "labels" out into top level
>> pieces of the envelope, while I imagined them as embedded within the
>> arbitrary key-value pairs.
>
> That is what I proposed for consideration in my initial email, yes. The
> argument I am throwing out there is that the stronger you define your
> messages, the easier it becomes for downstream systems to process and
> consume said messages. Without a required message "type" (presence in
> metadata/tags/labels/whatever would imply it being optional as I am
> reading things), you would have 1 class of messages with types that are
> easily consumable (effectively a switch statement in code) and those
> that aren't (requiring inference of message type from other fields or
> payload). You quickly realize inference sucks (redundant and/or
> inconsistent logic across downstream agents, imperfect detection,
> sensitive to seemingly innocuous schema changes, less overhead to
> calculate, etc), so, you end up adding "type" on all messages to avoid
> it. At this point, you wonder why you didn't avoid this mess altogether
> by making "type" required.

Yup, I've already agreed that type makes sense as a top level field
rather than something living in the metadata.

> To address a point JR made, "type" isn't perfect (it is also vulnerable
> to schema changes over time). But, these can be combated with a good
> review process, a central catalog to define types (or tuples that
> constitute a unique type), and versioning types or payloads when they
> inevitably change (either version the type name - add V1 - or add a
> "version" field to the payload for that specific type).

An "envelope version" value is specified in my data structure, above,
which will allow us to demarcate changes in the envelope structure. The
structure of specific types of messages can and should be versioned, as
well. I'm inclined to let that be optional, i.e. embedded in the
message tags, but could possibly be convinced that a "msg_version" field
belongs at the top level.

-r

Pete Fritchman

unread,

Oct 17, 2011, 11:46:54 AM10/17/11

to Gregory Szorc, servic...@mozilla.org

----- Original Message -----
> On 10/14/11 12:04 PM, Rob Miller wrote:
> I'd add near real-time monitoring to the list. Someone will inevitably
> want something faster (lower latency) than Hadoop that is more
> powerful
> than statsd: something like Esper.

If you log metrics via statsd (in our environment), you can get at an
Esper engine by writing cepmon rules: https://github.com/fetep/cepmon
(rolling out in our environment soon). Right now, cepmon doesn't
support sending output back into the metrics stream, but that's easy
to add if desired.

--
petef

Mike Hanson

unread,

Oct 24, 2011, 1:00:06 PM10/24/11

to Rob Miller, Ben Adida, servic...@mozilla.org

1. Checking my understanding - the operational model of this would be:

* I add generator calls to my service; I debug and test locally with some sort of stub/debugger version of logstash. I get the root of my string label from some well-understood central registry of names.

* I do something to get my messages understood by the router - is this self-service? do I have to open a bug? something in-between?

* I watch my events start to arrive in a self-service backend, or I engage with the teams running Esper, ArcSight, etc. to define my event stream

2. The distinction between payload and fields feels like it's based on an underlying system requirement/capability. Surfacing that buried assumption, if it's important, would be helpful.

3. As currently written, I am encouraged to submit many events to the metlog system and discover anomalies "later". That might mean that the system doesn't solve the problem of fraud/attack/intrusion detection. Perhaps that's okay - but dynamically noticing threats and cranking up the logging/paranoia at runtime can sometimes be helpful. Not necessarily saying this is a requirement for you - but if you don't do it, applications will. (The trivial example I can think of is counting login failures to BrowserID).

4. Timers in JS could be handled through a wrapped function, which would also be cool.

5. Can we scale the router horizontally? It seems so, but just checking.

m

Ben Adida

unread,

Oct 21, 2011, 8:38:46 PM10/21/11

to Rob Miller, Michael Hanson, servic...@mozilla.org

Hi Rob,

Thanks for sending this.

At a high level, this looks good. I like that you're focusing on the
performance side of it. I don't know zeroMQ but that's why I'm leaving
that to you :)

I would suggest that you consider a simpler message to start, maybe just
the key-value pairs, with eventually some required fields? Just a
suggestion, I'm happy to try your current approach, too.

Have a great weekend,

-Ben

Rob Miller

unread,

Oct 24, 2011, 2:28:25 PM10/24/11

to Mike Hanson, Ben Adida, servic...@mozilla.org

On 10/24/11 10:00 AM, Mike Hanson wrote:
>
> 1. Checking my understanding - the operational model of this would be:
>
> * I add generator calls to my service; I debug and test locally with
> some sort of stub/debugger version of logstash. I get the root of my
> string label from some well-understood central registry of names.

Yup.

> * I do something to get my messages understood by the router - is this
> self-service? do I have to open a bug? something in-between?

We'll build a corpus of understood message types, many of which will be
generally useful. You can add new message types at any time, and w/o
intervention they will be delivered to the "catch-all" back end, which
likely to be the Metrics team's Hadoop infrastructure. If you want to
get your messages routed in a more specific manner then you'd open a ticket.

> * I watch my events start to arrive in a self-service backend, or I
> engage with the teams running Esper, ArcSight, etc. to define my event
> stream

Specific message types would automatically be routed to the appropriate
back ends, which should already be set up to handle those types of
messages from multiple sources. The ultimate goal is that a new service
can start using this system and immediately get a set of useful reports
and outputs "for free". If there are needs that arise that aren't being
handled by the particular configurations that are available, then yes,
you'd need to engage w/ the owner of a particular back end to see if
changes can be made to support your cases.

> 2. The distinction between /payload/ and /fields/ feels like it's based

> on an underlying system requirement/capability. Surfacing that buried
> assumption, if it's important, would be helpful.

"payload" is the contents of your message. It's a blob as far as the
message processing and routing is concerned. You can put structured
data in there, or simply string messages, whatever you want to be able
to access on the other side.

"fields" is metadata *about* the message, i.e. part of the envelope, not
part of the payload. The fields *will* be parsed by logstash, and
routing decisions might be made based on the contents of those fields,
in concert w/ other "top level" pieces of metadata such as the message type.

Does this clarify? If so, great, I'll tweak the language in the spec to
include a variation on this theme. If not, please inquire further so I
can try to make sure your questions are answered and then weave the
answer into the document as appropriate.

> 3. As currently written, I am encouraged to submit many events to the
> metlog system and discover anomalies "later". That might mean that the
> system doesn't solve the problem of fraud/attack/intrusion detection.
> Perhaps that's okay - but dynamically noticing threats and cranking up
> the logging/paranoia at runtime can sometimes be helpful. Not
> necessarily saying this is a requirement for you - but if you don't do
> it, applications will. (The trivial example I can think of is counting
> login failures to BrowserID).

This seems to me more a function of the back ends that are in play, and
it also depends somewhat on your definition of "real time". Certain
back ends have specific support for event (or anomaly) detection, and
notification based on the same. These features will be as available to
you when the messages are routed through the metlog system as they would
be if you were sending messages to them directly. You may need to
engage w/ the owner of a particular back end to make sure that the
notification triggers you care about are in place and working as
desired, but the hope again is that we'll quickly build a set of
patterns that are generally useful, so subsequent services that start
using this infrastructure will be able to tap into a set of useful
notification strategies w/o much effort.

As for the "real time" nature, we'll be pushing messages through to the
back ends as quickly as possible, and the goal will be to have a minimal
amount of latency, but under times of heavy load there may be a bit of
message queuing that would slow down delivery of the messages from the
router to the back ends. I suspect it will all be fine, but if there's
a notification need for which the latency is sometimes to high, we'll
deal w/ that on a case-by-case basis.

> 4. Timers in JS could be handled through a wrapped function, which would
> also be cool.

Indeed, we've already provided both a decorator and a context manager
(i.e. "with metlog.timer as result:") in the Python library, will try to
provide analogous structures in JS. Anyone interested in helping w/ the
development of the JS library is welcome to contact me off-list.

> 5. Can we scale the router horizontally? It seems so, but just checking.

Yup. Services-Ops is already planning on having an instance of logstash
running on every single services machine, and we plan on using ZeroMQ's
"IPC" inter-process transport rather than TCP to deliver the messages in
most cases. Adding add'l logstash instances as needed should be a
pretty simple process.

Hope this answers most of your questions, thanks for taking a look!

Mike Connor

unread,

Oct 24, 2011, 3:01:46 PM10/24/11

to Mike Hanson, Ben Adida, servic...@mozilla.org

On 2011-10-24, at 1:00 PM, Mike Hanson wrote:

> 3. As currently written, I am encouraged to submit many events to the metlog system and discover anomalies "later". That might mean that the system doesn't solve the problem of fraud/attack/intrusion detection. Perhaps that's okay - but dynamically noticing threats and cranking up the logging/paranoia at runtime can sometimes be helpful. Not necessarily saying this is a requirement for you - but if you don't do it, applications will. (The trivial example I can think of is counting login failures to BrowserID).

Something we've been working on with infrasec is a generic abuse detection/prevention solution that works across all of our apps. The basic idea is to report events to ArcSight (via the API), do correlation and detection centrally in ArcSight, and then use a plugin to pass blocked IPs to the Zeus load balancers in front of our prod machines. This is sort of assumed in the design, but isn't called out well for external consumers. We'll get that fixed, as abuse detection/prevention is a common requirement across all apps, so we expect and intend to solve it here.

-- Mike

Reply all

Reply to author

Forward