Keeping "most interesting" traces

292 views

Skip to first unread message

Rodrigo Fonseca

unread,

Apr 6, 2018, 6:44:25 PM4/6/18

to Distributed Tracing Workgroup

Hi all,

We have a project here at Brown that may be of interest to people, and we need help from this community to validate the ideas.

The problem is this: if you have a fixed budget to trace, random sampling is usually a strategy. However, it is not good for infrequent (but perhaps important) anomalous executions. Suppose you have 99% of "normal" or boring executions, and that you have 4 types of anomalous executions in the remaining 1%. With random sampling, you end up dedicating 99% of your tracing budget to the boring stuff, and risk losing the interesting traces you should be debugging with. We are working on a project that can detect how similar a trace is to previous traces, and, roughly, decides to store the trace with a probability that is proportional to how different it is. In the example above, you could end up dedicating 20% of your tracing budget to each of the 5 types (assuming you have enough of each type).

Even if you are doing 100% tracing, our approach can be useful to separate the different types of executions you see and let you focus on the interesting ones, or allow you to "compress" your trace storage by only keeping around representative traces.

What we do is to do a type of dynamic clustering of the traces (I'll be happy to describe the approach in more detail if you are interested), and in some tests that we do here with a version of Spark that we instrumented with X-Trace, we can separate executions in which a node fails, in which we stress one of the disks, executions with different input sizes, etc.

This is where you come in: what we lack here are diverse enough traces, with real anomalies that happen only in production. I would love to know if you can share anonymized trace logs with us, so that we can test and validate our tools. Ideally, we would get a large enough set of traces that have different "types" of executions, with natural variations that occur in production.

We have the intention of publishing the results and the tools (not the traces :), and hope that this would solve a real problem that people have. One of the motivations of this problem came from conversations with the folks doing tracing at Google about 5 years ago!

If any of you can share a dataset of traces with us, please let me know, and we can find a way to proceed.

Lastly, if anyone is interested, we could briefly talk about this at the tracing workshop in May.

Thank you for reading :)

Rodrigo Fonseca

Brown University

Adrian Cole

unread,

Apr 7, 2018, 1:12:27 AM4/7/18

to Rodrigo Fonseca, Distributed Tracing Workgroup

Hi, Rodrigo.

Thanks for pitching this in. I also hope folks will share trace data
with you, at the very least to help in their own pursuit of better
usage of their systems. Incidentally, last meeting included "firehose
mode", which was a means to at least locally capture more data. What
you mention here is the perfect follow-on. Mind signing yourself up
for a slot convenient to you in May? If nothing else, it could be a
good chance to recruit a captive audience to share some data :P

-A

> --
> You received this message because you are subscribed to the Google Groups
> "Distributed Tracing Workgroup" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to distributed-tra...@googlegroups.com.
> To post to this group, send email to distribut...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/distributed-tracing/d8438115-ee44-4397-9ddd-0d806ce7b9cf%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages