Hi all,
We have a project here at Brown that may be of interest to people, and we need help from this community to validate the ideas.
The problem is this: if you have a fixed budget to trace, random sampling is usually a strategy. However, it is not good for infrequent (but perhaps important) anomalous executions. Suppose you have 99% of "normal" or boring executions, and that you have 4 types of anomalous executions in the remaining 1%. With random sampling, you end up dedicating 99% of your tracing budget to the boring stuff, and risk losing the interesting traces you should be debugging with. We are working on a project that can detect how similar a trace is to previous traces, and, roughly, decides to store the trace with a probability that is proportional to how different it is. In the example above, you could end up dedicating 20% of your tracing budget to each of the 5 types (assuming you have enough of each type).
Even if you are doing 100% tracing, our approach can be useful to separate the different types of executions you see and let you focus on the interesting ones, or allow you to "compress" your trace storage by only keeping around representative traces.
What we do is to do a type of dynamic clustering of the traces (I'll be happy to describe the approach in more detail if you are interested), and in some tests that we do here with a version of Spark that we instrumented with X-Trace, we can separate executions in which a node fails, in which we stress one of the disks, executions with different input sizes, etc.
This is where you come in: what we lack here are diverse enough traces, with real anomalies that happen only in production. I would love to know if you can share anonymized trace logs with us, so that we can test and validate our tools. Ideally, we would get a large enough set of traces that have different "types" of executions, with natural variations that occur in production.
We have the intention of publishing the results and the tools (not the traces :), and hope that this would solve a real problem that people have. One of the motivations of this problem came from conversations with the folks doing tracing at Google about 5 years ago!
If any of you can share a dataset of traces with us, please let me know, and we can find a way to proceed.
Lastly, if anyone is interested, we could briefly talk about this at the tracing workshop in May.
Thank you for reading :)
Rodrigo Fonseca
Brown University