X-Trace: A Pervasive Network Tracing Framework
Authors
Rodrigo Fonseca, George Porter, Randy Katz, Scott Shenker, Ion Stoica
Date
NSDI'07 - Networked Systems Design and Implementation, April 2007
Novel Idea
Describe and implement a tracing framework that works across different
layers of network protocols providing holistic tracing information.
Main Results
The authors describe and implement X-Trace, a tracing framework that
inserts metadata at various layers of the communication channels, and
is thus able to provide an integrated tracing of events. The framework
respects administrative domain restrictions when generating results.
Impact
The X-Trace framework provides the ability to tackle multiple network
layers and multiple applications in a single "tracing context".
Evidence
The authors start describing the framework design principles,
justifying some architectural features. The architecture itself,
including aspects as metadata propagation and issues involving report
generation, are described with sound arguments.
The paper provides some microbenchmarks on the previously mentioned
aspects and some usage scenario analysis, showing appropriate results.
The usage scenario section provides three examples, and the paper
actually discusses how the system "fits" in the sense of whether it
would indeed provide relevant information in each case (including
information that reveals failures). The "fitting" of a multi-layer and
integrated approach to tracing is particularly well discussed in the
third usage scenario (see Questions + Criticism section below).
Prior Work
They mention network tools and instrumentation protocols such as
traceroute and SNMP. They also cite a work by Hussain et al, focused
on network tracing, and also a work by Kompella et al, focused on
tracing state changes (is it?) in different network layers, and not on
the data flow like X-Trace.
Competitive Work
They mention Splunk, but argue that its log-based approach could
possibly not reveal proper event correlation. They also cite Pinpoint,
but they say this system focuses on inferring fault causality by
analyzing a J2EE-based data flow.
Pip allows its users to express how the system should behave, which is
compared to its actual behavior. Magpie correlates information
obtained at various levels and infers event causality. The paper
argues that Magpie is mostly focused on a single system or distributed
systems that are instrumented in a particular manner.
Finally, they mention two projects, AND and Constellation, as projects
that use inference techniques to produce data flow diagrams.
Reproducibility
The microbenchmarks on metadata propagation is reproducible, as well
as the report infrastructure testing, as the system is available
online.
The usage scenarios require more work if one decides to reproduce
them. I think some details are left out, but I think the purpose of
the section is to analyze the "fitness" of the framework.
Questions + Criticism
[Criticism] It is a really nice paper (honestly). I liked very much
the fact that the usage scenarios section analyzes the "fitting" of
the system, and how relevant information could be generated under
particular situations (particularly, in the third example, the process
vs host failure cases). When proposing a framework, a technique, this
is the most important metric indeed.
I have, though, a bunch of questions and some comments:
(1) [Question] In sizable tracings, how big is the effect of
collisions in the unique() function? Is it appropriate for these
cases?
(2) [Criticism] I think the packet sniffing application that sends
reports is actually something very interesting, and I think it would
deserve more discussion in the paper. I believe so because it directly
affects the feasibility of applying the system in bigger cases.
[Question] Are other protocols implemented besides IP and TCP?
(3) [Questions + Criticisms] How feasible is to use X-Trace in a
reasonably big distributed system, changing multiple applications? Is
there any usage scenario evaluation suggesting the framework
"scalability"?
There are more [Questions] in the following section (it makes more
sense there, again).
Ideas for Further Work
Doing something analogous to the packet sniffing application for
report generation, but now to generate metadata.
The idea is getting input from sockets in the kernel, and try to
identify an application-level protocol. If it is a widely-known, say
HTTP, modify the data flow including X-Trace metadata in the HTTP
header. Also, generate the metadata in the lower level network stack.
[Question] Does it appear viable? (application throughput and latency,
protocol processing inside the kernel, etc)
It is crazy, but could be awesome if we had a big system that
communicated using standard protocols in a system where we can modify
the kernel/runtime system (like the BSDs and Linux).
Paper Title |
X-Trace: A Pervasive Network Tracing Framework |
Author(s) |
Rodrigo Fonseca George Porter Randy H. Katz Scott Shenker Ion Stoica
|
Date | 2007 |
Novel Idea | Multi-level, cross-AD tracing via in-band metadata inserted into communication protocols. |
Main Result(s) |
If all relevant layers and nodes of the network implement X-Trace, it can provide a full, multilayered trace of any request that asks to be traced. Each node and protocol must be extended to recognize incoming messages with piggybacked XTrace metadata, add to that metadata, and write local reports and send them to a database. An XTrace implementation for a node provides the primitives pushNext and pushDown, which propagate the XTrace metadata to the next node in this node's layer and to the next node in a lower layer (if the outgoing request is built on such a layer and it is also XTrace-enabled) respectively. Anyone on an XTrace-enabled network can initiate a trace request by inserting an XTrace metadata into some message on the network. However, the reports generated thereby are not necessarily returned to the person who initiated it. In fact, if the network crosses ADs (but all nodes still implement XTrace) the owners of all involved ADs' XTrace implementations will receive separate batches of reports. |
Impact | Dunno really, but possibly inspired Tracelytics, a local company where I might possibly get an internship this winter maybe? |
Evidence |
Tests of the reporting framework with ab showed a 15% decrease in system throughput. They give a few simple examples of what an XTrace tree looks like under various failure conditions, and how it can be used to detect the source of issues. |
Reproducibility | Not much. No code, bare bones algorithms. They mention the challenges associated with implementing XTrace for protocols with complex message causality semantics, but don't describe any of the strategies they used to overcome them successfully in very great detail. |
Question | XTrace metadata contains an extensible options field. Does this mean it only works with protocols that themselves contain such a field? What about protocols with maximum message lengths? What about protocols with no free space at all for XTrace to piggyback on? |
Criticism | Especially given the cross-AD nature of XTrace, I feel that the concern over malicious injection of packets with XTrace metadata is understated. The performance section indicates a slowdown of 15% while processing a lot of XTraced messages; if the owner of one AD makes the decision to include XTrace on a machines that communicate with another, what is to prevent the other from causing a similar performance hit on the first at will? |
Ideas for further work | My instinct is that one of the highest-priority extensions to XTrace is to make it work with a wider variety of non-tree call graph structures. To do this, the modification to a node must be more significant - for instance, to provide Xtracing capabilities for a server adjudicating a quorum, the XTrace code must be able to detect than an incoming vote is part of a quorum, generate a report about the outcome of the vote, and attach XTrace metadata to the outgoing message(s) reporting the result - perhaps just the one heading back to the node that originally voted, or perhaps a message reporting the outcome to some other specific node. It's not always clear what semantics are desired -- if the XTraced node is on the losing side of a quorum vote, is it causally connected to the outcome of that vote? Do we want to XTrace the responses to *every* node that voted? If the vote has many inputs but only one output, what happens when two inputs are both XTraced? There is a lot of work to be done here, and doing it well could potentially lead to tools that help diagnose problems in the confusing call graphs where they are most needed. |
Authors: Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica
Date: NSDI 2007
Novel idea: X-Trace is a network tracing framework which traces causal paths between different network layers and administrative domains.
Main results: The authors present instrumentation libraries for applications in C, C++, Ruby, and Java. There is also a daemon process which collects and stores reports, reconstructs the causal paths, and displays visualizations.
Impact: X-Trace might enable sysadmins to find problems faster than they would using ordinary tools.
Evidence: The authors deployed X-Trace in a hosted web service and an overlay network, then injected six faults which would normally be difficult to diagnose. They quickly identified each fault using X-Trace.
Prior work: X-Trace builds on prior distributed tracing tools such as Pinpoint and Magpie.
Competitive work: Unlike its predecessors, X-Trace focuses on tracing causality between different network layers and administrative domains. X-Trace is cited as an inspiration for Google's Dapper tracing infrastructure.
Reproducibility: The BSD-licensed source code is available on GitHub.
Question: Section 3.2 mentions a packet sniffing application which sends reports on behalf of services and applications that cannot be modified to include libxtrreport. What are some examples of these services?
Criticism: The authors of Google's Dapper framework mention one inefficiency in X-Trace: "traces are collected not only at node boundaries but also whenever control is passed between different software layers within a node."
Ideas for further work: Almost every web application or distributed system has a basic logging framework baked in, but very few have X-Trace baked in. Lowering the barrier to entry would be helpful to many developers who might benefit from causal tracing.