Data Center TCP (DCTCP)
Authors
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan
Date
SIGCOMM'10 - August 2010
Novel Idea
Return multi-package ECN information to the sending host, which adapts
its congestion window incrementally (less abruptly).
Main Results
The paper presents real data center traffic measurements/analysis, and
proposes changes on the TCP protocol, for this particular context,
based on some identified traffic patterns. One of the identified
patterns is the requirement for low latency. Moreover, they propose
changes on the TCP to allow better congestion notification to the
sender, and to permit better congestion window adaptivity.
Impact
The paper analyzes real data-center information, and notices that
low-latency queries are well demanded in such systems (basically
because of SLA constraints and the aggregation pattern on data
processing). This obviously promotes awareness on this issue.
Moreover, the proposed changes on TCP for the data center environment
build upon those observations and actually show improved performance
in relation to the previous concerns.
Evidence
They start their argumentation identifying issues in data center
communication using real data. Among the issues, they describe incast
(synchronized short-term flows), queue-length problems (if the queue
is big, latency on small queries is compromised), and buffer
"pressure" (long flows get more and more buffer space, which results
in low buffer space for short queries).
The algorithm is described in detail, and they evaluate their solution
using the previous issues as metrics, also evaluating throughput,
queue length, and fairness convergence time. They have isolated tests
and tests that simulate real environments.
Prior Work + Competitive Work
They use ideas from other approaches that do active queue management
schemes, as RED and PI (they even use RED mechanisms to implement
DCTCP).
They talk about the approach of jittering (perturbing) flows, which
increase response time. They talk about the possibility of strongly
reducing the RTOmin + fine-grained retransmission, which can relieve
incast problems, but doesn't affect the queue length problem.
They also mention other TCP variants (Vegas, CTCP).
Reproducibility
I think their evaluation section is reproducible. The paper provides a
substantial amount of technical details that permit such
reproducibility. Of course, publicly available code would make it
easier to do so.
The measurements are in a different standing regarding this matter. Of
course, they were probably taken in a business context, and it is
understandable that the data is not widely available. Even so, if the
patterns are actually characteristic, they should appear in other
measurements by other people.
Questions + Criticism
Their tests used the TCP SACK, which seems important to their results
(in a single SACK package, they could signal many CWR flags at once,
right?). Is TCP SACK something common nowadays? (I think I'm obsolete
on this.)
They mention that RED+ECN implies in high query length because it is
supposedly too slow to react to bursts of traffic. But if they could
infer a probability of congestion, they could perhaps start marking
packets based on this probability, and the receiving nodes could
employ even more fine-grained control over the congestion window. And
if it is too slow, perhaps they could still use two different
thresholds, but set with lower values. They could actually get
quicker.
Ideas for Further Work
I think it makes sense to use both low and high thresholds of
RED-enabled switches to try to infer congestion indications "in a
shade of gray", and then mark the packets according to the inferred
probability, instead of joining both thresholds into a single K.
Perhaps, a probabilistic marking on the packets could have interesting
impact on the oscillation of the congestion window.
Paper Title |
Data Center TCP (DCTCP) |
Author(s) |
Mohammad Alizadeh‡ † , Alber t Greenberg, David A. Maltz, Jitendra Padhye, Par veen Patel† , Balaji Prabhakar‡ , Sudipta Sengupta, Murari Sridharan† |
Date | 2010 |
Novel Idea |
Common cluster workflows cause three main performance impairments in commodity switches running standard TCP. The first is known as incast, and occurs when a large number of simultaneous requests flood a switch, causing some of them to be dropped and incurring a TCP RTO due to the small amount of buffer space allocated for packets in these switches. The second is that long flows (typically cluster-maintenance software keeping nodes in sync, etc) cause short flows (typically processing related to a user query that needs to be responded to very quickly) to have increased latency due to queueing. The last they call "buffer pressure", and occurs for the same reason as the second - the queues formed by the long flows can fill up the space that would be used for buffering short flow packets, causing them to be dropped. They propose to solve all three problems by modifying TCPs ECN logic to not merely react (quite drastically) to the presence of congestion, but to react to its extent. |
Main Result(s) |
They describe DCTCP, an extension to TCP with the following properties: 1) Packets simply have the congestion bit set if they enter a queue of length k or greater, in contrast to TCP's more complex marking semantics. 2) Receivers echo fully explicit ECN information in acks. To preserve the benefits of delayed acks, they only send acks either when the normal m packets have been received, or whenever they receive a packet with a different ECN state from the last. 3) Senders keep track of an estimated percentage congestion alpha based on the information they receive in said acks. They use this to decrease the send window logarithmically with alpha, in the range 0 - .5, causing a gentler drop-off than TCP when congestion is estimated to be low. |
Impact | Ran out of time for this again... |
Evidence | They verify that DCTCP results in smaller and less variable queues on a real actual cluster running what seems like it was real actual cluster-style code. This is remarkable! They also show that it converges to fairness faster, and compares well on mutli-hop network tests. |
Prior Work | Their work is largely in response to prior work that has tried to avoid incast by artificially adding jitter to they system, which does solve the problem but increases median latency significantly. |
Reproducibility | They describe their algorithm really well, it seems to me. |
Question | They use shared-memory switches, which they claim are "like most commodity switches." What other kinds are available? Do they solve any of these challenges, and/or add different challenges of their own? |
Criticism | This paper honestly seems really good. They lay out some very precise problems, propose a solution, and show convincingly that it works. They compare their solution to other work in the field and explain why those preexisting technologies are unsatisfactory. I'm sure they are doing *something* wrong, but I don't see it. |
Ideas for further work | It seems like in very large clusters (or perhaps just poorly organized ones?) the problem they outline with incast resulting from a single packet from each machine could start to be more of an issue than they claim it is. When it is, what can be done? Perhaps there is a way to dynamically turn on a small amount of jitter when the system notices that's happening? Perhaps the jitter can be done at the application level for applications that are likely to cause that pattern? |