Google's buffer management system

122 views
Skip to first unread message

Mahdi

unread,
Oct 20, 2022, 12:46:26 AM10/20/22
to BBR Development
Hi BBR team,
I recently studied some papers and tried to understand that to avoid hardware cost and complexity, Google's data centres do not use switches with deep buffers [1]. Does it mean that tail drop is used to control congestion? Since BBR performance is closely related to buffer/queue length, I am curious to know what queue management systems Google's switches are equipped with and where I can read about it.
Thanks in advance for your help.
Best Regards,
Mahdi

[1]. Jain, Sushant, et al. "B4: Experience with a globally-deployed software-defined WAN." ACM SIGCOMM Computer Communication Review 43.4 (2013): 3-14.

Neal Cardwell

unread,
Oct 20, 2022, 8:53:29 AM10/20/22
to Mahdi, BBR Development
Hi Mahdi,

You can read about this in some of the Sigcomm papers from Google over the years:

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network
SIGCOMM '15
https://research.google/pubs/pub43837/

"6.1 Fabric Congestion
Despite the capacity in our fabrics, our networks experienced high congestion drops as utilization approached 25%. We found several factors contributed to congestion: i) inherent burstiness of flows led to inadmissible traffic in short time intervals typically seen as incast [8] or outcast [21]; ii) our commodity switches possessed limited buffering, which was sub optimal for our server TCP stack; iii) certain parts of the network were intentionally kept oversubscribed to save cost, e.g., the uplinks of a ToR; and iv) imperfect flow hashing especially during failures and in presence of variation in flow volume.

We used several techniques to alleviate the congestion in our fabrics. First, we configured our switch hardware schedulers to drop packets based on QoS. Thus, on congestion we would discard lower priority traffic. Second, we tuned the hosts to bound their TCP congestion window for intra-cluster traffic to not overrun the small buffers in our switch chips. Third, for our early fabrics, we employed link-level pause at ToRs to keep servers from over-running oversubscribed uplinks. Fourth, we enabled Explicit Congestion Notification (ECN) on our switches and optimized the host stack response to ECN signals [3]. Fifth, we monitored application bandwidth requirements in the face of oversubscription ratios and could provision bandwidth by deploying Pluto ToRs with four or eight uplinks as required. Similarly, we could repopulate links to the spine if the depop mode of a fabric was causing congestion. Sixth, the merchant silicon had shared memory buffers used by all ports, and we tuned the buffer sharing scheme on these chips so as to dynamically allocate a disproportionate fraction of total chip buffer space to absorb temporary traffic bursts. Finally, we carefully configured switch hashing functionality to support good ECMP load balancing across multiple fabric paths.

Our congestion mitigation techniques delivered substantial improvements. We reduced the packet discard rate in a typical Clos fabric at 25% average utilization from 1% to < 0.01%."

---

Aequitas: admission control for performance-critical RPCs in datacenters
SIGCOMM '22
https://dl.acm.org/doi/abs/10.1145/3544216.3544271
"Weighted fair queuing (WFQ) is available in commodity switches/NICs and its bandwidth sharing property enables the mapping from traffic priority classes to desired QoS levels.We currently take this approach to map RPC priorities to network QoS levels in our datacenters."

---

neal


--
You received this message because you are subscribed to the Google Groups "BBR Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/53cc2c09-3134-4c95-bc35-9d4448b2e049n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages