laubra manfred ennicus

1 view

Skip to first unread message

Mateos Weinzapfel

unread,

Aug 2, 2024, 5:26:49 AM8/2/24

to exlicharlink

Service A always fails a little bit; it never recovers. Service B occasionally fails cataclysmically. It recovers quickly but still experiences a near 100% outage during that period. Finally, service C rarely fails, but when it does fail, it fails for a long time.

The following graphic shows that each service has an equal number of nines. However, how you solve these three failure modes is drastically different! The first requires request hedging or retries, the second needs load shedding or backpressure, and the third needs faster detection and failover.

We use these techniques at Netflix to reach very high scales of online stateful services. We have near caches, which live on service hosts that handle billions of requests per second in sub-100-microsecond latency. We have remote caches based on Memcached that handle tens of millions of requests per second in 100-microsecond latency targets, storing petabytes of data. Finally, we have stateful databases with Apache Cassandra running in 4-region full-active mode, providing in-region read-your-write consistency with single-digit millisecond latency.

With an understanding of our software and the underlying hardware, we program workload capacity models, which consider several parameters about the workload to model the possible CPU, memory, network, and disk requirements. Given these workload details, the model then outputs the specifications for a cluster of computers that we call the least regretful choice. More information about how this capacity modeling works can be found here: AWS re:Invent talk from 2022.

We replicate our clusters to 12 Amazon availability zones spread across four regions because we want to ensure that all of our microservices have local zone access to their data. Network communication is an opportunity for failure, so we try to keep as much within a zone as we can, and if we have to cross zones, we try to keep it in the region. But, sometimes, we do have to go across regions. This replication technique allows us to have highly reliable write and read operations because we can use quorums to accept writes in any region. By having three copies in every region, we can provide a very high level of reliability while maintaining strong consistency.

Sometimes, we have to evacuate traffic from a degraded region. This illustrates the overall capacity usage of the running Netflix system. Most of our money is spent on our stateless services, typically provisioned to handle around one-fourth of global traffic.

The alternative is to do a more traditional sharded approach for our databases, where we run two copies of state instead of four. For example, imagine we had two replication groups: one in "America" between us-west-2 and us-east-2, and one for "Europe" between us-east-1 and eu-west1. In this traditional approach, we would have to reserve a lot more headroom for traffic during failover: () / () = 100% more.

Regions are impacted constantly, both for hardware and software reasons, so having this fast evacuation capability allows Netflix to recover extremely quickly during a failure. However, this is only cost-effective if you can spread the load between all other regions, which having a full-active data topology allows.

We couple the stateful process and the OS Kernel together because if we have to upgrade, for example, the Linux OS, we will have to bring down the data store process. Every time the primary process is down, we risk failing quorums, so we want to do this as fast as possible.

As hardware failure is so risky, when we launch new instances, we first ensure they can handle the load we will put on them. We review a list of checks, including that the network functions and the disks perform as expected, before starting the real workload - we call these pre-flight checks. Then, we continuously monitor errors and latency for both hardware and software and preemptively eject hardware that is starting to fail from the fleet before it becomes a source of failure.

You can do this with your software as well! For example, at Netflix, we use jvmquake on our stateful services written in Java because it detects GC death spirals early and prevents concurrent-mode-failure-related gray failures via a token bucket algorithm:

Continuous monitoring is vital for reliability because when you have 2000-plus clusters over a year, bad stuff happens to both your hardware and your software. If you are proactive about monitoring your components, you can detect failures quickly, remediate them, and recover before the failure propagates to the customer of your stateful service.

At Netflix, we treat caches like materialized view engines, caching complex business logic that runs on data rather than caching the underlying data. Most operations to services hit that cached view rather than the service. Whenever the underlying data of that service changes, the service re-calculates the cache value and fills that cache with the new view.

A cache in front of the service protects the service, which is, at least for us, the component that fails first. The cache is cheap relative to the service; stateless apps running actual business logic are quite expensive to operate. This technique can help us improve reliability by decreasing the amount of load that the services and data stores have to handle. It does shift load to caches, but those are easier to make reliable.

In this architecture, all operations go against the local cache populated by eventually consistent pub-sub, and this is extremely reliable because there are no database calls in the hot path of service requests.

With weighted-choice-of-n, we exploit prior knowledge about networks in the cloud. We know that because we have a replica of data in every zone, we will get a faster response if we send the request to the same zone. All we have to do is weight requests toward our local zone replica. We want this to degrade naturally, in case we only have two copies or if the client is in an overloaded zone. Instead of just having a strict routing rule, we take concurrency into account as well. Instead of picking the two with the least concurrency, we weigh the concurrency by the following factors: not being in the same availability zone, not having a replica, or being in an unhealthy state. This technique reduces latency by up to 40% and improves reliability by keeping traffic in the same zone!

The only problem that we have is the high latency impact. We went from 40% utilization to 80% utilization in 10 seconds, so latency degraded. Finally, our server autoscaling system restores our latency SLO by injecting capacity to remedy the latency impact. This whole process (from the client load doubling to our server limiters tripping, to our client hedges and exponential retries mitigating, to the server autoscaling and finally restoring the SLO) takes about five minutes with no human involvement.

Putting it all together: We run with enough buffer to prevent significant outages; when there is an impact, the systems automatically mitigate it, and finally, we recover quickly without human intervention.

Assume you must retry; how do you make your mutable API safe? The answer is idempotency tokens. In concept, this is pretty simple. You generate a token and send it to your endpoint. Then, if something bad happens, you can just retry using the same idempotency token. You have to make sure that the backing storage engines implement this idempotency.

At Netflix, an idempotency token is generally a tuple of a timestamp and some form of token. We can use client monotonic timestamps, in this case, a microsecond timestamp with a little bit of randomness mixed into the last character just to prevent a little bit of duplication and a random nonce. If we have a more consistent store, we might need to make a global monotonic timestamp within the same region. Finally, the most consistent option is to take a global transaction ID.

TimeSeries Abstraction stores all of our trace information and playback logs that tell us somebody hit play. This service handles tens of millions of events per second, fully durably with retention for multiple weeks or, in some cases, years.

The idempotency token in this service is built into the TimeSeries contract. If you provide an event with the same event time and unique event ID, then the backend storage deduplicates that operation. On the read side, we see that breaking down a potentially huge response into multiple pages that can be consumed progressively.

You can browse through the latest Netflix facts and statistics on JustWatch. Our guide is updated monthly to provide the most up-to-date information for Netflix's annual revenue, subscriber growth, market share and more. All of these statistics are available to download and share online.

What does a properly executed design thinking process look like? Examining real-world examples is an effective way to answer that question. Here are five examples of well-known brands that have leveraged design thinking to solve business problems.

Their first recommendation was to make the toothbrush easier to charge, especially while users were on the road. Another was making it more convenient for users to order replacement heads by allowing toothbrushes to connect to phones and send reminder notifications. Both proposals were successful because they focused on what users wanted rather than what the company wanted to roll out.

While these examples illustrate the kind of success design thinking can yield, you need to learn how to practice and use it before implementing it into your business model. Here are several ways to do so:

This is a useful exercise you can do with the examples above. Consider the problem each company faced and think through alternative solutions each could have tried. This can enable you to practice both empathy and ideation.

Our platform features short, highly produced videos of HBS faculty and guest business experts, interactive graphs and exercises, cold calls to keep you engaged, and opportunities to contribute to a vibrant online community.