darrfyn fabyanah cassidee

0 views
Skip to first unread message

Tabita Knezevic

unread,
Aug 1, 2024, 11:29:05 PM8/1/24
to restbalichun

You can browse through the latest Netflix facts and statistics on JustWatch. Our guide is updated monthly to provide the most up-to-date information for Netflix's annual revenue, subscriber growth, market share and more. All of these statistics are available to download and share online.

What does a properly executed design thinking process look like? Examining real-world examples is an effective way to answer that question. Here are five examples of well-known brands that have leveraged design thinking to solve business problems.

Their first recommendation was to make the toothbrush easier to charge, especially while users were on the road. Another was making it more convenient for users to order replacement heads by allowing toothbrushes to connect to phones and send reminder notifications. Both proposals were successful because they focused on what users wanted rather than what the company wanted to roll out.

While these examples illustrate the kind of success design thinking can yield, you need to learn how to practice and use it before implementing it into your business model. Here are several ways to do so:

This is a useful exercise you can do with the examples above. Consider the problem each company faced and think through alternative solutions each could have tried. This can enable you to practice both empathy and ideation.

Our platform features short, highly produced videos of HBS faculty and guest business experts, interactive graphs and exercises, cold calls to keep you engaged, and opportunities to contribute to a vibrant online community.

All course content is delivered in written English. Closed captioning in English is available for all videos. There are no live interactions during the course that requires the learner to speak English. Coursework must be completed in English.

All programs require the completion of a brief online enrollment form before payment. If you are new to HBS Online, you will be required to set up an account before enrolling in the program of your choice.

Our easy online enrollment form is free, and no special documentation is required. All participants must be at least 18 years of age, proficient in English, and committed to learning and engaging with fellow participants throughout the program.

HBS Online's CORe and CLIMB programs require the completion of a brief application. The applications vary slightly, but all ask for some personal background information. You can apply for and enroll in programs here. If you are new to HBS Online, you will be required to set up an account before starting an application for the program of your choice.

Our easy online application is free, and no special documentation is required. All participants must be at least 18 years of age, proficient in English, and committed to learning and engaging with fellow participants throughout the program.

Updates to your application and enrollment status will be shown on your account page. We confirm enrollment eligibility within one week of your application for CORe and three weeks for CLIMB. HBS Online does not use race, gender, ethnicity, or any protected class as criteria for admissions for any HBS Online program.

We accept payments via credit card, wire transfer, Western Union, and (when available) bank loan. Some candidates may qualify for scholarships or financial aid, which will be credited against the Program Fee once eligibility is determined. Please refer to the Payment & Financial Aid page for further information.

We also allow you to split your payment across 2 separate credit card transactions or send a payment link email to another person on your behalf. If splitting your payment into 2 transactions, a minimum payment of $350 is required for the first transaction.

After enrolling in a program, you may request a withdrawal with refund (minus a $100 nonrefundable enrollment fee) up until 24 hours after the start of your program. Please review the Program Policies page for more details on refunds and deferrals. If your employer has contracted with HBS Online for participation in a program, or if you elect to enroll in the undergraduate credit option of the Credential of Readiness (CORe) program, note that policies for these options may differ.

Service A always fails a little bit; it never recovers. Service B occasionally fails cataclysmically. It recovers quickly but still experiences a near 100% outage during that period. Finally, service C rarely fails, but when it does fail, it fails for a long time.

The following graphic shows that each service has an equal number of nines. However, how you solve these three failure modes is drastically different! The first requires request hedging or retries, the second needs load shedding or backpressure, and the third needs faster detection and failover.

We use these techniques at Netflix to reach very high scales of online stateful services. We have near caches, which live on service hosts that handle billions of requests per second in sub-100-microsecond latency. We have remote caches based on Memcached that handle tens of millions of requests per second in 100-microsecond latency targets, storing petabytes of data. Finally, we have stateful databases with Apache Cassandra running in 4-region full-active mode, providing in-region read-your-write consistency with single-digit millisecond latency.

With an understanding of our software and the underlying hardware, we program workload capacity models, which consider several parameters about the workload to model the possible CPU, memory, network, and disk requirements. Given these workload details, the model then outputs the specifications for a cluster of computers that we call the least regretful choice. More information about how this capacity modeling works can be found here: AWS re:Invent talk from 2022.

We replicate our clusters to 12 Amazon availability zones spread across four regions because we want to ensure that all of our microservices have local zone access to their data. Network communication is an opportunity for failure, so we try to keep as much within a zone as we can, and if we have to cross zones, we try to keep it in the region. But, sometimes, we do have to go across regions. This replication technique allows us to have highly reliable write and read operations because we can use quorums to accept writes in any region. By having three copies in every region, we can provide a very high level of reliability while maintaining strong consistency.

Sometimes, we have to evacuate traffic from a degraded region. This illustrates the overall capacity usage of the running Netflix system. Most of our money is spent on our stateless services, typically provisioned to handle around one-fourth of global traffic.

The alternative is to do a more traditional sharded approach for our databases, where we run two copies of state instead of four. For example, imagine we had two replication groups: one in "America" between us-west-2 and us-east-2, and one for "Europe" between us-east-1 and eu-west1. In this traditional approach, we would have to reserve a lot more headroom for traffic during failover: () / () = 100% more.

Regions are impacted constantly, both for hardware and software reasons, so having this fast evacuation capability allows Netflix to recover extremely quickly during a failure. However, this is only cost-effective if you can spread the load between all other regions, which having a full-active data topology allows.

We couple the stateful process and the OS Kernel together because if we have to upgrade, for example, the Linux OS, we will have to bring down the data store process. Every time the primary process is down, we risk failing quorums, so we want to do this as fast as possible.

As hardware failure is so risky, when we launch new instances, we first ensure they can handle the load we will put on them. We review a list of checks, including that the network functions and the disks perform as expected, before starting the real workload - we call these pre-flight checks. Then, we continuously monitor errors and latency for both hardware and software and preemptively eject hardware that is starting to fail from the fleet before it becomes a source of failure.

You can do this with your software as well! For example, at Netflix, we use jvmquake on our stateful services written in Java because it detects GC death spirals early and prevents concurrent-mode-failure-related gray failures via a token bucket algorithm:

Continuous monitoring is vital for reliability because when you have 2000-plus clusters over a year, bad stuff happens to both your hardware and your software. If you are proactive about monitoring your components, you can detect failures quickly, remediate them, and recover before the failure propagates to the customer of your stateful service.

At Netflix, we treat caches like materialized view engines, caching complex business logic that runs on data rather than caching the underlying data. Most operations to services hit that cached view rather than the service. Whenever the underlying data of that service changes, the service re-calculates the cache value and fills that cache with the new view.

A cache in front of the service protects the service, which is, at least for us, the component that fails first. The cache is cheap relative to the service; stateless apps running actual business logic are quite expensive to operate. This technique can help us improve reliability by decreasing the amount of load that the services and data stores have to handle. It does shift load to caches, but those are easier to make reliable.

In this architecture, all operations go against the local cache populated by eventually consistent pub-sub, and this is extremely reliable because there are no database calls in the hot path of service requests.

With weighted-choice-of-n, we exploit prior knowledge about networks in the cloud. We know that because we have a replica of data in every zone, we will get a faster response if we send the request to the same zone. All we have to do is weight requests toward our local zone replica. We want this to degrade naturally, in case we only have two copies or if the client is in an overloaded zone. Instead of just having a strict routing rule, we take concurrency into account as well. Instead of picking the two with the least concurrency, we weigh the concurrency by the following factors: not being in the same availability zone, not having a replica, or being in an unhealthy state. This technique reduces latency by up to 40% and improves reliability by keeping traffic in the same zone!

90f70e40cf
Reply all
Reply to author
Forward
0 new messages