Netflix
Netflix has been moving huge portions of its streaming operation to Amazon Web Services (AWS) for years now, and it says it has finally completed its giant shift to the cloud. “We are happy to report that in early January of 2016, after seven years of diligent effort, we have finally completed our cloud migration and shut down the last remaining data center bits used by our streaming service,” Netflix said in a blog post that it plans to publish at noon Eastern today. (The blog should go up at this link.)
Netflix operates “many tens of thousands of servers and many tens of petabytes of storage” in the Amazon cloud, Netflix VP of cloud and platform engineering Yury Izrailevsky told Ars in an interview.
Netflix had earlier planned to complete the shift by the end of last summer.
“Billing and payments was the last remaining piece. We wanted to make sure we do it right; obviously, there is a lot of privacy concerns around customer data,” Izrailevsky said. Previously, the applications and data related to billing and payments were in a cage Netflix rented at a colocation facility.
With this last piece finished, Netflix’s streaming business no longer operates any of its own data center space. But not everything is in Amazon.
Netflix operates its own content delivery network (CDN) called Open Connect. Netflix manages Open Connect from Amazon, but the storage boxes holding videos that stream to your house or mobile device are all located in data centers within Internet service providers' networks or at Internet exchange points, facilities where major network operators exchange traffic. Netflix distributes traffic directly to Comcast, Verizon, AT&T, and other big network operators at these exchange points.
Once a customer hits the “play” button, video is delivered from one of those sites. But all the applications and data needed to manage everything a customer does before clicking “play”—such as signing up for the service or searching videos—is running in the Amazon cloud. All the customer-facing systems for the streaming business are thus in Amazon or the Open Connect storage boxes. “All the search, personalization, all the business logic, all the data processing that enables the streaming experience, the 100 different applications and services that make up the streaming application, they live in AWS,” Izrailevsky said.
Most of the technology needed to manage employees of the streaming business is also in Amazon, though the company also uses some software-as-a-service applications such as Workday, Izrailevsky said.
There’s one other exception to Netflix’s shift to the cloud. While the streamingbusiness has gone all cloud, the old DVD mailing business has not. “Our DVD business is still relying on the data center [colocation facility] for all of their operations,” Izrailevsky said. The DVD and streaming businesses are run separately, with their own systems and processes. The DVD business is “stable” and well served by its current setup, Izrailevsky said.
In other words, the DVD business isn’t experiencing the massive growth that requires the ability to scale up as needed. Netflix streaming, meanwhile, just keeps growing and accounts for more than one-third of all North American fixed Internet traffic during peak viewing hours, according to the latest Sandvine Global Internet Phenomena Report. “Supporting such rapid growth would have been extremely difficult out of our own data centers; we simply could not have racked the servers fast enough,” Netflix’s blog post says. “Elasticity of the cloud allows us to add thousands of virtual servers and petabytes of storage within minutes, making such an expansion possible.”
Even though the DVD business remains in a traditional data center, it was actually an outage in the DVD operation that spurred Netflix’s shift to the cloud for its streaming service. For three days in August 2008, Netflix couldn’t ship DVDs to customers because of a major database corruption. As annoying as that was, Netflix knew it would be even worse if something like that happened to the streaming product. Customers could still watch the DVDs they had during that outage. But with streaming, a three-day outage would mean no video watching, period. Netflix had launched its streaming service in 2007 and knew there was potential for growth, “so we wanted to get ahead of that,” Izrailevsky said. Besides improved availability, Netflix says using Amazon allowed it to meet increasing demand at a lower price than it would have paid if it still operated its own data centers.
Netflix declined to say how much it pays Amazon, but says it expects to "spend over $800 million on technology and development in 2016," up from $651 million in 2015. Netflix spends less on technology than it does on marketing, according to its latest earnings report.
The big question on your mind might be this: What happens if the Amazon cloud fails?
That's one reason it took Netflix seven years to make the shift to Amazon. Instead of moving existing systems intact to the cloud, Netflix rebuilt nearly all of its software to take advantage of a cloud network that "allows one to build highly reliable services out of fundamentally unreliable but redundant components," the company says. To minimize the risk of disruption, Netflix has built a series of tools with names like “Chaos Monkey,” which randomly takes virtual machines offline to make sure Netflix can survive failures without harming customers. Netflix’s “Simian Army” ramped up with Chaos Gorilla (which disables an entire Amazon availability zone) and Chaos Kong (which simulates an outage affecting an entire Amazon region and shifts workloads to other regions).
Amazon’s cloud network is spread across 12 regions worldwide, each of which has availability zones consisting of one or more data centers. Netflix operates primarily in the Northern Virginia, Oregon, and Dublin regions, but if an entire region goes down, “we can instantaneously redirect the traffic to the other available ones,” Izrailevsky said. "It's not that uncommon for us to fail over across regions for various reasons."
Years ago, Netflix wasn't able to do that, and the company suffered a streaming failure on Christmas Eve in 2012, when it was operating in just one Amazon region. “We've invested a lot of effort in disaster recovery and making sure no matter how big a failure that we're able to bring things back from backups,” he said.
Netflix has multiple backups of all data within Amazon.
“Customer data or production data of any sort, we put it in distributed databases such as Cassandra, where each data element is replicated multiple times in production, and then we generate primary backups of all the data into S3 [Amazon’s Simple Storage Service],” he said. “All the logical errors, operator errors, or software bugs, many kinds of corruptions—we would be able to deal with them just from those S3 backups.”
What if all of Netflix's systems in Amazon went down? Netflix keeps backups of everything in Google Cloud Storage in case of a natural disaster, a self-inflicted failure that somehow takes all of Netflix's systems down, or a “catastrophic security breach that might affect our entire AWS deployment,” Izrailevsky said. “We've never seen a situation like this and we hope we never will.”
But Netflix would be ready in part thanks to a system it calls “Armageddon Monkey,” which simulates failure of all of Netflix’s systems on Amazon. It could take hours or even a few days to recover from an Amazon-wide failure, but Netflix says it can do it. Netflix pointed out that Amazon isolates its regions from each other, making it difficult for all of them to go out simultaneously.
“So that's not the scenario we're planning for. Rather it's a catastrophic bug or data corruption that would cause us to wipe the slate clean and start fresh from the latest good back-up,” a Netflix spokesperson said. “We hope we will never need to rely on Armageddon Monkey in real life, but going through the drill helps us ensure we back up all of our production data, manage dependencies properly, and have a clean, modular architecture; all this puts us in a better position to deal with smaller outages as well.”
Netflix declined to say where it would operate its systems during an emergency that forced it to move off Amazon. "From a security perspective, it'd be better not to say," a spokesperson said.
Netflix has released a lot of its software as open source, saying it prefers to collaborate with other companies than keep secret the methods for making cloud networks more reliable. “While of course cloud is important for us, we're not very protective of the technology and the best practices, we really hope to build the community,” Izrailevsky said.
17 March 2016
Tomorrow we'll release Season 2 of Marvel's Daredevil to 190 countries simultaneously. Netflix members all over the planet will instantly be able to stream the show on any internet-connected device. Even though millions of people around the world will be watching, there will be very little additional traffic on the “internet” because of a decision we made in 2011 to build our own content delivery network, or CDN.
Since we went global in January, we’ve had increased interest in how we deliver a great Netflix viewing experience to 190 countries simultaneously. We achieve that with Netflix Open Connect, our globally distributed CDN. This map of our network gives you a sense for how much this effort has scaled in the last five years.
ISP Locations
Internet Exchange Point (circles are sized by volume)
Netflix Open Connect delivers 100% of our video traffic, currently over 125 million hours of viewing per day. This amounts to tens of terabits per second of simultaneous peak traffic, making Netflix Open Connect one of the highest-volume networks in the world.
Globally, close to 90% of our traffic is delivered via direct connections between Open Connect and the residential Internet Service Providers (ISPs) our members use to access the internet. Most of these connections are localized to the regional point of interconnection that’s geographically closest to the member who’s watching. Because connections to the Netflix Open Connect network are always free and our traffic delivery is highly localized, thousands of ISPs around the world enthusiastically participate.
We also give qualifying ISPs the same Open Connect Appliances (OCAs) that we use in our internet interconnection locations. After these appliances are installed in an ISP’s data center, almost all Netflix content is served from the local OCAs rather than “upstream” from the internet. Many ISPs take advantage of this option, in addition to local network interconnection, because it reduces the amount of capacity they need to build to the rest of the internet since Netflix is no longer a significant factor in that capacity. This has the dual benefit of reducing the ISP’s cost of operation and ensuring the best possible Netflix experience for their subscribers.
We now have Open Connect Appliances in close to 1,000 separate locations around the world. In big cities like New York, Paris, London, Hong Kong, and Tokyo, as well as more remote locations — as far north as Greenland and Tromsø, Norway and as far south as Puerto Montt, Chile, and Hobart, Tasmania. ISPs have even placed OCAs in Macapá and Manaus in the Amazon rainforest — on every continent, except Antarctica and on many islands such as Jamaica, Malta, Guam, and Okinawa. This means that most of our members are getting their Netflix audio and video bits from a server that’s either inside of, or directly connected to, their ISP’s network within their local region.
As our service continues to grow in all of the new global locations we’re reaching, so will our Netflix Open Connect footprint, as ISPs take advantage of the costs savings available to them by participating in our Netflix Open Connect program. That means Netflix quality in places like India, the Middle East, Africa and Asia will continue to see improvements.
We shared in a recent blog post that Netflix uses Amazon’s AWS “cloud” for generic, scalable computing. Essentially everything before you hit “play” happens in AWS, including all of the logic of the application interface, the content discovery and selection experience, recommendation algorithms, transcoding, etc.; we use AWS for these applications because the need for this type of computing is not unique to Netflix and we can take advantage of the ease of use and growing commoditization of the “cloud” market.
Everything after you hit “play” is unique to Netflix, and our growing need for scale in this area presented the opportunity to create greater efficiency for our content delivery and for the internet in general.
To understand how all of this happens, let’s look a little more deeply at how Open Connect came about, and how it works:
Netflix Open Connect was originally developed in 2011 (and announced in 2012) as a response to the ever-increasing scale of Netflix streaming. Since the launch of the streaming service in 2007, Netflix had proved to be a significant and increasingly large share of internet traffic in every market in which we operated. Although third-party content delivery networks were doing a great job delivering Netflix content (as well as all kinds of other content on the internet), we realized we could be much more efficient based on our knowledge of how our members use Netflix. Although the number and size of the files that make up our content library can be staggering, we are able to use sophisticated popularity models to make sure the right file is on the right server at the right time. These advanced algorithms share some common approaches, and sometimes common inputs, with our industry-leading content recommendation systems.
As we touched on above, pre-positioning content in this way allows us to avoid any significant utilization of internet “backbone” capacity. Take the continent of Australia, for example. All access to internet content that does not originate in Australia comes via a number of undersea cables. Rather than using this expensive undersea capacity to serve Netflix traffic, we copy each file once from our US-based transcoding repository to the storage locations within Australia. This is done during off-peak hours, when we’re not competing with other internet traffic. After each file is on the continent, it is then replicated to dozens of Open Connect servers within each ISP network.
Beyond the basic concept of pre-positioning content, we were also able to focus on creating a highly efficient combination of hardware and software for our Open Connect Appliances. This specialization and focus on optimization has allowed us to improve OCA efficiency by an order of magnitude since the start of the program. We went from delivering 8 Gbps of throughput from a single server in 2012 to over 90 Gbps from a single server in 2016.
At the same time, Open Connect Appliances have become smaller and more power efficient. This means each TV show or movie that is watched by a Netflix subscriber requires less energy to power and cool a server that fits into a smaller space. In fact, our entire content serving footprint is carbon neutral, as we recently pointed out in this blog.
This year, we’ve extended our service everywhere in the world, with the exception of China. We’re excited about the role Netflix Open Connect can play in bringing enjoyment to people all over the planet. It feels like the adventure is just beginning!
-Ken
Ken Florance is Vice President, Content Delivery at Netflix