NetFlix is moving more towards the cloud

11 views
Skip to first unread message

eran.s...@gmail.com

unread,
Oct 24, 2010, 5:45:23 PM10/24/10
to iltec...@googlegroups.com
https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxwcmFjdGljYWxjbG91ZGNvbXB1dGluZ3xneDo0OWM2N2YwM2Q2OTAxMDZm&pli=1

Just thought some of you might be interested in this document written by Sid Anand, one of NetFlix's architects about why they are going to move more to the cloud instead of building their own data centers.

And this is a company with a market cap of ~$8 billion.

Eran

Ori Lahav

unread,
Oct 24, 2010, 6:34:15 PM10/24/10
to iltec...@googlegroups.com
Very interesting 
With so many challenges and investment in data conversion layers he could easily build another DC.

and... I wouldn't want to be NetFlix CTO in 3 months from now when he find out:
1. how much it cost them. - it is amazing how costly AWS is vs efficient DC management. 
2. when it fails and they depend in Amazon to fix it.

every minute downtime for $8B company costs too much to afford waiting on the support line of amazon with no real answers.
Shit happens to everyone. even to Amazon (believe me - I've been there). when it happens, the minimum you want is control over the situation.

Bottom line - every one has it's own considerations. especially those who still pay premium licenses to Oracle at 2010 when there are so many Open Source options. 

but... for that purpose - I offer my talk on DC management and why it is important to know how to do it.

Ori

eran.s...@gmail.com

unread,
Oct 25, 2010, 6:02:45 AM10/25/10
to iltec...@googlegroups.com
Using the cloud correctly requires understanding and how to structure your architecture to avoid problem which can and will occur in any DC.

That's why Amazon has multiple geographical regions (US West Coast, US East Cost, Ireland and Singapore).
In each of them there are multiple availability zones (AZ) which are basically separated data centers within the same region.

The math of Amazon losing multiple AZ within the same region and multiple regions is very slim but is possible in some strange scenarios, however, correct architecture should handle these problems.

Regarding your comment about Amazon support, this is unfortunately mostly FUD. From my experience and others their support is VERY responsive and helpful, even for non premium support paying customers.

Again, you wouldn't need to have Amazon support to much if you correctly architect your system to withstand AZ lose and region lose.

So, I think we can respectfully agree to disagree :-)

Eran

Arik Fraimovich

unread,
Oct 25, 2010, 6:24:08 AM10/25/10
to iltec...@googlegroups.com
When I saw Eran's original message I knew that soon there will be Ori's reply.

I know that Ori's reply is based on his personal experience and not just FUD, but I think that things changed since his experience with Amazon's cloud. Still, the cloud isn't some silver bullet and you have to plan carefully when migrating to the cloud. And if you're of the size of Netflix, you have to consider the fact that the cloud will die completely and see how you deal with it.

And for a company of Netflix's size the question of support is irrelevant -- I'm sure Netflix's CTO has Werner Vogels number... :-)

-- 
Arik Fraimovich

Ori Lahav

unread,
Oct 25, 2010, 7:12:12 AM10/25/10
to iltec...@googlegroups.com
I just looove this discussion :)

I agree, that Netflix case is not relevant for all of you (Start-ups) and that only this document worth for Amazon more then what Netflix are going to pay. so the probably discounted tons of the fees.

Let's remember they only moved their data over there and not yet serving it from there, their way is very long till they will have a working setup on the cloud.

What matters to me is not Netflix doing wrong decisions.
I think of you start-ups that don't understand that growing on the cloud is mostly inefficient. (at least with Amazon's price list).
Every application has it's own growth vs cost graphs. some are more flat, some are more steep but with cloud mathematics they are pretty much linear. the more you grow, the more you pay, respectfully.
When developing your own datacenters (and its pure engineering), you get to efficiencies that makes these graphs sub- linear. It is less efficient when you are small but gets much more efficient when you grow.
So, if you are start up which is planning to grow (if you are not then you are not interesting), it is OK to start in the cloud but:
1. be aware that at one point of your growth you will be throwing $$$ to the cloud that can be saved by smart DC engineering.
2. Don't attach too much to cloud solutions as it will be harder to get out when you will need to.
3. Gradually start building the knowledge of running datacenters so it will not seem frightening to go out.

Here is outbrain story.
We started at the cloud using S3 - that's what made most sense for us. We saw start-ups near us doing the same application get's bad buzz by hanging blogs for unresponsive service and swear this is not going to happen for us. and then Amazon screwed up with bad availability and service for weeks. complaints in their users forum were not getting any response (for weeks). we understood that and also understood that if this is the support culture then even paying for support will only get us the right to shout on sombody on the phone but will probably not fix issues. So we ejected.

In the meantime we have build a datacenter of few cabinets in New York. when we decided that we need higher availability and disaster recovery mechanisms we opened another one in LA and now to reduce the costs even more we are opening a 3rd one in Chicago. (availability zones are covered for us in the way we architect DCs )
I can talk a lot about building datacenters efficiently, that's one of the reasons we created ILTechTalks :) but the truth is that it is not that frightening and at the bottom line, for company that succeed and grow, it is by far more efficient then cloud. you will be amazed to hear the cost comparison. I had to recheck my calculations few times as I did not believe them myself. but they are apparently true.

bottom line, don't throw your investors $$$ and start being more smart and efficient.


Lior Sion

unread,
Oct 25, 2010, 11:02:50 AM10/25/10
to iltec...@googlegroups.com
Ori,

This sounds very interesting. Are you planning to have a talk about your DC availability?

What interests me, at least, are the details about the technology used for live failover, and the costs of those solutions.. 

About the original text, btw, I think that what you said on netflix pretty much sums it up: I'm guessing they are using amazon for data mainly (which is considered much more stable these days) and will take time to move the rest. Also, they probably got wonderful prices. At large scales, there's a lot to say about putting your talent and attention where you need it (for netflix it's not the cloud technology) and not to worry about the rest. The balance is sometimes more delicate than price and linear growth, IMO. 
--
thanks,
Lior Sion

Skype: sionlior | GTalk: lior.sion

Ori Lahav

unread,
Oct 25, 2010, 11:22:42 AM10/25/10
to iltec...@googlegroups.com
Hi Lior
I have my talk about Datacenters which is pretty general but if someone wants me to dive into details of one aspect or another I'm glad to do so. I did it in PicScout per their specific requests and it went pretty well.

Feel free to use me for this.

Ori

eran.s...@gmail.com

unread,
Oct 25, 2010, 2:18:29 PM10/25/10
to iltec...@googlegroups.com
I don't have a lot of time just now to reply to this (but I will) however from a startup perspective I think Zynga, Playfish, Foursquare, SimpleGeo and other are startups using AWS and the bottom like price is not always the right thing. 

If it hampers your way of getting your product more quickly on the market because 10-20% of your work force is doing things it shouldn't be doing instead of handling what you need to do then you are wasting investors money :-)

but that's just a tease for now.

I will get back tomorrow with a proper reply.

On Mon, Oct 25, 2010 at 1:12 PM, Ori Lahav <ola...@gmail.com> wrote:

yiftach shoolman

unread,
Oct 25, 2010, 3:12:12 PM10/25/10
to iltec...@googlegroups.com
I totally agree with Eran. Ori, I think u missed the automation things that you can easily do in the cloud and save a lot of $$$.
To share some knowledge, Zynga "added 500 web in 30 minutes" to deal with a 50% spike in Farmville. I don't know about any other way to do it with the traditional hosting. 
Zynga, BTW, also manage their own DC (I guess hosted somewhere) -->using this hybrid approach to deal with spikes and DR can be very lucrative. 

Yiftach
--
Yiftach Shoolman
+972-54-7634621

Ori Lahav

unread,
Oct 25, 2010, 3:45:12 PM10/25/10
to iltec...@googlegroups.com
Great example Yiftach.
each application has it's own behavior.some are flat - these are not interesting. some have expected growth (that's the case for outbrain) and some are spiky (maybe like Zinga). one thing that usually happen when you grow, the spikes becomes less and less significant and swallow in your reasonable headroom.
pay attention to the example - using the cloud when they need for fast growth and for steady load using DC. It's OK to use the cloud when you need it. but paying 7X on machines for long term... makes no sense.

Eran - to your point about peoples effort - I measured that too.
How much do you think of the Ops team time is devoted to hardware and DC infrastructure and how much to configure and maintain the  software infrastructure that you will need in the cloud also? DB/cache/application servers/queues/monitoring/etc...? you will be amazed to figure out that on DC you don't spend that much time on the physical aspects and you really gain it by the hardware price (3X to 7X) if you do it efficiently. the Myth that in the cloud you don't need Sys Admins is... a myth. it's only waisting your developers time doing the same job the sys admins are usually doing. as oppose to developing the real company product.

here is a question to the forum and I ask each one of you that answer this thread to answer it:

Why are you developing your code in-house? (I assume you are)
There are a lot of coding contractors that can do the job as good as your development team.
just hire a good product manager to spec the product and outsourcing company will do it for you.isn't it better?   

Arik Fraimovich

unread,
Oct 25, 2010, 4:13:24 PM10/25/10
to iltec...@googlegroups.com
Why are you developing your code in-house? (I assume you are)
There are a lot of coding contractors that can do the job as good as your development team.
just hire a good product manager to spec the product and outsourcing company will do it for you.isn't it better?   

I have a counter example: do you develop *everything* in house? 

-- 
Arik Fraimovich

Arik Fraimovich

unread,
Oct 25, 2010, 4:13:45 PM10/25/10
to iltec...@googlegroups.com
Of course I meant counter question...

-- 
Arik Fraimovich

Ori Lahav

unread,
Oct 25, 2010, 4:17:16 PM10/25/10
to iltec...@googlegroups.com
question in place. forum?

Yaniv Golan

unread,
Oct 25, 2010, 5:41:34 PM10/25/10
to iltec...@googlegroups.com
My answer to Arik's question is - we do what it takes. My answer to DC vs Cloud - it depends. 

Personally I try to shy away from Big Universal Truths. Startups are given money by investors to generate a sizable return, not to prove a point.

Sometimes it makes sense to develop externally. Sometimes it doesn't. Personally I lean toward developing as much as possible in-house, but it's not a Big Universal Truth - I re-examine this with each and every project. 

Sometimes it makes more sense to host on the cloud and sometimes hosting your own DC is the best answer. 

Instead of taking a position one way or another, I think it is more useful to explore these "sometimes" - 

Based on our accumulated experience, when does it make sense to develop internally or to develop externally? 

My short answer - if it's core, or if it requires agile iterative development - in house, otherwise, outsource makes sense. If choosing outsource, it needs to be a well defined, relatively short task - if not, make sure you have the extra managerial resources to manage it long term. My test for core or not - if the code gets deleted and the folks who wrote it disappear from the face of earth - can I recover in a few weeks? 

What's your answer?

Based on our accumulated experience, when does it make sense to host on the cloud and when does it make sense to manage your own DC? Or, going into a more granular level, which parts of your architecture does it make sense to have on the cloud and in which part of your company lifecycle does this make sense? Also, which cloud service do you choose and why?

There are so many possible combinations, each one with its own pros and cons. 

My own short answer - in most cases I'd start on the cloud, but only if I am sure my app is architected in a way that will allow me to migrate to a DC if I need to. If I choose AppEngine it'll probably be harder to migrate away, if I AWS it'll probably be easier. 

Data is actually more complex IMO - putting my data on the cloud usually makes tons of sense, but if I will end up having to move around huge chunks of data either for processing or even as a one time migration off the cloud, it might not make sense. 

When traffic goes up (yay), before deciding to migrate off the cloud, I'd look at a few other solutions first - Varnish-like services can reduce the load on your app server considerably and improve response time. Using a CDN can do wonders to your performance AND availability, even for non-static data. If your architecture is solid, there is a lot you can do in a DC to keep it efficient, OR in a cloud to keep it cheap.

While, like most engineering-related tasks, building and running an efficient DC might be a fascinating intellectual challenge, usually in the wider context of the company and the project this is simply not interesting - the hosting platform, be it DC or cloud, just needs to work and not get in the way - operationally and financially. 

Disclosure: Eran and myself work for the same company and in the same team. The architecture we built was able to cope with 10x and 100x spikes in traffic. We're running in our own DC with 4-6 web servers and a single fairly standard DB machine - though we're getting lots of help from friends like memcached, varnish and Akamai. "Our own DC" made sense to us back in 2005 when we start building this - if we were to do it again today, we most likely would have chosen AWS. 

--
Yaniv Golan

Ori Lahav

unread,
Oct 25, 2010, 6:55:56 PM10/25/10
to iltec...@googlegroups.com
Yaniv,
great share.
You can defiantly take a guess that my point here is going to extreme for the sake of argument and I pretty much agree that for every company and product needs there are the options that most suits it.
choosing the platform to run on is a decision you make at some point based on your knowledge and personal experience.

what really bothers me is the lack of knowledge there is about building and maintaining datacenters. It is precived as something very costly, high maintenance and not scalable. the Fear factor plays a lot in the psychology of the decision making and makes this decision not rational. Cloud vendors are riding this fear and lack of knowledge wave to push their pitch and the rest is known.

I've seen great architectures that makes lots of sense to host in the cloud. However there are many others that are not.
What I'm trying to encourage and promote is the knowledge of what it takes to run a datacenter efficiently, where you can be (a lot) more efficient then cloud and how real people can do it. Not only, Google, Amazon, eBay, facebook (and most of the internet companies) can do it.

At the end of the day, when you need to decide Cloud or DC - (question you should ask yourself from time to time even if you made a decision already), make sure you hear the people from both sides (not only AWS salesmen). hear companies with similar architecture to yours and those who already did major scaling steps (because this is where you want to be).

Note about the development:
My perception is that companies have mission. each of your employees comes to work to promote the company mission. you hire developers with a skill set that can promote the company mission. everybody in your organization should be aligned to this mission.

So... how come your servers are "general purpose" ? are they the best fit to complete the company mission and in minimum cost? why their IO bus is currently loaded by some video rendering task of other user? when Warner Vogel decides on a change - does he takes your mission or users into consideration? the support engineer on the other side of the line, does he care that your users blog is hanging?

I know this is higher and more spiritual words but worth to mention. I believe in companies that stick to their missions.

Ori

Ori Lahav

unread,
Oct 28, 2010, 6:03:07 AM10/28/10
to ILTechTalks
Yiftach wrote a question for this thread on another thread so I'm
putting it below:

Returning to Cloud Vs. DC

Can someone please explains me this magic here, i.e. why "cost is
linear with the growth" only applies to the cloud. Come on, when you
need to grow, you need to buy more servers and storage, and licenses
(I guess some of this team do pay for licenses) am I wrong ? so unlike
the cloud where you can only rent an instance (AWS case) or GB of
storage, here u have to buy a server, which holds multiple instances,
or a disk, which comes with a lot of GB. That means you don't grow
linearly but rather by using "big steps", which is usually more (even
much more) than your current need... ..

Yiftach
> >> On Mon, Oct 25, 2010 at 10:13 PM, Arik Fraimovich <a...@arikfr.com>wrote:
>
> >>> Of course I meant counter question...
>
> >>> --
> >>> Arik Fraimovich
>
> >>>>>> On Mon, Oct 25, 2010 at 8:18 PM, eran.sand...@gmail.com <
> >>>>>> eran.sand...@gmail.com> wrote:
>
> >>>>>>> I don't have a lot of time just now to reply to this (but I will)
> >>>>>>> however from a startup perspective I think Zynga, Playfish, Foursquare,
> >>>>>>> SimpleGeo and other are startups using AWS and the bottom like price is not
> >>>>>>> always the right thing.
>
> >>>>>>> If it hampers your way of getting your product more quickly on the
> >>>>>>> market because 10-20% of your work force is doing things it shouldn't be
> >>>>>>> doing instead of handling what you need to do then you are wasting investors
> >>>>>>> money :-)
>
> >>>>>>> but that's just a tease for now.
>
> >>>>>>> I will get back
>
> ...
>
> read more »

Ori Lahav

unread,
Oct 28, 2010, 6:25:49 AM10/28/10
to ILTechTalks
Yiftach
That's a good question which is, again very dependent in your
application requirements.
In General, when you start building your App you need the hardware to
be redundant and safe so if your only server looses a disk, for
example, you will not loose the entire service. so you buy a powerful
server with disks in raid 1+ 0 and you make sure you can grow as much
as you can in this server.
however - when you start growing and now you have 20 DB slaves then
loosing one of them is not such a big deal and the system can carry
this failure and keep running. so you lower the requirements from each
machine and you start striping disks (raid 0) and now you use much
cheaper machines.
To put it in formulas If in day 1 your DB layer cos X for 1 server. in
day 650 it can cost 20*(0.6X) and not 20X this is what I mean by sub
linear growth.

Having that in mind with every thing you do in the datacenter: Server
costs, Network gear costs, cabinets space efficiency, efficient power
usage, etc... you save hundreds of thousands of dollars per year (I'm
not kidding) that you can invest them in your business development or
product development.

one might say that you are already using this for the Ops team costs
but from our experience the effort you put on this when calculated
over time is not a big deal comparing to the savings.

Regarding step functions. you re right, the traditional way of DC
management is indeed very much based on these step functions but as we
knew this in outbrain we have managed to architecture the
infrastructure as such as we will be minimizing the step functions a
lot. Regarding that -> on my Tech Talk :)
> ...
>
> read more »

Eishay Smith

unread,
Oct 28, 2010, 2:08:01 PM10/28/10
to iltec...@googlegroups.com
Having control over the machines enables you to optimize your setup.
I have many types of machines as in lots of RAM & low power CPU, strong CPU & low RAM, ones with both SSD to put MySql files on and magnetic disc for large local backup/executables. It lets me fit way more application requirements in less space and coast and have monitoring system provide me reports on how I can optimize more. I would waste a lot if I would use off the shelf three types of machines.
Again, we outsource expertise of actually plugin the hardware into place, we're trusting them to do better job as its their core business (as we verify it).

--es

Lior Sion

unread,
Nov 9, 2010, 6:04:12 PM11/9/10
to iltec...@googlegroups.com

Yaron Galai

unread,
Nov 24, 2010, 1:24:23 PM11/24/10
to ILTechTalks
Another interesting post about Netflix moving to AWS:
http://www.readwriteweb.com/cloud/2010/11/why-netflix-switched-its-api-a.php



On Nov 9, 6:04 pm, Lior Sion <lior.s...@gmail.com> wrote:
> Thought you might want to read some more notes on the subject:
>
> http://highscalability.com/blog/2010/10/22/paper-netflixs-transition-...
>
> On Sun, Oct 24, 2010 at 2:45 PM, eran.sand...@gmail.com <
>
> eran.sand...@gmail.com> wrote:
>
> >https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbn...

Eishay

unread,
Dec 16, 2010, 6:47:58 PM12/16/10
to ILTechTalks
Interesting sequel:
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

"One of the first systems our engineers built in AWS is called the
Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and
services within our architecture. If we aren’t constantly testing our
ability to succeed despite failure, then it isn’t likely to work when
it matters most – in the event of an unexpected outage."

Healthy approach in any system that wants to ensure availability.

On Nov 24, 10:24 am, Yaron Galai <ga...@outbrain.com> wrote:
> Another interesting post about Netflix moving to AWS:http://www.readwriteweb.com/cloud/2010/11/why-netflix-switched-its-ap...

Lior Sion

unread,
Feb 23, 2011, 1:17:20 AM2/23/11
to iltec...@googlegroups.com

Eishay Smith

unread,
Jul 20, 2011, 5:17:07 PM7/20/11
to ILTechTalks
The sequel: "The Netflix Simian Army" http://www.businessinsider.com/netflix-unveils-its-monkey-army-2011-7
Is it as simple to create an automated "Simian Army" in a standard DC? 
I.e. hide/cripple/kill a machine, play with its configs (network/other) and do a real life drills going from local to mass outages as a full region goes down. 

The challenge is to enable fast recovery if the drill goes bad, and not having a person involved in calling for the drill to start and ending it.

Actually, I guess its not that hard to do, but when you're in the 'cloud' makes you think about it a bit more and its easy to actually implement it.

$ pwd
$ /netflix/eishay
Reply all
Reply to author
Forward
0 new messages