Cockcroft describes his role as Cloud Architect at Netflix not in terms of controlling the architecture, but as discovering and formalizing the architecture that emerged as the Netflix engineers built it. The Netflix development team established several best practices for designing and implementing a microservices architecture.
"This blog post may reference products that are no longer available and/or no longer supported. For the most current information about available F5 NGINX products and solutions, explore our NGINX product family. NGINX is now part of F5. All previous NGINX.com links will redirect to similar NGINX content on F5.com."
With over 1B active users, Facebook has one of the largest data warehouses in the world, storing more than 300 petabytes. The data is used for a wide range of applications, from traditional batch processing to graph analytics, machine learning, and real-time interactive analytics.
Airbnb supports over 100M users browsing over 2M listings, and their ability to intelligently make new travel suggestions to those users is critical to their growth. Their team runs an amazing blog AirbnbEng where they recently wrote about Data Infrastructure at Airbnb last year.
Keen.io is an event data platform that my team built. It provides big data infrastructure as a service to thousands of companies. With APIs for streaming, storing, querying, and presenting event data, we make it relatively easy for any developer to run world-class event data architecture, without having to staff a huge team and build a bunch of infrastructure. Our customers capture billions of events and query trillions of data points daily.
For the HTTP Stream API, load balancers can handle billions of incoming post requests a week as events stream in from apps, web sites, connected devices, servers, billing systems, etc. Events are validated, queued, and optionally enriched with additional metadata like IP-to-geo lookups. This all happens within seconds.
Alternatively, for teams that already have a Kafka-based event pipeline or systems that are compatible with open-source Kafka connectors, it will be easier to stream data to Keen via our Kafka Inbound Cluster. This data is also processed and stored securely within seconds.
After data is processed and enriched by Keen, it can be streamed to any external system in real time via the Kafka Outbound Cluster and a standard Kafka Consumer. This is a great way to build alerting functionality, power event-driven architectures, spin up an integration, or simply back up your data. Examples of popular services that consume from Kafka include ksqlDB, Materialize.io, and any system capable of consuming from Kafka with open-source Kafka connectors.
Once safely stored in Apache Cassandra, event data is available for querying via a REST API. Our architecture (via technologies like Apache Storm, DynamoDB, Redis, and AWS lambda), supports various querying needs from real-time data exploration on the raw incoming data, to cached queries which can be instantly loaded in applications and customer-facing reports.
Pinterest serves over 100M MAU doing over 10B+ pageviews per month. As of 2015, they had scaled their data team to over 250 engineers. Their infrastructure relies heavily on Apache Kafka, Storm, Hadoop, HBase, and Redshift.
Special thanks to the authors and architects of the posts mentioned above: Steven Wu at Netflix, Martin Traverso at Facebook Presto, AirbnbEng, Pinterest Engineering, and Ed Solovey at Crashlytics Answers.
I've now worked at Netflix for over three years. Time flies! I previously wrote about Netflix in 2015 and 2016, and if you are interested in what it's like to work here, I already covered much in those posts. As before, no one at Netflix has asked me to write this, and this is my personal blog and not a company post.
When I joined Netflix in April 2014, we had over 40 million subscribers in 41 countries. We are now in 190 countries and just crossed 100 million subscribers! It's been thrilling to be part of this and help Netflix scale.
You might imagine that at some point we had a major scaling crises, where it looked like we'd fail due to an architectural bottleneck, and engineers worked long nights and weekends to save Netflix from certain disaster. That'd make a great story, but it didn't happen. We're on the EC2 cloud, which has great scalability, and our own cloud architecture of microservices is also designed for scalability. During this time we did do plenty of hard work, rolling out new technologies and major microservice versions, and fixed many problems big and small. But there was no single crisis point. Instead, it has been a process of continual improvements, by many engineers across the company.
Most of my day is a 50/50 mixture of proactive projects, and reactive performance analysis. The proactive projects usually take weeks or months, and are where I'm developing a new technology or helping other teams with performance analysis or evaluations. Most of these projects aren't public yet, and some of them involve working with other companies on unreleased products. My work with Linux is different in that it is mostly public, and includes my perf-tools and bcc/eBPF tracing tools.
It's a good balance. Too much reactive work and you don't have time to build better tools and general fire proofing. Too much proactive work and you can become disconnected from the current company pain points, and start building solutions to the problems of yesteryear.
About one hour on average each day is meetings. Some of these are regular meetings: we have a team meeting once every two weeks where everyone discusses what they are working on, and I have a one-on-one with my manager once every two weeks. At a lower frequency, I have scheduled meetings with my manager's manager, and their manager. All these manager meetings keep me informed of the current company needs, and help connect me to the right people and projects at Netflix.
Then there's some random events that happen during the year. We have offsites, where we plan what to work on each quarter, and team building events. There's also unofficial recreational groups at Netflix, including movie clubs (for good movies, and for bad ones), a karaoke group (which includes some Hamilton fans), and various sports teams. I'm on the Netflix cricket team (if you're at Netflix and didn't know we exist, join the cricket chatroom). I also usually speak at some conferences each year.
The biggest difference I've found working here is still the culture. We are empowered to do the right thing, and believe in "freedom and responsibility".This is documented in the Netflix culture deck, and after three years I still find it true.
Before joining Netflix, you're told to read it and see if this company is right for you. Then while working here, staff cite the culture deck in meetings for decision making advice. It's not nice-sounding values that are printed in the lobby and people forget about. It's an ongoing influence in the day to day running of Netflix. Having it online also beats learning the culture through word of mouth or trial and error.
I know people in tech who are burned out but stay in lousy jobs, assuming every workplace is just as terrible. Jobs where there is little to no freedom, no responsibility or accountability, and where dumb office politics is the norm. I wish everyone could have a chance to work at a company like Netflix. Little to no bureaucracy. You can focus on engineering and getting stuff done, with awesome staff who will help you.
I've noticed a widespread cynicism about successful companies, especially US corporates, where it's assumed that they must be doing something shady to be really competitive. Like selling customer data, or making it difficult to terminate membership. It's been amazing and inspiring to see how Netflix operates, contrary to this belief. We don't do anything shady, and we're proud of that. We're an honest company.
SRE:Last year I talked about my site reliability engineering (SRE) work. Since then, our CORE SRE team has grown and I'm no longer needed on the on-call rotation, so I'm back to focusing on performance work. My 18 months of SRE on-call provided many memories and valuable experiences, as well as a deeper understanding of SRE. I talked about what I learned in my SREcon 2016 keynote, and how the aims and tools differ between performance engineering and SRE performance analysis.
I miss the thrill of being paged and knowing I'm going to work with other awesome engineers and fix something important in the next five minutes... or at least try to! If I miss this thrill too much, I can always jump into the CORE chatroom and help with production issues when they happen.
Linux:I've been contributing to profilers and tracers, and it's been satisfying to help fix these areas that I really care about.In the last three years I developed the ftrace-based perf-tools and used them to solve many problems, which I wrote about in lwn.net and spoke about at LISA 2014. I also worked with Alexei Starovoitov (now at Facebook) on enhanced BPF for tracing, and developed many bcc tools that use BPF. I spoke about these at Facebook's Performance@Scale event and other conferences.We're rolling out newer kernels now, and it's pretty exciting to use my bcc tools in production.
PMCs:When I considered joining Netflix three years ago, I had two technical concerns: 1. No advanced Linux tracer, and 2. No PMC access in EC2. How am I going to do advanced analysis without these? The more I thought about it, the more I became interested in the challenge, which would be the biggest of my career. Three years later, I've helped solve both of these (as well as devise some workarounds along the way). Now we have Linux 4.9 eBPF and The PMCs of EC2. Thanks to everyone who helped.
Team Changes: Our team has grown a little, and we have a new manager, Ed Hunter, who I worked for before at Sun Microsystems. It's great to be working with Ed again. Our prior manager, Coburn, was promoted.
90f70e40cf