Shikhare: GraphQL is an alternative communication protocol for APIs between the client and the server. In GraphQL, we have a schema that describes the data graph. This data is all the data that you can fetch from the schema. The schema is formed with types, which contains fields. This field reference other types. This is what a schema looks like in the GraphQL schema definition language. We also have a set of route types that we call query mutation and subscription. These are the entry points to the graph. With these entry points, we can construct a query. This query is a tree based structure. The query can be as big as you want, or as small as you want. In this particular query, we want a list of movies. For each movie, we want a title. Now we can take this query, and it's typically packaged into an HTTP POST request and sent to the GraphQL server. The GraphQL server processes the query and then returns a response, typically a JSON object. Very simply put, GraphQL gives you the ability to fetch exactly the data you want from the server, not more, not less. That's it. That's GraphQL in a nutshell.
What's all the hype about? Why is everyone so excited about GraphQL? Let's talk about some of the benefits. One of the big benefits of GraphQL is to minimize roundtrips with aggregation. Since the query can be as big or small, we can fetch all the needed data in a single roundtrip. We can take a look at this quick example. These are the movie recommendations for me on the Netflix UI. You can imagine, we might have two APIs to support this UI. We have a movie recommendation API, and for each of the movie IDs that the recommendation API recommends, we can have an image API. Let's say they are deployed in U.S.-West in California, and I'm visiting my parents in Singapore, which is approximately 8600 miles from the server. When I open the Netflix app, I have to make two sequential requests to be able to render this UI. Now there's many ways to solve this problem. You could imagine, we can use something like GraphQL to aggregate so that the client can write a single query, topics, and image for each of them. Maybe in the future, we add badges. Then we add a badge API. Then we just update the query. It's still a single query. GraphQL is not the only way to solve this problem, this can be a REST API as well. Then, you get this complex BFF architecture, and the REST API is not reusable. GraphQL provides a more reusable pattern for this aggregation or orchestration. Hence, GraphQL is a really good fit for consumer applications like Netflix.
This is a sample GraphQL schema. It might look similar to model classes in Java, because we can actually generate model classes, both on the server side and the client side. This gives us much more ease to write the code and send it to our API and back. There's also a clear indication of what's nullable and what's not. It's built into the language, so you can add an exclamation point to mark a field non-nullable. This reduces the churn caused by bugs in loosely typed APIs. It also forces collaboration between the client and the server teams. The strong typing is not just great as a contract. You can also build developer tools and power them using it. One other big benefit of GraphQL is it shines when it's implemented as a single graph for your organization. Because first it becomes a visual aid for all the data in your organization. Then it also becomes the connecting dots for all the different domains. Then you can write a query that crosses these domains. This is a really powerful paradigm.
My name is Tejas Shikhare. I'm a senior software engineer at Netflix. For the past three years, I've been blessed to be part of this amazing team. I've been working on our Federated GraphQL platform. My focus has been with GraphQL, and distributed systems. Most recently, I've been also working on developer tools and developer education. I'm a big fan of API stewardship. For our talk, we're going to start cataloging two of the common architectures, pattern for GraphQL in the industry. We're going to dive deep into the Federated architecture, which is what we're doing at Netflix. Then we'll jump into some of the migration challenges, and some strategy recommendations for you.
GraphQL was open sourced by Facebook in 2015. Since then, two core patterns have emerged across the industry. The most common way to implement GraphQL is in the monolithic architecture. Why? Because we want the one graph that we saw earlier. In small companies, the GraphQL service is just part of your core monolith. It is built within it. In some bigger companies you can have the GraphQL layer separate talking to the monolithic layer, or it could be talking to your microservice architecture. We've also seen that GraphQL service can be a BFF, Backend for Frontend owned by the UI teams. Or it could be a backend service, an aggregation service. Really, it's always owned by an API or a GraphQL team. This is how we started at Netflix too. This is an oversimplified view of the Netflix architecture. After we adopted microservices to scale our teams, we quickly discovered the need for an API layer to bring together and orchestrate everything for the UIs. We created this service called DNA API. Except, GraphQL was not invented yet. Facebook was still working on it internally, and it was not open source. We developed a similar technology called Falcor, which is actually open source. It just works like GraphQL, but it just didn't take off like GraphQL did. Both Falcor and GraphQL actually came from the same problem space. At Facebook, it was the newsfeed team trying to orchestrate data from multiple sources. At Netflix, it was the TV UI team trying to lay out the TV UI.
Then, over years, this monolith started growing, as we added more features, and eventually it became bigger. Along the way, we started seeing some problems. First, for every new feature, we needed a code change both in the service layer, but also in the API layer. This was often done by different teams. Because of this, the API team had to become experts in many domains. They were also the first line of support because it's a single runtime and handles all the requests. This frequent code changes, and it says more backend services, we need to connect to them, so more dependencies. This resulted in slow build times. Oftentimes, when you have a single runtime, a memory leak in one area could cause problems in the completely unrelated areas. We saw this cascading failures. These are some common problems of a monolith architecture. This is what we saw with the API layer. To fix this, you can imagine, let's say we have this API. It's owned by the same API team, but aggregates across many domains. What if we could still have this one graph, but then split the implementation of all of these subgraphs to different teams.
This is where we entered Federated GraphQL. What's the simplest way to explain this concept? Let's say we have this type Movie in the monolithic GraphQL API. It has three different fields, fulfilled by three different services. The monolith API team would go and implement resolvers to resolve these fields and aggregate data from multiple sources. What if we could break this type apart and give the type extended across service boundaries so that each team can implement their own part of the API. That's exactly what Federation is. Using this idea, we envision an architecture. There are three main components to this architecture. The first is a DGS, or a Domain Graph Service. It just implements the subgraph that we saw. The Domain Graph Service can be a separate service that calls into the microservice, or it could be the microservice itself. All it does is just implement the GraphQL API pertaining to that team subgraph.
Next, we have the schema registry. The schema registry is responsible for validating that each of these individual subgraphs are valid, and then merging them and composing them into a super graph, which is then exposed back to the clients by this highly available service, the GraphQL gateway. The clients write queries against the gateway and the gateway is responsible for breaking these queries apart into subqueries that are sent to the Domain Graph Services. My coworker Stephen and Jennifer gave an amazing talk at QCon Plus about two years ago, explaining Federation and architecture in great detail. You can learn about query planning and query execution. I definitely urge you to check that talk out if you haven't already.
Where are we today? GraphQL is used widely across the company. If you pull out your phones today and open the Netflix app, it's powered by GraphQL. It's using our member and the gaming graph. On the production and the studio side, we have a lot of people working on different parts of the production process, such as pre-production, post-production, on the set, and we build a lot of apps for them. These apps are also powered by GraphQL. It's powered by our Studio Graph. Then most recently, we have started also building an internal tools graph, which is for apps that are workforce facing, and we build them with GraphQL as well. We're dealing with multiple dimensions of scale here, over a billion requests per day, tens of thousands of types and fields, and 500-plus active developers. It's been over two years since Jennifer and Stephen presented. We've been operating and scaling Federation. Did we solve all our problems? Not quite. I think using Federation has just introduced some new ones. What I've learned from this experience is software engineering is largely about understanding the benefits and the tradeoffs, and then applying them to the situation at your company. No technology is the silver bullet. I want to take quite some time to share with you the challenges we are facing with Federation.
In the monolithic GraphQL team when you have this monolith API layer, only the API team needs to be GraphQL experts. In the Federated world, even the domain teams also need to learn GraphQL. The initial barrier to entry is just too high. Imagine one day going to your teams, backend teams, they're implementing their APIs in REST or gRPC. You tell them, start implementing your APIs with GraphQL and make sure they also merge into this unified graph. This is really hard. To address this, we leaned heavily into developer education. We created bootcamps, example codes, and lots of documentation for people to get started. Then we also provided first-class Slack support and weekly office hours. I think what really helped with the initial barrier to entry is we actually embedded with the domain teams. My team knew how to do GraphQL, so we worked with the other teams to help them spin up these services. Then they became the champions of the architecture. Federation sounds cool. You can just decentralize ownership. Actually, driving adoption is pretty hard. Over time you overcome the developer education problems, and the developers start to get the hang of it. Then you start getting graphs in your ecosystem, subgraphs. Then more developers come to the party. In the studio ecosystem, we have 159 subgraphs, so that many Domain Graph Services in that ecosystem.
90f70e40cf