wansava justea gillean

0 views
Skip to first unread message

Ling Kliment

unread,
Aug 2, 2024, 11:04:45 AM8/2/24
to fioscarechmo

Every piece of functionality and data is owned by a microservice and there are thousands of microservices. Also, multiple microservices communicate with each other to realize some of the more complex functionalities.

For example, when you open the Netflix application, you see the LOLOMO screen. Here, LOLOMO stands for list-of-list-of-movies and it is essentially built by fetching data from many microservices such as:

To avoid these issues, Netflix used a single front door for the various APIs. The device makes a call to this front door that performs the fanout to all the different microservices. The front door acts as a gateway and Netflix used Zuul for this purpose.

All of the different devices users can use to access Netflix have different requirements in subtle ways. While Netflix tried to keep a consistent look and feel for the UI and its behavior on every device, each device still has different limitations when it comes to memory or network bandwidth and therefore, loads data in slightly different ways.

The scripts were written by UI developers since they knew what exact data they needed to render a particular screen. Once written, the scripts were deployed on an API server and performed the fanout to all the different microservices by calling the appropriate Java client libraries. These client libraries were wrappers for either a gRPC service or a REST client.

On top of RxJava, Netflix created a fault-tolerant library named Hystrix that took care of failover and bulkheading. Even though reactive programming was complicated, it made a lot of sense for the time and the architecture allowed them to serve most of the traffic needs of Netflix.

Recently, Netflix has migrated from Java 8 to Java 17. After the migration, they saw about 20% better CPU usage on Java 17 versus Java 8 without any code changes. This was because of improvements in the G1 garbage collector. At the scale of Netflix, a 20% better CPU utilization is a big deal in terms of cost benefits.

Overall, Netflix has around 2800 Java applications that are mostly microservices of varying sizes. Also, they have around 1500 internal libraries. Some of them are actual libraries while many of them are just client libraries sitting in front of a gRPC or REST service.

For the build system, Netflix relies on Gradle. On top of Gradle, they use Nebula which is a set of open-source Gradle plugins. The most important aspect of Nebula is in the resolution of libraries. Nebula helps with version locking that helps with reproducible builds.

Virtual threads allow server-side applications written in a thread-per-request style to scale at optimal hardware utilization. In a thread-per-request style, a request comes and the server provides a thread for it. All of the work for the request happens in this thread

Netflix uses the latest version of OSS Spring Boot and their goal is to stay as close as possible to the open source community. However, to integrate closely with the Netflix ecosystem and infrastructure, they have also created Spring Boot Netflix which is a bunch of modules built on top of Spring Boot.

All the changes have been made to solve problems from the previous approach. For example, the move to RxJava was to handle fanouts in a better way and the move to GraphQL Federation was to solve the issues of complexity due to RxJava.

Along with these changes, there has also been a parallel evolution in terms of Java language versions from Java 8 to 17 and now 21+. A lot of it has also been prompted by Spring Boot version 3 finally moving beyond Java 8 and forcing the entire ecosystem to upgrade.

Overall, the theme has been towards standardization of the approach in building microservices across the organization. However, considering the constant challenges faced in operating at their scale while staying ahead of the competition, the evolution will continue.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Netflix has always been at the cutting edge of high performance streaming. Curious if other streaming platforms like Disney+ or HBO have made the same R&D investments as Netflix to stay performant. Netflix just seems so much further ahead.

Bugs sneak out when less than 80% of user flows are tested before shipping. But how do you get that kind of coverage? You either spend years scaling in-house QA \u2014 or you get there in just 4 months with QA Wolf.

However, calling so many services from your device (such as the television) or mobile app is typically inefficient. Making 10 network calls doesn\u2019t scale and results in a poor customer experience. Many streaming apps suffer from such performance issues.

However, things got complicated quickly because of fault tolerance. When dealing with multiple services, you can have one of them not responding quickly enough or failing, resulting in a situation where you\u2019ve to clean up threads and make sure things work properly.

UI developers had to create all the mini backends and they didn\u2019t like working in the Groovy Java space with RxJava. It\u2019s not the primary language they use on a daily basis that makes things difficult

With GraphQL, the client has to be explicit about the field selection. You can\u2019t just ask for shows and get all the data from shows. Instead, you have to specifically mention that you want to get the title of the show and the score of various reviews. If you don\u2019t ask for a field, you won\u2019t get the field.

While it\u2019s more work for the client to specify the query in GraphQL, it solves the whole problem around over-fetching where you get a lot more data than you might actually need. This paves the way to create one API that can serve all the different UIs.

DGS is an in-house framework developed by Netflix to build GraphQL services. When they started moving to GraphQL and GraphQL Federation, there wasn\u2019t any Java framework that was mature enough to use at the Netflix scale. Therefore, they built on top of the low-level GraphQL Java framework and augmented it with features like code generation for schema types and support for federation.

While there are multiple DGSs, there\u2019s just one big GraphQL schema from the perspective of a device such as the TV. This schema contains all the possible data that can be rendered. The device doesn\u2019t need to worry about all the different microservices that are part of the schema in the backend.

For example, the LOLOMO DGS can define a type show with just the title. Then, the images DGS can extend that type show and add an artwork URL to it. The two different DGSs don\u2019t know anything about each other. All they need to do is publish their schema to the federated gateway. The federated gateway knows how to talk to a DGS because all of them have a GraphQL endpoint.

More recently, Netflix has been actively testing and rolling out changes with Java 21. Comparing the move from Java 8 to Java 17, it\u2019s significantly easy to go from Java 17 to 21. Java 21 also provides a few important features such as:

Netflix found a lot of benefits in leveraging the huge open-source community of the Spring framework, existing documentation, and training opportunities that are easily available. The evolution of Spring and its features align very well with the core Netflix principle of \u201Chighly aligned, loosely coupled\u201D.

90f70e40cf
Reply all
Reply to author
Forward
0 new messages