breell willabel latashah

0 views

Skip to first unread message

Ling Kliment

unread,

Aug 2, 2024, 11:42:51 AM8/2/24

to gauropanfto

Recently, DEV was struggling with CI build times. It got to the point where Travis builds were taking up to 30 mins to complete. On top of that, we had a few flaky specs in our build and the result was some very frustrated developers.

When I began trying to figure out how to solve this issue, the first strategy that came to mind was parallelizing the build process. This means splitting up our test suite into chunks and running each of those chunks at the same time. The challenge with this is figuring out how to split the chunks up so that each chunk runs in the same amount of time.

For example, if you have a 30 min build and you want to split it into 3 parallel builds, ideally you want each build to run for 10 min. In order to do this, you have to figure out how long each of your tests takes to run. While grappling with this problem, one of my coworkers suggested I checkout KnapsackPro.

KnapsackPro allows you to evenly split up your tests between parallel CI builds so that they run in the most optimal way to save you time. KnapsackPro does this by recording the time each test takes to run. It then uses that timing data to split your tests up equally in terms of runtime into however many groups you choose. This sounded like a great solution so I started digging into the docs trying to figure out how to get it all setup

Before I go any further, I have to say that the KnapsackPro docs are some of the best I have worked with. They offer a thorough step by step setup plan for whatever language or testing framework you are using. In addition, their FAQ docs cover just about every possible buggy scenario you might run into. All of this made setting up KnapsackPro a straight forward process.

Once you have the gem installed then its time to set up your configuration based on what kind of testing framework you use. To figure out how to configure the gem, KnapsackPro gives you this handy installation guide.

Simply select your testing frameworks and gems and it will tell you what to add to your configuration. In our case, we use Rspec, Webmock, and TravisCI which gave us these additional steps to perform.

If you happen to see your tests failing due to WebMock not allowing requests to Knapsack Pro API it means you probably reconfigure WebMock in some of your tests. For instance, you may use WebMock.reset! or it's called automatically in the after(:each) block, if you require 'webmock/rspec'. These setups will remove api.knapsackpro.com from allowed domains. Please try below to fix this issue:

This PR introduces a service called Knapsack to help us parallelize our spec suite as evenly as possible. The first time I ran Knapsack in our build it ran every test separately and recorded the time it took to run the test. Using this information Knapsack then splits the tests up for us into 3 equally timed groups to run in parallel each time we run our test suite. This is how regular mode works.

The changes in this PR are introducing the gem and setting it up. All of them were made with the help of the Knapsack installation guide which walks you through all the changes you should make to get it working properly.

NOTE - There is currently a bug with the parallelization in Travis that causes the --local flag for our bundler command to be ignored. This means on your first build, since there is no travis cache, the jobs will likely take 13min. I am in contact with Travis support to get this resolved.

Why aren't we using Queue Mode? Ideally, we want to use queue mode. In Queue Mode Knapsack sends us groups of 3-5 specs at once and then when they finish, sends another group of specs. It keeps doing this until all specs have been run. This is obviously the fastest approach but we ran into some errors with the jobs hanging. My goal is to get the regular version out then try to debug that hanging issue so we can use queue mode.

If you click through to the pull request, you will notice we also made some additional changes to ensure our code coverage checks and other CI steps ran efficiently and correctly with our new parallel builds.

With all of those changes in place, you will then want to push your branch up and let the KnapsackPro API do its thing. Keep in mind, the first run will NOT be optimal because the knapsack_pro gem will record the execution time of every one of your tests.

Here you can find everything from node build times to a breakdown of how long each test took to run. Once KnapsackPro has that data, then it can strategically split your tests up as evenly as possible for all future builds. Your second test suite run on your CI provider will be parallelized with the optimal test suite split if the first run was recorded correctly.

The new script tries to hit the KnapsackPro API, but without the token, it fails. Upon failure, it will fallback on grouping tests by directory names and you will see an output that looks like this:

I said it above and I will say it again, KnapsackPro is extremely well documented which makes getting started with it very straight forward. There is literally a doc for just about every question or scenario you can run into.

One of the big benefits of KnapsackPro is that they give you the option to make your dashboard and test stats public. This is a huge deal for us at DEV because we have so many external contributors. It is amazing when those contributors can access the same data that the core team can.

KnapsackPro is a small company, which means when you send them a support email it goes straight to a real person! No automated response, no bouncing around between different support people with canned responses. You go straight to someone who will be able to help you.

At the start of this, DEV's test suite was in pretty rough shape in terms of flakiness and reliability. It's also worth mentioning that I am a Site Reliability Engineer, not a QA engineer, so I struggled quite a bit getting everything setup. What got me through was the support and help I received from KnapsackPro along the way. Email responses were quick(within 24 hrs) and not only would they answer my questions, but they also offered me tips about how to set things up even more efficiently than I was.

The end result of all this work is that we now have a test suite that runs in about 10 min! In addition, when we come across a new flaky spec, we can simply retry the one job that failed, instead of having to run the entire suite.

The move also forced us to separate our testing process and our deploy process. Now when a deploy fails for some external reason, we can simply retry the deploy step. Before, we would have to restart the entire build and run the whole test suite again before we could deploy. It was not fun.

I was not enticed or asked to write this blog post by anyone from KnapsackPro. I know many companies struggle with slow test suites and I wanted to share how we tackled that problem at DEV so hopefully, others might be able to do what we did to solve their own challenges.

If anyone would be interested to learn how a dynamic test suite split works in Knapsack Pro Queue Mode I've written an article to compare Regular Mode and Queue Mode docs.knapsackpro.com/2020/how-to-s...

Queue Mode might be useful especially when you run dozen of parallel jobs or even 100+ and your tests have random execution time or some of the parallel jobs have delayed start so then Queue Mode can help auto-balance tests split to ensure all parallel CI nodes finish at the same time :) In result, you get fast CI build with no bottleneck lagging parallel job.

Jonathan Friedland, the new vice president of global corporate communications who had joined Netflix just a few months earlier, asked whether customers on tight incomes might object to the price hike, according to people at Hastings' meeting. Hastings argued that Netflix was a great bargain. He said he knew that some customers would complain but that the number would be small and the anger would quickly fade.

Hastings was wrong. The price hike and the later, aborted attempt to spin off the company's DVD operations enraged Netflix customers. The company lost 800,000 subscribers, its stock price dropped 77 percent in four months, and management's reputation was battered. Hastings went from Fortune magazine's Businessperson of the Year to the target of Saturday Night Live satire.

To Hastings' credit, what he wanted to do made sense. The DVD's best days are behind it. Video streamed via the Internet is slowly replacing the physical disc, and betting a business on a dying product is never a great idea. So Hastings wanted to get ahead of the curve and focus on streaming, to disrupt his own business before someone else did it for him. It was aggressive, far-sighted, and very much in character.

Hastings is someone who knows a thing or two about disrupting businesses. Netflix, after all, is the company that drove the giants of video rental out of the sector with a simple premise: A simple-to-use Web site that delivers DVDs right to your doorstep. Best of all: No late fees. He became one of those executives with the "visionary" label, who can predict where a market is going before it happens, and was asked to join the board of directors of two of the most important companies in tech, Microsoft and Facebook.

Leading up to the first anniversary of the Netflix meltdown, CNET interviewed former and current Netflix employees to find out how a series of missteps turned into a lost year, and whether it has rebounded from those self-inflicted wounds. Most asked to remain anonymous. Netflix declined to comment for this story.

So how did Hastings stumble? Just prior to the attempt to remake Netflix into a streaming-video distributor, there was turmoil in the company's executive offices. Several of Hastings' most trusted lieutenants were no longer as influential with the CEO. Others had left and their replacements did not yet have the clout to convince Hastings he was being too aggressive for a customer base that by 2011 could hardly have been considered on the bleeding edge of consumer tech.