Classroom Testing Rules

0 views

Skip to first unread message

Patricia

unread,

Aug 5, 2024, 6:43:49 AM8/5/24

to aculavstar

Thisdocument is intended to help those with a basic knowledge of machinelearning get the benefit of Google's best practices in machine learning. Itpresents a style for machine learning, similar to the Google C++ Style Guideand other popular guides to practical programming. If you have taken a classin machine learning, or built or worked on a machine-learned model, then youhave the necessary background to read this document.

Most of the problems you will face are, in fact, engineering problems. Evenwith all the resources of a great machine learning expert, most of the gainscome from great features, not great machine learning algorithms. So, the basicapproach is:

Machine learning is cool, but it requires data. Theoretically, you can take datafrom a different problem and then tweak the model for a new product, but thiswill likely underperform basicheuristics. If you think thatmachine learning will give you a 100% boost, then a heuristic will get you 50%of the way there.

Google Plus team measures expands per read,reshares per read, plusones perread, comments/read, comments per user, reshares per user, etc. which they usein computing the goodness of a post at serving time. Also, note that anexperiment framework, in which you can group users into buckets and aggregatestatistics by experiment, is important. SeeRule #12.

By being more liberal about gathering metrics, you can gain a broader pictureof your system. Notice a problem? Add a metric to track it! Excited about somequantitative change on the last release? Add a metric to track it!

A simple heuristic can get your product out the door. A complex heuristic isunmaintainable. Once you have data and a basic idea of what you are trying toaccomplish, move on to machine learning. As in most software engineeringtasks, you will want to be constantly updating your approach, whether it is aheuristic or a machine-learned model, and you will find that themachine-learned model is easier to update and maintain (seeRule #16).

The first model provides the biggest boost to your product, so it doesn't needto be fancy. But you will run into many more infrastructure issues than youexpect. Before anyone can use your fancy new machine learning system, you haveto determine:

Once you have a system that does these three things reliably, you have donemost of the work. Your simple model provides you with baseline metrics and abaseline behavior that you can use to test more complex models. Some teams aimfor a "neutral" first launch: a first launch that explicitly deprioritizesmachine learning gains, to avoid getting distracted.

Machine learning has an element of unpredictability, so make sure that youhave tests for the code for creating examples in training and serving, andthat you can load and use a fixed model during serving. Also, it is importantto understand your data: seePractical Advice for Analysis of Large, Complex Data Sets.

Do be mindful of the added complexity when using heuristics in an MLsystem. Using old heuristics in your new machine learning algorithm canhelp to create a smooth transition, but think about whether there is asimpler way to accomplish the same effect.

This is a problem that occurs more for machine learning systems than for otherkinds of systems. Suppose that a particular table that is being joined is nolonger being updated. The machine learning system will adjust, and behaviorwill continue to be reasonably good, decaying gradually. Sometimes you findtables that are months out of date, and a simple refresh improves performancemore than any other launch that quarter! The coverage of afeature may change due to implementation changes: for example a feature columncould be populated in 90% of the examples, and suddenly drop to 60% of theexamples. Play once had a table that was stale for 6 months, and refreshingthe table alone gave a boost of 2% in install rate. If you track statistics ofthe data, as well as manually inspect the data on occasion, you can reducethese kinds of failures.

If the system is large, and there are many feature columns, know who createdor is maintaining each feature column. If you find that the person whounderstands a feature column is leaving, make sure that someone has theinformation. Although many feature columns have descriptive names, it's goodto have a more detailed description of what the feature is, where it camefrom, and how it is expected to help.

You have many metrics, or measurements about the system that you care about,but your machine learning algorithm will often require a single objective, anumber that your algorithm is "trying" to optimize. I distinguish herebetween objectives and metrics: a metric is any number that your systemreports, which may or may not be important. See alsoRule #2.

You want to make money, make your users happy, and make the world a betterplace. There are tons of metrics that you care about, and you should measurethem all (see Rule #2). However,early in the machine learning process, you will notice them all going up, eventhose that you do not directly optimize. For instance, suppose you care aboutnumber of clicks and time spent on the site. If you optimize for number ofclicks, you are likely to see the time spent increase.

Often you don't know what the true objective is. You think you do but then asyou stare at the data and side-by-side analysis of your old system and new MLsystem, you realize you want to tweak the objective. Further, different teammembers often can't agree on the true objective. The ML objective should besomething that is easy to measure and is a proxy for the "true" objective.In fact, there is often no "true" objective (seeRule#39).Sotrain on the simple ML objective, and consider having a "policy layer" on topthat allows you to add additional logic (hopefully very simple logic) to dothe final ranking.

These are all important, but also incredibly hard to measure. Instead, useproxies: if the user is happy, they will stay on the site longer. If the useris satisfied, they will visit again tomorrow. Insofar as well-being andcompany health is concerned, human judgement is required to connect anymachine learned objective to the nature of the product you are selling andyour business plan.

Linear regression, logistic regression, and Poisson regression are directlymotivated by a probabilistic model. Each prediction is interpretable as aprobability or an expected value. This makes them easier to debug than modelsthat use objectives (zero-one loss, various hinge losses, and so on) that tryto directly optimize classification accuracy or ranking performance. Forexample, if probabilities in training deviate from probabilities predicted inside-by-sides or by inspecting the production system, this deviation couldreveal a problem.

For example, in linear, logistic, or Poisson regression, there are subsets ofthe data where the average predicted expectation equals the average label (1-moment calibrated, or just calibrated). This is true assuming that you have noregularization and that your algorithm has converged, and it is approximatelytrue in general. If you have a feature which is either 1 or 0 for each example,then the set of 3 examples where that feature is 1 is calibrated. Also, if youhave a feature that is 1 for every example, then the set of all examples iscalibrated.

At some level, the output of these two systems will have to be integrated. Keepin mind, filtering spam in search results should probably be more aggressivethan filtering spam in email messages. Also, it is a standard practice to removespam from the training data for the quality classifier.

In the first phase of the lifecycle of a machine learning system, theimportant issues are to get the training data into the learning system, get anymetrics of interest instrumented, and create a serving infrastructure. Afteryou have a working end to end system with unit and system tests instrumented,Phase II begins.

In the second phase, there is a lot of low-hanging fruit. There are a varietyof obvious features that could be pulled into the system. Thus, the secondphase of machine learning involves pulling in as many features as possible andcombining them in intuitive ways. During this phase, all of the metrics shouldstill be rising. There will be lots of launches, and it is a great time topull in lots of engineers that can join up all the data that you need tocreate a truly awesome learning system.

If you use an external system to create a feature, remember that the externalsystem has its own objective. The external system's objective may be only weaklycorrelated with your current objective. If you grab a snapshot of the externalsystem, then it can become out of date. If you update the features from theexternal system, then the meanings may change. If you use an external system toprovide a feature, be aware that this approach requires a great deal of care.

The primary issue with factored models and deep models is that they arenonconvex. Thus, there is no guarantee that an optimal solution can beapproximated or found, and the local minima found on each iteration can bedifferent. This variation makes it hard to judge whether the impact of achange to your system is meaningful or random. By creating a model withoutdeep features, you can get an excellent baseline performance. After thisbaseline is achieved, you can try more esoteric approaches.

There are a variety of ways to combine and modify features. Machine learningsystems such as TensorFlow allow you to pre-process your data throughtransformations.The two most standard approaches are "discretizations" and "crosses".

Crosses combine two or more feature columns. A feature column, in TensorFlow'sterminology, is a set of homogenous features, (e.g. male, female, US,Canada, Mexico, et cetera). A cross is a new feature column with features in,for example, male, female US,Canada, Mexico. This new feature columnwill contain the feature (male, Canada). If you are using TensorFlow and youtell TensorFlow to create this cross for you, this (male, Canada) feature willbe present in examples representing male Canadians. Note that it takes massiveamounts of data to learn models with crosses of three, four, or more basefeature columns.