Z Score Prediction

0 views

Skip to first unread message

Nikita Desjardins

unread,

Aug 3, 2024, 1:38:04 PM8/3/24

to consretorre

Scoring parameter: Model-evaluation tools usingcross-validation (such asmodel_selection.cross_val_score andmodel_selection.GridSearchCV) rely on an internal scoring strategy.This is discussed in the section The scoring parameter: defining model evaluation rules.

Metric functions: The sklearn.metrics module implements functionsassessing prediction error for specific purposes. These metrics are detailedin sections on Classification metrics,Multilabel ranking metrics, Regression metrics andClustering metrics.

Model selection and evaluation using tools, such asmodel_selection.GridSearchCV andmodel_selection.cross_val_score, take a scoring parameter thatcontrols what metric they apply to the estimators evaluated.

For the most common use cases, you can designate a scorer object with thescoring parameter; the table below shows all possible values.All scorer objects follow the convention that higher return values are betterthan lower return values. Thus metrics which measure the distance betweenthe model and the data, like metrics.mean_squared_error, areavailable as neg_mean_squared_error which return the negated valueof the metric.

The following metrics functions are not implemented as named scorers,sometimes because they require additional parameters, such asfbeta_score. They cannot be passed to the scoringparameters; instead their callable needs to be passed tomake_scorer together with the value of the user-settableparameters.

functions ending with _error, _loss, or _deviance return avalue to minimize, the lower the better. When convertinginto a scorer object using make_scorer, setthe greater_is_better parameter to False (True by default; see theparameter description below).

The sklearn.metrics module implements several loss, score, and utilityfunctions to measure classification performance.Some metrics might require probability estimates of the positive class,confidence values, or binary decisions values.Most implementations allow each sample to provide a weighted contributionto the overall score, through the sample_weight parameter.

Some metrics are essentially defined for binary classification tasks (e.g.f1_score, roc_auc_score). In these cases, by defaultonly the positive label is evaluated, assuming by default that the positiveclass is labelled 1 (though this may be configurable through thepos_label parameter).

In extending a binary metric to multiclass or multilabel problems, the datais treated as a collection of binary problems, one for each class.There are then a number of ways to average binary metric calculations acrossthe set of classes, each of which may be useful in some scenario.Where available, you should select among these using the average parameter.

"macro" simply calculates the mean of the binary metrics,giving equal weight to each class. In problems where infrequent classesare nonetheless important, macro-averaging may be a means of highlightingtheir performance. On the other hand, the assumption that all classes areequally important is often untrue, such that macro-averaging willover-emphasize the typically low performance on an infrequent class.

"micro" gives each sample-class pair an equal contribution to the overallmetric (except as a result of sample-weight). Rather than summing themetric per class, this sums the dividends and divisors that make up theper-class metrics to calculate an overall quotient.Micro-averaging may be preferred in multilabel settings, includingmulticlass classification where a majority class is to be ignored.

"samples" applies only to multilabel problems. It does not calculate aper-class measure, instead calculating the metric over the true and predictedclasses for each sample in the evaluation data, and returning their(sample_weight-weighted) average.

While multiclass data is provided to the metric, like binary targets, as anarray of class labels, multilabel data is specified as an indicator matrix,in which cell [i, j] has value 1 if sample i has label j and value0 otherwise.

In multilabel classification, the function returns the subset accuracy. Ifthe entire set of predicted labels for a sample strictly match with the trueset of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

The top_k_accuracy_score function is a generalization ofaccuracy_score. The difference is that a prediction is consideredcorrect as long as the true label is associated with one of the k highestpredicted scores. accuracy_score is the special case of k = 1.

The balanced_accuracy_score function computes the balanced accuracy, which avoids inflatedperformance estimates on imbalanced datasets. It is the macro-average of recallscores per class or, equivalently, raw accuracy where each sample is weightedaccording to the inverse prevalence of its true class.Thus for balanced datasets, the score is equal to accuracy.

In the binary case, balanced accuracy is equal to the arithmetic mean ofsensitivity(true positive rate) and specificity (true negativerate), or the area under the ROC curve with binary predictions rather thanscores:

In contrast, if the conventional accuracy is above chance only because theclassifier takes advantage of an imbalanced test set, then the balancedaccuracy, as appropriate, will drop to \(\frac1n\_classes\).

Our definition: [Mosley2013], [Kelleher2015] and [Guyon2015], where[Guyon2015] adopt the adjusted version to ensure that random predictionshave a score of \(0\) and perfect predictions have a score of \(1\)..

Class balanced accuracy as described in [Mosley2013]: the minimum between the precisionand the recall for each class is computed. Those values are then averaged over the totalnumber of classes to get the balanced accuracy.

The confusion_matrix function evaluatesclassification accuracy by computing the confusion matrix with each row correspondingto the true class (Wikipedia and other references may use different conventionfor axes).

The parameter normalize allows to report ratios instead of counts. Theconfusion matrix can be normalized in 3 different ways: 'pred', 'true',and 'all' which will divide the counts by the sum of each columns, rows, orthe entire matrix, respectively.

In multiclass classification, the Hamming loss corresponds to the Hammingdistance between y_true and y_pred which is similar to theZero one loss function. However, while zero-one loss penalizesprediction sets that do not strictly match true sets, the Hamming losspenalizes individual labels. Thus the Hamming loss, upper bounded by the zero-oneloss, is always between zero and one, inclusive; and predicting a proper subsetor superset of the true labels will give a Hamming loss betweenzero and one, exclusive.

References [Manning2008] and [Everingham2010] present alternative variants ofAP that interpolate the precision-recall curve. Currently,average_precision_score does not implement any interpolated variant.References [Davis2006] and [Flach2015] describe why a linear interpolation ofpoints on the precision-recall curve provides an overly-optimistic measure ofclassifier performance. This linear interpolation is used when computing areaunder the curve with the trapezoidal rule in auc.

Note that the precision_recall_curve function is restricted to thebinary case. The average_precision_score function supports multiclassand multilabel formats by computing each class score in a One-vs-the-rest (OvR)fashion and averaging them or not depending of its average argument value.

Note that this formula is still undefined when there are no true positives, falsepositives, or false negatives. By default, F-1 for a set of exclusively true negativesis calculated as 0, however this behavior can be changed using the zero_divisionparameter.Here are some small examples in binary classification:

In a multiclass and multilabel classification task, the notions of precision,recall, and F-measures can be applied to each label independently.There are a few ways to combine results across labels,specified by the average argument to theaverage_precision_score, f1_score,fbeta_score, precision_recall_fscore_support,precision_score and recall_score functions, as describedabove.

The jaccard_score (like precision_recall_fscore_support) appliesnatively to binary targets. By computing it set-wise it can be extended to applyto multilabel and multiclass through the use of average (seeabove).

The hinge_loss function computes the average distance betweenthe model and the data usinghinge loss, a one-sided metricthat considers only prediction errors. (Hingeloss is used in maximal margin classifiers such as support vector machines.)

Log loss, also called logistic regression loss orcross-entropy loss, is defined on probability estimates. It iscommonly used in (multinomial) logistic regression and neural networks, as wellas in some variants of expectation-maximization, and can be used to evaluate theprobability outputs (predict_proba) of a classifier instead of itsdiscrete predictions.

In the multiclass case, the Matthews correlation coefficient can be defined in terms of aconfusion_matrix \(C\) for \(K\) classes. To simplify thedefinition consider the following intermediate variables:

When there are more than two labels, the value of the MCC will no longer rangebetween -1 and +1. Instead the minimum value will be somewhere between -1 and 0depending on the number and distribution of ground true labels. The maximumvalue is always +1.For additional information, see [WikipediaMCC2021].

The multilabel_confusion_matrix function computes class-wise (default)or sample-wise (samplewise=True) multilabel confusion matrix to evaluatethe accuracy of a classification. multilabel_confusion_matrix also treatsmulticlass data as if it were multilabel, as this is a transformation commonlyapplied to evaluate multiclass problems with binary classification metrics(such as precision, recall, etc.).

Here are some examples demonstrating the use of themultilabel_confusion_matrix function to calculate recall(or sensitivity), specificity, fall out and miss rate for each class in aproblem with multilabel indicator matrix input.