That's true that binary:logistic is the default objective for XGBClassifier, but I don't see any reason why you couldn't use other objectives offered by XGBoost package.For example, you can see in sklearn.py source code that multi:softprob is used explicitly in multiclass case.
The default objective for XGBClassifier is ['reg:linear]however there are other parameters as well..binary:logistic-It returns predicted probabilities for predicted classmulti:softmax - Returns hard class for multiclass classificationmulti:softprob - It Returns probabilities for multiclass classification
As for the objective, you cannot optimize MAPE directly; instead, you use objective='reg:squarederror' which optimizes the L2 loss. You can use early stopping with the MAPE metric on a validation data set in order to choose the best model.
use_rmm: Whether to use RAPIDS Memory Manager (RMM) to allocate GPU memory. This option is only applicable when XGBoost is built (compiled) with the RMM plugin enabled. Valid values are true and false.
For more information about GPU acceleration, see XGBoost GPU Support. In distributed environments, ordinal selection is handled by distributed frameworks instead of XGBoost. As a result, using cuda: will result in an error. Use cuda instead.
Step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. Note that a tree where no splits were made might still contain a single terminal node with a non-zero score.
Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree. exact tree method requires non-zero value.
Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be.
Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.
Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
gradient_based: the selection probability for each training instance is proportional to theregularized absolute value of gradients (more specifically, \(\sqrtg^2+\lambda h^2\)).subsample may be set to as low as 0.1 without loss of model accuracy. Note that thissampling method is only supported when tree_method is set to hist and the device is cuda; other treemethods only support uniform sampling.
colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level. This is not supported by the exact tree method.
Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances). See Parameters Tuning for more discussion. Also, see Higgs Kaggle competition demo for examples: R, py1, py2, py3.
A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. The following updaters exist:
update: Starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updaters is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iterations performed. Currently, the following built-in updaters could be meaningfully used with this process type: refresh, prune. With process_type=update, one cannot use updaters that create new trees.
Constraints for interaction representing permitted interactions. The constraints mustbe specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each innerlist is a group of indices of features that are allowed to interact with each other.See Feature Interaction Constraints for more information.
A threshold for deciding whether XGBoost should use one-hot encoding based split forcategorical data. When number of categories is lesser than the threshold then one-hotencoding is chosen, otherwise the categories will be partitioned into children nodes.
If the booster object is DART type, predict() will perform dropouts, i.e. onlysome of the trees will be evaluated. This will produce incorrect results if data isnot the training data. To obtain correct results on test sets, set iteration_range toa nonzero value, e.g.
greedy: Select coordinate with the greatest gradient magnitude. It has O(num_feature^2) complexity. It is fully deterministic. It allows restricting the selection to top_k features per group with the largest magnitude of univariate weight change, by setting the top_k parameter. Doing so would reduce the complexity to O(num_feature*top_k).
thrifty: Thrifty, approximately-greedy feature selector. Prior to cyclic updates, reorders features in descending magnitude of their univariate weight changes. This operation is multithreaded and is a linear complexity approximation of the quadratic greedy selection. It allows restricting the selection to top_k features per group with the largest magnitude of univariate weight change, by setting the top_k parameter.
reg:squaredlogerror: regression with squared log loss \(\frac12[log(pred + 1) - log(label + 1)]^2\). All input labels are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.
reg:absoluteerror: Regression with L1 error. When tree model is used, leaf value is refreshed after tree construction. If used in distributed training, the leaf value is calculated as the mean value from all workers, which is not guaranteed to be optimal.
multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
rmsle: root mean square log error: \(\sqrt\frac1N[log(pred + 1) - log(label + 1)]^2\). Default metric of reg:squaredlogerror objective. This metric reduces errors generated by outliers in dataset. But because log function is employed, rmsle might output nan when prediction value is less than -1. See reg:squaredlogerror for other requirements.
error: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
When used with LTR task, the AUC is computed by comparing pairs of documents to count correctly sorted pairs. This corresponds to pairwise learning to rank. The implementation has some issues with average AUC around groups and distributed workers not being well-defined.
On a single machine the AUC calculation is exact. In a distributed environment the AUC is a weighted average over the AUC of training rows on each node - therefore, distributed AUC is an approximation sensitive to the distribution of data across workers. Use another metric in distributed environments if precision and reproducibility are important.
After XGBoost 1.6, both of the requirements and restrictions for using aucpr in classification problem are similar to auc. For ranking task, only binary relevance label \(y \in [0, 1]\) is supported. Different from map (mean average precision), aucpr calculates the interpolated area under precision recall curve using continuous interpolation.
interval-regression-accuracy: Fraction of data points whose predicted labels fall in the interval-censored labels.Only applicable for interval-censored data. See Survival Analysis with Accelerated Failure Time for details.
It specifies the number of pairs sampled for each document when pair method is mean, or the truncation level for queries when the pair method is topk. For example, to train with ndcg@6, set lambdarank_num_pair_per_sample to \(6\) and lambdarank_pair_method to topk.
Whether we should use exponential gain function for NDCG. There are two forms of gain function for NDCG, one is using relevance value directly while the other is using \(2^rel - 1\) to emphasize on retrieving relevant documents. When ndcg_exp_gain is true (the default), relevance degree cannot be greater than 31.
My differential equation knowledge is pretty rusty so I've created a custom obj function with a gradient and hessian that models the mean squared error function that is ran as the default objective function in XGBRegressor to make sure that I am doing all of this correctly. The problem is, the results of the model (the error outputs are close but not identical for the most part (and way off for some points). I don't know what I'm doing wrong or how that could be possible if I am computing things correctly. If you all could look at this an maybe provide insight into where I am wrong, that would be awesome!
4a15465005