Hi Josh,
>
> How can I use PMML to create dummy columns to suit an XGBoost model?
> I was thinking a combination of IsMissing and maybe MapValues?
>
You should keep things as simple as possible, and actually train the
XGBoost model using "DayOfWeek" as a categorical variable. Yes, it can
be done, and it's much easier than messing with PMML transformations
afterwards.
The idea is to expand the sole categorical column "DayOfWeek" into a
sequence of seven indicator (0/1) columns "isSunday", "isMonday", ..,
"isSaturday" using the so-called OneHotEncoder transformation. The
actual mechanics of doing so depend on your programming
language/platform.
The JPMML-XGBoost project provides helper functions genDMatrix(df_y,
df_X, file) and genFMap(df_X, file) that should help get you started
in R:
https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/R/util.R
First, prepare one-hot-encoded data files "onehotencoded.svm" (LibSVM
data format) and "onehotencoded.fmap" (XGBoost feature map data
format) based on a data.frame:
df_X = data.frame("DayOfWeek" = ..., ...)
df_y = ...
my_dmatrix = genDMatrix(df_y, df_X, "onehotencoded.svm")
my_fmap = genFMap(df_X, "onehotencoded.fmap")
Then, train the XGBoost model as usual:
my_xgb = xgboost(data = my_dmatrix, objective = "reg:linear", nrounds = 15)
Finally, use the r2pmml package to convert the XGBoost model to a PMML
file. The function r2pmml.xgb.Booster takes an argument "fmap", which
can be the name of the XGBoost feature map file, or the data.frame
representation of the XGBoost feature map as returned by the genFMap()
utility function:
r2pmml(my_xgb, "onehotencoded.fmap", "xgb.pmml")
r2pmml(my_xgb, my_fmap, "xgb2.pmml")
If you open one of the resulting PMML documents, then you will see
that JPMML-XGBoost library was smart enough to aggregate those seven
indicator variable columns back to a sole categorical variable column:
<DataField name="DayOfWeek">
<Value value="Sunday"/>
..
<Value value="Saturday"/>
</DataField>
The JPMML-XGBoost project includes some more R examples. For example,
the following script generates unit testing resources for use with the
popular Auto and Audit datasets. The Audit dataset deals with five
categorical levels, whereas two of them are more than 10 levels
"deep".
https://github.com/jpmml/jpmml-xgboost/blob/master/src/test/R/xgboost.R
VR