JPMML Pre-Processing

549 views
Skip to first unread message

Diogo Falcão

unread,
Feb 17, 2016, 6:43:20 AM2/17/16
to Java PMML API
Hey Guys,

I am trying to find some Pre-Processing examples using JPMML, but I couldn't find any.

Someone can give me a very simple example showing how to use pmml preprocessing on jpmml?

Thank You Very Much!

Villu Ruusmann

unread,
Feb 17, 2016, 9:03:26 AM2/17/16
to Java PMML API
Hi Diogo,

>
> I am trying to find some Pre-Processing examples using JPMML, but I couldn't find any.
>

What exactly do you have in mind when asking about "pre-processing examples"?

Most PMML documents contain transformation elements
(http://dmg.org/pmml/v4-2-1/Transformations.html) that map input field
values from user's (ie. "external") value space to model's (ie.
"internal") value space.

Some examples:
1) GeneralRegressionXformAuto.pmml - lines 57 to 80 - selected numeric
fields are centered and scaled:
https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-rattle/src/test/resources/pmml/GeneralRegressionXformAuto.pmml
2) KernlabSVMAudit.pmml - lines 91 to 250 - all categorical fields are
converted to numeric fields:
https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-rattle/src/test/resources/pmml/KernlabSVMAudit.pmml

>
> Someone can give me a very simple example showing how to use pmml preprocessing on jpmml?
>

The JPMML-Evaluator API is designed to work with arbitrary complexity
PMML documents.

Data pre-processing, if necessary, is automatically triggered during
model evaluation. You can not, and must not, interfere with it in any
way.


VR

Diogo Falcão

unread,
Feb 19, 2016, 3:06:43 PM2/19/16
to Villu Ruusmann, Java PMML API
Thank you for the reply,

I mean, Is possible to deal with Missing values in the data? For example, create a PMML that change the missing value for any other of my choice?
And are the examples only related to the transformation right? Is it possible to merge that with the model of neural network? For example, I want a unique PMML that is able to pre-processing the data (transform, fill missing values, binarize and whatever) and after that the data is send to a evaluation on a PMML representing a neural network.

Thank you

Villu Ruusmann

unread,
Feb 19, 2016, 4:26:50 PM2/19/16
to Java PMML API
Hi Diogo,

>
> I mean, Is possible to deal with Missing values in the data? For example,
> create a PMML that change the missing value for any other of my choice?

Missing field values are a special case, because they can be handled
in two ways.

First, you can construct a DerivedField element that uses the
"isMissing" built-in function
(http://dmg.org/pmml/v4-2-1/BuiltinFunctions.html#boolean1) for its
business logic:

<DerivedField name="ensureXIsNotNull">
<Apply function="if">
<!-- condition -->
<Apply function="isMissing">
<FieldRef field="x"/>
</Apply>
<!-- condition evaluated to true, return "x" as is -->
<FieldRef field="x"/>
<!-- condition evaluated to false, return a replacement value -->
<Constant>default_x</Constant>
</Apply>
</DerivedField>

Second, the same can be expressed using the "missingValueReplacement"
attribute of the MiningField element:

<MiningSchema>
<MiningField name="x" missingValueReplacement="default_x"/>
</MiningSchema>

> I want a unique PMML that is able to pre-processing the data (transform, fill
> missing values, binarize and whatever) and after that the data is send to a
> evaluation on a PMML representing a neural network.
>

Basically, you want to build (manually?) a library of most common
transformations, and then "inject" those into machine generated PMML
documents?

If I recall correctly, then KNIME (https://www.knime.org/) should
provide such capabilities. Under "KNIME Labs" there is a section "PMML
/ Modular PMML", which contains nodes such as "PMML Model Appender",
"PMML Transformation Appender" etc. You would load a PMML document
with transformations and then append a newly created NeuralNetwork
model to it.

Of course, I would recommend you to check out JPMML-Model library, and
build such a merge tool yourself.

PMML templatePMML = loadPMML("template.pmml");
PMML modelPMML = loadPMML("nnet.pmml");
(templatePMML.getModels()).addAll(modelPMML.getModels());
savePMML(templatePMML, "template_with_nnet.pmml");


VR

Josh Izzard

unread,
Mar 1, 2016, 3:44:37 PM3/1/16
to Java PMML API
Villu/Diogo,

I have a similar question about using PMML to create dummy variables based on incoming text data. For example, if I have a model that predicts Revenue as a function of DayOfWeek, and it is built off of data that looks like

| DayOfWeek_Sunday | DayOfWeek_Monday | .... | DayOfWeek_Saturday |
-------------------------------------------------------------------------
| 0 | 1 | ... | 0 |
-------------------------------------------------------------------------

Then gets requested with data of the form

DayOfWeek
------------
"Monday"


How can I use PMML to create dummy columns to suit an XGBoost model? I was thinking a combination of IsMissing and maybe MapValues?

Villu Ruusmann

unread,
Mar 1, 2016, 4:53:36 PM3/1/16
to Java PMML API
Hi Josh,

>
> How can I use PMML to create dummy columns to suit an XGBoost model?
> I was thinking a combination of IsMissing and maybe MapValues?
>

You should keep things as simple as possible, and actually train the
XGBoost model using "DayOfWeek" as a categorical variable. Yes, it can
be done, and it's much easier than messing with PMML transformations
afterwards.

The idea is to expand the sole categorical column "DayOfWeek" into a
sequence of seven indicator (0/1) columns "isSunday", "isMonday", ..,
"isSaturday" using the so-called OneHotEncoder transformation. The
actual mechanics of doing so depend on your programming
language/platform.

The JPMML-XGBoost project provides helper functions genDMatrix(df_y,
df_X, file) and genFMap(df_X, file) that should help get you started
in R:
https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/R/util.R

First, prepare one-hot-encoded data files "onehotencoded.svm" (LibSVM
data format) and "onehotencoded.fmap" (XGBoost feature map data
format) based on a data.frame:
df_X = data.frame("DayOfWeek" = ..., ...)
df_y = ...
my_dmatrix = genDMatrix(df_y, df_X, "onehotencoded.svm")
my_fmap = genFMap(df_X, "onehotencoded.fmap")

Then, train the XGBoost model as usual:
my_xgb = xgboost(data = my_dmatrix, objective = "reg:linear", nrounds = 15)

Finally, use the r2pmml package to convert the XGBoost model to a PMML
file. The function r2pmml.xgb.Booster takes an argument "fmap", which
can be the name of the XGBoost feature map file, or the data.frame
representation of the XGBoost feature map as returned by the genFMap()
utility function:
r2pmml(my_xgb, "onehotencoded.fmap", "xgb.pmml")
r2pmml(my_xgb, my_fmap, "xgb2.pmml")

If you open one of the resulting PMML documents, then you will see
that JPMML-XGBoost library was smart enough to aggregate those seven
indicator variable columns back to a sole categorical variable column:
<DataField name="DayOfWeek">
<Value value="Sunday"/>
..
<Value value="Saturday"/>
</DataField>

The JPMML-XGBoost project includes some more R examples. For example,
the following script generates unit testing resources for use with the
popular Auto and Audit datasets. The Audit dataset deals with five
categorical levels, whereas two of them are more than 10 levels
"deep".
https://github.com/jpmml/jpmml-xgboost/blob/master/src/test/R/xgboost.R


VR

Josh Izzard

unread,
Mar 2, 2016, 2:35:10 PM3/2/16
to Java PMML API
Thanks very much Villu - really appreciate the help. On the mtcars dataset this seems like this solution will work for me - I will let you know if it does indeed solve my problem. Thanks!
Reply all
Reply to author
Forward
0 new messages