How to prepare derived field inputs for evaluation

289 views
Skip to first unread message

Rikus Combrinck

unread,
Nov 20, 2016, 4:00:04 AM11/20/16
to Java PMML API
I am new to JPMML. I'm am using it to evaluate a model created in KNIME.

The model is a neural network and the inputs include both numerical and a number of categorical one-hot encoded fields. I've included the transformation from categorical to one-hot values in the PMML document created by KNIME.

Top level view of PMML file looks like this:

<PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
<Header copyright="rikus">...</Header>
<DataDictionary numberOfFields="14">...</DataDictionary>
<TransformationDictionary>...</TransformationDictionary>
<NeuralNetwork functionName="regression" ...>...</NeuralNetwork>
</PMML>

The TransformationDictionary contains the categorical to one-hot conversions, like this:

<TransformationDictionary>
<DerivedField dataType="integer" name="A_CAT1" optype="ordinal">
<NormDiscrete field="CAT1" mapMissingTo="0.0" value="A"/>
</DerivedField>
<DerivedField dataType="integer" name="B_CAT1" optype="ordinal">
<NormDiscrete field="CAT1" mapMissingTo="0.0" value="B"/>
</DerivedField>
<DerivedField dataType="integer" name="A_CAT2" optype="ordinal">
<NormDiscrete field="CAT2" mapMissingTo="0.0" value="A"/>
</DerivedField>
<DerivedField dataType="integer" name="B_CAT2" optype="ordinal">
<NormDiscrete field="CAT2" mapMissingTo="0.0" value="B"/>
</DerivedField>
...
</TransformationDictionary>

Now, my understanding is that the one-hot transformation will be done automatically when I evaluate the model. Is this correct?

The problem occurs when I try to create an input map to supply to the evaluator in this manner:

Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();

List<InputField> inputFields = evaluator.getInputFields();
for(InputField inputField : inputFields){
FieldName inputFieldName = inputField.getName();

// The raw (ie. user-supplied) value could be any Java primitive value
Object rawValue = ...;

// The raw value is passed through: 1) outlier treatment, 2) missing value treatment, 3) invalid value treatment and 4) type conversion
FieldValue inputFieldValue = inputField.prepare(rawValue);

arguments.put(inputFieldName, inputFieldValue);
}

The call to evaluator.getInputFields() returns all the numerical input fields to the model, but neither the one-hot fields (A_CAT1, B_CAT1, A_CAT2, ...), nor as I expected, the fields that should be transformed (CAT1, CAT2, ...).

Why are the one-hot fields before transformation (CAT1, CAT2, ...) not part of the evaluator.getInputFields() collection, and how do I access them (to be able to use their prepare() methods) while populating the input map for evaluator.evaluate(arguments)?

Your help would be much appreciated.

Villu Ruusmann

unread,
Nov 21, 2016, 2:01:44 PM11/21/16
to Java PMML API
Hi Rikus,

> The TransformationDictionary contains the categorical to one-hot conversions, like this:
>
> <TransformationDictionary>
> <DerivedField dataType="integer" name="A_CAT1" optype="ordinal">
> <NormDiscrete field="CAT1" mapMissingTo="0.0" value="A"/>
> </DerivedField>
> <DerivedField dataType="integer" name="B_CAT1" optype="ordinal">
> <NormDiscrete field="CAT1" mapMissingTo="0.0" value="B"/>
> </DerivedField>
> </TransformationDictionary>
>
> Now, my understanding is that the one-hot transformation will be done automatically
> when I evaluate the model. Is this correct?

Correct - DerivedField elements, which implement the one-hot-encoding
transformation, will be evaluated automatically if they are needed by
the neural network model.

It may happen that KNIME generated a number of DerivedField elements
that are not needed by any model (ie. they just sit there and waste
space and human attention). It's possible to clean
TransformationDictionary (and LocalTransformations) elements from
unused DerivedField elements by applying the
org.jpmml.model.visitors.TransformationDictionaryCleaner visitor class
to the PMML document:

org.dmg.pmml.PMML pmml = ...;
org.jpmml.model.visitors.TransformationDictionaryCleaner
dictionaryCleaner = new TransformationDictionaryCleaner();
dictionaryCleaner.applyTo(pmml);

>
> The call to evaluator.getInputFields() returns all the numerical input fields
> to the model, but neither the one-hot fields (A_CAT1, B_CAT1, A_CAT2, ...),
> nor as I expected, the fields that should be transformed (CAT1, CAT2, ...).
>

Evaluator#getInputFields() should give you the names of all continuous
and categorical fields that are used by the model directly or
indirectly (ie. after applying some sort of transformation to them).
This list of fields is based on the MiningSchema element of the
top-level Model element.

In your case, the PMML document should have the following structure:
<PMML>
<DataDictionary>
<DataField name="CAT1" optype="categorical" dataType="string">
<Value value="A"/>
<Value value="B"/>
</DataField>
</DataDictionary>
<TransformationDictionary>
<DerivedField name="A_CAT1">
<NormDiscrete field="CAT1" value="A" mapMissingTo="0.0"/>
</DerivedField>
</TransformationDictionary>
<NeuralNetwork>
<MiningSchema>
<MiningField name="CAT1"/>
</MiningSchema>
</NeuralNetwork>
</PMML>

You should read the above something like this:
1) The NN model needs input field "CAT1" (Select all nodes
/PMML/NeuralNetwork/MiningSchema/MiningField[@usageType == 'active']).
2) The field "CAT1" is a string field, whose valid value space is
either "A" or "B" (Select all nodes
/PMML/DataDictionary/DataField[@name == 'CAT1']/Value[@property ==
'valid']).
3) The NN model may operate with the "CAT1" field directly, or operate
with any derived fields (Select nodes from
/PMML/TransformationDictionary) that depend on the "CAT1" field.

In this demo case, Evaluator#getInputFields() should return a
singleton list, which corresponds to the "CAT1" field. It would be an
error to return an empty list, or a list that contains (additional-)
"A_CAT1" derived field.

> Why are the one-hot fields before transformation (CAT1, CAT2, ...)
> not part of the evaluator.getInputFields() collection,

These two fields "CAT1" and "CAT2" must be available there.

KNIME is known to produce incorrect/invalid PMML documents in some
cases. I'm fairly sure that the problem is related to the KNIME
generated PMML document, not the JPMML-Evaluator library. So, you
should kick off your debugging efforts by opening the neural network
model in text editor, and checking if its structure is correct
(specifically, is the MiningSchema element correctly populated with
"CAT1" and "CAT2" MiningField elements?).

> and how do I access them (to be able to use their prepare() methods)
> while populating the input map for evaluator.evaluate(arguments)?
>

If you don't prepare the values of "CAT1" and "CAT2" input fields,
then they will be treated as missing values during model evaluation.

Please note that your NormDiscrete elements contain a special handler
for that situation - a missing value will be converted to 0.0:
<NormDiscrete field="CAT1" mapMissingTo="0.0" value="A"/>


VR

Rikus Combrinck

unread,
Nov 23, 2016, 8:54:29 AM11/23/16
to Java PMML API
Hi Villu,

> KNIME is known to produce incorrect/invalid PMML documents in some
> cases. I'm fairly sure that the problem is related to the KNIME
> generated PMML document, not the JPMML-Evaluator library. So, you
> should kick off your debugging efforts by opening the neural network
> model in text editor, and checking if its structure is correct
> (specifically, is the MiningSchema element correctly populated with
> "CAT1" and "CAT2" MiningField elements?).

Thanks for your detailed and very helpful response. I really appreciate it.

Indeed, you are correct. The problem turned out to be the PMML that KNIME generated.

I'm still exploring the issue, but it seems that at the core is ambiguity about representation of transformed variables in KNIME. For some transformations the variables are changed in place (i.e. they keep their original names). That means after a transformation, there are conceptually a new set of variables which hide the old ones. In the PMML KNIME produces, it recognises both the original and transformed variables (the latter indicated with an asterisk appended to the name), but mixes the references up. So, in some cases the PMML refers to the original variables where it should be referring to the transformed ones.

Understandably, this leads to problems when JPMML tries to interpret the PMML file. I will takes this up with the KNIME developers.

Thanks also for JPMML. It's really useful.

Regards,
Rikus

Reply all
Reply to author
Forward
0 new messages