The model is a neural network and the inputs include both numerical and a number of categorical one-hot encoded fields. I've included the transformation from categorical to one-hot values in the PMML document created by KNIME.
Top level view of PMML file looks like this:
<PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
<Header copyright="rikus">...</Header>
<DataDictionary numberOfFields="14">...</DataDictionary>
<TransformationDictionary>...</TransformationDictionary>
<NeuralNetwork functionName="regression" ...>...</NeuralNetwork>
</PMML>
The TransformationDictionary contains the categorical to one-hot conversions, like this:
<TransformationDictionary>
<DerivedField dataType="integer" name="A_CAT1" optype="ordinal">
<NormDiscrete field="CAT1" mapMissingTo="0.0" value="A"/>
</DerivedField>
<DerivedField dataType="integer" name="B_CAT1" optype="ordinal">
<NormDiscrete field="CAT1" mapMissingTo="0.0" value="B"/>
</DerivedField>
<DerivedField dataType="integer" name="A_CAT2" optype="ordinal">
<NormDiscrete field="CAT2" mapMissingTo="0.0" value="A"/>
</DerivedField>
<DerivedField dataType="integer" name="B_CAT2" optype="ordinal">
<NormDiscrete field="CAT2" mapMissingTo="0.0" value="B"/>
</DerivedField>
...
</TransformationDictionary>
Now, my understanding is that the one-hot transformation will be done automatically when I evaluate the model. Is this correct?
The problem occurs when I try to create an input map to supply to the evaluator in this manner:
Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
List<InputField> inputFields = evaluator.getInputFields();
for(InputField inputField : inputFields){
FieldName inputFieldName = inputField.getName();
// The raw (ie. user-supplied) value could be any Java primitive value
Object rawValue = ...;
// The raw value is passed through: 1) outlier treatment, 2) missing value treatment, 3) invalid value treatment and 4) type conversion
FieldValue inputFieldValue = inputField.prepare(rawValue);
arguments.put(inputFieldName, inputFieldValue);
}
The call to evaluator.getInputFields() returns all the numerical input fields to the model, but neither the one-hot fields (A_CAT1, B_CAT1, A_CAT2, ...), nor as I expected, the fields that should be transformed (CAT1, CAT2, ...).
Why are the one-hot fields before transformation (CAT1, CAT2, ...) not part of the evaluator.getInputFields() collection, and how do I access them (to be able to use their prepare() methods) while populating the input map for evaluator.evaluate(arguments)?
Your help would be much appreciated.
> KNIME is known to produce incorrect/invalid PMML documents in some
> cases. I'm fairly sure that the problem is related to the KNIME
> generated PMML document, not the JPMML-Evaluator library. So, you
> should kick off your debugging efforts by opening the neural network
> model in text editor, and checking if its structure is correct
> (specifically, is the MiningSchema element correctly populated with
> "CAT1" and "CAT2" MiningField elements?).
Thanks for your detailed and very helpful response. I really appreciate it.
Indeed, you are correct. The problem turned out to be the PMML that KNIME generated.
I'm still exploring the issue, but it seems that at the core is ambiguity about representation of transformed variables in KNIME. For some transformations the variables are changed in place (i.e. they keep their original names). That means after a transformation, there are conceptually a new set of variables which hide the old ones. In the PMML KNIME produces, it recognises both the original and transformed variables (the latter indicated with an asterisk appended to the name), but mixes the references up. So, in some cases the PMML refers to the original variables where it should be referring to the transformed ones.
Understandably, this leads to problems when JPMML tries to interpret the PMML file. I will takes this up with the KNIME developers.
Thanks also for JPMML. It's really useful.
Regards,
Rikus