Missing Values and JPMML

1,077 views
Skip to first unread message

peterot

unread,
Jul 18, 2014, 11:12:47 AM7/18/14
to jp...@googlegroups.com
Hi,

I was just experimenting with the JPMML evaluator and I've been having some issues with missing values.

It certainly seems that the PMML spec allows missing values and has many options for dealing with them when they are encountered during model evaluation. However, I couldn't find a way to represent missing values for use in the JPMML evaluator.

The FieldValue class seems to be pretty locked down in terms of null values and will throw an exception if null is used as the value. However, when fields are missed from the context instead (as in the CSVEvaluationExample) MissingFieldExceptions are thrown.

Is there a different approach that should be used for missing field values or are they not currently supported in the JPMML evaluator?

Thanks for any help,

Pete

Villu Ruusmann

unread,
Jul 18, 2014, 3:27:20 PM7/18/14
to jp...@googlegroups.com
Hi Peter,

>
> It certainly seems that the PMML spec allows missing values and has many options
> for dealing with them when they are encountered during model evaluation.
> However, I couldn't find a way to represent missing values for use in the JPMML evaluator.
>

The JPMML-Evaluator does not cut corners here. Everything is
implemented according to the PMML specification.

There are two scenarios:
1) "The field does not exist". That is, java.util.Map#containsKey(K)
returns false;
2) "The field exists, but is mapped to the null value". That is,
java.util.Map#containsKey(K) returns true, but java.util.Map#get(K)
returns a null reference.

If you want to pass a null reference as a field value, simply do the following:
Map<FieldName, Object> arguments = ...
arguments.put(new FieldName("optional_field"), null);

Let's assume that you are scoring data records from a CSV value and
encounter a cell whose value is "N/A". In that case you should still
insert a null reference as a value to the arguments map.

> The FieldValue class seems to be pretty locked down in terms of null values and will throw
> an exception if null is used as the value. However, when fields are missed from the context
> instead (as in the CSVEvaluationExample) MissingFieldExceptions are thrown.
>

Class org.jpmml.evaluator.FieldValue (not to be confused with
org.dmg.pmml.FieldValue) is a wrapper around user-provided Java
primitive value. This kind of wrapper is necessary, because a PMML
value has a data type and an operational type (i.e. continuous,
categorical, ordinal). For example, an ordinal String could be created
in Java application code as follows:
OrdinalValue ordinalString = new OrdinalValue(DataType.STRING, "medium");
ordinalString.setOrdering(Arrays.asList("low", "medium", "high"));

However, you should never instantiate classes ContinuousValue,
CategoricalValue or OrdinalValue directly. Please use appropriate
methods from the utility class org.jpmml.evaluator.FieldValueUtil.
Actually, this utility class contains two types of methods. First
there are ordinary object creation methods (#create(...)) and then
there are kind of object casting methods (#refine(...)). For example,
you can cast the above ordinal String to a categorical String as
follows:
CategoricalValue categoricalString =
(CategoricalString)FieldValueUtil.refine(DataType.STRING,
OpType.CATEGORICAL, ordinalString);

A null FieldValue is simply represented by a null reference. For
example, when you invoke FieldValueUtil#create(DataType.STRING,
OpType.CATEGORICAL, null) you will get back a null reference. As of
PMML schema versions 3.X and 4.X, this is an optimal solution. Maybe
PMML schema version 5.0 will introduce operations on null values
(let's hope not!), and in that case it will be necessary to devise a
new solution.

Anyway, in a typical use scenario, there is no need to engage with
FieldValues in your Java application code. In the JPMML-Evaluator
1.1.X series the whole interaction with the PMML scoring engine can be
fit into a single line of Java source code (see
http://openscoring.io/blog/2014/05/15/jpmml_evaluator_api_prepare_evaluate/
chapter "Option 2: lazy preparation"). You simply pass you arguments
as a Map<FieldName, String>, and you'll get back a Map<FieldName, ?>
(where the value type is either an instance of
org.jpmml.evaluator.Computable or a Java primitive type).

Did that answer your question? If you have a PMML model that throws an
MissingFieldException, then it simply "declares" that it requires that
all fields are mapped to non-missing values. I could probably give you
a more detailed answer if you sent me (privately) the PMML file
together with a few problematic lines from the CSV file.


VR

Peter Ottery

unread,
Jul 19, 2014, 7:00:11 AM7/19/14
to Villu Ruusmann, jp...@googlegroups.com
Hi Villu,

Thanks for looking into this.

Actually the examples I was using are all freely available online so I'm happy to share links to them. I found the issue with the DMG Audit tree example (http://www.dmg.org/pmml_examples/rattle_pmml_examples/AuditTree.xml) which uses the Audit dataset (http://www.dmg.org/pmml_examples/#Audit).

As a quick test I tried evaluating this model using the JPMML evaluator's CSV evaluation example:

java -cp example-1.1-SNAPSHOT.jar org.jpmml.evaluator.CsvEvaluationExample --model c:\temp\JPMML\fxcopreport.xml --input c:\temp\JPMML\Audit.csv --output output.csv

The CsvEvaluationExample builds up the context missing FieldNames out of the argument map completely if they don't exist. This then results in the following exception during evaluation:

Exception in thread "main" org.jpmml.evaluator.MissingFieldException (at or around line 133): Occupation
        at org.jpmml.evaluator.PredicateUtil.evaluateSimpleSetPredicate(PredicateUtil.java:194)
        at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:62)
        at org.jpmml.evaluator.PredicateUtil.evaluateCompoundPredicate(PredicateUtil.java:145)
        at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:58)
        at org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:223)
        at org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:185)
        at org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:195)
        at org.jpmml.evaluator.TreeModelEvaluator.evaluateTree(TreeModelEvaluator.java:134)
        at org.jpmml.evaluator.TreeModelEvaluator.evaluateClassification(TreeModelEvaluator.java:106)
        at org.jpmml.evaluator.TreeModelEvaluator.evaluate(TreeModelEvaluator.java:81)
        at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:68)
        at org.jpmml.evaluator.CsvEvaluationExample.evaluateAll(CsvEvaluationExample.java:226)
        at org.jpmml.evaluator.CsvEvaluationExample.execute(CsvEvaluationExample.java:97)
        at org.jpmml.evaluator.Example.execute(Example.java:45)
        at org.jpmml.evaluator.CsvEvaluationExample.main(CsvEvaluationExample.java:72)

The issue is that the FieldValue is retrieved from the map and if this is null either because the FieldName is missing or because the FieldName is associated with a null ref the exception will be thrown.

Just to check I updated the CSV example to add the missing fields as FieldName,null pairs. However, the exception is the same.

The model has missingValueStrategy="defaultChild" so I would expect it to allow null values.

What do you think? Is there some other issue?

Thanks,

Pete

Villu Ruusmann

unread,
Jul 19, 2014, 5:07:36 PM7/19/14
to jp...@googlegroups.com
Hi Peter,


> As a quick test I tried evaluating this model using the JPMML evaluator's
> CSV evaluation example:
>
> java -cp example-1.1-SNAPSHOT.jar org.jpmml.evaluator.CsvEvaluationExample
> --model c:\temp\JPMML\fxcopreport.xml --input c:\temp\JPMML\Audit.csv
> --output output.csv
>

I analyzed this issue and fixed two things.

First, contrary the advice that I gave you yesterday, the application
class CsvEvaluationExample was skipping cells with null values. The
sample data file Audit.csv contains total 2000 data records, and 141
of them specify "NA" values. Previously, the size of the arguments map
was less then 9 when a data record contained missing field values.
Now, the size of the arguments map is always 9 (viz, the MiningSchema
element contains nine MiningField elements whose usageType attribute
is "active"):
https://github.com/jpmml/jpmml-evaluator/commit/5feb0e4825b6b3ffac8ab1eb6f896e929c279ca5

Second, it was incorrect behavior that the method
org.jpmml.evaluator.PredicateUtil#evaluateSimpleSetPredicate(SimpleSetPredicate,
EvaluationContext) threw an instance of MissingFieldException if the
field value was missing. Just like all other predicate evaluation
methods, it should have returned a null reference, which corresponds
to the "unknown" state:
https://github.com/jpmml/jpmml-evaluator/commit/5583bd0a6ce9dff1d80967d4022052d0c2da4247

If you run your test(s) again with the latest GitHub checkout, then
the evaluation completes successfully

>
> The model has missingValueStrategy="defaultChild" so I would expect it to
> allow null values.
>

Actually, the missingValueStrategy clause is never triggered with the
sample data, because SimpleSetPredicate elements are nested inside
CompoundPredicate elements whose boolean operator are "surrogate". For
example, let's consider the CompoundPredicate element on line 132 of
the sample model file AuditTree.xml. If the "Occupied" field is
missing, then the SimpleSetPredicate element on line 133 evaluates to
unknown. However, the evaluation does not terminate with an error
here, but continues with the next predicate. If the "Education" field
exists, then the SimpleSetPredicate element on line 136 returns a
non-null value, which is terminates the traversal.

The missingValueStrategy is probably triggered if you invoke the
evaluator with an empty arguments map. Currently, the JPMML-Evaluator
library would fail with an UnsupportedFeatureException, because the
"defaultChild" strategy is not implemented. The basic support for
TreeModel element was introduced in one of the earliest
JPMML-Evaluator versions, but it was never completed (because there
was so much other critical work to do). I believe that now it's time
to go back and fix it.

If you want to learn more about JPMML-Evaluator, then I would
definitely suggest you to write your own version of
CsvEvaluationExample application. My code is rather complex and
inelegant, because it is ready to perform the grouping of data rows
(this functionality is only needed when evaluating association rules
models). If you skip all this complexity, then you can perform the
evaluation on a line basis, which gives you better diagnostics where
and for what reason failures occur.


VR

Peter Ottery

unread,
Jul 20, 2014, 4:01:41 PM7/20/14
to Villu Ruusmann, jpmml
Hi Villu,

Thanks for fixing those issues. Is there anywhere where there is a list of all the PMML features which are supported/unsupported in the JPMML evaluator?

Thanks,

Pete

Villu Ruusmann

unread,
Jul 20, 2014, 5:33:49 PM7/20/14
to jpmml
Hi Peter,

>
> Thanks for fixing those issues.

I went on with my work and implemented a full support for the
'defaultChild' missing value strategy type as well:
https://github.com/jpmml/jpmml-evaluator/commit/d6ae68d2db1725f322172449b0959c693550a28b

That leaves two more missing value strategies to do -
`weightedConfidence` and `aggregateNodes`. They seem like marginal
features and I will probably ignore them for some more time.

> Is there anywhere where there is a list of
> all the PMML features which are supported/unsupported in the JPMML
> evaluator?
>

The README.md file at the root directory is pretty much up to date.
When something is declared to be fully supported, then it means that
at least 90-95% of the functionality is implemented. Indeed, you
managed to prove that this percentage was somewhat lower for
TreeModel, but it really was an exception :-) I would say that
TreeModel is one of the most complex model types in the PMML
specification, plus it was implemented in the very beginning and not
revisited at a later time.

Anyway, when speaking about support, then we need to consider two levels:
1) Model type. The latest version of the PMML specification declares
15 top-level model elements. The JPMML-Evaluator library implements 11
of them. Currently, JPMML-Evaluator will throw an
UnsupportedFeatureException if try to evaluate a PMML document that
contains a 1) BaselineModel, 2) SequenceModel, 3) TextModel or 4)
TimeSeriesModel element. Also, the CoxRegression subtype of the
GeneralRegressionModel element is not supported.
2) Model feature. A model element may contain additional markup that
specifies additional functionality. For example, every missing value
strategy type is a feature of the TreeModel element. The
JPMML-Evaluator library is very "defensive" in this regard. During the
evaluation of a data record, it throws an UnsupportedFeatureException
whenever of wherever it sees an unsupported piece of PMML markup.

You may search around the codebase and see where
org.jpmml.manager.UnsupportedFeatureException is thrown. You will see
that it mostly happens in the default case of switch statements, or
the else clause of if statements. It doesn't mean that all control
structures are deficient. It's simply a convenient way of being
future-compatible. For example, if the PMML schema version 5.0
introduces another missing value strategy type then it will be
automatically trapped in the default case.

I have a TODO list of few unsupported features that should be
implemented. However, I haven't bothered to import them to GitHub
issue tracker - it typically takes less time to code the solution,
than to describe textually what exactly is missing and how to fix it.

Anyway, I intend to release JPMML-Evaluator version 1.1.7 at the end
of this week (i.e. 27 July, 2014). It will include all TreeModel
improvements, and I hope to add something more to it - probably will
implement the support for the CoxRegression subtype of the
GeneralRegressionModel element.


VR
Reply all
Reply to author
Forward
0 new messages