pmml extensions

153 views
Skip to first unread message

Benjamin Auffarth

unread,
Jan 12, 2016, 6:55:28 AM1/12/16
to Java PMML API
Hi Villu,
I was reading up on your blog post about pmml extensions (http://openscoring.io/blog/2015/05/15/jpmml_model_api_vendor_extensions/). I think extensions are very useful for defining pre-processing steps, however, I have to admit that I find it hard to understand your blog post without a working example. I was looking around in the jpmml codebase for a similar extension functionality, and found several relevant pieces of code, particularly the definition of the mean function (https://github.com/jpmml/jpmml-evaluator/blob/9fdbf0718be4fb7652f7f9294b4ceb57a38d55cf/pmml-extension/src/main/java/org/jpmml/evaluator/functions/MeanFunction.java). Could you provide an example of how to use the mean function within a PMML file, please?
Cheers!

Villu Ruusmann

unread,
Jan 12, 2016, 8:55:08 AM1/12/16
to Java PMML API
Hi Ben,

> I was reading up on your blog post about pmml extensions. I have
> to admit that I find it hard to understand your blog post without a
> working example.

This blog post is about mixing PMML markup with some other XML markup.
A typical use case is "annotating" existing PMML documents with
company-specific metainformation. Suppose you want to map a DataField
element to a specific database column in your company database.

The DataField element (http://dmg.org/pmml/v4-2-1/DataDictionary.html)
has attributes such as "name" and "displayName", but they are not
particularly suited for the job. The solution would be to define your
own XML dialect for representing such database schema binding
metadata, and embed these elements inside the DataField element.

A DataField element for customer age:
<DataDictionary>
<DataField name="age" displayName="The age of customer"
dataType="integer" optype="categorical"/>
</DataDictionary>

The same, when enhanced with database schema binding metadata:
<DataDictionary>
<DataField name="age" displayName="The age of customer"
dataType="integer" optype="categorical">
<Extensions>
<dbs:DatabaseBinding xmlns:dbs="http://mycompany.com/dbs/v1">
<dbs:Table name="customers"/>
<dbs:Column name="age"/>
</dbs:DatabaseBinding>
</Extensions>
</DataField>
</DataDictionary>

With a little bit of engineering effort, such metadata could be used
to create an auto-assembly workflow, as every model is able to report
its "incoming dependencies".

> I was looking around in the jpmml codebase for a similar extension
> functionality, and found several relevant pieces of code, particularly
> the definition of the mean function. Could you provide an example
> of how to use the mean function within a PMML file, please?
>

This is a mechanism for working with Java user-defined functions
(UDFs). You would define a Java UDF if you need to carry out a rather
complex data pre-processing operation, which may be impossible or
inefficient to do in terms of PMML built-in functions
(http://dmg.org/pmml/v4-2-1/BuiltinFunctions.html). Suppose you want
to find out the zodiac sign of the customer based on her date of
birth.

A Java UDF is invoked just like any other function, using the Apply
element (http://dmg.org/pmml/v4-2-1/Functions.html#xsdElement_Apply).
<Apply function="com.mycompany.util.ZodiacSignFunction">
<FieldRef field="dateOfBirth"/>
</Apply>

Package "org.jpmml.evaluator.functions" contains a couple of Java UDF
classes for working with collections of numeric values. PMML does not
support looping over the elements of a collection, so it was
inevitable to implement it this way.

Please note that PMML mostly deals with scalar-type values, not
collection-type values. The only place where you can explicitly order
the instantiation of a collection-type value is by setting the value
of the "multipleModelMethod" attribute of the Segmentation element to
"selectAll" (http://dmg.org/pmml/v4-2-1/MultipleModels.html).

For example, if you train a Random Forest (RF) model, then this
attribute is typically set to "average". By changing it to
"selectAll", you will be able to extract more information about the
prediction result:

<MiningModel>
<Segmentation multipleModelMethod="selectAll">
<!-- 500 member tree models omitted for brevity -->
</Segmentation>
<Output>
<!-- First, assign the implicit result to a named variable so that
it can be referenced -->
<OutputField name="treeScores" feature="predictedValue"/>
<!-- This corresponds to the default "average" multipleModelMethod
attribute value -->
<OutputField name="averageScore" feature="transformedValue">
<Apply function="org.jpmml.evaluator.functions.MeanFunction">
<FieldRef field="treeScores"/>
</Apply>
</OutputField>
<OutputField name="medianScore" feature="transformedValue">
<Apply function="org.jpmml.evaluator.functions.PercentileFunction">
<FieldRef field="treeScores"/>
<Constant>50</Constant>
</Apply>
</OutputField>
</Output>
</MiningModel>

The JPMML-Evaluator library will be able to pick up Java UDF classes
automatically if their JAR files have been appended to the application
classpath.


VR

Benjamin Auffarth

unread,
Jan 13, 2016, 7:35:22 AM1/13/16
to Villu Ruusmann, Java PMML API
Hi Villu, 
 thanks a lot for your explanations. The suggested extension for DB fields sounds really good. I'll have a look into that. 
 Meanwhile, with your explanations, I have been able to extend pre-processing functionality, so I am happy for now. 
Cheers! 


----
Reply all
Reply to author
Forward
0 new messages