Using JPMML to produce PMML models

1,619 views
Skip to first unread message

jayati tiwari

unread,
Jun 12, 2014, 8:36:16 AM6/12/14
to jp...@googlegroups.com
Hello,

I have been using JPMML for running algorithms over distributed platforms like Storm, Spark etc. but I haven't been able to identify, how to use JPMML for producing my own PMML model files for specific data sets other than the ones in jpmml-rattle project(audit, iris and ozone data-sets).

Lets say I have an entirely different data-set of the following form:

"Plane","XCoordinate","YCoordinate"
0.0,0.7800144346305873,1.6512542456242612
1.0,3.3192955924982677,4.664828345688715
0.0,-0.9059493298933676,-0.42207747354389447
1.0,3.1776956110847916,1.1393123509452483
0.0,-0.5246202787832872,1.0246845701853746

and so on, wish to know how can I generate a PMML model that can run a Naive Bayes classifier on this data-set?

I think I am missing something, can somebody provide me some pointers on this?

Also, does anyone know apart from Augustus which other tools support generation of PMML models files?

Regards,
Jayati

Villu Ruusmann

unread,
Jun 12, 2014, 11:50:17 AM6/12/14
to jp...@googlegroups.com
Hi Jayati,

>
> I have been using JPMML for running algorithms over distributed platforms like Storm, Spark etc.
> But I haven't been able to identify, how to use JPMML for producing my own PMML model files for
> specific data sets other than the ones in jpmml-rattle project(audit, iris and ozone data-sets).
>

PMML production can be handled using the JPMML-Model library
(https://github.com/jpmml/jpmml-model), which provides a so-called
class model for representing PMML schema version 3.0 through 4.2
documents. Please note that JPMML-Model does not contain any machine
learning functionality. You still need an external software for that.

For example, a Naive Bayes model can be trained using Apache
Spark/MLlib. The trained model is represented as an instance of class
org.apache.spark.mllib.classification.NaiveBayesModel
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel).
As of today, MLlib does not provide built-in tools for "converting"
from this model representation to PMML representation. You would have
to create it yourself.

The benefits of using the JPMML-Model library over traditional DOM
manipulation are the following:
1) Statically typed class model. Can't add elements/attributes to
where they do not belong. No room for spelling errors.
2) Value constructors. Can't miss required elements/attributes.
3) Fluent API. The code is relatively well readable and maintainable.
Full IDE autocomplete support.
4) Automated translation between PMML schema versions. The class model
corresponds to the latest PMML 4.2 specification. If there is a need
to back-port it to earlier schema versions, then it can be performed
automatically (or when the feature is not back-portable, then an
appropriate exception is thrown).

> Lets say I have an entirely different data-set of the following form:
>
> "Plane","XCoordinate","YCoordinate"
> 0.0,0.7800144346305873,1.6512542456242612
> 1.0,3.3192955924982677,4.664828345688715
> 0.0,-0.9059493298933676,-0.42207747354389447
> 1.0,3.1776956110847916,1.1393123509452483
> 0.0,-0.5246202787832872,1.0246845701853746
>
> and so on, wish to know how can I generate a PMML model that can run a Naive Bayes classifier on this data-set?
>

The following R code trains a NaiveBayes classification model
(assuming "Plane" as dependent variable and "XCoordinate" and
"YCoordinate" as independent variables):

library("e1071")
library("pmml")

data = read.table(file = "nb.csv", header = TRUE, sep = ",")

# convert from num to factor
data[, "Plane"] = as.factor(data[, "Plane"])

naiveBayes = naiveBayes(Plane ~ XCoordinate + YCoordinate, data)
saveXML(pmml(naiveBayes, predictedField = "Plane"), "nb.pmml")

>
> Also, does anyone know apart from Augustus which other tools support generation of PMML models files?
>

Any decent statistics/machine learning software should possess this
capability nowadays. The list of PMML-capable software (column "PMML
producer") is given at DMG.org's website at:
http://www.dmg.org/products.html


VR

jayati tiwari

unread,
Jun 17, 2014, 2:32:51 AM6/17/14
to jp...@googlegroups.com
Hi VR,

Thanks so much for the detailed reply. It really helped.

Wanted to know, how good an idea it would be to have a library of:

1. Converters of library specific non-pmml models produced by various ML Libraries like (Mahout, Weka, some Python libraries like Milk PyBrain, MLPY etc.) to PMML Models.

2. Vice-versa converters that can convert PMML models to library specific models like XYZ.model file for Mahout .. etc

Can you please suggest ?

Regards,
Jayati

Villu Ruusmann

unread,
Jun 17, 2014, 9:39:53 AM6/17/14
to jp...@googlegroups.com
Hi Jayati,

>
> Wanted to know, how good an idea it would be to have a library of:
>

That sounds like an excellent idea :-) I bet, several people have had
this idea before, but so far, nobody has bothered to implement it in
code.

> 1. Converters of library specific non-pmml models produced by various ML Libraries
> like (Mahout, Weka, some Python libraries like Milk PyBrain, MLPY etc.) to PMML Models.
>

The JPMML family of libraries comes in handy when you are dealing with
the Java language and other JVM languages such as Scala and Clojure.
It is probably not much help when you are dealing with non-JVM
languages such as Python (you may, however, see if/how it works in
Jython).

If you search GitHub then there are several software projects that are
already using JPMML-Model library for converting from internal model
representations to the PMML representation.

For example, Cloudera Oryx (similar to Mahout) can export several of
its model types to PMML:
https://github.com/cloudera/oryx

Also, FeedZai Open Scoring Server (FOS) appears to perform two-way
conversion between the classification-type Weka RandomForest model
type and PMML:
https://github.com/feedzai/fos-weka

Specifically, check out the following two files:
https://github.com/feedzai/fos-weka/blob/master/src/main/java/weka/classifiers/trees/RandomForestPMMLProducer.java
https://github.com/feedzai/fos-weka/blob/master/src/main/java/weka/classifiers/trees/RandomForestPMMLConsumer.java

I have thought about leveraging my PMML experience by creating PMML
producers for popular Java machine learning software. The prime
targets appear to be Spark, Mahout and Weka. However, I would be
interested in finding some financial support for this work. This way
the new code could be released under the Apache License, version 2.0
(or similar).

> 2. Vice-versa converters that can convert PMML models to library specific models like XYZ.model file for Mahout .. etc
>

The conversion from the PMML representation to some internal model
representation is much more difficult, because a PMML document may
contain unsupported "vocabulary". For example, Spark/MLlib supports
logistic regression models that implement only(?) binary
classification. However, PMML supports logistic regression models that
implement both binary and multi-class classification. So, if you have
a PMML document that contains a model that implements multi-class
classification, you have no way of "explaining" that to Spark/MLlib.

The first goal should be to pursue simple conversion workflows such as
MLlib -> PMML -> MLlib. Further goals may include more complex
inter-software conversion workflows such as (R | Weka) -> PMML ->
MLlib.


VR

jayati tiwari

unread,
Jun 18, 2014, 2:41:01 AM6/18/14
to jp...@googlegroups.com
Hi VR,

Thanks a ton. Your prompt replies are really appreciable.

Having PMML producers for Spark and Mahout sound very interesting. I guess Weka already has the support.

I would start researching on these lines. I have done a bit of work on Spark, so I might wanna start with that first.

Other than that I would also check if Jython can work for the Python MLLibs, if yes, all of those can be extended to add PMML producer utility.

Thanks again.

Regards,
Jayati

Vamshi

unread,
Jan 5, 2015, 5:19:00 AM1/5/15
to jp...@googlegroups.com
Hi VR and Jayati,
Your discussion sounded very much relevant to me.I want to generate pmml for any standard model (or customized model) in java using jpmml and hence looked at
https://github.com/jpmml/jpmml-model/blob/master/pmml-model-example/src/main/java/org/jpmml/model/GolfingTreeModelExample.java
Here produce() method is responsible for pmml file generation. I found here the simplePredicate values for nodes are hardcoded and then the 'root' is passed to TreeModel. But, ideally how should it be done and from where do all these values come? Should they be obtained from some 'TreeModel' object which we can pass it to produce() method()?
Correct me if i understood it wrong?

Villu Ruusmann

unread,
Jan 5, 2015, 6:17:10 AM1/5/15
to jpmml
Hi Vamshi,

> https://github.com/jpmml/jpmml-model/blob/master/pmml-model-example/src/main/java/org/jpmml/model/GolfingTreeModelExample.java

The structure of this TreeModel comes from the PMML specification. Go
to http://www.dmg.org/v4-2-1/TreeModel.html and search for a paragraph
called "Example TreeModel" (located in the middle of the document).

The idea of the GolfingTreeModelExample class is to demonstrate how to
use JPMML class model API. Basically, you should be using the "Fluent
API" approach wherever possible.

> But, ideally how should it be done and from where do all these values come?
> Should they be obtained from some 'TreeModel' object which we can pass it to
> produce() method()?

In your case there should be some kind of external data structure that
you will be mapping to PMML elements and attributes. There is some
generic content like the DataDictionary element and
MiningSchema/Output elements, and then there is a model type-specific
content.

I hope to publish my PMML production library for R later this month,
which is targeted at the conversion of extremely large R model data
structures to PMML documents. For example, it can export a ~5 GB
Random Forest PMML file in one minute. The standard "pmml" library
would take several days to do the same.


VR

Vamshi

unread,
Jan 7, 2015, 4:54:31 AM1/7/15
to jp...@googlegroups.com
Thanks VR for the clear explanation.
And that's good to know that your publishing such an efficient pmml lib.
But eager to know one thing. Will your PMML lib contain time series models and other models which are not currently supported in R?

Villu Ruusmann

unread,
Jan 7, 2015, 5:21:37 AM1/7/15
to jpmml
Hi Vamshi,

> And that's good to know that your publishing such an efficient pmml lib.
> But eager to know one thing. Will your PMML lib contain time series models
> and other models which are not currently supported in R?
>

The first version will only support Random Forest models
(classification and regression). Random Forest is one of the most
popular algorithms out there and, quite honestly, the current PMML
exporter implementation of the "pmml" package (both functionality- and
performance-wise) does not impress me at all. After that I will
probably tackle other bagging and boosting algorithms (eg. gbm).

The good news is that my library will be released under the BSD
3-Clause License, which should encourage collaboration. So, if you
feel like it, you could contribute a time series exporter yourself.


VR
Reply all
Reply to author
Forward
0 new messages