R vs Python for PMML Workflow

1,635 views
Skip to first unread message

Andrew Orso

unread,
Jul 11, 2016, 4:27:11 PM7/11/16
to Java PMML API
I was wondering what your opinion is on using R vs Python to generate a PMML with the necessary transformations from an efficient workflow POV? Our team uses R currently but it seems like some of the R models in the CRAN pmml package are unstable and it's also tough to handle the local transformations. My understanding is with Python it's maybe a bit easier to use the sklearn.preprocessing and the workflow to get to pmml doesn't require as many extra steps? Hope my question is clear.

Thanks!

Villu Ruusmann

unread,
Jul 12, 2016, 4:22:03 AM7/12/16
to Java PMML API
Hi Andrew,

> I was wondering what your opinion is on using R vs Python to
> generate a PMML with the necessary transformations from an
> efficient workflow POV?

The conversion to PMML is the "finishing touch" of a ML workflow.
Ideally, it's a matter of inserting a single r2pmml() or
sklearn2pmml() function call into your R or Python script.

The real value is generated by script lines that precede that function
call. Therefore, you should always go with the ML platform that
maximizes the productivity of people in your data science team, and
not the one that claims to provide the best PMML support today. It's
always possible to improve PMML support if that should prove to be a
limiting factor.

> My understanding is with Python it's maybe a bit easier to use
> the sklearn.preprocessing and the workflow to get to pmml doesn't
> require as many extra steps? Hope my question is clear.
>

I have built PMML conversion libraries for R, Scikit-Learn, XGBoost
and Apache Spark ML. I wouldn't say that any of them is inherently
more "PMML friendly" (or "PMML hostile") than the others. In fact, I'm
currently building a PMML conversion helper library JPMML-Converter
(https://github.com/jpmml/jpmml-converter) that provides common
abstractions and functionality to all JPMML-R, JPMML-SkLearn,
JPMML-XGBoost and JPMML-SparkML libraries. It should serve as a
further proof that the playing field is level.

Major differences at the ML platform level:
*) Scikit-Learn, XGBoost and Spark ML algorithms operate only on
numerical data, whereas R algorithms operate on mixed data (eg.
factors). A good deal of a typical Scikit-Learn or Spark ML workflow
is about "expanding" string columns to sets of numerical columns.
However, this activity is completely pointless from the PMML
perspective. JPMML-SkLearn and JPMML-SparkML libraries contain special
logic that "collapses" sets of numerical features back to
original-like string features. It's a lot of extra work, but it allows
for more compact representation of model schemas. In the JPMML-SkLearn
world, this compaction functionality became available in sklearn2pmml
package version 0.9.0. If you're using an older version, then please
upgrade!
*) Python and Scala are OOP languages, whereas R is a functional
language. If you're looking to implement a custom feature transformer
in Scikit-Learn or Spark ML, then you would 1) extend a specific base
class and 2) plug it into the pipeline at a specific location. R
doesn't provide/need such rigid formalizations. In R, you can append a
new column to data.frame by simply applying a function to an existing
data.frame column.

Obviously, it's technically much easier to build PMML converters for a
fixed data structure (ie. a Scikit-Learn or SparkML transformer class)
than a variable data structure (ie. a R function). It's difficult to
make meaningful progress in R space if there's so much freedom around.
The "pmmlTransformations" package aims to introduce some structured
thinking, but its user interface (eg. the argument syntax) is rather
dreadful.

Do you have any ideas how to bring more structure to R scripts? The
"caret" package provides the preProcess( ) function
(http://topepo.github.io/caret/preprocess.html) that seems on par with
built-in Scikit-Learn and Spark ML feature transformer classes (eg.
imputation, scaling, PCA). I'm likely to take a shot at it sometime.


VR

Villu Ruusmann

unread,
Jul 18, 2016, 1:40:31 PM7/18/16
to Java PMML API
Hi Andrew,

I have released r2pmml version 0.8.1, which introduces support for
caret-style data pre-processing (see "preProcess" in
https://cran.r-project.org/web/packages/caret/caret.pdf).

The mapping between SkLearn and Caret transformations:
*) MinMaxScaler -> preProcess(method = "range")
*) StandardScaler(with_mean = TRUE, with_std = TRUE) ->
preProcess(method = c("center", "scale"))
*) Imputer(strategy = "median") -> preProcess(method = "medianImpute")

You would train a processor object, and pass it to the r2pmml()
function as the "preProcess" argument:

library("caret")
library("randomForest")
library("r2pmml")

data(iris)
iris.preProcess = preProcess(iris, method = c("range"))
iris.transformed = predict(iris.preProcess, newdata = iris)
iris.rf = randomForest(Species ~., data = iris.transformed, ntree = 7)
r2pmml(iris.rf, preProcess = iris.preProcess, "iris_rf.pmml")


VR

Andrew Orso

unread,
Jul 25, 2016, 9:59:48 PM7/25/16
to Java PMML API

Hey Villu,

As always, thank you for the thoughtful response. If I understand correctly, the issue with R is that, unlike sklearn where you use on the supported preProcessors, you can accomplish the same thing a million different ways by just writing a function. Is that correct? Now, accounting for the fact that preProcess requires numeric columns, I ran the following code:

data(iris)
iris.preProcess = preProcess(iris[,1:4], method = c("range"))
iris.transformed = predict(iris.preProcess, newdata = iris[,1:4])
iris.transformed = cbind(iris.transformed, Species = iris$Species)


iris.rf = randomForest(Species ~., data = iris.transformed, ntree = 7)
r2pmml(iris.rf, preProcess = iris.preProcess, "iris_rf.pmml")


and got the following exception:

Exception in thread "main" java.lang.ClassCastException: org.jpmml.rexp.RStringVector cannot be cast to org.jpmml.rexp.RGenericVector
at org.jpmml.rexp.PreProcessFeatureMapper.<init>(PreProcessFeatureMapper.java:51)
at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:68)
at org.jpmml.rexp.Main.run(Main.java:149)
at org.jpmml.rexp.Main.main(Main.java:97)
Error in .convert(tempfile, file, ...) : 1.

Did I do something wrong here? Either way, I'm super excited to get this working because this is just what I was hoping for!

I wonder how difficult it would be to take low-volume levels of a factor and map them to something like "Other" based on some sort of threshold. This is something I currently do with MapXform by just creating the mapping myself but would be cool to integrate into the workflow. Would it be difficult to go and create my own preProcessor if I agree on a standardization for something like the above? I don't have java experience but I have plenty of coding experience in general.

Best,

Andrew

Villu Ruusmann

unread,
Jul 26, 2016, 8:36:38 AM7/26/16
to Java PMML API
Hi Andrew,

>
> If I understand correctly, the issue with R is that, unlike sklearn where
> you use on the supported preProcessors, you can accomplish the same
> thing a million different ways by just writing a function. Is that correct?

Correct.

If the model depends on pre-processed data, then you need to inform
the conversion engine about it. The conversion engine only knows about
the serialized data structure that is passed to it. It has no way of
knowing what happened in other parts of your R script.

The goal is to perform data pre-processing using a R library that
"encapsulates" the description of all activities to a serializable
data structure.

The caret library provides the "preProcess" class exactly for that
purpose. You're supposed to train a preProcess object and persist it
using saveRDS(). The next time you want to transform new data, you
would restore the preProcess object using readRDS() and apply it via
predict(preProcess, newdata = ...). All the transformation logic is
contained in this preProcess object. There is no need to keep around
not a single line of supporting R code.

> Now, accounting for the fact that preProcess requires numeric columns, I ran the following code:
>
> data(iris)
> iris.preProcess = preProcess(iris[,1:4], method = c("range"))
> iris.transformed = predict(iris.preProcess, newdata = iris[,1:4])
> iris.transformed = cbind(iris.transformed, Species = iris$Species)
> iris.rf = randomForest(Species ~., data = iris.transformed, ntree = 7)
> r2pmml(iris.rf, preProcess = iris.preProcess, "iris_rf.pmml")
>
>
> and got the following exception:
> Exception in thread "main" java.lang.ClassCastException: org.jpmml.rexp.RStringVector cannot be cast to org.jpmml.rexp.RGenericVector
> at org.jpmml.rexp.PreProcessFeatureMapper.<init>(PreProcessFeatureMapper.java:51)
>

What is your caret package version?
library("caret")
packageVersion("caret")

I'm using caret 6.0.70, and the above R script works without problems.

>
> I wonder how difficult it would be to take low-volume levels of a factor
> and map them to something like "Other" based on some sort of threshold.
> This is something I currently do with MapXform by just creating the
> mapping myself but would be cool to integrate into the workflow.
>

Check out the "defaultValue" attribute of the MapValues transformation:
http://dmg.org/pmml/v4-2-1/Transformations.html#xsdElement_MapValues

For example, if you have a 10-level factor, then you should provide
explicit mappings only for most frequent factor levels, and leave all
the rest for the implicit "defaultValue" mapping. The MapXForm
function appears to support this kind of encoding.

<MapValues outputColumn="fruitCode" defaultValue="0">
<FieldColumnPair field="fruit" column="fruitName"/>
<InlineTable>
<row><fruitName>apple</fruitName><fruitCode>1</fruitCode></row>
<row><fruitName>orange</fruitName><fruitCode>2</fruitCode></row>
</InlineTable>
</MapValues>

Here, fruit names "apple" and "orange" would be mapped to 1 and 2,
respectively, whereas all others would be mapped to 0.


VR

Andrew Orso

unread,
Aug 7, 2016, 9:01:09 PM8/7/16
to Java PMML API

Hi Villu,

I have had a chance to play around with the new version of r2pmml and I'm really enjoying the caret integration. In fact, I hadn't used caret much before this and the functionality it provides, along with your code, is a game-changer.

> Check out the "defaultValue" attribute of the MapValues transformation:
> http://dmg.org/pmml/v4-2-1/Transformations.html#xsdElement_MapValues
>
> For example, if you have a 10-level factor, then you should provide
> explicit mappings only for most frequent factor levels, and leave all
> the rest for the implicit "defaultValue" mapping. The MapXForm
> function appears to support this kind of encoding.
>
> <MapValues outputColumn="fruitCode" defaultValue="0">
> <FieldColumnPair field="fruit" column="fruitName"/>
> <InlineTable>
> <row><fruitName>apple</fruitName><fruitCode>1</fruitCode></row>
> <row><fruitName>orange</fruitName><fruitCode>2</fruitCode></row>
> </InlineTable>
> </MapValues>
>
> Here, fruit names "apple" and "orange" would be mapped to 1 and 2,
> respectively, whereas all others would be mapped to 0.

This is what I actually do today to handle this situation. I was more wondering how difficult it would be to write a custom transformer class that would do something of the sort. Of course this operation is not standardized like many of the caret preprocessors, but if I can make the transformation "self-contained", is it possible to create a transformer? I have a similar issue with mapping missing values to something for factors. These are things I can do in the pmml/pmmlTransformations package but as you say, the syntax is extremely clunky and the package is lacking in support.

On a completely separate note, I'm interested to hear what your vision is for the future of pmml? You own pretty much every major active pmml package for R and Python so what is the dream?

Best,
Andrew

Josh Izzard

unread,
Aug 7, 2016, 11:59:31 PM8/7/16
to Java PMML API
Hi Villu,

To chime in on the discussion of preprocessing I'd like to hear your thoughts on the following:

1) It seems that we want to stabilize on a standard preprocessing R library. We have some standard models/modeling packages in R: glm, randomForest, knn, xgboost, etc, so the question of "How do I go from a model to PMML file" translates for you into coding against one of these standard modeling libraries. However, since preprocessing is so much more open-ended in R, it has historically been difficult to identify a useful area of opportunity where if you were to code against a particular library there would be some guarantee of use and adoption.

Caret seems like a good point of convergence for us - it is actively developed and the maintainer seems interested in adding new functionality. One thought that I had to answer Andrew's question above and allow you to add a useful feature to the r2pmml package is the following: I could submit a pull request to the caret package that contains a "categoricalImpute" method for preProcess that maps missing factor or character data to a given value - either the most common value column-wise or a separate "Unknown" default case value. With this method added to the caret package it seems like it would be reasonable to add it as a supported method to the r2pmml conversion engine, yes?

My second question regards the preprocessing necessary for use of the xgboost package with the JPMML converter. You wrote some helpful functions for me a few months ago that takes a data frame and converts it to an xgb.DMatrix, and one that creates the feature map ("fmap"). How would you think about working these into the caret package? It would be nice to have a similar preprocessing "object" that converts an R data frame and a subset of that data frame into DMatrices that can be scored by an xgboost model without much finagling. Currently if I take a subset of a data frame or a data frame of similar but not identical data, and I try to go df -> DMatrix -> score with xgboost built off different df, I get back scores that don't make a lot of sense...however this could very well be user error.

Interested in your thoughts. Thanks for the consistent high quality of responses in this Google group.

Josh

Villu Ruusmann

unread,
Aug 8, 2016, 1:31:31 PM8/8/16
to Java PMML API
Hi Andrew,

>
> I have had a chance to play around with the new version of r2pmml
> and I'm really enjoying the caret integration.
>

The Caret package provides unified interface for working with most
popular R packages. So, instead of learning the names/parametrizations
of many R model training functions, you'd only learn the
caret::train() function. Model tuning and (cross-)validation are "on"
by default. It can easily happen that Caret application code is more
compact/robust than functionally equivalent Scikit-Learn application
code.

Hadn't thought about it this way, but perhaps it would be right to
make the interoperability with the Caret package the main objective
for r2pmml.

>> Check out the "defaultValue" attribute of the MapValues transformation:
>> http://dmg.org/pmml/v4-2-1/Transformations.html#xsdElement_MapValues
>
> This is what I actually do today to handle this situation. I was more
> wondering how difficult it would be to write a custom transformer class
> that would do something of the sort.
> I have a similar issue with mapping missing values to something for factors.
>

The JPMML-Converter library (https://github.com/jpmml/jpmml-converter)
contains shared functionality between JPMML-R, JPMML-SkLearn,
JPMML-SparkML and JPMML-XGBoost libraries.

I introduced a FieldDecorator framework to it not long time ago, which
is suitable for enhancing DataField/MiningField pairs with extra
information. This framework is used by the sklearn2pmml package to
record valid value space information, and to encode the
sklearn.preprocessing.Imputer transformation type "the PMML way" if
possible.

If you hadn't noticed this new sklearn2pmml functionality before, then
you could check out the sklearn.decoration.ContinuousDomain (and
CategoricalDomain) example:
from sklearn2pmml.decoration import ContinuousDomain
iris_mapper = sklearn_pandas.DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"],
[ContinuousDomain()]),
("Species", None)
])

There are two parts to this approach.

First, there's a Python component that captures/holds all the relevant
information:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py

Second, there's a Java component that can translate the above to PMML:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/decoration/ContinuousDomain.java

The FieldDecorator framework could be used by the r2pmml package in a
similar fashion. The difficult part is designing a R component that
would be easy and intuitive to use. There's always a possibility to
define special-purpose data structures inside r2pmml package, but I'd
prefer something more generic, possibly data.frame based.

How about the following approach, where DataField/MiningField
decorations are simply declared as column attributes?
data(iris)
attr(iris$Sepal.Length, "invalidValueTreatment") = "asMissing"
attr(iris$Sepal.Length, "missingValueReplacement") = 3.5

Such decorated data.frame could be passed to the r2pmml() function:
r2pmml(.., dataset = iris, ..)


VR

Villu Ruusmann

unread,
Aug 8, 2016, 2:40:41 PM8/8/16
to Java PMML API
Hi Josh,

>
> I could submit a pull request to the caret package that contains a
> "categoricalImpute" method for preProcess that maps missing factor or
> character data to a given value - either the most common value column-
> wise or a separate "Unknown" default case value.
> With this method added to the caret package it seems like it would be
> reasonable to add it as a supported method to the r2pmml?
>

You should definitely give it a try. There are more issues open in
this area (eg. https://github.com/topepo/caret/issues/417) that could
be handled while you're at it.

The r2pmml implementation of the "categoricalImpute" method doesn't
need to wait so long. The idea is to manually enhance the trained
preProcess object with this information:

# Default
pp = preProcess(myData)

# Manual enhancement
pp$method = c(pp$method, "categoricalImpute" = "myCatCol") // Append a
new method "categoricalImpute" to methods list
pp$categoricalImpute = c("myCatCol" = "someValue") // Provide column
name/default value mappings

Then again, the imputation of categorical fields also fits the
data.frame pattern that was proposed in my previous e-mail:
attr(myData$myCatCol, "missingValueReplacement") = "someValue"
r2pmml(.., dataset = myData, ..)

> You wrote some helpful functions for me a few months ago that takes
> a data frame and converts it to an xgb.DMatrix, and one that creates
> the feature map ("fmap"). How would you think about working these into
> the caret package?

You mean the functions in util.R file:
https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/R/util.R

They were created as a temporary workaround for my R integration
testing needs. They do the work, but they're not elegant nor
particularly efficient.

The genFMap() function should support more R datatypes. For example,
the boolean datatype should be mapped to two "i" columns ("value=true"
and "value=false"). The genDMatrix() function needs more
"vectorization", and the conversion should happen in-memory, without
using the temporary file. It's all about producing a sparse LibSVM
matrix as efficiently as possible. Perhaps the genDMatrix() function
has been made obsolete by some new fancy R package already (that uses
C/C++).

Again, if you have time for all this, then go ahead. You may consider
the xgboost package (based on
https://github.com/dmlc/xgboost/tree/master/R-package) as an
alternative destination.


VR

Josh Izzard

unread,
Aug 10, 2016, 2:55:42 PM8/10/16
to Java PMML API
Villu,

> Then again, the imputation of categorical fields also fits the
> data.frame pattern that was proposed in my previous e-mail:
> attr(myData$myCatCol, "missingValueReplacement") = "someValue"
> r2pmml(.., dataset = myData, ..)

This method of adding attributes to the data frames, to then be passed to the `r2pmml` function fits well with what I'm thinking. I like this as a fix because it seems flexible but also can conform to the existing structure of the r2pmml package.

> You mean the functions in util.R file:
> https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/R/util.R
>
> They were created as a temporary workaround for my R integration
> testing needs. They do the work, but they're not elegant nor
> particularly efficient.

> ....


> Perhaps the genDMatrix() function
> has been made obsolete by some new fancy R package already (that uses
> C/C++).

I will investigate this. I use your DMatrix and Fmap functions quite a lot because I don't know another way of creating the fmaps necessary to pass to the the jpmml-xgboost command line converter.

I'll provide an update in a few weeks as to the caret pull request and the DMatrix and Fmap investigations.

Thanks,
Josh

Harshit Karnatak

unread,
Dec 6, 2017, 6:37:52 AM12/6/17
to Java PMML API

Hi VR,
I am having doubt about how to add derived features in the pmml. for e.g. if I created a feature using subtraction of 2 columns, how to add that column in pmml file?
Thanks in advance

Villu Ruusmann

unread,
Dec 6, 2017, 8:21:41 AM12/6/17
to Java PMML API
Hi Harshit,

> I am having doubt about how to add derived features in
> the pmml. for e.g. if I created a feature using subtraction
> of 2 columns, how to add that column in pmml file?
>

What is your ML framework - R or Python?

In R, you can do rich feature engineering work inside model formulas.
See slides 20 and 21 in the following presentation:
https://www.slideshare.net/VilluRuusmann/converting-r-to-pmml-82182483

For example:
model.glm = glm(y ~ I(x1 - x2), data = ...)

In Python/Scikit-Learn, you can do arithmetics using the
sklearn2pmml.preprocessing.ExpressionTransformer transformation type.
See slides 20 and 21 in the following presentation:
https://www.slideshare.net/VilluRuusmann/converting-scikitlearn-to-pmml

For example:
subtractor = ExpressionTransformer("X[:, 0] - X[:,1]")


VR
Message has been deleted

Villu Ruusmann

unread,
Dec 8, 2017, 5:24:01 PM12/8/17
to Java PMML API
Hi Gaurav,

For starters - your question is completely unrelated to the topic of
"R vs. Python for PMML Workflow". You should consider opening a new
topic in the future. Something like "Advanced in-formula feature
engineering in R" would have been a more fitting topic/subject line.

Also, please do not delete messages after they have been posted to the
mailing list (even if they landed in the wrong topic), as that messes
up the history.

>
> data1$c<-as.factor(data1$c)
> data1.rf<-randomForest(ifelse((as.integer(strptime(a,format="%d-%m-%y")-strptime(b,format="%d-%m-%y")))>0,1,0) ~ c,data=data1)
>
> r2pmml(data1.rf,"xyz.pmml")
>

Thanks - that's a very interesting example. In principle, it should be
completely doable in (R2)PMML, but at the moment there are several
obstacles on the way:

1) The feature engineering happens on the "label side" of the formula
(ie. func(y) ~ x), not on the "features' side" (ie. y ~ func(x)):
https://github.com/jpmml/jpmml-r/issues/7

2) Inside the formula, you're performing an explicit type cast using
the 'as.integrer' function, which is currently not supported:
https://github.com/jpmml/jpmml-r/issues/8

3) Inside the formula, you're parsing strings into date objects using
the 'strptime' function, which is also currently not supported:
https://github.com/jpmml/jpmml-r/issues/9

Please subscribe to these issues to track my progress.


VR

Harshit Karnatak

unread,
Dec 11, 2017, 1:20:40 PM12/11/17
to Java PMML API
Hi VR,
The slides are really helpful and descriptive and helped me solve my problem to a great level.
Thanks for the prompt replies and awesome answers. You are always a great help.

Harshit Karnatak

unread,
Dec 12, 2017, 6:59:45 AM12/12/17
to Java PMML API
Hi VR,
I am trying to use TFIDF in pmml pipeline
tfidfVectorizer1 = TfidfVectorizer(analyzer='word',preprocessor=None,min_df = 100,stop_words='english',vocabulary={'totals':0,'grand'},tokenizer=Splitter(),token_pattern=None,norm=None)

But it throws the error, but when I remove vocabulary parameter, it works fine.
I am wondering if using vocabulary is supported or not.If not, is there a workaround to include vocabulary in tfidf.
The error thrown is as follows:

java.lang.ClassCastException: java.lang.Integer cannot be cast to numpy.core.Scalar
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:99)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:76)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to numpy.core.Scalar
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:99)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:76)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Preserved joblib dump file(s): C:\Users\HARSHI~1.000\AppData\Local\Temp\pipeline-3wg7fyiw.pkl.z
Traceback (most recent call last):
File "C:\Users\harshit.karnata.NOTEBOOK436.000\AppData\Roaming\Python\Python36\site-packages\sklearn2pmml\__init__.py", line 216, in sklearn2pmml
subprocess.check_call(cmd)
File "E:\ANACONDA\lib\subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['java', '-cp', 'C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\guava-20.0.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\istack-commons-runtime-3.0.5.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\jaxb-core-2.3.0.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\slf4j-jdk14-1.7.25.jar;C:\\ai_datasciences_python\\sklearn2pmml-plugin-1.0-SNAPSHOT.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', 'C:\\Users\\HARSHI~1.000\\AppData\\Local\\Temp\\pipeline-3wg7fyiw.pkl.z', '--pmml-output',

Thanks in advance.

Harshit Karnatak

unread,
Dec 19, 2017, 4:20:29 AM12/19/17
to Java PMML API

Hi VR,
As I was going further in r2pmml and feature engineering using model formulae, I came across a requirement where I have to do the multiline and complex transformation to create a derived field and also use multiple columns for the same. Till now I am familiar with only single line transformations as given in your ppt.
The following is the transformations that I have to use.
The following code is in PYTHON because I am tring to convert a python model to R.
Here X is a dataframe and "TRANSFORMED_COLUMN" is the final column that I desire to form.

def Transforming_function(X):
y = []
data = pd.DataFrame(X, columns=['col1', 'col2', 'col3', 'col4',
'col5'])
data['TRANSFORMED_COLUMN'] = 1
for i in data['col1'].unique():
for j in data[data['col1'] == i]['col2'].unique():
temp = pd.DataFrame()
temp = data[(data['col1'] == i) & (data['col2'] == j)]
temp.reset_index(drop=True, inplace=True)
first_row = 1
last_row = temp.at[temp.shape[0] - 1, 'col3']
df = pd.DataFrame()
df = temp[temp['col4'] == 1]
df.sort_values('col3', inplace=True)
df = df.reset_index(drop=True)
df2 = pd.DataFrame()
df2 = temp[temp['col5'] == 1]
df2.sort_values('col3', inplace=True)
df2 = df2.reset_index(drop=True)
if ((df.empty) & (df2.empty)):
continue
if df.empty:
total_row_number = last_row
else:
total_row_number = df.reset_index(drop=True).at[df.shape[0] - 1, 'col3']
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
data['col3'] == total_row_number), 'TRANSFORMED_COLUMN'] = 1
if df2.empty:
heading_row_number = first_row
else:
heading_row_number = df2.reset_index(drop=True).at[0, 'col3']
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
data['col3'] <= heading_row_number), 'TRANSFORMED_COLUMN'] = 0
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
(data['col3'] > heading_row_number) & (
data['col3'] < total_row_number)), 'TRANSFORMED_COLUMN'] = 1
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
data['col3'] >= total_row_number), 'TRANSFORMED_COLUMN'] = 0

y = np.array(data['TRANSFORMED_COLUMN'])
return y

Could you help me how to do such multiline transformations in r2pmml?
Thanks in advance.

Villu Ruusmann

unread,
Dec 19, 2017, 5:18:37 PM12/19/17
to Java PMML API
Hi Harshit,

> I came across a requirement where I have to do the
> multiline and complex transformation to create a derived
> field and also use multiple columns for the same.
>

The JPMML-R library doesn't care if your write the model formula as a
single (long-) line, or break it down into multiple (shorter-) lines.

It's also possible to extend the formula parser component with new functions:
1) Create a R function (eg. "myfunction"), and put it into some R
package (eg. "mycompany") to make it easily identifiable.
2) Create a JPMML-R handler for it. You would need to fork the JPMML-R
library, because there's no proper plugin framework (as is the case
with SkLearn2PMML/JPMML-SkLearn) available at the moment.
3) Use the this R function in a model formula using its
fully-qualified name: "as.formula(y ~ mycompany::myfunction(col1,
col2, col3, col4, col5))"

However, the above approach is really not suitable for solving your
'Transforming_function()' case, because this function is
table-oriented, not row-oriented. For example, it is scanning through
'col1' and 'col2' columns in order to identify unique values. This can
be done during model training, when the complete dataset is available.
This cannot be done during model deployment, when the data comes in
one row at a time.

To make progress, you'd need to split/refactor your
'Transforming_function()' into multiple parts:
1) Extract the training-time logic (everything that must be done
table/column-wise). For example, identifying unique category values.
2) Extract the deployment-time logic (everything that can be done
row-wise). For example, checking if the value is equal to the
specified category value.
3) Connect parts #1 and #2 using some "parameter transfer object". For
example, a map for carrying around the category values of each column.

The PMML conversion logic only needs to deal with the second part.


VR
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages