Thanks!
Hey Villu,
As always, thank you for the thoughtful response. If I understand correctly, the issue with R is that, unlike sklearn where you use on the supported preProcessors, you can accomplish the same thing a million different ways by just writing a function. Is that correct? Now, accounting for the fact that preProcess requires numeric columns, I ran the following code:
data(iris)
iris.preProcess = preProcess(iris[,1:4], method = c("range"))
iris.transformed = predict(iris.preProcess, newdata = iris[,1:4])
iris.transformed = cbind(iris.transformed, Species = iris$Species)
iris.rf = randomForest(Species ~., data = iris.transformed, ntree = 7)
r2pmml(iris.rf, preProcess = iris.preProcess, "iris_rf.pmml")
and got the following exception:
Exception in thread "main" java.lang.ClassCastException: org.jpmml.rexp.RStringVector cannot be cast to org.jpmml.rexp.RGenericVector
at org.jpmml.rexp.PreProcessFeatureMapper.<init>(PreProcessFeatureMapper.java:51)
at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:68)
at org.jpmml.rexp.Main.run(Main.java:149)
at org.jpmml.rexp.Main.main(Main.java:97)
Error in .convert(tempfile, file, ...) : 1.
Did I do something wrong here? Either way, I'm super excited to get this working because this is just what I was hoping for!
I wonder how difficult it would be to take low-volume levels of a factor and map them to something like "Other" based on some sort of threshold. This is something I currently do with MapXform by just creating the mapping myself but would be cool to integrate into the workflow. Would it be difficult to go and create my own preProcessor if I agree on a standardization for something like the above? I don't have java experience but I have plenty of coding experience in general.
Best,
Andrew
Hi Villu,
I have had a chance to play around with the new version of r2pmml and I'm really enjoying the caret integration. In fact, I hadn't used caret much before this and the functionality it provides, along with your code, is a game-changer.
> Check out the "defaultValue" attribute of the MapValues transformation:
> http://dmg.org/pmml/v4-2-1/Transformations.html#xsdElement_MapValues
>
> For example, if you have a 10-level factor, then you should provide
> explicit mappings only for most frequent factor levels, and leave all
> the rest for the implicit "defaultValue" mapping. The MapXForm
> function appears to support this kind of encoding.
>
> <MapValues outputColumn="fruitCode" defaultValue="0">
> <FieldColumnPair field="fruit" column="fruitName"/>
> <InlineTable>
> <row><fruitName>apple</fruitName><fruitCode>1</fruitCode></row>
> <row><fruitName>orange</fruitName><fruitCode>2</fruitCode></row>
> </InlineTable>
> </MapValues>
>
> Here, fruit names "apple" and "orange" would be mapped to 1 and 2,
> respectively, whereas all others would be mapped to 0.
This is what I actually do today to handle this situation. I was more wondering how difficult it would be to write a custom transformer class that would do something of the sort. Of course this operation is not standardized like many of the caret preprocessors, but if I can make the transformation "self-contained", is it possible to create a transformer? I have a similar issue with mapping missing values to something for factors. These are things I can do in the pmml/pmmlTransformations package but as you say, the syntax is extremely clunky and the package is lacking in support.
On a completely separate note, I'm interested to hear what your vision is for the future of pmml? You own pretty much every major active pmml package for R and Python so what is the dream?
Best,
Andrew
To chime in on the discussion of preprocessing I'd like to hear your thoughts on the following:
1) It seems that we want to stabilize on a standard preprocessing R library. We have some standard models/modeling packages in R: glm, randomForest, knn, xgboost, etc, so the question of "How do I go from a model to PMML file" translates for you into coding against one of these standard modeling libraries. However, since preprocessing is so much more open-ended in R, it has historically been difficult to identify a useful area of opportunity where if you were to code against a particular library there would be some guarantee of use and adoption.
Caret seems like a good point of convergence for us - it is actively developed and the maintainer seems interested in adding new functionality. One thought that I had to answer Andrew's question above and allow you to add a useful feature to the r2pmml package is the following: I could submit a pull request to the caret package that contains a "categoricalImpute" method for preProcess that maps missing factor or character data to a given value - either the most common value column-wise or a separate "Unknown" default case value. With this method added to the caret package it seems like it would be reasonable to add it as a supported method to the r2pmml conversion engine, yes?
My second question regards the preprocessing necessary for use of the xgboost package with the JPMML converter. You wrote some helpful functions for me a few months ago that takes a data frame and converts it to an xgb.DMatrix, and one that creates the feature map ("fmap"). How would you think about working these into the caret package? It would be nice to have a similar preprocessing "object" that converts an R data frame and a subset of that data frame into DMatrices that can be scored by an xgboost model without much finagling. Currently if I take a subset of a data frame or a data frame of similar but not identical data, and I try to go df -> DMatrix -> score with xgboost built off different df, I get back scores that don't make a lot of sense...however this could very well be user error.
Interested in your thoughts. Thanks for the consistent high quality of responses in this Google group.
Josh
> Then again, the imputation of categorical fields also fits the
> data.frame pattern that was proposed in my previous e-mail:
> attr(myData$myCatCol, "missingValueReplacement") = "someValue"
> r2pmml(.., dataset = myData, ..)
This method of adding attributes to the data frames, to then be passed to the `r2pmml` function fits well with what I'm thinking. I like this as a fix because it seems flexible but also can conform to the existing structure of the r2pmml package.
> You mean the functions in util.R file:
> https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/R/util.R
>
> They were created as a temporary workaround for my R integration
> testing needs. They do the work, but they're not elegant nor
> particularly efficient.
> ....
> Perhaps the genDMatrix() function
> has been made obsolete by some new fancy R package already (that uses
> C/C++).
I will investigate this. I use your DMatrix and Fmap functions quite a lot because I don't know another way of creating the fmaps necessary to pass to the the jpmml-xgboost command line converter.
I'll provide an update in a few weeks as to the caret pull request and the DMatrix and Fmap investigations.
Thanks,
Josh
Hi VR,
I am having doubt about how to add derived features in the pmml. for e.g. if I created a feature using subtraction of 2 columns, how to add that column in pmml file?
Thanks in advance
But it throws the error, but when I remove vocabulary parameter, it works fine.
I am wondering if using vocabulary is supported or not.If not, is there a workaround to include vocabulary in tfidf.
The error thrown is as follows:
java.lang.ClassCastException: java.lang.Integer cannot be cast to numpy.core.Scalar
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:99)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:76)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to numpy.core.Scalar
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:99)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:76)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Preserved joblib dump file(s): C:\Users\HARSHI~1.000\AppData\Local\Temp\pipeline-3wg7fyiw.pkl.z
Traceback (most recent call last):
File "C:\Users\harshit.karnata.NOTEBOOK436.000\AppData\Roaming\Python\Python36\site-packages\sklearn2pmml\__init__.py", line 216, in sklearn2pmml
subprocess.check_call(cmd)
File "E:\ANACONDA\lib\subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['java', '-cp', 'C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\guava-20.0.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\istack-commons-runtime-3.0.5.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\jaxb-core-2.3.0.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\slf4j-jdk14-1.7.25.jar;C:\\ai_datasciences_python\\sklearn2pmml-plugin-1.0-SNAPSHOT.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', 'C:\\Users\\HARSHI~1.000\\AppData\\Local\\Temp\\pipeline-3wg7fyiw.pkl.z', '--pmml-output',
Thanks in advance.
Hi VR,
As I was going further in r2pmml and feature engineering using model formulae, I came across a requirement where I have to do the multiline and complex transformation to create a derived field and also use multiple columns for the same. Till now I am familiar with only single line transformations as given in your ppt.
The following is the transformations that I have to use.
The following code is in PYTHON because I am tring to convert a python model to R.
Here X is a dataframe and "TRANSFORMED_COLUMN" is the final column that I desire to form.
def Transforming_function(X):
y = []
data = pd.DataFrame(X, columns=['col1', 'col2', 'col3', 'col4',
'col5'])
data['TRANSFORMED_COLUMN'] = 1
for i in data['col1'].unique():
for j in data[data['col1'] == i]['col2'].unique():
temp = pd.DataFrame()
temp = data[(data['col1'] == i) & (data['col2'] == j)]
temp.reset_index(drop=True, inplace=True)
first_row = 1
last_row = temp.at[temp.shape[0] - 1, 'col3']
df = pd.DataFrame()
df = temp[temp['col4'] == 1]
df.sort_values('col3', inplace=True)
df = df.reset_index(drop=True)
df2 = pd.DataFrame()
df2 = temp[temp['col5'] == 1]
df2.sort_values('col3', inplace=True)
df2 = df2.reset_index(drop=True)
if ((df.empty) & (df2.empty)):
continue
if df.empty:
total_row_number = last_row
else:
total_row_number = df.reset_index(drop=True).at[df.shape[0] - 1, 'col3']
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
data['col3'] == total_row_number), 'TRANSFORMED_COLUMN'] = 1
if df2.empty:
heading_row_number = first_row
else:
heading_row_number = df2.reset_index(drop=True).at[0, 'col3']
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
data['col3'] <= heading_row_number), 'TRANSFORMED_COLUMN'] = 0
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
(data['col3'] > heading_row_number) & (
data['col3'] < total_row_number)), 'TRANSFORMED_COLUMN'] = 1
data.loc[(data['col1'] == i) & (data['col2'] == j) & (
data['col3'] >= total_row_number), 'TRANSFORMED_COLUMN'] = 0
y = np.array(data['TRANSFORMED_COLUMN'])
return y
Could you help me how to do such multiline transformations in r2pmml?
Thanks in advance.