Create PMMLPipeline from existing model and DictVectorizer

Dan

unread,

Apr 23, 2017, 5:28:26 AM4/23/17

to Java PMML API

Hi Villu,

I have an existing LogisticRegression model which was trained on features that were transformed using a DictVectorizer.

I want to create a PMMLPipeline object from this trained model.

I could refactor my code and fit the DataFrameMapper on the original data (and thus create the PMMLPipeline from the beginning), but this would take too much time. I already have the LogisticRegression and DictVectorizer objects in hand.

I found this issue (https://github.com/jpmml/sklearn2pmml/issues/27), but I was wondering if you have any ideas specifically to the issue of fitting a DataFrameMapper from an existing DictVectorizer with little pain as possible?

Do I need to parse out the features from the DictVectorizer's feature_names_ dictionary and somehow reconstruct the data for fitting to the mapper?

Thanks!

Villu Ruusmann

unread,

Apr 23, 2017, 8:57:42 AM4/23/17

to Java PMML API

Hi Dan,

>
> I have an existing LogisticRegression model which was trained on
> features that were transformed using a DictVectorizer.
>
> I want to create a PMMLPipeline object from this trained model.
>

Just create a PMMLPipeline instance, and pass it to the sklearn2pmml() function:
transformer = loadPKL("transformer.pkl")
estimator = loadPKL("estimator.pkl")
pipeline = PMMLPipeline(
("transformer", transformer),
("estimator", estimator)
)
# DO NOT invoke PMMLPipeline#fit(X, y), because it would overwrite the
configuration that was loaded from PKL files
sklearn2pmml("pipeline.pmml", pipeline)

You can create all your transformers and estimators manually, there is
no need to #fit() anything. For example, suppose you have performed
linear regression using MS Excel, and would like to generate a PMML
model based on MS Excel model configuration:
regressor = LinearRegression()
regressor.coef_ = numpy.array([..]) # Pass SLOPE() value(s) here
regressor.intercept_ = numpy.array([..]) # Pass INTERCEPT() value(s) here
pipeline = PMMLPipeline([
("regressor", regressor)
])
sklearn2pmml("regressor.pmml", pipeline)

>
> Do I need to parse out the features from the DictVectorizer's
> feature_names_ dictionary and somehow reconstruct the data
> for fitting to the mapper?
>

You cannot use a DictVectorizer in a PMMLPipeline at the moment,
because it's not recognized by the JPMML-SkLearn library.

It looks fairly easy to implement. You can subscribe to the following
GitHub issue to track my progress on it (could happen already this
week):
https://github.com/jpmml/jpmml-sklearn/issues/39

In the meantime, you could manually translate the DictVectorizer
object to a list of LabelBinarizer objects.

VR

Dan

unread,

Apr 24, 2017, 2:17:19 AM4/24/17

to Java PMML API

Hi Villu,

Sorry, but I'm a little confused from your answer.

I forgot to mention that I would like to retain the feature names in the exported PMML (instead of x1,x2,x3, etc.)

>
> transformer = loadPKL("transformer.pkl")
> estimator = loadPKL("estimator.pkl")
> pipeline = PMMLPipeline(
> ("transformer", transformer),
> ("estimator", estimator)
> )
>

This is what I tried, but DictVectorizer does not work in PMMLPipeline.

I'm not sure what "loadPKL" does exactly. And what type is the "transformer" object in your example? Not DictVectorizer? Also, My DictVectorizer object is not pickled to file first; I want to export the PMML as part of the training so there is no intermediate pickling step.

>
> In the meantime, you could manually translate the DictVectorizer
> object to a list of LabelBinarizer objects.
>

So you're saying if I have a DictVectorizer whose feature_names_ = ["f1=v1", "f1=v2", "f2=v3", ...], I would parse these values and create a LabelBinarizer for each:
lb_f1 = LabelBinarizer()
lb_f1.fit(["v1","v2"])
lb_f2 = LabelBinarizer()
lb_f2.fit(["v3"])
Then would I construct a DataFrameMapper from this and use it in the PMMLPipeline?
mapper = DataFrameMapper([("f1",CategoricalDomain(),lb_f1), ("f2",CategoricalDomain(),lb_f2)])

Sorry for the confusion. :-)

Villu Ruusmann

unread,

Apr 24, 2017, 6:04:04 PM4/24/17

to Java PMML API

Hi Dan,

>
> This is what I tried, but DictVectorizer does not work in PMMLPipeline.
>

The JPMML-SkLearn library now supports DictVectorizer:
https://github.com/jpmml/jpmml-sklearn/commit/ac10c6fee0cda6a3c5e1c49e0278db077b64c5db

Attached is a demo Python script Audit.py.

Please note that the updated JPMML-SkLearn library has not been
propagated to the SkLearn2PMML package yet. So, you need to save the
fitted pipeline to a PKL file, and perform the conversion using the
command-line application:
$ java -jar target/converter-executable-1.3-SNAPSHOT.jar --pkl-input
Audit.pkl.z --pmml-output Audit.pmml

VR

Audit.py

Dan

unread,

Apr 25, 2017, 3:28:23 AM4/25/17

to Java PMML API

This is great, thanks!

My problem is my input contains both multi-value (string) categorical and binary (integer) categorical variables. The DictVectorizer ignores the binary categorical vars because it thinks they're continuous.

Therefore in the exported PMML, a binary categorical var appears as continuous:
<DataField name="binary_feature" optype="continuous" dataType="double"/>
When it should be:
<DataField name="binary_feature" optype="categorical" dataType="integer">
<Value value="0"/>
<Value value="1"/>
</DataField>

I tried inserting a DataFrameMapper into the pipeline but the input to the mapper is a matrix whereas the input to the DictVectorizer is a list of of dicts, so I don't know how to fit the whole pipeline.

I assume I need to use a LabelEncoder for the binary categorical vars, but again, its input format is different than that of DitVectorizer, so I'm stuck on how to construct the pipeline.

You have of fitting a PMMLPipeline using DictVectorizer (https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py), but not combined with other transformers or data frame mappers.

Any ideas?
Thanks!

Villu Ruusmann

unread,

Apr 25, 2017, 10:19:51 AM4/25/17

to Java PMML API

Hi Dan,

>
> My problem is my input contains both multi-value (string) categorical

> and binary (integer) categorical variables. The DictVectorizer ignores

> the binary categorical vars because it thinks they're continuous.
>
> Therefore in the exported PMML, a binary categorical var appears as continuous:
> <DataField name="binary_feature" optype="continuous" dataType="double"/>
>

The DictVectorizer transformations treats 0/1 indicator features as continuous doubles, and this is properly reflected in PMML representation.

If you want to modify this behaviour, then you could try "casting" such features from the double data type to some non-numeric datatype, such as "string".

> I tried inserting a DataFrameMapper into the pipeline but the input

> to the mapper is a matrix whereas the input to the DictVectorizer is

> a list of of dicts, so I don't know how to fit the whole pipeline.
>

In my Python script I translated matrix-DataFrame to dict-DataFrame using the pandas.to_dict("records") function. I assume there should be an inverse function to it, which lets you move in the opposite direction just as easily.

So, you could translate a dict-DataFrame back to a matrix-DataFrame, and go with DataFrameMapper as usual. JPMML-SkLearn/SkLearn2PMML supports the FeatureUnion meta-transformer (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) that lets you introduce "branching" into your pipeline. In one branch, you could do the pre-processing of continuous features (eg. scaling), and in the other branch you could do the pre-processing of categorical features (eg. binarization).

> I assume I need to use a LabelEncoder for the binary categorical vars
>

Depending on your preference, you can use LabelEncoder of LabelBinarizer for that.

There was a bug in earlier version of JPMML-SkLearn/SkLearn2PMML that prevented the proper conversion of two-category LabelBinarizer instances, but that has been fixed:

https://github.com/jpmml/jpmml-sklearn/commit/2aa640e73b0ea70e6dbf8f7003130d0da76e65e8

> You have of fitting a PMMLPipeline using DictVectorizer

> (https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py),

> but not combined with other transformers or data frame mappers.
>

The "DictAudit" workflow is there only for test coverage.

The DictVectorizer transformation is an ordinary Scikit-Learn transformation, which can be followed by extra transformers:

pipeline = PMMLPipeline([

("step1", DictVectorizer()),

("step2", StandardScaler()),

("step3", LogisticRegression())

])

VR

Dan

unread,

Apr 26, 2017, 9:01:38 AM4/26/17

to Java PMML API

Sorry Villu, thanks but I'm completely lost. I understand what you're suggesting, I just don't know how to implement it. Very frustrating! :-(

In my first attempt, I tried using just a mapper and the classifier (without DictVectorizer):
> missing_invalid_params = {'invalid_value_treatment': 'as_missing', 'missing_value_treatment': 'as_value', 'missing_value_replacement': 'DEFAULT'}
>
> mapper = DataFrameMapper([
> ('categorical_feature1', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('categorical_feature2', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('binary_categorical_feature1', [CategoricalDomain(**default_value_params), LabelEncoder()]),
> ('binary_categorical_feature2', [CategoricalDomain(**default_value_params), LabelEncoder()]),
> ('continuous_feature', ContinuousDomain())
> ])
>
> df = ... #my raw input features dataframe
> pipeline = PMMLPipeline([
> ('mapper', mapper),
> ('classifier', LogisticRegression())
> ])
> pipeline.fit(df[df.columns.difference(['LABEL'])].to_dict('records'), df['LABEL'])
> sklearn2pmml(pipeline, 'model.pmml', debug=True)

Which gave me this error:
> SEVERE: Failed to convert
> java.lang.IllegalArgumentException
> at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:61)
> at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:95)
> at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:120)
> at org.jpmml.sklearn.Main.run(Main.java:146)
> at org.jpmml.sklearn.Main.main(Main.java:93)
>
> Exception in thread "main" java.lang.IllegalArgumentException
> at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:61)
> at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:95)
> at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:120)
> at org.jpmml.sklearn.Main.run(Main.java:146)
> at org.jpmml.sklearn.Main.main(Main.java:93)
> ('Preserved joblib dump file(s): ', 'c:\\users\\dan\\appdata\\local\\temp\\pipeline-sxinwc.pkl.z')

> sklearn2pmml(pipeline, 'model.pmml', debug=verbose)
> File "C:\Users\dan\AppData\Roaming\Python\Python27\site-packages\sklearn2pmml\__init__.py", line 142, in sklearn2pmml
> raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
> RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

The problematic line is here: https://github.com/jpmml/jpmml-sklearn/blob/8e06f844883fddec8ce19a2a5e940d177cc0aee5/src/main/java/sklearn_pandas/DataFrameMapper.java#L61
Could this be a bug?

In the meantime, I tried implementing it using FeatureUnion (with DictVectorizer), but I still need the mapper for indicating the missing value treatment for both categorical and binary categorical features:
> pmml_pipeline = PMMLPipeline([
> ('union', FeatureUnion([
> ('categorical', DataFrameMapper([
> ('categorical_feature1', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('categorical_feature2', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('vectorizer', DictVectorizer(sparse=False))
> ])),
> ### HOW TO TRANSFORM DATAFRAME BACK INTO MATRIX HERE? ###
> ('binary_categorical', DataFrameMapper([
> ('binary_categorical_feature1', [CategoricalDomain(**missing_invalid_params), LabelEncoder()]),
> ('binary_categorical_feature2', [CategoricalDomain(**missing_invalid_params), LabelEncoder()])
> ])),
> ('continuous', DataFrameMapper([
> ('continuous_feature', ContinuousDomain())
> ]))
> ])),
> ('classifier', LogisticRegression())
> ])

This won't work for all pipelines because the format is dict, not matrix:
> pmml_pipeline.fit(df[df.columns.difference(['LABEL'])].to_dict('records'), df['LABEL'])
> #Here: pickle and use Java converted until sklearn2pmml also supports DictVectorizer

> So, you could translate a dict-DataFrame back to a matrix-DataFrame, and go with DataFrameMapper as usual. JPMML-SkLearn/SkLearn2PMML supports the FeatureUnion meta-transformer (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) that lets you introduce "branching" into your pipeline. In one branch, you could do the pre-processing of continuous features (eg. scaling), and in the other branch you could do the pre-processing of categorical features (eg. binarization).

Can you please show me how to transform the data for different pipelines in the union? Will the sklearn2pmml support exporting such a transformation?

Thanks!

Villu Ruusmann

unread,

Apr 26, 2017, 10:22:12 AM4/26/17

to Java PMML API

Hi Dan,

You need to decide if you want to be building matrix- or dict-based
pipelines. Can't switch from one data representation to another in the
middle of the pipeline that easily.

If you like the matrix interface, then all pipeline "heads" should be
DataFrameMapper. If you like the dict interface, then they should be
DictVectorizer.

In a nutshell, you "problem" is that you want to go with
DictVectorizer, and still (selectively-) apply ContinuousDomain and
CategoricalDomain transformations in order to decorate MiningField
elements (eg. setting invalid value treatments, missing value
replacements). Sorry, these two things don't belong together, and we
must find/develop another way.

The question becomes - is it good to apply the same decoration to all
features (while maintaining the continuous vs categorical separation),
or do you need feature-level granularity? In the first case, it would
be possible to define class PMMLDictVectorizer(DictVectorizer), which
has attributes such as PMMLDictVectorizer#invalid_value_treatment,
PMMLDictVectorizer#missing_value_replacement etc. In the second case,
some more design/thinking is needed.

There's always the opportunity to develop some PMML manipulation tool,
which can load the model from a file, and then add/modify/remove
MiningField decoration attributes as appropriate. This way there would
be no need to clutter JPMML-SkLearn/SkLearn2PMML ecosystem with
modified transformer classes (eg. insisting that one needs to be aware
and use PMMLDictVectorizer instead of DictVectorizer), and the
decoration problem would be solved for other ML frameworks as well
(eg. R, Apache Spark ML).

>> #Here: pickle and use Java converted until sklearn2pmml also supports DictVectorizer

I have just released SkLearn2PMML version 0.20.1, which includes
support for the DictVectorizer transformation.

> Can you please show me how to transform the data for
> different pipelines in the union? Will the sklearn2pmml
> support exporting such a transformation?
>

The assumption is that when you introduce branching using the
FeatureUnion transformation, then each pipeline "head" will start off
with an empty feature list.

For example, this code is for analyzing interactions
"Gender:Education" and "Gender:Occupation" (but ignoring
"Education:Occupation").
audit = pandas.read_csv("Audit.csv")
pipeline = PMMLPipeline([
("preprocessor", FeatureUnion([
("gender:education", Pipeline([
("mapper", DataFrameMapper([
("Gender", LabelBinarizer()),
("Education", LabelBinarizer())
])),
("interaction", PolynomialFeatures())
])),
("gender:occupation", Pipeline([
("mapper", DataFrameMapper([
("Gender", LabelBinarizer()),
("Occupation", LabelBinarizer())
])),
("interaction", PolynomialFeatures())

]))
])),
("classifier", LogisticRegression())
])

pipeline.fit(audit, audit["Adjusted"])
sklearn2pmml(pipeline, "Audit.pmml")

VR

Dan

unread,

Apr 26, 2017, 12:15:00 PM4/26/17

to Java PMML API

Hi Villu,

Thank you so much for the elaborate response. I understand my problem with DictVectorizer much clearer now.

However, can you please remark on why my first solution (mapper and estimator only, without dict) generated that Java IllegalArgumentException?

Thanks!

Reply all

Reply to author

Forward