I have an existing LogisticRegression model which was trained on features that were transformed using a DictVectorizer.
I want to create a PMMLPipeline object from this trained model.
I could refactor my code and fit the DataFrameMapper on the original data (and thus create the PMMLPipeline from the beginning), but this would take too much time. I already have the LogisticRegression and DictVectorizer objects in hand.
I found this issue (https://github.com/jpmml/sklearn2pmml/issues/27), but I was wondering if you have any ideas specifically to the issue of fitting a DataFrameMapper from an existing DictVectorizer with little pain as possible?
Do I need to parse out the features from the DictVectorizer's feature_names_ dictionary and somehow reconstruct the data for fitting to the mapper?
Thanks!
Sorry, but I'm a little confused from your answer.
I forgot to mention that I would like to retain the feature names in the exported PMML (instead of x1,x2,x3, etc.)
>
> transformer = loadPKL("transformer.pkl")
> estimator = loadPKL("estimator.pkl")
> pipeline = PMMLPipeline(
> ("transformer", transformer),
> ("estimator", estimator)
> )
>
This is what I tried, but DictVectorizer does not work in PMMLPipeline.
I'm not sure what "loadPKL" does exactly. And what type is the "transformer" object in your example? Not DictVectorizer? Also, My DictVectorizer object is not pickled to file first; I want to export the PMML as part of the training so there is no intermediate pickling step.
>
> In the meantime, you could manually translate the DictVectorizer
> object to a list of LabelBinarizer objects.
>
So you're saying if I have a DictVectorizer whose feature_names_ = ["f1=v1", "f1=v2", "f2=v3", ...], I would parse these values and create a LabelBinarizer for each:
lb_f1 = LabelBinarizer()
lb_f1.fit(["v1","v2"])
lb_f2 = LabelBinarizer()
lb_f2.fit(["v3"])
Then would I construct a DataFrameMapper from this and use it in the PMMLPipeline?
mapper = DataFrameMapper([("f1",CategoricalDomain(),lb_f1), ("f2",CategoricalDomain(),lb_f2)])
Sorry for the confusion. :-)
My problem is my input contains both multi-value (string) categorical and binary (integer) categorical variables. The DictVectorizer ignores the binary categorical vars because it thinks they're continuous.
Therefore in the exported PMML, a binary categorical var appears as continuous:
<DataField name="binary_feature" optype="continuous" dataType="double"/>
When it should be:
<DataField name="binary_feature" optype="categorical" dataType="integer">
<Value value="0"/>
<Value value="1"/>
</DataField>
I tried inserting a DataFrameMapper into the pipeline but the input to the mapper is a matrix whereas the input to the DictVectorizer is a list of of dicts, so I don't know how to fit the whole pipeline.
I assume I need to use a LabelEncoder for the binary categorical vars, but again, its input format is different than that of DitVectorizer, so I'm stuck on how to construct the pipeline.
You have of fitting a PMMLPipeline using DictVectorizer (https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py), but not combined with other transformers or data frame mappers.
Any ideas?
Thanks!
In my first attempt, I tried using just a mapper and the classifier (without DictVectorizer):
> missing_invalid_params = {'invalid_value_treatment': 'as_missing', 'missing_value_treatment': 'as_value', 'missing_value_replacement': 'DEFAULT'}
>
> mapper = DataFrameMapper([
> ('categorical_feature1', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('categorical_feature2', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('binary_categorical_feature1', [CategoricalDomain(**default_value_params), LabelEncoder()]),
> ('binary_categorical_feature2', [CategoricalDomain(**default_value_params), LabelEncoder()]),
> ('continuous_feature', ContinuousDomain())
> ])
>
> df = ... #my raw input features dataframe
> pipeline = PMMLPipeline([
> ('mapper', mapper),
> ('classifier', LogisticRegression())
> ])
> pipeline.fit(df[df.columns.difference(['LABEL'])].to_dict('records'), df['LABEL'])
> sklearn2pmml(pipeline, 'model.pmml', debug=True)
Which gave me this error:
> SEVERE: Failed to convert
> java.lang.IllegalArgumentException
> at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:61)
> at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:95)
> at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:120)
> at org.jpmml.sklearn.Main.run(Main.java:146)
> at org.jpmml.sklearn.Main.main(Main.java:93)
>
> Exception in thread "main" java.lang.IllegalArgumentException
> at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:61)
> at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:95)
> at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:120)
> at org.jpmml.sklearn.Main.run(Main.java:146)
> at org.jpmml.sklearn.Main.main(Main.java:93)
> ('Preserved joblib dump file(s): ', 'c:\\users\\dan\\appdata\\local\\temp\\pipeline-sxinwc.pkl.z')
> sklearn2pmml(pipeline, 'model.pmml', debug=verbose)
> File "C:\Users\dan\AppData\Roaming\Python\Python27\site-packages\sklearn2pmml\__init__.py", line 142, in sklearn2pmml
> raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
> RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams
The problematic line is here: https://github.com/jpmml/jpmml-sklearn/blob/8e06f844883fddec8ce19a2a5e940d177cc0aee5/src/main/java/sklearn_pandas/DataFrameMapper.java#L61
Could this be a bug?
In the meantime, I tried implementing it using FeatureUnion (with DictVectorizer), but I still need the mapper for indicating the missing value treatment for both categorical and binary categorical features:
> pmml_pipeline = PMMLPipeline([
> ('union', FeatureUnion([
> ('categorical', DataFrameMapper([
> ('categorical_feature1', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('categorical_feature2', [CategoricalDomain(**missing_invalid_params), LabelBinarizer()]),
> ('vectorizer', DictVectorizer(sparse=False))
> ])),
> ### HOW TO TRANSFORM DATAFRAME BACK INTO MATRIX HERE? ###
> ('binary_categorical', DataFrameMapper([
> ('binary_categorical_feature1', [CategoricalDomain(**missing_invalid_params), LabelEncoder()]),
> ('binary_categorical_feature2', [CategoricalDomain(**missing_invalid_params), LabelEncoder()])
> ])),
> ('continuous', DataFrameMapper([
> ('continuous_feature', ContinuousDomain())
> ]))
> ])),
> ('classifier', LogisticRegression())
> ])
This won't work for all pipelines because the format is dict, not matrix:
> pmml_pipeline.fit(df[df.columns.difference(['LABEL'])].to_dict('records'), df['LABEL'])
> #Here: pickle and use Java converted until sklearn2pmml also supports DictVectorizer
> So, you could translate a dict-DataFrame back to a matrix-DataFrame, and go with DataFrameMapper as usual. JPMML-SkLearn/SkLearn2PMML supports the FeatureUnion meta-transformer (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) that lets you introduce "branching" into your pipeline. In one branch, you could do the pre-processing of continuous features (eg. scaling), and in the other branch you could do the pre-processing of categorical features (eg. binarization).
Can you please show me how to transform the data for different pipelines in the union? Will the sklearn2pmml support exporting such a transformation?
Thanks!
Thank you so much for the elaborate response. I understand my problem with DictVectorizer much clearer now.
However, can you please remark on why my first solution (mapper and estimator only, without dict) generated that Java IllegalArgumentException?
Thanks!