I am trying to export a PMML file in Python using sklearn2pmml for import by JPMML in Java.
My goal: For any inputted missing/invalid value, ignore the value by using coefficient equal to 0.0 for that categorical variable (logistic regression).
Up until now, as shown below, I manually edit the PMML file by inserting an artificial default value and set it's coefficient to 0. Then I instruct any missing/invalid value to be this default value.
<DataDictionary>
<DataField name="first_token" optype="categorical" dataType="string">
<Value value="token1"/>
<Value value="token2"/>
<Value value="token3"/>
<Value value="DEFAULT"/>
</DataField>
</DataDictionary>
<RegressionModel functionName="classification" normalizationMethod="softmax">
<MiningSchema>
<MiningField name="first_token" invalidValueTreatment="asMissing" missingValueReplacement="DEFAULT"/>
</MiningSchema>
<RegressionTable intercept="-0.16989516996537837" targetCategory="1">
<CategoricalPredictor name="first_token" value="token1" coefficient="-0.24977558505178424"/>
<CategoricalPredictor name="first_token" value="token2" coefficient="-0.2936699428921362"/>
<CategoricalPredictor name="first_token" value="token3" coefficient="0.3735503579785421"/>
<CategoricalPredictor name="first_token" value="DEFAULT" coefficient="0"/>
</RegressionTable>
</RegressionModel>
Here is my PMML export code:
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model.logistic import LogisticRegression
mapper = DataFrameMapper([('first_token',[CategoricalDomain(),LabelBinarizer()])])
model = LogisticRegression()
p = PMMLPipeline([('mapper',mapper), ('classifier',model)])
p.fit(df,df['LABEL'])
sklearn2pmml(p, 'my_pmml.pmml')
My first thought was to use the Imputer, but it only allows to choose "mean", "median", and "most_frequent", not a constant value.
My second thought was to write my own Transformer and add the "DEFAULT" value:
class MyCategoricalDomain(CategoricalDomain):
def __init__(self, **kwargs):
CategoricalDomain.__init__(self, **kwargs)
def fit(self, X, y=None):
CategoricalDomain.fit(self, X, y)
self.data_ = np.append(self.data_, 'DEFAULT')
return self
But then I got an error that "MyCategoricalDomain" is not a subclass of Transformer.
I found these issues:
https://github.com/jpmml/sklearn2pmml/issues/20
https://github.com/jpmml/jpmml-sklearn/issues/30
And this:
https://github.com/jpmml/sklearn2pmml-plugin
Before I dive into writing my own Java wrapper for my custom Transformer, is there an easier way to accomplish missing/invalid treatment with a constant coefficient value replacement?
Thanks!
Ahh, my original question was supposed to include the specification of the missing value treatment, but the point you made was that it is not necessary to also add the "DEFAULT" value to the DataDictionary nor the CategoricalPredictor entry. And it will still work because 0 is chosen as the default coefficient!
Thanks Villu, you are the king of fast responses!
Just a follow-up question in the same spirit. I would like to change the valid range for a continuous variable (called "leftMargin" and "rightMargin" attributes in the PMML). In training, suppose I only observe values in the range [0.2,0.8], but in inference I may encounter anything in range [0,1].
Original:
<DataField name="ILR" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0.2" rightMargin="0.8"/>
</DataField>
My manual change:
<DataField name="ILR" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0.0" rightMargin="1.0"/>
</DataField>
I would like to call:
ContinuousDomain(min=0.0, max=1.0) but there is no such option.
If I were to extend the transformer myself:
class MyContinuousDomain(ContinuousDomain):
def __init__(self, **kwargs):
ContinuousDomain.__init__(self, **kwargs)
def fit(self, X, y = None):
self.data_min_ = 0.0
self.data_max_ = 1.0
return self
Again, is there any way to do this in native sklearn2pmml without implementing and registering the Java class?