sklearn2pmml: Assign default coefficient for missing/invalid for categorical variable

728 views
Skip to first unread message

Dan

unread,
Apr 20, 2017, 1:11:42 AM4/20/17
to Java PMML API
(Using Python 2.7, sklearn2pmml 0.19.0, sklearn 0.18.1, sklearn_pandas 1.3.0)

I am trying to export a PMML file in Python using sklearn2pmml for import by JPMML in Java.

My goal: For any inputted missing/invalid value, ignore the value by using coefficient equal to 0.0 for that categorical variable (logistic regression).

Up until now, as shown below, I manually edit the PMML file by inserting an artificial default value and set it's coefficient to 0. Then I instruct any missing/invalid value to be this default value.

<DataDictionary>
<DataField name="first_token" optype="categorical" dataType="string">
<Value value="token1"/>
<Value value="token2"/>
<Value value="token3"/>
<Value value="DEFAULT"/>
</DataField>
</DataDictionary>
<RegressionModel functionName="classification" normalizationMethod="softmax">
<MiningSchema>
<MiningField name="first_token" invalidValueTreatment="asMissing" missingValueReplacement="DEFAULT"/>
</MiningSchema>
<RegressionTable intercept="-0.16989516996537837" targetCategory="1">
<CategoricalPredictor name="first_token" value="token1" coefficient="-0.24977558505178424"/>
<CategoricalPredictor name="first_token" value="token2" coefficient="-0.2936699428921362"/>
<CategoricalPredictor name="first_token" value="token3" coefficient="0.3735503579785421"/>
<CategoricalPredictor name="first_token" value="DEFAULT" coefficient="0"/>
</RegressionTable>
</RegressionModel>


Here is my PMML export code:
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model.logistic import LogisticRegression
mapper = DataFrameMapper([('first_token',[CategoricalDomain(),LabelBinarizer()])])
model = LogisticRegression()
p = PMMLPipeline([('mapper',mapper), ('classifier',model)])
p.fit(df,df['LABEL'])
sklearn2pmml(p, 'my_pmml.pmml')


My first thought was to use the Imputer, but it only allows to choose "mean", "median", and "most_frequent", not a constant value.

My second thought was to write my own Transformer and add the "DEFAULT" value:
class MyCategoricalDomain(CategoricalDomain):
def __init__(self, **kwargs):
CategoricalDomain.__init__(self, **kwargs)
def fit(self, X, y=None):
CategoricalDomain.fit(self, X, y)
self.data_ = np.append(self.data_, 'DEFAULT')
return self

But then I got an error that "MyCategoricalDomain" is not a subclass of Transformer.

I found these issues:
https://github.com/jpmml/sklearn2pmml/issues/20
https://github.com/jpmml/jpmml-sklearn/issues/30

And this:
https://github.com/jpmml/sklearn2pmml-plugin

Before I dive into writing my own Java wrapper for my custom Transformer, is there an easier way to accomplish missing/invalid treatment with a constant coefficient value replacement?

Thanks!

Villu Ruusmann

unread,
Apr 20, 2017, 4:06:35 AM4/20/17
to Java PMML API
Hi Dan,

>
> My goal: For any inputted missing/invalid value,
> ignore the value by using coefficient equal to 0.0 for
> that categorical variable (logistic regression).
>

You can map missing/invalid input values to a default value using
MiningField "decoration" attributes @missingValueTreatment,
@invalidValueTreatment and @missingValueReplacement:

<MiningSchema>
<MiningField name="first_token"
missingValueTreatment="asValue"
invalidValueTreatment="asMissing"
missingValueReplacement="DEFAULT"
/>
</MiningSchema>

In the current PMML schema version the
MiningField@missingValueTreatment attribute is effectively no-op, as
it's only supposed to indicate how the value of the
MiningField@missingValueReplacement attribute was "derived". So, in my
example, I'm setting it to "asValue" to indicate that this value was
chosen pretty much arbitrarily. The "asIs" would be functionally
identical.

The PMML specification doesn't clarify if the replacement value
"DEFAULT" must be listed in the corresponding DataField element (as a
valid value option) or not. If you do have an easy mechanism for
updating the contents of the DataDictionary element, then you may do
so, but it's not required.

The same goes for editing the contents of the RegressionTable element.
There is no need to insert a no-op CategoricalPredictor element for
the replacement value "DEFAULT":
<RegressionTable>
<categoricalPredictor name="first_token" value="DEFAULT" coefficient="0"/>
</RegressionTable>

For example, if you apply Scikit-Learn's feature selection method to a
categorical column (eg. sequence [CategoricalDomain(),
LabelBinarizer(), SelectKBest(k = 3)]), then the regression table
contains CategoricalPredictor elements only for the "chosen" category
levels.

>
> Here is my PMML export code:
> from sklearn_pandas import DataFrameMapper
> from sklearn2pmml import PMMLPipeline
> from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
> from sklearn.preprocessing import LabelBinarizer
> from sklearn.linear_model.logistic import LogisticRegression
> mapper = DataFrameMapper([('first_token',[CategoricalDomain(),LabelBinarizer()])])
> model = LogisticRegression()
> p = PMMLPipeline([('mapper',mapper), ('classifier',model)])
> p.fit(df,df['LABEL'])
> sklearn2pmml(p, 'my_pmml.pmml')
>

Classes CategoricalDomain and ContinuousDomain inherit from
sklearn2pmml.decoration.Domain, which provides means for customizing
MiningField decoration attributes.

Your use case would be handled like this:
mapper = DataFrameMapper([
('first_token', [CategoricalDomain(missing_value_treatment =
"as_value", invalid_value_treatment = "as_missing",
missing_value_replacement = "DEFAULT"), LabelBinarizer()])
])

Please note that attribute values must be specified in Python style
("as_value"), not in PMML style ("asValue"). This looks like a bad
decision, and I'm going to enable PMML-style aliases in the next
version of sklearn2pmml.

>
> My first thought was to use the Imputer, but it only allows to
> choose "mean", "median", and "most_frequent", not a constant value.
>

The sklearn2pmml.decoration.Domain implements missing value
replacement logic in its #transform(X) method:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L26-L32

So, if you apply CategoricalDomain(missing_value_replacement =
"DEFAULT") to a data columns that contains missing values, they should
be replaced with "DEFAULT" values. There's no need for missing value
replacement using Imputer anymore.

> My second thought was to write my own Transformer and add the "DEFAULT" value:
> class MyCategoricalDomain(CategoricalDomain):
> def __init__(self, **kwargs):
> CategoricalDomain.__init__(self, **kwargs)
> def fit(self, X, y=None):
> CategoricalDomain.fit(self, X, y)
> self.data_ = np.append(self.data_, 'DEFAULT')
> return self
>
> But then I got an error that "MyCategoricalDomain" is not a subclass of Transformer.
>

The Python class hierarchy (ie. class MyCategoricalDomain is a
subclass of CategoricalDomain) is not properly reflected on the Java
side (this is a fundamental limitation, which doesn't have an easy
fix/workaround). So, you would need to instruct
JPMML-SkLearn/SkLearn2PMML specifically about the fact that class
MyCategoricalDomain is a subclass of sklearn.Transformer:

1) Create Java class: class MyCategoricalDomain extends
sklearn2pmml.decoration.CategoricalDomain { .. }
2) Register this newly created Java class with JPMML-SkLearn runtime
by mentioning it in the META-INF/sklearn2pmml.properties resource
file.

This is exactly the same procedure as developing and deploying custom
transformers with the SkLearn2PMML-Plugin project.


VR

Dan

unread,
Apr 20, 2017, 5:42:21 AM4/20/17
to Java PMML API
> Your use case would be handled like this:
> mapper = DataFrameMapper([
> ('first_token', [CategoricalDomain(missing_value_treatment =
> "as_value", invalid_value_treatment = "as_missing",
> missing_value_replacement = "DEFAULT"), LabelBinarizer()])
> ])

Ahh, my original question was supposed to include the specification of the missing value treatment, but the point you made was that it is not necessary to also add the "DEFAULT" value to the DataDictionary nor the CategoricalPredictor entry. And it will still work because 0 is chosen as the default coefficient!

Thanks Villu, you are the king of fast responses!

Just a follow-up question in the same spirit. I would like to change the valid range for a continuous variable (called "leftMargin" and "rightMargin" attributes in the PMML). In training, suppose I only observe values in the range [0.2,0.8], but in inference I may encounter anything in range [0,1].

Original:
<DataField name="ILR" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0.2" rightMargin="0.8"/>
</DataField>

My manual change:
<DataField name="ILR" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0.0" rightMargin="1.0"/>
</DataField>

I would like to call:
ContinuousDomain(min=0.0, max=1.0) but there is no such option.

If I were to extend the transformer myself:
class MyContinuousDomain(ContinuousDomain):
def __init__(self, **kwargs):
ContinuousDomain.__init__(self, **kwargs)

def fit(self, X, y = None):
self.data_min_ = 0.0
self.data_max_ = 1.0
return self

Again, is there any way to do this in native sklearn2pmml without implementing and registering the Java class?

Villu Ruusmann

unread,
Apr 20, 2017, 6:13:01 AM4/20/17
to Java PMML API
Hi Dan,

>
> I would like to change the valid range for a continuous variable
> (called "leftMargin" and "rightMargin" attributes in the PMML).
> In training, suppose I only observe values in the range [0.2,0.8],
> but in inference I may encounter anything in range [0,1].
>
> I would like to call:
> ContinuousDomain(min=0.0, max=1.0) but there is no such option.
>

That's an interesting use case, and it should be fairly common in
real-life work.

The behaviour of ContinuousDomain should be modified so that if user
has specified explicit "min" and/or "max" arguments, then they should
not be overriden with data-based values. Furthermore, the
ContinuousDomain#transform(X) method should apply this [min, max]
restriction to incoming data frames/data matrices as well, so that
Scikit-Learn and PMML behaviours would be identical.

As a temporary workaround, you can manually modify #data_min_ and
#data_max_ attributes:

# THIS: keep a reference to the ContinuousDomain instance that you
want to manipulate
cont_domain = ContinuousDomain()

mapper = DataFrameMapper([
("cont_column", cont_domain)
])
pipeline = PMMLPipeline([
("mapper", mapper),
("estimator", ..)
])
pipeline.fit(X, y)

# Pipeline#fit() initializes cont_domain#data_min_ and #data_max_
attributes. However, we will override those values with more
appropriate values
cont_domain.data_min_ = 0
cont_domain.data_max_ = 1

# Now, the sklearn2pmml() method will see modified bounds:
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)


VR

Dan

unread,
Apr 23, 2017, 3:35:56 AM4/23/17
to Java PMML API
Thanks Villu, that worked perfectly!
Reply all
Reply to author
Forward
0 new messages