Manual feature selection support in SKLearn2PMML

Akshay Tilekar

unread,

Jul 15, 2020, 10:50:15 AM7/15/20

to Java PMML API

I am using SKLearn2PMML for PMML generation of XGBoost model and I want to implement a use case like below:

1. Pass numerical and categorical features to pipeline.

2. Perform Binning using Expression Transformer on categorical features.

3. Use binned features only along with other numerical features to train the model and avoid the old categorical features which was used for binning.

Then, I want to specify which features explicitly to be passed to the classifier. (like removing the old categorical features and should only pass numerical one)

Is there a way to do it so ?

Villu Ruusmann

unread,

Jul 15, 2020, 3:34:18 PM7/15/20

to Java PMML API

Hi Akshay,

> 1. Pass numerical and categorical features to pipeline.

You'd need to mapper your pipeline with a meta-transformer (aka
"mapper"), which selects and transforms columns one-by-one.

Latest Scikit-Learn versions include sklearn.compose.ColumnTransformer for this:
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

However, for simpler workflows, I'd suggest using
sklearn_pandas.DataFrameMapper:
https://github.com/scikit-learn-contrib/sklearn-pandas/blob/1.1.0/sklearn_pandas/dataframe_mapper.py#L30-L132

Please consult with their API docs and examples.

> 2. Perform Binning using Expression Transformer on categorical features.
>

The sklearn2pmml.preprocessing.ExpressionTransformer lets you binning
manually, using nested if-else expressions:
manual_binner = ExpressionTransformer("'cat_A' if X[0] < 0 else 'cat_B'")

The PMML specification provides a Discretize element for expressing
the binning business logic:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_Discretize

If your binning needs are non-trivial, then you should consider
switching from ExpressionTransformer to
sklearn2pmml.preprocessing.CutTransformer, which creates
fully-featured Discretize elements.

The CutTransformer API is identical to the pandas.cut() function:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html

For example,
pandas_cut_binner = CutTransformer(bins = [0], labels = ['cat_A', 'cat_B'])

With CutTransformer you can precompute bin thresholds using arbitrary
Numpy functions. For example, using numpy.nanquantile,
numpy.nanpercentile, or any other NaN-aware utility function.

> 3. Use binned features only along with other numerical
> features to train the model and avoid the old categorical
> features which was used for binning.
>

ColumnTransformer and DataFrameMapper allow you to pass columns on
as-is (untransformed), transformed or both.

mapper = DataFrameMapper([
("cont_column_asis", None),
("cont_column_binned", CutTransformer(...))
])

If you remove the first mapping ("cont_column_asis"), then your model
will only see the transformed continuous feature
("cont_column_binned").

I suggest you read more about those mapper meta-transformers. All your
requested functionality is already available there, and the
SkLearn2PMML package can convert those fitted mappers into the PMML
representation really easily.

VR

Akshay Tilekar

unread,

Jul 20, 2020, 12:19:49 PM7/20/20

to Java PMML API

Thanks for replying and clarification Villu.

I have implemented all the necessary steps already. The problem I am facing is with filtering the final set of features.

I am having a scenario like;

Let's say I am having features A - Numeric, B- Categoric, C-Numeric. I am defining and doing missing value replacement for A, B & C in first step.

In next step I am using B to create a new binned (binary numeric) variable with alias D, and I am able to do that.

When I am attaching the XGBoost Classifier at the end of pipeline, all the features (along with the B, as we have already defined it for missing value replacement initially, it is available in pipeline as a feature too) automatically get pass to it. I want it to be filtered and only A, C & D should get pass to the final classifier. Any ways to do so ?

Villu Ruusmann

unread,

Jul 20, 2020, 5:42:37 PM7/20/20

to Java PMML API

Hi Akshay,

>
> Let's say I am having features A - Numeric, B- Categoric, C-Numeric.
> I am defining and doing missing value replacement for A, B & C in first step.
> In next step I am using B to create a new binned (binary numeric)
> variable with alias D, and I am able to do that.

What is your functional definition of "step"? Is it a Scikit-Learn
pipeline step?

If so, then perhaps you're using too many steps already. Both
DataFrameMapper and ColumnTransformer allow you to apply any number of
decorators/transformers/selectors to a single column, so your pipeline
should be reduced to a two-step one:

pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
("A", ContinuousDomain(missing_value_replacement = 0)),
("B", [ContinuousDomain(missing_value_replacement = 0),
ExpressionTransformer("numpy.log(X[0])"), CutTransformer(..)]) # THIS
("C", [CategoricalDomain(missing_value_replacement = "(unknown)"),
OneHotEncoder()])
])),
("classifier", XGBClassifier())
])

> When I am attaching the XGBoost Classifier at the
> end of pipeline, all the features (along with the B, as we
> have already defined it for missing value replacement initially,
> it is available in pipeline as a feature too) automatically get pass to it.

In the above pipeline, after doing DataFrameMapper.fit_transform(X)
(as part of PMMLPipeline.fit_transform(X)), you would be getting a new
dataframe, which contains column A, but does not contain original
columns B or C.

The reason is that column A stays as it was - it's only filtered
through ContinuousDomain, which is a no-op transformer. Column B gets
transformed to a single new column "cut(numpy.log(B))", and column C
gets transformed to a list of binary indicator columns (one for each
category level).

To repeat, the XGBClassifier does not see original columns B or C in
the above case, but their transformation results.

> I want it to be filtered and only A, C & D should
> get pass to the final classifier. Any ways to do so ?
>

You can filter out arbitrary columns by inserting an extra
ColumnTransformer into the pipeline.

You need to know the name(s) or the index(es) of the column(s) that
you wish to filter out. In multi-step pipelines column names often get
lost, so I'd suggest you to experiment with index-based filtering
first. To make things simpler for yourself, try to rearrange columns
in the initial DataFrameMapper step so that there is some
"predictability" to the structure of the transformed data frame (that
is produced by DataFrameMapper.fit_transform(X)).

For example, suppose I want to exclude the second column
(corresponding to "cut(numpy.log(B))") from the dataset:

pipeline = PMMLPipeline([
("mapper", DataFrameMapper(..)),
("feature_selector", ColumnTransformer(transformers =
[("drop_second_col", "drop", [1])], remainder = "passthrough")) # THIS
("classifier", XGBClassifier(..))
])

I haven't tested if the above ColumnTransformer is actually runnable.
It's supposed to demonstrate the main idea of using ColumnTransformer
for data frame reshaping - specify "drop" for those columns that you
want to filter out, and "passthrough" for those that you want to keep.

VR

Akshay Tilekar

unread,

Jul 23, 2020, 1:32:51 AM7/23/20

to Java PMML API

Thanks Villu.

I guess using the column names to drop didn't support as the becomes numpy array implicitly. Index based dropping worked for me.

[("drop_cols", "drop", [1])], remainder = "passthrough"))

Thanks again.

On Wednesday, July 15, 2020 at 8:20:15 PM UTC+5:30, Akshay Tilekar wrote:

Reply all

Reply to author

Forward