Hi Akshay,
>
> Let's say I am having features A - Numeric, B- Categoric, C-Numeric.
> I am defining and doing missing value replacement for A, B & C in first step.
> In next step I am using B to create a new binned (binary numeric)
> variable with alias D, and I am able to do that.
What is your functional definition of "step"? Is it a Scikit-Learn
pipeline step?
If so, then perhaps you're using too many steps already. Both
DataFrameMapper and ColumnTransformer allow you to apply any number of
decorators/transformers/selectors to a single column, so your pipeline
should be reduced to a two-step one:
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
("A", ContinuousDomain(missing_value_replacement = 0)),
("B", [ContinuousDomain(missing_value_replacement = 0),
ExpressionTransformer("numpy.log(X[0])"), CutTransformer(..)]) # THIS
("C", [CategoricalDomain(missing_value_replacement = "(unknown)"),
OneHotEncoder()])
])),
("classifier", XGBClassifier())
])
> When I am attaching the XGBoost Classifier at the
> end of pipeline, all the features (along with the B, as we
> have already defined it for missing value replacement initially,
> it is available in pipeline as a feature too) automatically get pass to it.
In the above pipeline, after doing DataFrameMapper.fit_transform(X)
(as part of PMMLPipeline.fit_transform(X)), you would be getting a new
dataframe, which contains column A, but does not contain original
columns B or C.
The reason is that column A stays as it was - it's only filtered
through ContinuousDomain, which is a no-op transformer. Column B gets
transformed to a single new column "cut(numpy.log(B))", and column C
gets transformed to a list of binary indicator columns (one for each
category level).
To repeat, the XGBClassifier does not see original columns B or C in
the above case, but their transformation results.
> I want it to be filtered and only A, C & D should
> get pass to the final classifier. Any ways to do so ?
>
You can filter out arbitrary columns by inserting an extra
ColumnTransformer into the pipeline.
You need to know the name(s) or the index(es) of the column(s) that
you wish to filter out. In multi-step pipelines column names often get
lost, so I'd suggest you to experiment with index-based filtering
first. To make things simpler for yourself, try to rearrange columns
in the initial DataFrameMapper step so that there is some
"predictability" to the structure of the transformed data frame (that
is produced by DataFrameMapper.fit_transform(X)).
For example, suppose I want to exclude the second column
(corresponding to "cut(numpy.log(B))") from the dataset:
pipeline = PMMLPipeline([
("mapper", DataFrameMapper(..)),
("feature_selector", ColumnTransformer(transformers =
[("drop_second_col", "drop", [1])], remainder = "passthrough")) # THIS
("classifier", XGBClassifier(..))
])
I haven't tested if the above ColumnTransformer is actually runnable.
It's supposed to demonstrate the main idea of using ColumnTransformer
for data frame reshaping - specify "drop" for those columns that you
want to filter out, and "passthrough" for those that you want to keep.
VR