Standard scaler for features in evaluate

8 views
Skip to first unread message

Dave Hagler

unread,
Jul 17, 2023, 6:23:00 PM7/17/23
to Java PMML API
I have a PMML file that was generated with sklearn2pmml, and I can load and evaluate the model in Java. The original model pipeline used StandardScaler and LogisticRegression. I can see in the xml file there are DerivedFields with names like "standardScaler". My question is do the input fields need to be transformed with the same standard scaling or does evaluate automatically take care of it?

Villu Ruusmann

unread,
Jul 18, 2023, 1:00:08 AM7/18/23
to Java PMML API
Hi Dave,

> My question is do the input fields need to be
> transformed with the same standard scaling or
> does evaluate automatically take care of it?
>

PMML model expects untransformed input.

In Scikit-Learn terms, the input arguments map for the
Evaluator#evaluate(Map) method should be populated with values
according to "real-life" data schema - the same level of data that
would be normally passed to Pipeline.fit(X, y) or Pipeline.predict(X)
methods.

PMML handles any and all data transformations for you automatically.
They are considered an internal implementation detail.

You can query the PMML model schema using Evaluator#getInputFields()
and #getTargetFields()/#getOutputFields() methods.

In your case, the query to #getInputFields() will return the names of
untransformed input fields (eg. "Sepal.Length"), which means that this
is exactly what you should be inputting to model during its
deployment.

Now, thinking about it, the Evaluator object does not have any public
API methods that would expose the identities of internal transformed
fields (ie. there is no Evaluator#getDerivedFields() method). Perhaps
there should be... A potential use case could be about distinguishing
between models that belong to "uses direct data" vs. "uses
pre-processed data" categories.


VR
Reply all
Reply to author
Forward
0 new messages