Data Preparation in PMML Pipeline

118 views
Skip to first unread message

Ivan Nardini

unread,
Feb 2, 2019, 4:38:43 AM2/2/19
to Java PMML API
Hi PMML guys,

I'm writing cause I test the data preprocessing and feature engineering capabilities.

And I have a question:

Is the pipeline able to manage dates fields? And if yes, is it possible to engine data differences by days.

Best,

Ivan

Villu Ruusmann

unread,
Feb 2, 2019, 5:47:26 AM2/2/19
to Java PMML API
Hi Ivan,

>
> Is the pipeline able to manage dates fields?
>

First of all, what's your target ML framework? The term "pipeline"
suggests that it's either Scikit-Learn or Apache Spark.

The capabilities of JPMML converter libraries are listed in their
repository README files:
1) https://github.com/jpmml/jpmml-sklearn#features
2) https://github.com/jpmml/jpmml-sparkml#features

> And if yes, is it possible to engine data differences by days.
>

Scikit-Learn doesn't provide built-in date/datetime functionality. But
the latest version of the SkLearn2PMML package
(https://github.com/jpmml/sklearn2pmml) does provide the following
decorator and transformer classes:

1) sklearn2pmml.decoration.DateDomain
2) sklearn2pmml.decoration.DateTimeDomain
3) sklearn2pmml.preprocessing.DaysSinceYearTransformer
4) sklearn2pmml.preprocessing.SecondsSinceYearTransformer

Putting them all together:
mapper = DataFrameMapper([
(["begin_date", "end_date"], [DateDomain(),
DaysSinceYearTransformer(year = 2019),
Alias(ExpressionTransformer("X[1] - X[0]"), "duration_in_days", prefit
= True)])
])


VR

Ivan Nardini

unread,
Feb 2, 2019, 7:21:38 AM2/2/19
to Java PMML API

Thanks Villu.

Just for feedback:

a) datedomain() sorted date field by time, is it correct?

and

b) in your opinion is it right to map all variable before and then implement preprocessing and features engineering steps like date differences, labelling, missing imputation (like replace NAN's with "Others" category or replace nan (not a number) & infinity values with zero)

Best,

Ivan

Villu Ruusmann

unread,
Feb 2, 2019, 7:47:53 AM2/2/19
to Java PMML API
Hi Ivan,

>
> a) datedomain() sorted date field by time, is it correct?
>

The DateDomain decorator class takes a 'string' input column and
converts it to a 'datetime' column using the 'pandas.to_datetime()'
function. So, you should make sure that your string input values
roughly follow ISO 8601 standard.

The column is not sorted or rearranged in any other way. You should
view DateDomain and DateTimeDomain as dummy parser components.

>
> b) in your opinion is it right to map all variable before
> and then implement preprocessing and features engineering steps
>

I'm not a data scientist, so my opinion doesn't count all that much.

The main point is doing everything inside the (PMML)Pipeline class.
Depending on the complexity of your workflow, it may be necessary to
involve multiple DataFrameMapper steps (combined using the
FeatureUnion transformer).

Of course, the logical sequence of column processing steps is
1) Decoration (eg. DateDomain)
2) Imputation (eg. SimpleImputer)
3) Transformation (eg. DaysSinceYearTransformer followed by
ExpressionTransformer)
5) Selection (eg. GenericUnivariateSelect)


VR
Reply all
Reply to author
Forward
0 new messages