how to do missing value treatment and label encoding together for categorical variable in sklearn2pm

1,831 views
Skip to first unread message

Jiby Babu

unread,
Oct 21, 2016, 11:53:37 AM10/21/16
to Java PMML API
Hi,
I would like to do 2 sklearn transformation functions while converting to pmml.
For example,
for feature in features:
if df[feature].dtype in ['int', 'float64']:
feature_type_tuple_list.append(([feature], sklearn.preprocessing.Imputer(strategy='median')))
else:
feature_type_tuple_list.append((feature, [sklearn.preprocessing.Imputer(strategy='most_frequent'), sklearn.preprocessing.LabelEncoder()]))

# Appending the response variable to the mapper
feature_type_tuple_list.append(('target', None))

# Passing the feature and the associated transformations to the dataframe mapper
jobs_mapper = DataFrameMapper(feature_type_tuple_list)

# Fit the mapper
jobs_mapper.fit_transform(df[features+['target']])


Its returningthe following errror.
Traceback (most recent call last):
File "/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/IPython/core/interactiveshell.py", line 2869, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-63-39ac36018d2e>", line 1, in <module>
jobs_mapper.fit_transform(df[features+['target']])
File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/local/lib/python2.7/site-packages/sklearn_pandas/dataframe_mapper.py", line 97, in fit
transformers.fit(self._get_col_subset(X, columns))
File "/usr/local/lib/python2.7/site-packages/sklearn_pandas/pipeline.py", line 49, in fit
Xt, fit_params = self._pre_transform(X, **fit_params)
File "/usr/local/lib/python2.7/site-packages/sklearn_pandas/pipeline.py", line 42, in _pre_transform
Xt = transform.fit_transform(Xt, **fit_params_steps[name])
File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Library/Python/2.7/site-packages/sklearn/preprocessing/imputation.py", line 156, in fit
force_all_finite=False)
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: california

Any idea what is the right way to do it?

Thanks,
Jiby

Jiby Babu

unread,
Oct 21, 2016, 11:59:54 AM10/21/16
to Java PMML API

More specially, for categorical variables, I want to do label encoder and impute with mode, if the value is missing

Villu Ruusmann

unread,
Oct 21, 2016, 2:25:15 PM10/21/16
to Java PMML API
Hi Jiby,

> feature_type_tuple_list.append((feature, [sklearn.preprocessing.Imputer(strategy='most_frequent'), sklearn.preprocessing.LabelEncoder()]))
>
> Its returning the following errror.
> Traceback (most recent call last):
> File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in fit_transform
> return self.fit(X, **fit_params).transform(X)
> File "/Library/Python/2.7/site-packages/sklearn/preprocessing/imputation.py", line 156, in fit
> force_all_finite=False)
> File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
> array = np.array(array, dtype=dtype, order=order, copy=copy)
> ValueError: could not convert string to float: california
>

GitHub is unavailable, so I'm unable to check the source code of the
Imputer class.

My hypothesis is that Imputer only works with numeric values. In your
workflow, you should apply LabelEncoder first and Imputer(strategy =
"most_frequent") second. However, quick googling shows that
LabelEncoder doesn't support missing values.

A possible workaround:
1) Replace missing values in matrix with "null" string constant.
2) Apply LabelEncoder.
3) Drop the "null" label from the fitted LabelEncoder, and replace
corresponding encoded values in matrix with NaN values.
4) Apply Imputer.

The above is terrible hack, but should get you past the error. Someone
else must have solved such problem already, so try more googling.


VR

Jiby Babu

unread,
Oct 21, 2016, 2:49:29 PM10/21/16
to Villu Ruusmann, Java PMML API
Thanks Villu; let me try that; 

Also , right now, is it possible to pass customer transformers to DataFrameMapper, and then later fit it,  if we implement the respective one in the sklearn.preprocessing

Thanks,
Jiby

Villu Ruusmann

unread,
Oct 21, 2016, 5:23:45 PM10/21/16
to Java PMML API
Hi Jiby,

>
> Also, right now, is it possible to pass customer transformers to
> DataFrameMapper, and then later fit it, if we implement the respective one
> in the sklearn.preprocessing
>

The purpose of the DataFrameMapper object is to "explain" how to
compute the X data matrix (that is passed to estimator.fit(X, y) or
estimator.predict(X)) based on some real-world data source such as a
CSV file.

There are two workflows possible:
1) The "real thing" approach. Construct a DataFrameMapper object, and
invoke its fit_transform() method.
2) The "fake" approach. Construct a DataFrameMapper object based on
PRE-fitted transformer objects. Do not call its fit() or
fit_transform() methods after that, as it would override the existing
state.

The sklearn2pmml package doesn't care if you give it a "real thing" or
a "fake" DFM object. If you want to experiment with custom transformer
classes (that are not yet recognized/supported by the JPMML-SkLearn
library), then you need to take the "fake" approach.

There's some background in these two GitHub issues:
https://github.com/jpmml/jpmml-sklearn/issues/3
https://github.com/jpmml/jpmml-sklearn/issues/12


VR
Reply all
Reply to author
Forward
0 new messages