PMML Label Encoding With some function

Saad Syed

unread,

Aug 28, 2017, 6:27:30 AM8/28/17

to Java PMML API

Hi Villu ,

I have a field which I am preprocessing(converting to lower case ) before passing to Label Encoder
I created an extension project for Custom transformations along the lines of your example.

My Mapper looks like this

DataFrameMapper([ ('Column_A', [CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown") ,
CustomTransformFunctionGenerator(function='lowercase'),
),
LabelEncoder()]),
])

However it throws me an exception :
java.lang.IllegalArgumentException: lowercase(Column_A) ,

I looked around and figured that this might be because of the transformation LabelEncoder will look for a data field of the name lowercase(Column_A), but the datafield tag is created with the name : Column_A , hence the error.

To overcome this , I changed the order :
mapper = DataFrameMapper([ ('Column_A', [
CustomTransformFunctionGenerator(function='lowercase'),
CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown"),
LabelEncoder()]),

so that the DataField markup is created with the correct name , but in this case I get the following exception :

org.jpmml.converter.ContinuousFeature cannot be cast to org.jpmml.converter.WildcardFeature
at sklearn2pmml.decoration.CategoricalDomain.encodeFeatures(CategoricalDomain.java:65)

My end objective is to apply some function to a categorical field and then pass it to Label Encoder.

I've only come here after exhausting all my ideas , hope it doesn't come across as spam.
Thanks.

Villu Ruusmann

unread,

Aug 28, 2017, 6:50:21 PM8/28/17

to Java PMML API

Hi Saad,

>
> I have a field which I am preprocessing(converting to lower case ) before passing to Label Encoder

> mapper = DataFrameMapper([ ('Column_A', [

> CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown"),

> CustomTransformFunctionGenerator(function='lowercase'),
> LabelEncoder()]),
>

Let me explain how the JPMML-SkLearn library parses this construct.
PMML needs detailed feature information (eg. field name, data type,
operational type, valid and invalid value spaces etc.), which is not
available in Scikit-Learn pipelines. The only solution is to traverse
the list of pre-processing transformations, and extract/infer feature
information bit by bit.

The initial state of a feature is org.jpmml.converter.WildcardFeature.
The only bit of information that is known about a feature is its name:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn_pandas/DataFrameMapper.java#L68

The first step is to determine the operational type (ie. continuous or
categorical) of the feature. The easiest way to cast from the wildcard
optype to a specific optype is by applying a ContinuousDomain or
CategoricalDomain transformers.

CategoricalDomain requires the input feature to be
o.j.c.WildcardFeature, and it "casts" it it o.j.c.CategoricalFeature
by invoking WildcardFeature#toCategoricalFeature():
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/decoration/CategoricalDomain.java#L65
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/decoration/CategoricalDomain.java#L77

When the CategoricalDomain transformation completes, the PMML document
will contain a DataField element with name "Column_A".

Most of the time, the next step would be LabelEncoder, which
translates feature value from one value space (typically string) to
some other value space (typically integer). The current LabelEncoder
converter is rather naive, and assumes that the input feature is
backed by a DataField element:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/preprocessing/LabelEncoder.java#L97
https://github.com/jpmml/jpmml-converter/blob/master/src/main/java/org/jpmml/converter/PMMLEncoder.java#L141-L163

However, this assumption is violated in your case, because the
intermediate CustomTransformFunction(function = "lowercase")
transformer has created an extra DerivedField element with name
"lowercase(Column_A)":
https://github.com/jpmml/jpmml-converter/blob/master/src/main/java/org/jpmml/converter/PMMLEncoder.java#L145

>
> However it throws me an exception :
> java.lang.IllegalArgumentException: lowercase(Column_A) ,
>
> I looked around and figured that this might be because of
> the transformation LabelEncoder will look for a data field of
> the name lowercase(Column_A), but the datafield tag is
> created with the name : Column_A , hence the error.
>

Exactly, the method o.j.c.PMMLEncoder#toCategorical(FieldName,
List<String>) should account for the possibility that the field name
refers to a DerivedField element, not DataField element. Since the
DerivedField element has no place for keeping valid values
information, then the processing should simply return:

public Field toCategorical(FieldName name, List<String> values){
Field field = getField(name);
if(field instanceof DerivedField){
// No-op
} else

if(field instanceof DataField){
// proceed as usual
}
return field;
}

Just opened a GitHub issue about it:
https://github.com/jpmml/jpmml-converter/issues/7

>
> To overcome this , I changed the order :
> mapper = DataFrameMapper([ ('Column_A', [
> CustomTransformFunctionGenerator(function='lowercase'),
> CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown"),
> LabelEncoder()]),
>

> org.jpmml.converter.ContinuousFeature cannot be cast to org.jpmml.converter.WildcardFeature
> at sklearn2pmml.decoration.CategoricalDomain.encodeFeatures(CategoricalDomain.java:65)
>

CategoricalDomain expects to be in the first position of the list of
transformers. In your example it is in the second position, so it is
being passed an incompatible feature type.

> My end objective is to apply some function to a
> categorical field and then pass it to Label Encoder.
>

Your best option is to wait until the issue jpmml/jpmml-converter#7
gets fixed, which could/should happen in the second half of this week.

In the meantime, you could create a subclass of SkLearnEncoder that
overrides the #toCategorical(FieldName, List<String>) method and
suppresses bad IllegalArgumentException exceptions:

public class FixedSkLearnEncoder extends SkLearnEncoder {

@Override
public DataField toCategoricalFeature(FieldName name, List<String> values){
try {
super.toCategoricalFeature(name, values);
} catch(IllegalArgumentException iae){
return null;
}
}
}

Unfortunately, it's rather challenging to put this FixedSkLearnEncoder
class in action, because you would need to build your own
JPMML-SkLearn library and sklearn2pmml package versions:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/PMMLPipeline.java#L77

VR

Saad Syed

unread,

Aug 29, 2017, 1:04:00 AM8/29/17

to Java PMML API

Hi Villu,
Thanks for your reply. I'll try and see if I can get the SkLearnEncoder subclass solution working.

Saad Syed

unread,

Sep 13, 2017, 5:11:44 AM9/13/17

to Java PMML API

Hey Villu,
Didn't have much luck with subclass solution,would it possible for you to fix this in the near future?
https://github.com/jpmml/jpmml-converter/issues/7

Villu Ruusmann

unread,

Sep 13, 2017, 7:09:15 AM9/13/17

to Java PMML API

Hi Saad,

> Didn't have much luck with subclass solution,would
> it possible for you to fix this in the near future?
> https://github.com/jpmml/jpmml-converter/issues/7
>

If you'd like to push some specific GitHub issue forward, then you
should simply comment (eg. "+1") on it.

I'll take your request into consideration, and try to release an
updated version by the end of this week. The JPMML-Converter library
is included in most top-level JPMML converter libraries, so its
version number needs to be bumped in several places (eg. in addition
to JPMML-SkLearn, also in JPMML-LightGBM and JPMML-XGBoost, which are
transitive dependencies to it) in order to avoid classpath conflicts.

VR

Villu Ruusmann

unread,

Sep 14, 2017, 4:33:51 PM9/14/17

to Java PMML API

Hi Saad,

I have released SkLearn2PMML version 0.24.0, which you've been waiting for.

Your requirement was about being able to do string manipulation
between CategoricalDomain and LabelEncoder/LabelBinarizer steps -
accept "fuzzy" category labels, and transform them to "standardized"
category labels inside the model.

The first part of the workflow - "accept fuzzy strings" - could not be
done earlier, because the CategoricalDomain decorator was capturing
and storing all unique string values. Now it's possible to turn off
this behaviour by specifying the "with_data = False" attribute:
https://github.com/jpmml/sklearn2pmml/commit/92d941e5be5d4a4a877773007479e3269b644623

The second part of the workflow - "normalize strings" - requires a
custom transformer. I have added a sample implementation into the
SkLearn2PMML-Plugin project:
https://github.com/jpmml/sklearn2pmml-plugin/commit/232af8eee5fb3ffe10e735e6f78805c2a228af44

Attached is a Python script Audit.py that puts those two new
developments together. I strongly advise you to play with it.

The associated data file Audit.csv is available here:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/csv/Audit.csv

The instructions for integrating the com.mycompany.StringNormalizer
transformer into SkLearn2PMML/JPMML-SkLearn runtime are given in the
README file of the SkLearn2PMML-Plugin project:
https://github.com/jpmml/sklearn2pmml-plugin/blob/master/README.md

In brief, do the following:
1) Check out SkLearn2PMML-Plugin, and build it locally using Apache Maven
2) Append the path to the EGG file to your PYTHONPATH environment variable.
3) Append the path to the JAR file to sklearn2pmml() function call as
a "user_classpath" argument.

The magic happens on line 20 of the Audit.py scipt:
[CategoricalDomain(with_data = False, with_statistics = False),
StringNormalizer(function = "uppercase"), LabelBinarizer()]

VR

Audit.py

Saad Syed

unread,

Dec 20, 2017, 9:53:42 AM12/20/17

to Java PMML API

Hi Villu,

When using the CategoricalDomain(with_data=False) option the resulting PMML creates that field as Continuous and double even for String fields.

This is the mapper :

('Column',[CategoricalDomain(with_data=False),CustomTransformer(function="somefun")])

And this is how the data field is created.

Manually changing the optype to categorical and continuous and dataType to string works fine.

Villu Ruusmann

unread,

Dec 20, 2017, 10:52:26 AM12/20/17

to Java PMML API

Hi Saad,

>
> When using the CategoricalDomain(with_data=False) option
> the resulting PMML creates that field as Continuous and double
> even for String fields.
>

It could be that CategoricalDomain is generating "categorical string"
type definition in the beginning, but some other transformation step
(or the final estimator step) is overriding this type definition later
on.

Please note that sklearn.Transformer subclasses should indicate the
expected data type and operational type of their inputs by overriding
#getDataType() and #getOpType() methods:

class MyTransformer extends sklearn.Transformer {

@Override
public DataType getDataType(){
return DataType.STRING;
}

@Override
public OpType getOpType(){
return OpType.CATEGORICAL;
}
}

Have you done it in your CustomTransformer class? If not, then this
could be one way how the "continuous double" type definition is
leaking into the pipeline.

Also, you can force-cast the type definition of features inside the
Transformer#encodeFeatures(List<Feature>, SkLearnEncoder) method:

class MyTransformer extends sklearn.Transformer {

@Override
public List<Feature> encodeFeatures(List<String> features,
SkLearnEncoder encoder){
Feature feature = features.get(0);
encoder.updateType(feature.getName(), OpType.CATEGORICAL,
DataType.STRING); // THIS!
}
}

Such force-casting is not recommended (as it indicates a bug in some
earlier steps), but it's still much better than having to edit the
generated PMML documents manually.

> This is the mapper :
>
> ('Column',[CategoricalDomain(with_data=False),CustomTransformer(function="somefun")])
>

If you're dealing with "open vocabularies", then you might want to
skip the CategoricalDomain(with_data = False) step altogether:
("Column", [CustomTransformer(function="somefun")])

The type definition of the field will be based on what
CustomTransformer#getDataType() and #getOpType() methods report.

Another trick is that your CustomTransformer class may specify a
different data type depending on the value of the "function"
attribute:

class MyTransformer extends sklearn.Transformer {

@Override
public DataType getDataType(){
String function = getFunction();

switch(function){
case "integer_function":
return DataType.INTEGER;
case "floating-point_function":
return DataType.DOUBLE;
default:
return DataType.STRING;
}
}
}

VR

Saad Syed

unread,

Dec 29, 2017, 12:43:28 AM12/29/17

to Java PMML API

Hi Villu,

> Please note that sklearn.Transformer subclasses should indicate the
> expected data type and operational type of their inputs by overriding
> #getDataType() and #getOpType() methods:
>
> class MyTransformer extends sklearn.Transformer {
>
> @Override
> public DataType getDataType(){
> return DataType.STRING;
> }
>
> @Override
> public OpType getOpType(){
> return OpType.CATEGORICAL;
> }
> }
>

This method works and the generated DataFields have proper Datatypes. However when working with more than one field, if we go with this fix, each of them will have the same type , i.e the one reported by the getDataType and getOpType
functions. However its very much possible to have some transformation on two fields of different types. How do we deal with this?

Also, is there a way to achieve proper DataTypes and Optypes for both the derived and datafields when using CustomTransformers without having to hardcode the types?

Thanks.

Villu Ruusmann

unread,

Dec 29, 2017, 4:21:45 PM12/29/17

to Java PMML API

Hi Saad,

> Also, is there a way to achieve proper DataTypes and Optypes
> for both the derived and datafields when using CustomTransformers
> without having to hardcode the types?

There are around ten DataTypes and three OpTypes in the PMML specification:
http://dmg.org/pmml/v4-3/DataDictionary.html#xsdElement_DataField

However, most use cases are limited to the following:
*) Categorical string, boolean, integer
*) Continuous double, float, integer

It might be worthwhile to subclass your CustomTransformer class for
each of those combinations. For example, a subclass for categorical
integers:

class CatIntCustomTransformer extends CustomTransformer {

@Override
public DataType getDataType(){ return DataType.INTEGER; }

@Override
public OpType getOpType(){ return OpType.CATEGORICAL; }
}

Alternatively, you might introduce "pmml_data_type" and "pmml_op_type"
attributes to your CustomTransformer class:

class CustomTransformer extends Transformer {

@Override
public DataType getDataType(){
String dataType = (String)get("pmml_data_type");
return DataType.valueOf(dataType.toUpperCase()); // Look up enum
value by its name
}

@Override
public OpType getOpType(){
String opType = (String)get("pmml_op_type");
return OpType.valueOf(opType.toUpperCase()); // Same as above

}
}

>
> However its very much possible to have some transformation
> on two fields of different types. How do we deal with this?
>

Is there a "super" datatype that can represent the values of all input
fields correctly?

For numeric fields (double + float; double + integer) it's typically
the DOUBLE datatype. For mixed fields, it's typically the STRING
datatype.

VR

Saad Syed

unread,

Dec 30, 2017, 11:53:21 AM12/30/17

to Java PMML API

Hi Villu,

> Is there a "super" datatype that can represent the values of all input
> fields correctly?
>
> For numeric fields (double + float; double + integer) it's typically
> the DOUBLE datatype. For mixed fields, it's typically the STRING
> datatype.
>

To make it clearer, say I have the following mapper :

(['Column1','Column2'],[CustomTransformer(function="somefun")])

where Column1 is string and Categorical and Column2 is Double and continuous.
> @Override
> public DataType getDataType(){
> return DataType.STRING;

> }
>
> @Override
> public OpType getOpType(){
> return OpType.CATEGORICAL;
> }
> }
>

If I adopt this method, both of the fields will have Categorical and String, which is not correct.

How do we get correct datatypes for both these fields?

Villu Ruusmann

unread,

Dec 30, 2017, 5:28:45 PM12/30/17

to Java PMML API

Hi Saad,

>
> To make it clearer, say I have the following mapper :
> (['Column1','Column2'],[CustomTransformer(function="somefun")])
>

> How do we get correct datatypes for both these fields?
>

The correct answer would be that the JPMML-SkLearn library isn't quite
ready for your use case (as you're clearly on the edge, developing
custom transformers for mixed feature types).

Anyway, below are some ideas that should help you to move on.

First and foremost, methods sklearn.Transformer#getDataType and
#getOpType are only invoked if the type of some feature(s) is not
known by the time they are put into actual use. In your example, the
converter is about to invoke
CustomTransformer#encodeFeatures(List<Feature>, SkLearnEncoder), but
it has not idea what are "Column1" and "Column2" types. Therefore, the
converter invokes CustomTransformer#getDataType and #getOpType, which
should be interpreted as "hey, CustomTransformer, I'm about to pass
some data to you - what would be your preferred data representation?".

Our objective is to help the converter to figure out "Column1" and
"Column2" types proactively, so that it wouldn't need to ask
CustomTransformer's preference. Basically, there should be a special
"type definition" step right in front of CustomTransformer:
(["Column1", "Column2"], [MixedDomain(), CustomTransformer(function =
"somefun")])

The class sklearn.decoration.MixedDomain doesn't exist at the moment.
However, it seems like a very useful addition, so I've opened the
following GitHub issue, and intend to get an early prototype out in
the coming days: https://github.com/jpmml/sklearn2pmml/issues/73

In the meantime, you might experiment with "branched" pipelines. The
idea is to introduce a temporary split into the workflow so that
"Column1" and "Column2" can be defined independently of one another
(using the existing CategoricalDomain() and ContinuousDomain()
classes, respectively). After that, merge these branches back into
one, and feed the resulting two-column data matrix to your
CustomTransformer class. Since everything is well-defined, then the
converter doesn't need to invoke CustomTransformer#getDataType() and
#getOpType() anymore.

I've attached a sample archive Audit.zip that demonstrates this workaround:
1) The "define_features" step sees the original data matrix. The split
is introduced using the FeatureUnion transformer; the first branch
defines "Age", and the second branch defines "Occupation" (and also
binarizes it to make it conform to Scikit-Learn conventions).
2) The "compute_feature_interaction" step sees the 2-column data
matrix that is coming out of the "define_features" step.
3) The converter should be able to handle any nested Pipeline and/or
FeatureUnion configuration (eg. unlimited nesting depth, unlimited
number of branches).

VR

Audit.zip

Villu Ruusmann

unread,

Jan 7, 2018, 4:11:32 PM1/7/18

to Java PMML API

Hi Saad,

>
> To make it clearer, say I have the following mapper :
> (['Column1','Column2'],[CustomTransformer(function="somefun")])
>

> How do we get correct datatypes for both these fields?
>

I've released SkLearn2PMML version 0.29.0 (based on JPMML-SkLearn
version 1.4.5), which introduces two new multi-column transformations:
1) sklearn2pmml.decoration.MultiDomain - Define multiple columns in one go
2) sklearn2pmml.preprocessing.MultiLookupTransformer - Map multiple
input columns to one output column.

Here's an example about implementing a lookup transformation based on
two categorical columns:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L370

After the upgrade, it will be possible to handle your use case like this:
(['Column1', 'Column2'], [MultiDomain([CategoricalDomain(),
ContinuousDomain()]), CustomTransformer(function="somefun")])

VR

Reply all

Reply to author

Forward