I have a field which I am preprocessing(converting to lower case ) before passing to Label Encoder
I created an extension project for Custom transformations along the lines of your example.
My Mapper looks like this
DataFrameMapper([ ('Column_A', [CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown") ,
CustomTransformFunctionGenerator(function='lowercase'),
),
LabelEncoder()]),
])
However it throws me an exception :
java.lang.IllegalArgumentException: lowercase(Column_A) ,
I looked around and figured that this might be because of the transformation LabelEncoder will look for a data field of the name lowercase(Column_A), but the datafield tag is created with the name : Column_A , hence the error.
To overcome this , I changed the order :
mapper = DataFrameMapper([ ('Column_A', [
CustomTransformFunctionGenerator(function='lowercase'),
CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown"),
LabelEncoder()]),
so that the DataField markup is created with the correct name , but in this case I get the following exception :
org.jpmml.converter.ContinuousFeature cannot be cast to org.jpmml.converter.WildcardFeature
at sklearn2pmml.decoration.CategoricalDomain.encodeFeatures(CategoricalDomain.java:65)
My end objective is to apply some function to a categorical field and then pass it to Label Encoder.
I've only come here after exhausting all my ideas , hope it doesn't come across as spam.
Thanks.
Hi Villu,
Thanks for your reply. I'll try and see if I can get the SkLearnEncoder subclass solution working.
Hey Villu,
Didn't have much luck with subclass solution,would it possible for you to fix this in the near future?
https://github.com/jpmml/jpmml-converter/issues/7
Hi Villu,
When using the CategoricalDomain(with_data=False) option the resulting PMML creates that field as Continuous and double even for String fields.
This is the mapper :
('Column',[CategoricalDomain(with_data=False),CustomTransformer(function="somefun")])
And this is how the data field is created.
<DataField name="row_string" optype="continuous" dataType="double"/>
Manually changing the optype to categorical and continuous and dataType to string works fine.
> Please note that sklearn.Transformer subclasses should indicate the
> expected data type and operational type of their inputs by overriding
> #getDataType() and #getOpType() methods:
>
> class MyTransformer extends sklearn.Transformer {
>
> @Override
> public DataType getDataType(){
> return DataType.STRING;
> }
>
> @Override
> public OpType getOpType(){
> return OpType.CATEGORICAL;
> }
> }
>
This method works and the generated DataFields have proper Datatypes. However when working with more than one field, if we go with this fix, each of them will have the same type , i.e the one reported by the getDataType and getOpType
functions. However its very much possible to have some transformation on two fields of different types. How do we deal with this?
Also, is there a way to achieve proper DataTypes and Optypes for both the derived and datafields when using CustomTransformers without having to hardcode the types?
Thanks.
> Is there a "super" datatype that can represent the values of all input
> fields correctly?
>
> For numeric fields (double + float; double + integer) it's typically
> the DOUBLE datatype. For mixed fields, it's typically the STRING
> datatype.
>
To make it clearer, say I have the following mapper :
(['Column1','Column2'],[CustomTransformer(function="somefun")])
where Column1 is string and Categorical and Column2 is Double and continuous.
> @Override
> public DataType getDataType(){
> return DataType.STRING;
> }
>
> @Override
> public OpType getOpType(){
> return OpType.CATEGORICAL;
> }
> }
>
If I adopt this method, both of the fields will have Categorical and String, which is not correct.
How do we get correct datatypes for both these fields?