Change the probability class name

38 views
Skip to first unread message

Mayuri

unread,
Jan 2, 2023, 7:55:14 AM1/2/23
to Java PMML API
Hi,

I am using sklearn2pmml to export 3 logistic models as PMML files but I got an error: "Exception in thread "main" org.jpmml.evaluator.DuplicateValueException: The value for field "probability(0)" has already been defined". I manually changed the probability (0) and probability (1) class to "class_0_seg_i" and "class_1_seg_i" where i ∈ {1,2,3} and that worked. Is there any other way to do it rather than manually changing the class names.

Thanks,
Mayuri

Villu Ruusmann

unread,
Jan 2, 2023, 2:41:52 PM1/2/23
to Java PMML API
Hi Mayuri,

> I am using sklearn2pmml to export 3 logistic models
> as PMML files but I got an error: "Exception in thread "main"
> org.jpmml.evaluator.DuplicateValueException:
> The value for field "probability(0)" has already been defined".
>

In short, you have an ensemble of three logistic regression models,
which define output fields that have identical names.

This is allowed with some ensembling approaches, and not with others.
Looks like your case belongs to the second category.

The problem lies with the PMML conversion software, which should be
able to detect and prevent such ensembles from being generated. The
PMML evaluator engine is not to blame here - it was simply following
(incorrect-) orders.

Can you tell me what was the Python ensemble model type? You should
open a JPMML-SkLearn issue, and submit a minimal reproducible
exemplary case there: https://github.com/jpmml/jpmml-sklearn/issues

> I manually changed the probability (0) and probability (1)
> class to "class_0_seg_i" and "class_1_seg_i" where i ∈ {1,2,3}
> and that worked.

Yes, the fix to this problem is to ensure that all logistic regression
models have unique output field names.

This "limitation" is actually very reasonable. If you were exporting
PMML evaluation results into a CSV file, then you also wouldn't want
to have three "probability(0)" or "probability(1)" columns in there.
All these output field names need some kind of prefix or suffix to
indicate the associated segment/model.

The JPMML-SkLearn/SkLearn2PMML software stack would be disambiguating
output fields using the following pattern: "probability(<category>,
<segment_id>)". So, if your ensemble model has three segments A, B and
C, you would be getting "probability(0, A)", "probability(0, B)" and
"probability(0, C)".

> Is there any other way to do it rather than manually
> changing the class names.
>

You could write a small helper application to do the output field
renaming for you :-)

On a more serious note, there is an open issue about it:
https://github.com/jpmml/sklearn2pmml/issues/361

The proposed approach
(https://github.com/jpmml/sklearn2pmml/issues/361#issue-1435626832)
where the mapping is defined at the (PMML)Pipeline level works if
there is only one "probability(0)" output field in the entire PMML
document.

However, in your case there are three output fields in the
(PMML)Pipeline scope, so the above wouldn't work. The rename mapping
would need to be defined at the Estimator level..

If you have any ideas about which Python syntax would be simplest/most
effective, please add your comments there.

Final note - the SkLearn2PMML package provides
sklearn2pmml.decoration.Alias and s.d.MultiAlias decorator classes for
renaming DerivedField elements. However, they cannot be used for
renaming OutputField elements.

Or, perhaps you don't need so many output fields at all? Perhaps they
should be not generated at all? See
https://github.com/jpmml/jpmml-sklearn/issues/180


VR

Villu Ruusmann

unread,
Jan 2, 2023, 2:49:40 PM1/2/23
to Java PMML API
Hi Mayuri,

>
> > Is there any other way to do it rather than manually
> > changing the class names.
> >
>
> The JPMML-SkLearn/SkLearn2PMML software stack would be disambiguating
> output fields using the following pattern: "probability(<category>,
> <segment_id>)". So, if your ensemble model has three segments A, B and
> C, you would be getting "probability(0, A)", "probability(0, B)" and
> "probability(0, C)".
>

Sorry, forgot to post the following idea earlier: "can you rename your
target categories, in order to incorporate the segment identifier in
them?"

Right now you have logistic regression models trained with 0/1 target
categories. But if you renamed them to 0A/1A, 0B/1B and 0C/1C target
categories, respectively, then you'd be getting unique output field
names automatically.


VR
Reply all
Reply to author
Forward
0 new messages