Hi Mayuri,
> I am using sklearn2pmml to export 3 logistic models
> as PMML files but I got an error: "Exception in thread "main"
> org.jpmml.evaluator.DuplicateValueException:
> The value for field "probability(0)" has already been defined".
>
In short, you have an ensemble of three logistic regression models,
which define output fields that have identical names.
This is allowed with some ensembling approaches, and not with others.
Looks like your case belongs to the second category.
The problem lies with the PMML conversion software, which should be
able to detect and prevent such ensembles from being generated. The
PMML evaluator engine is not to blame here - it was simply following
(incorrect-) orders.
Can you tell me what was the Python ensemble model type? You should
open a JPMML-SkLearn issue, and submit a minimal reproducible
exemplary case there:
https://github.com/jpmml/jpmml-sklearn/issues
> I manually changed the probability (0) and probability (1)
> class to "class_0_seg_i" and "class_1_seg_i" where i ∈ {1,2,3}
> and that worked.
Yes, the fix to this problem is to ensure that all logistic regression
models have unique output field names.
This "limitation" is actually very reasonable. If you were exporting
PMML evaluation results into a CSV file, then you also wouldn't want
to have three "probability(0)" or "probability(1)" columns in there.
All these output field names need some kind of prefix or suffix to
indicate the associated segment/model.
The JPMML-SkLearn/SkLearn2PMML software stack would be disambiguating
output fields using the following pattern: "probability(<category>,
<segment_id>)". So, if your ensemble model has three segments A, B and
C, you would be getting "probability(0, A)", "probability(0, B)" and
"probability(0, C)".
> Is there any other way to do it rather than manually
> changing the class names.
>
You could write a small helper application to do the output field
renaming for you :-)
On a more serious note, there is an open issue about it:
https://github.com/jpmml/sklearn2pmml/issues/361
The proposed approach
(
https://github.com/jpmml/sklearn2pmml/issues/361#issue-1435626832)
where the mapping is defined at the (PMML)Pipeline level works if
there is only one "probability(0)" output field in the entire PMML
document.
However, in your case there are three output fields in the
(PMML)Pipeline scope, so the above wouldn't work. The rename mapping
would need to be defined at the Estimator level..
If you have any ideas about which Python syntax would be simplest/most
effective, please add your comments there.
Final note - the SkLearn2PMML package provides
sklearn2pmml.decoration.Alias and s.d.MultiAlias decorator classes for
renaming DerivedField elements. However, they cannot be used for
renaming OutputField elements.
Or, perhaps you don't need so many output fields at all? Perhaps they
should be not generated at all? See
https://github.com/jpmml/jpmml-sklearn/issues/180
VR