Newcomer to JPMML with some questions.

506 views
Skip to first unread message

Ian Utley

unread,
Dec 14, 2017, 7:17:46 AM12/14/17
to Java PMML API
I am using LightGBM to produce a tree model, converting the LightGBM model to a PMML model using the jpmml-lightgbm converter. Then using the pmml-evaluator to evaluate the model.

Initially I had some issues converting LGBM models with large numeric values in the cat_threshold section of the model due to thresholds being unsigned ints in C, so they cant fit into a signed Java int. I amended the converter to parse cat_thresholds as longs instead, and all was well and ended up with a JPMML model.

However, I am having issues using the evaluator (getting an InvalidResultException) which I have tracked down to my arguments being outside the range determined during training.

E.g I'm using the Adult training set and LightGBM determines that fnlwgt is in the range [12285:1484705].
This feature is represented in the JPMML model as:
<DataField name="fnlwgt" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="12285.0" rightMargin="1484705.0"/>
</DataField>

However, my non-training data has a value outside this range. How do I handle this such that the model is still evaluated?

Regards
Ian.

Ian Utley

unread,
Dec 14, 2017, 10:04:39 AM12/14/17
to Java PMML API
Update: I preprocessed the JPMML model to remove intervals from the DataDictionary before evaluating and my results now match LightGBM's predictions. Not sure if this is the correct way of doing it...

Villu Ruusmann

unread,
Dec 14, 2017, 11:30:19 AM12/14/17
to Java PMML API
Hi Ian,

> I am using LightGBM to produce a tree model, converting
> the LightGBM model to a PMML model using the jpmml-lightgbm
> converter. Then using the pmml-evaluator to evaluate the model.
>

A general warning - LightGBM is evolving quite rapidly, and the
JPMML-LightGBM library might be more or less outdated. The current
version of JPMML-LightGBM is targeting LightGBM v2.0.7.

> Initially I had some issues converting LGBM models with
> large numeric values in the cat_threshold section.
>

Well spotted!

This is a major issue, and I've propagated it to GitHub:
https://github.com/jpmml/jpmml-lightgbm/issues/9

Would you mind attaching your patch there?

> However, I am having issues using the evaluator
> (getting an InvalidResultException) which I have tracked
> down to my arguments being outside the range
> determined during training.
>

Possible solutions:
1) Add more (dummy-) data records to your dataset so that the complete
"applicability domain" would be covered.
2) Manually edit the "feature_infos" attribute (in the header section)
in LightGBM model text file.
3) You're probably using LightGBM standalone, and converting models
with JPMML-LightGBM command-line application. However, if you switched
to Scikit-Learn framework, then you'd be able to customize feature
bounds with the help of sklearn2pmml.decoration.ContinuousDomain
meta-transformation.
4) Post-process PMML documents using the JPMML-Model library. If you
remove DataField/Interval elements, then all input values will be
considered to be valid.
5) Enhancing the JPMML-LightGBM command-line application with a
"--no-domain" command-line switch, which would apply solution #4
automatically.

Here's an example about exporting "domain-less" LightGBM models using
Scikit-Learn:
pipeline = PMMLPIpeline([
("mapper", DataFrameMapper([
("x", ContinuousDomain(with_data = False)) # THIS!
])),
("estimator", LGBMRegressor())
])

Here's an example using the Visitor API to get rid of all
DataField/Interval elements:
class NoDomainVisitor extends org.jpmml.model.visitors.AbstractVisitor {

@Override
public VisitorAction visit(DataField dataField){
if(dataField.hasIntervals()){
List<Interval> intervals = dataField.getIntervals();
intervals.clear();
}
return super.visit(dataField);
}
}

org.dmg.pmml.PMML pmml = loadPMML();
Visitor visitor = new NoDomainVisitor();
visitor.applyTo(pmml);


VR

Jaja w

unread,
Sep 17, 2018, 1:53:02 AM9/17/18
to Java PMML API
在 2017年12月14日星期四 UTC+8下午11:04:39,Ian Utley写道:
> Update: I preprocessed the JPMML model to remove intervals from the DataDictionary before evaluating and my results now match LightGBM's predictions. Not sure if this is the correct way of doing it...

I got the same problem as you do. And I preprocessed the JPMML model as you did. But I'm not sure if you have a better way to solve this problem. Thank you.

dan x

unread,
Sep 17, 2018, 5:13:38 AM9/17/18
to Java PMML API
I'm afraid removing <Interval closure> section directly from the original converted pmml file will had some side-effect on the model.
I just tried to conduct two predictions on the same data set, but with one with the original model file and python, and the other with filtered jpmml(remove interval closure after covert to pmml) and java(to implement the model with pmml). I got slightly different result(origin lightgbm file+python: 0.651 f1, filtered pmml file+java: 0.648).

Villu Ruusmann

unread,
Sep 17, 2018, 8:41:36 AM9/17/18
to Java PMML API
Hello,

>
> And I preprocessed the JPMML model as you did. But I'm not
> sure if you have a better way to solve this problem.
>

If you train models using Scikit-Learn, then it's possible to turn off
the generation of Interval elements by "decorating" the problematic
feature columns using the sklearn2pmml.decoration.ContinuousDomain or
s.d.CategoricalDomain pseudo-transformers.

You need to set the 'Domain.with_data' attribute to False (defaults to
True, because in my opinion, all models should aim to keep track of
their applicability domains):
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L56
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L140-L142

For example, when working with the Iris dataset (see the attached
LightGBMIris.zip archive file for a complete example), then it's
possible to generate Interval elements selectively like this:
mapper = DataFrameMapper([
(["Sepal.Length", "Sepal.Width"], ContinuousDomain(with_data = False)),
(["Petal.Length", "Petal.Width"], ContinuousDomain(with_data = True))
])

The above would cause "Sepal.Length" and "Sepal.Width" fields to
accept any double value.


VR
LightGBMIris.zip
Reply all
Reply to author
Forward
0 new messages