How to get model information from PMML file? (From specific tags)

410 views
Skip to first unread message

Aayush Shah

unread,
Jun 16, 2022, 8:21:48 AM6/16/22
to Java PMML API
Here is the scenario that I am in: I am working on a project in which I am using `java-spark` to perform the predictions.

Now, PMML can come from any platform like python, knime, R etc but the main objective is to fetch the model properties like coefficients from the PMML file.

And my program has to be such automated that — it should be able to look for model coefficients only if the model type is linear regression (suppose). And in case of tree model I might want to fetch the feature importance etc.

Clearly, here we also need to look through the PMML file and fetch the model type and name. So this one would also require the extraction of specific data from the file.

Question:

So, do we have any library in java by which we can get the information that we require on the fly on any pmml file?

Language Background:

Spark: 2.4.3
java: 1.8
jpmml-evaluator-spark: 1.3.0 (it is which is responsible to read the pmml file)

So if there is anything helpful to get the information about the model from the file and then use it in our way, then please do help in this direction.

Thanking you 😀

Villu Ruusmann

unread,
Jun 16, 2022, 8:46:00 AM6/16/22
to Java PMML API, Aayush Shah
HI AS,

First, one copy of your question is enough. And the less text styling,
the better.

>
> Here is the scenario that I am in: I am working on a project
> in which I am using `java-spark` to perform the predictions.
>

Are you using JPMML-Evaluator-Spark for making predictions, or are you
using it to extract linear model coefficients and then make
predictions using your own code?

If it's the latter, then you could work with the base JPMML-Model library:
https://github.com/jpmml/jpmml-model

> And my program has to be such automated that —
> it should be able to look for model coefficients ***only***
> if the model type is linear regression *(suppose)*.
>

There is a special-purpose Visitor API inside the JPMML-Model library.
It's designed for traversing arbitrary complexity PMML data
structures, and querying and/or modifying it as needed.

Simply create a subclass of org.jpmml.model.visitors.AbstractVisitor,
and override appropriate AbstractVisitor#visit(<PMML element>)
methods.

For example, looking up coefficients for continuous features
(NumericPredictor element), and category level contributions for
categorical features:

Visitor coefficientPrinter = new AbstractVisitor(){
@Override
public VisitorAction visit(NumericPredictor numericPredictor){
System.out.println(numericPredictor.getField() + " -> " +
numericPredictor.getCoefficient());
return super.visit(numericPredictor);
}

@Override
public VisitorAction visit(CategoricalPredictor categoricalPredictor){
System.out.println(numericPredictor.getField() + "/" +
categoricalPredictor.getValue() + " -> " +
categoricalPredictor.getCoefficient());
}
};
PMML pmml = loadPMML(..)
pmml.applyTo(coefficientPrinter);

> And in case of tree model I might want to fetch the feature importance etc.

Feature importances are available as
MiningSchema/MiningField@importance attributes:
https://dmg.org/pmml/v4-4-1/MiningSchema.html

Create an AbstractVisitor subclass that visits MiningField elements,
and then prints out the result of MiningField#getImportance() method.

> **Clearly**, here we also need to look through the
> PMML file and fetch the model type and name.

The model type is reflected in the name of the top-level Model element.

For example, all linear regression models become are represented using
the RegressionModel element (irrespective of their native ML framework
representation):
https://dmg.org/pmml/v4-4-1/Regression.html

During Visitor API traversal, you can distinguish between top-level
and nested models by checking the status of the current element stack,
as available via the (Abstract)Visitor#getParents() method.

By definition, for a top-level model, the stack of parent elements
contains a single PMML element.

> So, do we have any library in java by which we can
> get the information that we require on the fly on any pmml file?
>

TLDR: See the JPMML-Model library:
https://github.com/jpmml/jpmml-model

For Visitor API code example, please use GitHub search.


VR

Aayush Shah

unread,
Jun 17, 2022, 6:06:03 AM6/17/22
to Java PMML API
Hei VR,

Thank you very much for your comprehensive reply! I know I have been using a lot of formatting but will 
keep it normal this time - actually deleted the first question but might have appeared twice which I didn't
mean to do so.

I have followed the approach but because of my lack of knowledge I have not quite reached to the destination
where I wanted to.

I have tried the following code to make a PMML pmml object as you've stated and the following workflow
___

// To read the pmml file and later convert it as PMML object
Evaluator evaluator = new LoadingModelEvaluatorBuilder()
.load(new File("path\\file.pmml"))
.build();

// Here is the type conversion - so that PMML object could be made 
// I found this workaround to do so using - org.dmg.pmml.PMML
HasPMML hasPMML = (HasPMML)evaluator;
PMML pmml = hasPMML.getPMML();

// From this file, I was able to get the fields like
// `getBaseVersion`, `getHeader` etc.

// This one looked promissing `model` which might include the coefficients
List<Model> modelList = pmml.getModels();
System.out.println("Total: " + modelList.size());  // Only one model was there

Model first = modelList.get(0);
System.out.println("Name of Algo: " + first.getAlgorithmName()); // sklearn.linear_model._base.LinearRegression

// I have tried `.getLocalTransformations`, `.getDerivedFields` etc but they didn't include the NumericPredictor Field
LocalTransformations transformations = first.getLocalTransformations();
List<DerivedField> fields = transformations.getDerivedFields();

// The numeric predictor field is in `RegressionTable` tag - but am unsure how to reach to that tag

// Have tried to override in the subclass but for that needed `numericpredictor`
// And also this pmml object doesn't have .applyTo method - as you stated
Visitor coefficientPrinter = new AbstractVisitor() {
@Override
public VisitorAction visit(NumericPredictor numericPredictor) {
System.out.println(numericPredictor.getField() + " -> " +
numericPredictor.getCoefficient());
return super.visit(numericPredictor);
}
};

Once I can access each NumericField object then I could fetch the coefficients. I apologize for this very basic query but 
I am a bit not sure how to figure it out.

Again thank you for your detailed support VR

AS

Villu Ruusmann

unread,
Jun 17, 2022, 6:34:07 AM6/17/22
to Java PMML API, Aayush Shah
Hi AS,

>
> // To read the pmml file and later convert it as PMML object
>

You don't need to go through ModelEvaluatorBuilder to obtain an
org.dmg.pmml.PMML object!

Simply use the org.jpmml.model.PMMLUtil#unmarshal(InputStream) utility method.

> // From this file, I was able to get the fields like
> // `getBaseVersion`, `getHeader` etc.
>

You should be obtaining this information using Visitor API.

>
> // The numeric predictor field is in `RegressionTable` tag - but am unsure how to reach to that tag
>

The Visitor API will find all the occurrences of the NumericPredictor
element for you automatically.

Simply point it to the root object in the PMML class model object
hierarchy, and it will traverse it fully and efficiently without any
additional guidance.

> // And also this pmml object doesn't have .applyTo method - as you stated
>
> Visitor coefficientPrinter = new AbstractVisitor() {
> ...
> };
>

Looks like I suggested the PMML#applyTo(Visitor) method.

The correct one is the Visitor#applyTo(org.dmg.pmml.PMMLObject) method:

PMML pmml = PMMLUtil.unmarshal(...);
coefficientPrinter.applyTo(pmml);

Any decent IDE should be able to assist you in locating these methods.


VR

Aayush Shah

unread,
Jun 24, 2022, 7:55:55 AM6/24/22
to Java PMML API
Hei VR,

I am amazed with your replay which has really helped me to navigate through the PMML
file and get the things from specific tags, really appreciate the quick and comprehensive
responses you have been giving to the community.

Now, after getting the information back from the file, sometimes PMML has some specific
data which we need to check and change by the program, has to be handled automatically.

So, my question is about changing some specific content from old to new. Like we have the 
<output> tags:
<Output>
    <OutputField name="probability(0.0)" optype="continuous" dataType="double" feature="probability" value="0.0"/> 
    <OutputField name="probability(1.0)" optype="continuous" dataType="double" feature="probability" value="1.0"/> 
</Output>

Now, there is the period (.) which causes problem in spark as you have discussed in this thread: https://github.com/jpmml/jpmml-evaluator-spark/issues/6 
so I am trying to solve it by replacing the period with "" with java. So, will you please help me solving this problem ie. to replace the value in the PMML file by
programming and then using the updated file to run the model?

And, secondly - there is one attribute in the <PMML> tag <PMML xmlns="http://www.dmg.org/PMML-4_2" ...> where we have the URI in the xmlns attribute.
Same thing, sometimes there is "https://" in the URL but java fails read as per its regex so I want to convert into "http://" and then run the PMML from there. But the thing
is that I don't know how to get the access to "xmlns" attribute and the how to update it.

Huge thanks and regards.
Aayush Shah

Aayush Shah

unread,
Jun 24, 2022, 9:01:21 AM6/24/22
to Java PMML API
And, am sorry to making the separate post since I can't edit the question once posted...
If you could direct that: I want to just check the existence of the element in the pmml file
before performing certain operation. How can I do that?

Like, I want to check that whether the file contains <RegressionTable> element or not.
And if it does then do something otherwise I would check other tags to satisfy my requirement.

So, is there a way to get Boolean result of existance of the tag in PMML, sir?

Thanks,
Aayush Shah

Villu Ruusmann

unread,
Jun 24, 2022, 1:22:55 PM6/24/22
to Java PMML API, Aayush Shah
Hi AS,

>
> Like, I want to check that whether the file contains
> <RegressionTable> element or not.
> And if it does then do something otherwise I would check
> other tags to satisfy my requirement.
>

My answer to all PMML structural querying questions is "use the Visitor API".

I would advise to collect all one element type-related activities into
one (Abstract)Visitor subclass. The visitor application (via
Visitor#applyTo(PMMLObject) entry method) is extremely fast. If I
recall correctly, then it takes a few milliseconds to traverse a 1GB
PMML class model object from end to end.

So, instead of a two-step workflow - first Visitor checks if element
exists or not, then the second Visitor does something with this info -
use a single do-everything Visitor. An empty traversal is very cheap,
and is not worth optimizing away.

> So, is there a way to get Boolean result of existance of the tag in PMML, sir?
>

// Finder class
public class CheckIfElementExists extends AbstractVisitor {

public boolean objectFound = false;

@Override
public VisitorAction visit(PMMLObject object){
if(isThisWhatImLookingFor(object)){
this.objectFound = true;
return VisitorAction.TERMINATE;
}
return VisitorAction.CONTINUE;
}
}

// Finder application
CheckIfElementExists ciee = new CheckIfElementExists();
ciee.applyTo(pmml);
System.out.println("Found what I was looking for: " + ciee.objectFound);

Please note that in the finder business logic I'm returning
VisitorAction.TERMINATE after I found the match. This is like a "quick
exit" instruction, which stops the Visitor for performing any
additional checks.

Also, please note that the Visitor API is traversing the PMML class
model object is "depth first" fashion (and not the "breadth first"
fashion!)..


VR

Villu Ruusmann

unread,
Jun 24, 2022, 1:52:06 PM6/24/22
to Java PMML API, Aayush Shah
Hi AS,

>
> So, my question is about changing some specific
> content from old to new. Like we have the <output> tags:
>

In case of probability-type OutputField elements, the name of the
field is derived based on target category values.

Therefore, the output field name "probability(1.0)" indicates that the
data type of the target has been "categorical + double". In case of
the binary target, you should consider converting it to "categorical +
integer" type, in order to have this output field renamed to
"probability(1)" automatically.

Casting the target type is applicable if you can re-run the original
data science pipeline. If you only have a PMML file, then you must
resort to manual error correction in the form of renaming fields using
the Visitor API.

Luckily enough, there is a special-purpose
org.jpmml.model.visitors.FieldRenamer class available for this job!

Sample usage:
Map<String, String> mappings = new HashMap<>();
mappings.put("probability(1.0)", "probability(1)");
mappings.put("probability(0.0)", "probability(0)");

FieldRenamer fieldRenamer = new FieldRenamer(mappings);
fieldRenamer.applyTo(pmml);

If you want to perform really custom renames, then consider
subclassing org.jpmml.model.visitors.FieldNameFilterer, and overriding
its #filter(String) method.

> And, secondly - there is one attribute in the <PMML> tag
> <PMML xmlns="http://www.dmg.org/PMML-4_2" ...> where
> we have the URI in the xmlns attribute.
>

Interesting - I hadn't noticed that DMG.org has changed all their XML
namespace URI protocols from "http" to "https".

Over 15 years, everything was "http://"-prefixed. And now it has been
quietly replaced with the "https://" prefix. I should check with the
URI specification if this kind of replacement is actually permitted -
perhaps these XMLNS URIs should be interpreted as two different
resources.

> Same thing, sometimes there is "https://" in the URL but java fails read ...

The org.dmg.pmml.Version enum uses hard-coded "http" protocol prefixes
currently:
https://github.com/jpmml/jpmml-model/blob/1.6.3/pmml-model/src/main/java/org/dmg/pmml/Version.java#L7-L14

> ... as per its regex so I want to convert into
> "http://" and then run the PMML from there. But the thing
>

You should implement a SAX filter for performing XMLNS replacements
(instead of regular expressions).

This looks like a very fundamental issue, and such a SAX filter should
be part of the JPMML-Model library (and applied by default).

I've opened a new JPMML-Model issue about this "surprise move":
https://github.com/jpmml/jpmml-model/issues/38


VR

Aayush Shah

unread,
Sep 5, 2022, 4:17:41 AM9/5/22
to Java PMML API
Hello VR,
It has been an amazing experience with the jpmml libraries about which firstly I knew nothing, but with your
to-the-point responses and guidance, I can now implement my project pretty well. Many thanks for it 🙏

I have tried searching on the existing issues and also the blog on openscoring.io for my query but wasn't able
to trace them. Actually, I want to include the "Multivariate statistics" from python to PMML.
___
I have seen issue #47 which discusses about including the feature statistics in PMML and that was pretty
clear, I was able to achieve them with the Domains. I am referring to MultiVariateStats such as standard error,
t-Value, p-value, ANOVA, etc.

According to my understanding, unlike statsmodels, sklearn doesn't provide such tests and values so that might
be one of the reasons but if there is a way to include them in the PMML, will you please guide me on how to achieve that? 

I also couldn't find the example PMML files online which could help me see how all of them are stored and how they are
interconnected. The dmg has pretty neat examples and explanations but if it is possible to have a full PMML, it would be
great.

Secondly,
I was also surprised about this http to https conversion. And I was tracing issue #38 that you've opened. I am willing to 
"replace" that https to http. I am not sure how to work with the SAX filter in the code. I think you've already implemented it, and I think
it should be fine to release it as it would be an optional fix for the solutions since they aren't responding to that issue. Who knows
when!

If possible, will you please also give some code example for that replacement?

Thank you so much,
Aayush Shah
Reply all
Reply to author
Forward
0 new messages