Get fields of a PMML model without loading the entire model into memory

180 views
Skip to first unread message

Ron Gonzalez

unread,
May 2, 2016, 1:51:17 PM5/2/16
to Java PMML API
I'm on the latest version of JPMML, and I need to get the signature of the PMML model without loading the entire XML file into memory since the PMML file is a whopping 1 GB.

I see a SAXSource and InputSource that takes in an InputStream instance so I'm hoping that it's possible.

Can I create a PMML instance without having to load all the trees for the random forest for example since I'm not interested in using the API to retrieve that information?

Thanks,
Ron

Villu Ruusmann

unread,
May 2, 2016, 4:51:29 PM5/2/16
to Java PMML API
Hi Ron,

> I'm on the latest version of JPMML, and I need to get the signature
> of the PMML model without loading the entire XML file into memory
> since the PMML file is a whopping 1 GB.
>

The JDK's default JAXB unmarshalling engine (GlassFish Metro) is
surprisingly good. It should be able to parse a 1 GB PMML file into an
in-memory PMML class model object in 10 seconds or less (at least I
can parse a 5 GB PMML file in a minute on my laptop).

There are several tricks that offer considerable performance improvements:
1) Increase JVM heap size by setting "-Xms" and "-Xmx" JVM options. As
a rule, if you don't want to wait after GC, then the JVM heap size
should be at least three times the size of your largest PMML file.
2) Disable SAX Locator information. One option is telling the JAXB
runtime not to initialize the value of the PMMLObject#locator field,
the other option is deleting this field altogether (by redefining the
class org.dmg.pmml.PMMLObject using JPMML-Model agent technology
before it's loaded by the class loader), which would also reduce the
size of the resulting class model object (as that field is present in
every class model class instance).
3) Redefine more heavily used class model classes. There are several
class file transformers in the latest versions (1.2.13 and newer) of
the org.jpmml.agent package:
https://github.com/jpmml/jpmml-model/tree/1.2.X/pmml-agent/src/main/java/org/jpmml/agent

Does the performance improve if you start your JVM like this:
$ java -Xms6G -Xmx12G
-javaagent:"pmml-agent-1.2.14.jar=locator=false;extensions=false;node=simple,anonymous"
-cp myapplication.jar com.mycompany.MyApplication

> Can I create a PMML instance without having to load all the trees for
> the random forest for example since I'm not interested in using the
> API to retrieve that information?
>

Basically, you're interested in retrieving the MiningSchema element of
the top-level MiningModel element (XPath
"/PMML/MiningModel/MiningSchema"). Every member TreeModel element
(there's probably 500 of them?) has it's own MiningSchema element as
well (XPath "/PMML/MiningModel/Segmentation/Segment/TreeModel/MiningSchema"),
but they can be safely ignored, because they are a subset of the
top-level MiningModel element.

The recommended way of collecting all MiningField elements is using
the Visitor API.

First, define a field name collector:
public class MiningFieldCollector extends AbstractVisitor {

public Set<FieldName> names = new HashSet<>();

@Override
public void visit(MiningField miningField){
this.names.add(miningField.getName());

return super.visit(miningField);
}
}

Then, apply it to the PMML fragment of interest:
PMML pmml = ...;
MiningFieldCollector nameCollector = new MiningFieldCollector();
nameCollector.applyTo(pmml);
System.out.println(nameCollector.names);

> I see a SAXSource and InputSource that takes in an InputStream
> instance so I'm hoping that it's possible.
>

Indeed, as a last resort, you could "simplify" the XML event stream
that is consumed by the JAXB engine. There are no ready to use code
examples for that, so you need to do the research.

When dealing with random forests, then the goal should be to build a
SAX filter that skips the top-level Segmentation element and all its
contents (XPath "/PMML/MiningModel/Segmentation").


VR

Villu Ruusmann

unread,
May 3, 2016, 5:11:00 AM5/3/16
to Java PMML API
Hi Ron,

>
>> I see a SAXSource and InputSource that takes in an InputStream
>> instance so I'm hoping that it's possible.
>>
>
> Indeed, as a last resort, you could "simplify" the XML event stream
> that is consumed by the JAXB engine. There are no ready to use code
> examples for that, so you need to do the research.
>
> When dealing with random forests, then the goal should be to build a
> SAX filter that skips the top-level Segmentation element and all its
> contents (XPath "/PMML/MiningModel/Segmentation").
>

I have just introduced a SAX filter class org.jpmml.model.SkipFilter
for skipping PMML elements:
https://github.com/jpmml/jpmml-model/commit/52674a3060bd14267f58c796102aad95b81d9430

The usage is straightforward:
InputStream is = ...;
Source source = SkipFilter.apply(new InputSource(is), "Segmentation");
PMML pmml = JAXBUtil.unmarshalPMML(source);

This filter should "reduce" your 1 GB random forest file down to mere
kilobytes (and thus increase JAXB unmarshalling speeds by several
orders of magnitude).

However, I'm unsure if it was a good move that this SAX filter class
is a subclass of org.xml.sax.helpers.XMLFilterImpl, because it is
rather complicated to chain it together with other SAX filters classes
such as org.jpmml.model.ImportFilter.

If you can suggest ways to improve the "chainability" of SAX filters,
then please let me know. There will be an opportunity to introduce
breaking API changes when switching from JPMML-Model version 1.2.X to
1.3.X.


VR

Ron Gonzalez

unread,
May 4, 2016, 10:16:29 PM5/4/16
to Villu Ruusmann, Java PMML API
Great thanks Villu! I'll give it a whirl.

--Ron

Ron Gonzalez

unread,
May 4, 2016, 11:00:25 PM5/4/16
to Villu Ruusmann, Java PMML API
I did this and it seems to have succeeded, but I'm not sure how to check in the created PMML if it really skipped the segmentation section. When I ran it in debug mode, the isSkipping() method was getting invoked.

        InputSource source = new InputSource(pmmlStream);
       
        XMLReader reader = XMLReaderFactory.createXMLReader();
        XMLFilter importFilter = new ImportFilter(reader);
        XMLFilter skipFilter = new SkipFilter(reader, "Segmentation");
       
        skipFilter.setParent(importFilter);
       
        SAXSource transformedSource = new SAXSource(skipFilter, source);
       
        PMML pmml = JAXBUtil.unmarshalPMML(transformedSource);

Villu Ruusmann

unread,
May 5, 2016, 9:28:36 AM5/5/16
to Java PMML API
Hi Ron,

Thanks for suggesting XMLFilter#setParent(XMLFilter) - this was the
missing piece in SAX toolbox.

I have generalized your solution to any number of XML filters:
https://github.com/jpmml/jpmml-model/commit/56af1e42a0cb9d79b4d30b305f879ba723ce9277

For example, the following command-line application will copy an PMML
file from one location to another, while deleting all Segmentation and
Extension elements in it:

public static void main(String... args) throws Exception {
PMML pmml;

try(InputStream is = new FileInputStream(args[0])){
InputSource source = new InputSource(is);
Source filteredSource = JAXBUtil.createFilteredSource(source, new
ImportFilter(), new SkipFilter("Segmentation"), new
SkipFilter("Extension"));
pmml = JAXBUtil.unmarshalPMML(filteredSource);
}

try(OutputStream os = new FileOutputStream(args[1])){
JAXBUtil.marshalPMML(pmml, new StreamResult(os));
}
}

You can expect a new release of the JPMML-Evaluator library by the end
of next week. Until then you should copy-paste the XML filtering code
into your application.


VR

Ron Gonzalez

unread,
May 5, 2016, 10:35:04 AM5/5/16
to Villu Ruusmann, Java PMML API
Great, will look forward to that...
Reply all
Reply to author
Forward
0 new messages