Hi Ron,
> I'm on the latest version of JPMML, and I need to get the signature
> of the PMML model without loading the entire XML file into memory
> since the PMML file is a whopping 1 GB.
>
The JDK's default JAXB unmarshalling engine (GlassFish Metro) is
surprisingly good. It should be able to parse a 1 GB PMML file into an
in-memory PMML class model object in 10 seconds or less (at least I
can parse a 5 GB PMML file in a minute on my laptop).
There are several tricks that offer considerable performance improvements:
1) Increase JVM heap size by setting "-Xms" and "-Xmx" JVM options. As
a rule, if you don't want to wait after GC, then the JVM heap size
should be at least three times the size of your largest PMML file.
2) Disable SAX Locator information. One option is telling the JAXB
runtime not to initialize the value of the PMMLObject#locator field,
the other option is deleting this field altogether (by redefining the
class org.dmg.pmml.PMMLObject using JPMML-Model agent technology
before it's loaded by the class loader), which would also reduce the
size of the resulting class model object (as that field is present in
every class model class instance).
3) Redefine more heavily used class model classes. There are several
class file transformers in the latest versions (1.2.13 and newer) of
the org.jpmml.agent package:
https://github.com/jpmml/jpmml-model/tree/1.2.X/pmml-agent/src/main/java/org/jpmml/agent
Does the performance improve if you start your JVM like this:
$ java -Xms6G -Xmx12G
-javaagent:"pmml-agent-1.2.14.jar=locator=false;extensions=false;node=simple,anonymous"
-cp myapplication.jar com.mycompany.MyApplication
> Can I create a PMML instance without having to load all the trees for
> the random forest for example since I'm not interested in using the
> API to retrieve that information?
>
Basically, you're interested in retrieving the MiningSchema element of
the top-level MiningModel element (XPath
"/PMML/MiningModel/MiningSchema"). Every member TreeModel element
(there's probably 500 of them?) has it's own MiningSchema element as
well (XPath "/PMML/MiningModel/Segmentation/Segment/TreeModel/MiningSchema"),
but they can be safely ignored, because they are a subset of the
top-level MiningModel element.
The recommended way of collecting all MiningField elements is using
the Visitor API.
First, define a field name collector:
public class MiningFieldCollector extends AbstractVisitor {
public Set<FieldName> names = new HashSet<>();
@Override
public void visit(MiningField miningField){
this.names.add(miningField.getName());
return super.visit(miningField);
}
}
Then, apply it to the PMML fragment of interest:
PMML pmml = ...;
MiningFieldCollector nameCollector = new MiningFieldCollector();
nameCollector.applyTo(pmml);
System.out.println(nameCollector.names);
> I see a SAXSource and InputSource that takes in an InputStream
> instance so I'm hoping that it's possible.
>
Indeed, as a last resort, you could "simplify" the XML event stream
that is consumed by the JAXB engine. There are no ready to use code
examples for that, so you need to do the research.
When dealing with random forests, then the goal should be to build a
SAX filter that skips the top-level Segmentation element and all its
contents (XPath "/PMML/MiningModel/Segmentation").
VR