JPMML vs Scikit performance

595 views
Skip to first unread message

Meir Maor

unread,
Nov 27, 2017, 7:57:00 AM11/27/17
to Java PMML API
I'm evaluating JPMML, and tried measuring performance.
I trained a scikit learn random forest model with 700 trees on 320K rows by 300 features.
The PMML came out 917MB but compressed to 96MB which is similar to the pkl file which did not compress measureably.

When running scikit learn on large batch I got 0.5 milliseconds per row(single thread)
When running jpmml after applying the set of optimizer from the example I got 4 milliseconds per row. (after warm up)

Test on my laptop (i7-7700HQ Oracle java 1.8.0_131-b11 giving Java 15GB of heap out of 64GB RAM)

Are this the numbers I should expect? Are there other benchmarks to look at for reasonably sized models. (I picked what I though was a reasonable but not extreme size).
Are there other stuff I should be using? different setup for batch operations?

Some code snippets(Scala) attached, almost all time is inside evaluator.evaluate

Thanks
Meir

jpmml.txt

Villu Ruusmann

unread,
Nov 27, 2017, 10:28:14 PM11/27/17
to Java PMML API
Hi Meir,

> The PMML came out 917MB but compressed to 96MB which
> is similar to the pkl file which did not compress measureably.
>

What is your current SkLearn2PMML package version? If you upgrade to
the latest 0.26.0 version (released on 18th of October), then you will
be able to take advantage of "decision tree compaction" functionality,
which reduces the size/complexity of decision tree-based Scikit-Learn
models (including ensemble models such as RandomForestX and
GradientBoostingX) roughly two times.

Attached is a demo archive Audit.zip, which builds a random forest
classifier for the Audit dataset.

The compaction of decision trees is activated on line 27, by setting a
"compact = True" attribute on a fitted estimator object. If you run
the script, then you will get two random forest PMML files - the
default one ("rf.pmml") is 1.9 MB in size, whereas the compacted one
("rf-compact.pmml") is 1.1 MB in size. During scoring, you should see
a comparable 1.1/1.9 performance improvement.

> When running scikit learn on large batch I got 0.5 milliseconds
> per row(single thread). When running jpmml after applying the
> set of optimizer from the example I got 4 milliseconds per row.
>

Your script looks fine.

After decision tree compaction the JPMML side should improve from 4 ms
to 2 .. 2.5 ms?

This five times performance difference has many parts:
1) In SkLearn feature values are retrieved by array access, whereas in
JPMML they are retrieved by Map lookup. Also, there's a nasty float ->
double -> float conversion overhead for numeric features.
2) In SkLearn the predictions of member decision tree models are
lightweight arrays, whereas in JPMML they are java.util.LinkedHashMap
objects, which are rather costly to instantiate and initialize (eg.
see this https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/tree/TreeModelEvaluator.java#L327-L398).
3) Give or take, Java/JVM are still slower than Python/Numpy for numeric work.

>
> Are this the numbers I should expect?
>

You should benchmark classification- and regression-type random forest
models separately. The latter should perform relatively better,
because their result type is more lightweight
(org.jpmml.evaluator.tree.NodeScore vs
org.jpmml.evaluator.tree.NodeScoreDistribution).

Also, changing the "composition" of feature space may change the
balance. I would argue that (J)PMML is relatively better at handling
categorical features than continuous features (eg. in the demo
dataset, string columns "Education", "Occupation" etc.) because it
operates directly on input values, without doing any OneHotEncoding or
LabelBinarizer transformations on them.

> Are there other stuff I should be using? different setup for batch operations?
>

What is your performance requirement? Coming as close as possible to
Scikit-Learn performance, or some specific threshold (eg. "under
1ms")?

There are many undeployed hacks and tricks available, such as caching
org.jpmml.evaluation.tree.NodeScoreDistribution instances. However,
the long terms solution will be provided in the form of a
JPMML-Transpiler library layer on top of of the JPMML-Evaluator
library layer.


VR
Audit.zip

Meir Maor

unread,
Nov 29, 2017, 4:12:31 AM11/29/17
to Java PMML API
Currently I'm converting to PMML using the the Java Library (which uses python underneath? )
def pickledModelToPMML(pickledModel: File) = {
val obj=using(PickleUtil.createStorage(pickledModel)){ storage =>
PickleUtil.unpickle(storage)
}
val pMMLPipeline= obj match {
case p: PMMLPipeline => p
case p: Pipeline => new PMMLPipeline().setSteps(p.getSteps)
case e: Estimator => new PMMLPipeline().setSteps(Collections.singletonList(Array[AnyRef]("estimator", e)))
}
val pmml = pMMLPipeline.encodePMML()
...

I also did not change the python training code, I was happy I could train models as before and pickle them as before (and didn't use PMMLPipeline) just fit with sklearn.
I gather to use the new compact feature I will have to change the training code and convert to pmml from python? Can this be done with an already pickled python model? can it be used from Java/Scala with jpmml-sklearn ?

Villu Ruusmann

unread,
Nov 29, 2017, 10:02:30 PM11/29/17
to Java PMML API
Hi Meir,

> Currently I'm converting to PMML using the the Java
> Library (which uses python underneath? )

The JPMML-SkLearn library is 100% pure Java, and can be used in
systems that don't have Python executable/runtime installed.

Data transfer between Python and Java systems happens via the pickle
file. The Python side writes it out, and the Java side reads it in
using the Pyrolite library (https://github.com/irmen/Pyrolite).

In Pyrolite, all Python classes inherit from
net.razorvine.pickle.objects.ClassDict, which in turn inherits from
java.util.HashMap. Therefore, you can set new attributes and delete
existing attributes using methods Map#put(String, Object) and
Map#remove(Object), respectively.

>
> Can this be done with an already pickled python model?
> can it be used from Java/Scala with jpmml-sklearn ?
>

These Python estimator classes that support decision tree compaction
functionality are "marked" with sklearn.tree.HasTreeOptions marker
interface.

In your Java/Scala code, you could activate the compaction flag like this:

if(obj instanceof Estimator){
Estimator estimator = (Estimator)obj;
if(estimator instanceof sklearn.tree.HasTreeOptions){
estimator.put(sklearn.tree.HasTreeOptions.OPTION_COMPACT,
Boolean.TRUE); // THIS!
}
}

Please use the Java constant HasTreeOptions.OPTION_COMPACT (instead of
hard-coding a string literal "compact"). It is possible that the
naming of options will change sometimes in the future (eg. by
introducing some prefix to their names such as "_option_compact"), and
hard-coded option names are likely to break then.


VR
Reply all
Reply to author
Forward
0 new messages