Hi Meir,
> The PMML came out 917MB but compressed to 96MB which
> is similar to the pkl file which did not compress measureably.
>
What is your current SkLearn2PMML package version? If you upgrade to
the latest 0.26.0 version (released on 18th of October), then you will
be able to take advantage of "decision tree compaction" functionality,
which reduces the size/complexity of decision tree-based Scikit-Learn
models (including ensemble models such as RandomForestX and
GradientBoostingX) roughly two times.
Attached is a demo archive Audit.zip, which builds a random forest
classifier for the Audit dataset.
The compaction of decision trees is activated on line 27, by setting a
"compact = True" attribute on a fitted estimator object. If you run
the script, then you will get two random forest PMML files - the
default one ("rf.pmml") is 1.9 MB in size, whereas the compacted one
("rf-compact.pmml") is 1.1 MB in size. During scoring, you should see
a comparable 1.1/1.9 performance improvement.
> When running scikit learn on large batch I got 0.5 milliseconds
> per row(single thread). When running jpmml after applying the
> set of optimizer from the example I got 4 milliseconds per row.
>
Your script looks fine.
After decision tree compaction the JPMML side should improve from 4 ms
to 2 .. 2.5 ms?
This five times performance difference has many parts:
1) In SkLearn feature values are retrieved by array access, whereas in
JPMML they are retrieved by Map lookup. Also, there's a nasty float ->
double -> float conversion overhead for numeric features.
2) In SkLearn the predictions of member decision tree models are
lightweight arrays, whereas in JPMML they are java.util.LinkedHashMap
objects, which are rather costly to instantiate and initialize (eg.
see this
https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/tree/TreeModelEvaluator.java#L327-L398).
3) Give or take, Java/JVM are still slower than Python/Numpy for numeric work.
>
> Are this the numbers I should expect?
>
You should benchmark classification- and regression-type random forest
models separately. The latter should perform relatively better,
because their result type is more lightweight
(org.jpmml.evaluator.tree.NodeScore vs
org.jpmml.evaluator.tree.NodeScoreDistribution).
Also, changing the "composition" of feature space may change the
balance. I would argue that (J)PMML is relatively better at handling
categorical features than continuous features (eg. in the demo
dataset, string columns "Education", "Occupation" etc.) because it
operates directly on input values, without doing any OneHotEncoding or
LabelBinarizer transformations on them.
> Are there other stuff I should be using? different setup for batch operations?
>
What is your performance requirement? Coming as close as possible to
Scikit-Learn performance, or some specific threshold (eg. "under
1ms")?
There are many undeployed hacks and tricks available, such as caching
org.jpmml.evaluation.tree.NodeScoreDistribution instances. However,
the long terms solution will be provided in the form of a
JPMML-Transpiler library layer on top of of the JPMML-Evaluator
library layer.
VR