Hi Jiby,
>
> I am using random Forest.
>
That's exactly what I suspected.
For Scikit-Learn decision tree-based models there are two new
org.dmg.pmml.SimplePredicate object instance allocated for each node
split. This is extremely wasteful, because most of those newly
allocated SimplePredicate object instances are equal to some existing
SimplePredicate object instances.
For example, if your model contains a boolean field, then there only
needs to exist to SimplePredicate objects - one for the "false" value,
and the other for the "true" value:
<SimplePredicate field="myIndicatorVar" operator="equal" value="false"/>
<SimplePredicate field="myIndicatorVar" operator="equal" value="true"/>
sklearn2pmml/JPMML-SkLearn does not check if an identical
SimplePredicate object has already been created, and will happily
create one million new <SimplePredicate field="myIndicatorVar"
operator="equal" value="false"/> objects. r2pmml/JPMML-R includes this
predicate caching/reuse logic, and is able to hold ~ten times bigger
PMML document in the same amount of RAM.
Just opened a new GitHub issue about it:
https://github.com/jpmml/jpmml-sklearn/issues/34
The fix is really trivial, probably less than ten lines of code needs
to be added/modified. Unfortunately, I cannot give you any time
estimate when it will become available.
VR