is there any way to increase the heap size for sklearn2pmml version 0.12.0?

340 views
Skip to first unread message

Jiby Babu

unread,
Mar 1, 2017, 5:39:54 PM3/1/17
to Java PMML API
Hi,

I am writing a big PMML file and getting  java.lang.OutOfMemoryError: GC overhead limit exceeded issue;
My sklearn2pmml version is 0.12.0;
is there  any work around for increasing the heapsize while the java process is running?

Any help is much appreciated!

Thanks,
Jiby

Villu Ruusmann

unread,
Mar 1, 2017, 5:54:48 PM3/1/17
to Java PMML API
Hi Jiby,

>
> I am writing a big PMML file and getting java.lang.OutOfMemoryError: GC
> overhead limit exceeded issue;
>
> is there any work around for increasing the heapsize while the java process
> is running?
>

You can't increase the heapsize of an already running JVM process.

I can think of of the following options:

1) Export JAVA_OPTS environment variable with custom heap size settings:
http://stackoverflow.com/questions/417152/how-do-i-set-javas-min-and-max-heap-size-through-environment-variables
I don't know if this environment variable can be exported from within
a running Python script. But it should work if you export it using
regular shell tools (eg. "export JAVA_OPTS=...") before starting the
Python process.

2) Clone the sklearn2pmml package, open the definition of
sklearn2pmml() function, and insert "-Xms" and "-Xmx" options there:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/__init__.py#L130

Actually, tweaking JVM options seems like a useful feature, so I've
opened an issue about it:
https://github.com/jpmml/sklearn2pmml/issues/28

3) Save the PMMLPipeline object in Pickle data format to a file in
local filesystem, and perform the conversion using JPMML-SkLearn
command-line application: https://github.com/jpmml/jpmml-sklearn


VR

Jiby Babu

unread,
Mar 1, 2017, 8:15:11 PM3/1/17
to Villu Ruusmann, Java PMML API
Hi Villu,

You are simply awesome! The first small hack itself worked!
Thanks a lot for your quick responses

Cheers,
Jiby 

Jiby Babu

unread,
Mar 9, 2017, 3:08:21 PM3/9/17
to Villu Ruusmann, Java PMML API
Hi Villu,

The first hack u suggested were working but unfortunately it isn't now;
I was trying out ur second tweak; ie insert "-Xms4G" and "-Xmx16G" options there:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/__init__.py#L130

I just need to replace the following 

cmd = ["java", "-cp", os.pathsep.join(_package_classpath() + user_classpath), "org.jpmml.sklearn.Main"]
with
cmd = ["java -Xms4G -Xmx16G ", "-cp", os.pathsep.join(_package_classpath() + user_classpath), "org.jpmml.sklearn.Main"]

in the file /Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.py right?

But the above change is returning some error. 
Anything wrong in the way i am doing?

Thanks,
Jiby

Villu Ruusmann

unread,
Mar 9, 2017, 3:50:34 PM3/9/17
to Java PMML API
Hi Jiby,

> cmd = ["java -Xms4G -Xmx16G ", "-cp", os.pathsep.join(_package_classpath() +
> user_classpath), "org.jpmml.sklearn.Main"]
>

You should break "java -Xms4G -Xmx16G" into three tokens: "java",
"-Xms4G" and "-Xmx16G". The subprocess.check_call() will assemble them
as appropriate.

If you mix "clean" tokens and "whitespace mixed-tokens", then
subprocess() is likely to get confused.

Anyway, what's this model object that requires so much memory? Some
decision tree ensemble (random forest, gbm)? The
sklearn2pmml/JPMML-SkLearn stack is much less memory optimized than
the r2pmml/JPMML-R stack; the difference is something like ten-fold.

If you give me more technical information about your setup (ideally,
the PKL file), then I could take directed action to narrow this gap.


VR

Jiby Babu

unread,
Mar 9, 2017, 4:14:59 PM3/9/17
to Villu Ruusmann, Java PMML API
Hi Villu,
Thanks for the recommendation. 
That makes sense and I will try it out that.

I am using random Forest; What else technical info do you need ?

Thanks,
Jiby

Villu Ruusmann

unread,
Mar 9, 2017, 4:38:34 PM3/9/17
to Java PMML API
Hi Jiby,

>
> I am using random Forest.
>

That's exactly what I suspected.

For Scikit-Learn decision tree-based models there are two new
org.dmg.pmml.SimplePredicate object instance allocated for each node
split. This is extremely wasteful, because most of those newly
allocated SimplePredicate object instances are equal to some existing
SimplePredicate object instances.

For example, if your model contains a boolean field, then there only
needs to exist to SimplePredicate objects - one for the "false" value,
and the other for the "true" value:
<SimplePredicate field="myIndicatorVar" operator="equal" value="false"/>
<SimplePredicate field="myIndicatorVar" operator="equal" value="true"/>

sklearn2pmml/JPMML-SkLearn does not check if an identical
SimplePredicate object has already been created, and will happily
create one million new <SimplePredicate field="myIndicatorVar"
operator="equal" value="false"/> objects. r2pmml/JPMML-R includes this
predicate caching/reuse logic, and is able to hold ~ten times bigger
PMML document in the same amount of RAM.

Just opened a new GitHub issue about it:
https://github.com/jpmml/jpmml-sklearn/issues/34

The fix is really trivial, probably less than ten lines of code needs
to be added/modified. Unfortunately, I cannot give you any time
estimate when it will become available.


VR

Jiby Babu

unread,
Mar 9, 2017, 7:40:58 PM3/9/17
to Villu Ruusmann, Java PMML API
Thanks Villu! Hope we will have a fix soon :)
Btw, your suggestion for the temp WAR worked! Thanks a ton!

Cheers,
Jiby
Reply all
Reply to author
Forward
0 new messages