is there any way to increase the heap size for sklearn2pmml version 0.12.0?

Jiby Babu

unread,

Mar 1, 2017, 5:39:54 PM3/1/17

to Java PMML API

Hi,

I am writing a big PMML file and getting java.lang.OutOfMemoryError: GC overhead limit exceeded issue;

My sklearn2pmml version is 0.12.0;

is there any work around for increasing the heapsize while the java process is running?

Any help is much appreciated!

Thanks,

Jiby

Villu Ruusmann

unread,

Mar 1, 2017, 5:54:48 PM3/1/17

to Java PMML API

Hi Jiby,

>
> I am writing a big PMML file and getting java.lang.OutOfMemoryError: GC
> overhead limit exceeded issue;
>

> is there any work around for increasing the heapsize while the java process
> is running?
>

You can't increase the heapsize of an already running JVM process.

I can think of of the following options:

1) Export JAVA_OPTS environment variable with custom heap size settings:
http://stackoverflow.com/questions/417152/how-do-i-set-javas-min-and-max-heap-size-through-environment-variables
I don't know if this environment variable can be exported from within
a running Python script. But it should work if you export it using
regular shell tools (eg. "export JAVA_OPTS=...") before starting the
Python process.

2) Clone the sklearn2pmml package, open the definition of
sklearn2pmml() function, and insert "-Xms" and "-Xmx" options there:
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/__init__.py#L130

Actually, tweaking JVM options seems like a useful feature, so I've
opened an issue about it:
https://github.com/jpmml/sklearn2pmml/issues/28

3) Save the PMMLPipeline object in Pickle data format to a file in
local filesystem, and perform the conversion using JPMML-SkLearn
command-line application: https://github.com/jpmml/jpmml-sklearn

VR

Jiby Babu

unread,

Mar 1, 2017, 8:15:11 PM3/1/17

to Villu Ruusmann, Java PMML API

Hi Villu,

You are simply awesome! The first small hack itself worked!

Thanks a lot for your quick responses

Cheers,

Jiby

Jiby Babu

unread,

Mar 9, 2017, 3:08:21 PM3/9/17

to Villu Ruusmann, Java PMML API

Hi Villu,

The first hack u suggested were working but unfortunately it isn't now;

I was trying out ur second tweak; ie insert "-Xms4G" and "-Xmx16G" options there:

https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/__init__.py#L130

I just need to replace the following

cmd = ["java", "-cp", os.pathsep.join(_package_classpath() + user_classpath), "org.jpmml.sklearn.Main"]

with

cmd = ["java -Xms4G -Xmx16G ", "-cp", os.pathsep.join(_package_classpath() + user_classpath), "org.jpmml.sklearn.Main"]

in the file /Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.py right?

But the above change is returning some error.

Anything wrong in the way i am doing?

Thanks,

Jiby

Villu Ruusmann

unread,

Mar 9, 2017, 3:50:34 PM3/9/17

to Java PMML API

Hi Jiby,

> cmd = ["java -Xms4G -Xmx16G ", "-cp", os.pathsep.join(_package_classpath() +
> user_classpath), "org.jpmml.sklearn.Main"]
>

You should break "java -Xms4G -Xmx16G" into three tokens: "java",
"-Xms4G" and "-Xmx16G". The subprocess.check_call() will assemble them
as appropriate.

If you mix "clean" tokens and "whitespace mixed-tokens", then
subprocess() is likely to get confused.

Anyway, what's this model object that requires so much memory? Some
decision tree ensemble (random forest, gbm)? The
sklearn2pmml/JPMML-SkLearn stack is much less memory optimized than
the r2pmml/JPMML-R stack; the difference is something like ten-fold.

If you give me more technical information about your setup (ideally,
the PKL file), then I could take directed action to narrow this gap.

VR

Jiby Babu

unread,

Mar 9, 2017, 4:14:59 PM3/9/17

to Villu Ruusmann, Java PMML API

Hi Villu,

Thanks for the recommendation.

That makes sense and I will try it out that.

I am using random Forest; What else technical info do you need ?

Thanks,

Jiby

Villu Ruusmann

unread,

Mar 9, 2017, 4:38:34 PM3/9/17

to Java PMML API

Hi Jiby,

>
> I am using random Forest.
>

That's exactly what I suspected.

For Scikit-Learn decision tree-based models there are two new
org.dmg.pmml.SimplePredicate object instance allocated for each node
split. This is extremely wasteful, because most of those newly
allocated SimplePredicate object instances are equal to some existing
SimplePredicate object instances.

For example, if your model contains a boolean field, then there only
needs to exist to SimplePredicate objects - one for the "false" value,
and the other for the "true" value:
<SimplePredicate field="myIndicatorVar" operator="equal" value="false"/>
<SimplePredicate field="myIndicatorVar" operator="equal" value="true"/>

sklearn2pmml/JPMML-SkLearn does not check if an identical
SimplePredicate object has already been created, and will happily
create one million new <SimplePredicate field="myIndicatorVar"
operator="equal" value="false"/> objects. r2pmml/JPMML-R includes this
predicate caching/reuse logic, and is able to hold ~ten times bigger
PMML document in the same amount of RAM.

Just opened a new GitHub issue about it:
https://github.com/jpmml/jpmml-sklearn/issues/34

The fix is really trivial, probably less than ten lines of code needs
to be added/modified. Unfortunately, I cannot give you any time
estimate when it will become available.

VR

Jiby Babu

unread,

Mar 9, 2017, 7:40:58 PM3/9/17

to Villu Ruusmann, Java PMML API

Thanks Villu! Hope we will have a fix soon :)

Btw, your suggestion for the temp WAR worked! Thanks a ton!

Cheers,

Jiby

Reply all

Reply to author

Forward