Performance Improvements (initial heap space, autosave, row operations)

Felix Lohmeier

unread,

Jun 20, 2017, 8:46:52 PM6/20/17

to OpenRefine

I noticed that the dev team is going to focus on performance issues in 2017 (cf. https://github.com/OpenRefine/OpenRefine/projects). Here are some of my personal observations:

JAVA heap space settings: OpenRefine defaults to -Xms256M (initial heap space) and -Xmx1024M (max heap space). The command line argument "-m" overrides the max heap space setting (-Xmx) but the initial heap space remains unchanged. Some people recommend to set -Xms = -Xmx. This should prevent the JVM from wasteful upsizing steps. I have tried to measure the time savings with Owen Stephens openrefine-timer (see timings-xms-xmx-xls attached) and observed 3% time savings with -Xms=8g -Xmx=8g instead of -Xms=256M -Xmx8g. Are there any known drawbacks? I suppose to change line 915 in refine shell script from "add_option "-Xms256M -Xmx$REFINE_MEMORY -Drefine.memory=$REFINE_MEMORY" to "add_option "-Xms$REFINE_MEMORY -Xmx$REFINE_MEMORY -Drefine.memory=$REFINE_MEMORY". If this is not common sense, maybe we can add another variable for the initial heap space?
Autosave period: There is a fixed period of 5 minutes in line 88 of main/src/com/google/refine/RefineServlet.java. Is there any way to override this setting (e.g. in refine.ini?). This is a significant performance issue for longlasting transformations.
Row operations (split/join multi-valued-cells, delete rows): These operations are very expensive. Given that the cross function only works on cells and therefore requires splitting multi-valued cells, these operations often amount to more than 75% of total run time in my projects. Either extending the cross function to accept values as input (which would allow to combine the split, cross and join in one expression) or improving the general execution time of row operations (parallelization?) would result in a high impact on total run time.

timings-xms-xmx.xls

Thad Guidry

unread,

Jun 20, 2017, 10:15:54 PM6/20/17

to openr...@googlegroups.com

On Tue, Jun 20, 2017 at 7:46 PM Felix Lohmeier <felix.l...@opencultureconsulting.de> wrote:

I noticed that the dev team is going to focus on performance issues in 2017 (cf. https://github.com/OpenRefine/OpenRefine/projects). Here are some of my personal observations:
JAVA heap space settings: OpenRefine defaults to -Xms256M (initial heap space) and -Xmx1024M (max heap space). The command line argument "-m" overrides the max heap space setting (-Xmx) but the initial heap space remains unchanged. Some people recommend to set -Xms = -Xmx. This should prevent the JVM from wasteful upsizing steps. I have tried to measure the time savings with Owen Stephens openrefine-timer (see timings-xms-xmx-xls attached) and observed 3% time savings with -Xms=8g -Xmx=8g instead of -Xms=256M -Xmx8g. Are there any known drawbacks? I suppose to change line 915 in refine shell script from "add_option "-Xms256M -Xmx$REFINE_MEMORY -Drefine.memory=$REFINE_MEMORY" to "add_option "-Xms$REFINE_MEMORY -Xmx$REFINE_MEMORY -Drefine.memory=$REFINE_MEMORY". If this is not common sense, maybe we can add another variable for the initial heap space?

Yes, the drawbacks are if you don't have 8gb of free memory :) that your OS doesn't have its fingers on.

Sure, we can make it slightly easier to allow command line parameters. But for most folks using OpenRefine we opt for a set it and forget it...I.E. setting this preference in the refine.ini file...which we already support and document on our Wiki.

Autosave period: There is a fixed period of 5 minutes in line 88 of main/src/com/google/refine/RefineServlet.java. Is there any way to override this setting (e.g. in refine.ini?). This is a significant performance issue for longlasting transformations.

Not currently a way to override the Autosave period.

This is one of those things that I myself asked previous Devs to add to the preferences to be able to override, just as we allow for the workspace folder and other settings https://github.com/OpenRefine/OpenRefine/blob/master/main/webapp/modules/core/preferences.vt

Row operations (split/join multi-valued-cells, delete rows): These operations are very expensive. Given that the cross function only works on cells and therefore requires splitting multi-valued cells, these operations often amount to more than 75% of total run time in my projects. Either extending the cross function to accept values as input (which would allow to combine the split, cross and join in one expression) or improving the general execution time of row operations (parallelization?) would result in a high impact on total run time.

That's one of the use cases we have to improve.

Its a 2 phase approach however to improve for that use case. Several things have to happen and would have to change in OpenRefine and a new mode called "batch mode" would allow to choose from the current way OpenRefine works to an alternate way or mode with some expected latency.

The changes would be in our project manager as well as storage layer as well as the operations matrix methods to utilize the new storage layer/compute layers. We are leaning toward having an option for a storage layer built around Apache Spark / Hadoop and would trial this out with the community. I've also already reached out to the Apache Flink team to see if they want to give a go at perhaps helping as well, and a member of their team said he'd take a look. (no guarantees)

I've captured just highlevel TODOs for some of it here: https://github.com/OpenRefine/OpenRefine/projects/1

but more links to gists around that research and others can be added to our TODO's on that page by anyone.

Felix, I hope this gives you the high level ideas we have so far.

What we really need are hackers to hack on some of these ideas and see what's feasible.

Jacky and I are more than willing to help guide folks. Our codebase is not the easiest to navigate for the uninitiated.

-Thad

+ThadGuidry

Thad Guidry

unread,

Jun 20, 2017, 10:24:01 PM6/20/17

to openr...@googlegroups.com

Oh, and Felix, for what its worth...

We are grateful and do acknowledge how far you have pushed OpenRefine boundaries with your scripts, Python work, and Docker guide. https://github.com/felixlohmeier/openrefine-batch

We appreciate it all.

We just want OpenRefine to natively handle larger datasets without having to perform workarounds such as yours and still retain part of its flexible UI that so many folks love.

-Thad

+ThadGuidry

Felix Lohmeier

unread,

Jun 22, 2017, 6:46:16 AM6/22/17

to OpenRefine

Thank you, Thad!

I am looking forward to new storage and compute layers, that sounds great! Unfortunately I am missing the needed skills to hack along.

But adding options for autosave period and initial java heap space seems feasible to me. I created a pull request: added options for initial java heap space and autosave period #1202. Works for me... hope that it works for all.

Reply all

Reply to author

Forward