Measuring scale/limits of OpenRefine

Owen Stephens

unread,

Feb 18, 2017, 5:39:53 AM2/18/17

to OpenRefine

When I'm teaching OpenRefine one of the questions that comes up regularly is "How much data can OpenRefine deal with?". My reply is usually along the lines of:

There are lots of variables but as a rough guide, Thousands of rows is fine, Hundreds of thousands of rows probably OK, million plus rows probably not going to work so well/at all

I thought it was about time I tried to get a bit more precise with this, so I've written a test script which creates CSVs of fake data, uploads the data into OpenRefine, carries out a few simple transformations, and exports the data as a tsv. The scripts and details are available at https://github.com/ostephens/openrefine-timer

I thought I'd post some of the initial findings here.

I ran the tests with using a 4 column CSV with different amounts of memory assigned to OpenRefine - 1Gb (default), 2Gb and 4Gb. In each case the thing that failed first was carrying out the operations/transformations.

With 1Gb memory, it started to struggle around 700k rows, and failed to do the operations (although successfully loaded) on 800k rows

With 2Gb memory, it started to struggle around 1.35million rows, and failed to do the operations (although successfully loaded) on 1.55million rows

With 4Gb memory, there is no sudden up tick in time taken to do the operations - it just gradually increases until it failed to do the operations on 2.625million rows (n.b. the script allows you to do repeated loads of the same file size and take an average - I've not done this here which is probably why those spikes are in there)

The script is very basic and ideally I'd like to try doing some other type of test - like increasing the number of columns as well as rows, and trying loads of the same data in different formats (e.g. xls vs csv) - feel free to take the script and adapt it if you are interested. For example I've just done a test run with an 8 column file (1Gb memory) and in that case it failed slightly earlier at 600k rows. Also at the request of Scott Carlson on Twitter (https://twitter.com/scottythered) I ran a test where I loaded and exported the data, but didn't carry out any operations. I only did this with 1Gb memory assigned and found that it would keep loading the data up to around 1.5million rows (but good luck in doing anything with the data once it is OR!):

If anyone has done anything similar I'd be interested to know if you saw similar results. If you have a particular aspect of OpenRefine you'd like to see tested in terms of data volume, let me know or add an Issue to the script GitHub. Also feel free to take and develop the script if you want, or ask questions/make requests.

Best wishes

Owen

Thad Guidry

unread,

Feb 18, 2017, 7:37:49 AM2/18/17

to OpenRefine

Thanks Owen,

Most of us already know this. We've done benchmarking internally ourselves but its a rather moot point since its all in-memory storage. I did some performance benchmarking a long time ago, but it was manual based. Here's a bit of history on my 1gb file size testing long long ago... https://groups.google.com/forum/#!searchin/openrefine-dev/thad|sort:relevance/openrefine-dev/66qwZIJqe18/2HBTv0Y-JQgJ

OpenRefine streams data into memory (StringBuilder) and doesn't use compression or any other optimizations in its in-memory storage backend other than a few array optimizations against faceted browsing, etc. Per Stephano, "The most surprising architectural feature for many is the fact that OpenRefine has no database... or, more precisely, it runs off of its own in-memory data-store that is built up-front to be optimized for the operations required by faceted browsing and infinite undo."

https://github.com/OpenRefine/OpenRefine/wiki/Architecture

https://github.com/OpenRefine/OpenRefine/wiki/Server-Side-Architecture

So your actually working with your datasets against the Java Heap. And its doing Object diffing and storing that in-memory so that you can easily Undo at any point in time. Folks forget about that part the moment they apply a transform :) And using in-memory storage is also what brings the rapid fluid faceting and other real-time interaction that OpenRefine users love. There are always tradeoffs. Sure we could read and write to disk and put our maximum memory requirements at only 200mb to operate within. And others have proposed and wired up OrientDB as a storage backend, but that degrades the real-time interaction performance quite a bit, depending on what your doing. I tried this myself long ago.

At this point in time and with our fairly modern processors these days, the quick win is to implement a startup parameter that allows LZ4 compression on the data (near real-time) and where OpenRefine is reading and writing to the project storage backend as it compresses the data 50%, basically doubling OpenRefine's current handling capacity. But that would be about the most gain taken without losing real-time performance. And this is something that I am considering working on this year or get someone to volunteer and hack on !

When folks need to work with larger datasets we encourage them to use larger tools designed for bigger data. Apache Spark, Pentaho, Talend, etc.

-Thad

+ThadGuidry

qi cui

unread,

Feb 18, 2017, 10:03:33 PM2/18/17

to OpenRefine

I agreed that in-memory storage is a double-edge sword which user love and also has its tradeoffs.

But I am not sure that applying the compression will make OR to process more data. My expression is that eventually the in-memory data still need to be uncompressed into 2 dimensional array to apply the actual operations. So by compressing it will only allow OR to load bigger data with less memory footprint but won't improve the processing efficiency. BTW, I did some research on Apache parquet and that seems to be a good fit for OR data modeling but still it does not like to be a drop-in replacement for current one because it means that all the current operations need to rewritten.

Thad, you mentioned you tried OpenRefine wired up OrientDB. What did you do and what's your findings?

Owen Stephens

unread,

Feb 19, 2017, 9:02:53 AM2/19/17

to openr...@googlegroups.com

Thanks both

I just want to be clear - my intention was simply to produce some figures to give some broad guidance on how much data OpenRefine can work with comfortably with different amounts of memory allocated.

I did not intend this as a criticism of OpenRefine or a request to improve performance or change how it works- I think the trade offs made in OpenRefine are well chosen and why I like it so much.

Owen

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,

Feb 19, 2017, 3:33:05 PM2/19/17

to openr...@googlegroups.com

Hi Jacky,

OpenRefine with OrientDB storing rows was about 2 times slower, just as what Luca saw as well.

Here is the old message thread...

https://groups.google.com/d/msg/openrefine/TOgkfvI4ZV4/JIHtnZGKMzkJ

My idea about compression was utilizing it at the same time with off heap storage, sorry if I was unclear about that (I was typing fast, the below should give you more technical details of my idea). Some of the functions that OpenRefine does currently in heap, would be allowed to be done off heap. Basically allowing Batching processes when your utilizing facets, clustering, etc...to handle large aggregates of data. Yes, this would slow down things for near realtime feedback, but only if the data was very large, but this also speed things up for those with big data cases, and only when the user has choosen that choice of using off heap Apache Spark processing or existing in heap storage engine from Stephano.

Apache Spark gives us the best of ALL worlds, with Compression, SQL, Dataframes, JDBC/ODBC, and it already can interface with Parquet.

Apache Spark can run Standalone, or Hadoop, Mesos, Cloud.

We just need to give users an option in OpenRefine to switch from the IN HEAP row storage OR using Apache Spark processing and configured datasource (and in doing so it allows batching processes through Apache Spark)

Technical details:

High level design not thoroughly investigated but ongoing.

The idea is just to expose OpenRefine's load/save using Apache Spark's dataframe http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

The default datasource for Apache Spark is already Parquet :) http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources

I'd love for someone to start hacking on using Spark and OpenRefine.

-Thad

+ThadGuidry

qi cui

unread,

Feb 19, 2017, 5:37:39 PM2/19/17

to OpenRefine

Actually I have been thought of Spark on OpenRefine for a while and had some high level design in mind. If someone really think extending the Spark option is necessary and it does not reinvent the wheel considering there are other Big Data tools like Pentaho, Talend etc in the market, Please let me know so we can work together.

The main parts will be:

Data loading to spread the data set to HDFS
Data processing using the spark. which means swapping out the operations current based on heap memory
Front end change to use partial data to do the facet instead of using the whole data set
A mechanism to use the sample data to design the operations before submitting to spark cluster
Handling the redo/undo since the diff will not be stored locally
Handing the fact that each time the data spreading(the number of worker node and the partition of the data set etc) is different. It will create some challenge to persistent the operations which are deterministic if the data set is loaded wholly in memory. For example, "change the cell(50,2) value from abc to cba".

Thad Guidry

unread,

Feb 19, 2017, 9:52:29 PM2/19/17

to openr...@googlegroups.com

Jacky,

Some of the plumbing and pipelining itself can be done with Spring Cloud Data Flow.

http://cloud.spring.io/spring-cloud-dataflow/

-Thad

+ThadGuidry

Patrick Maroney

unread,

Feb 20, 2017, 4:55:37 PM2/20/17

to OpenRefine

I want to highlight the positive experiences I've had with using "on demand" instantiations of very large AWS EC2 Ubuntu to run OpenRefine on data sets with millions of rows. The only issue I've had with this approach is the ingesting path is through the browser on my local host which means the data sets stored on EBS, S3, or EFS volumes have to transit the much slower network paths (vs. direct AWS 10GB paths).

In one case I was able to complete numerous transformations (including python custom scripts) in hours vs. he days it was taking me on my local 32GB Mac OS X system. This time savings paid for whatever AWS charges I incurred by a couple orders of magnitude. Plus I didn't have to wait 30 minutes for each transform on a column to complete which really slows down the iterative interactive nature of developing OpenRefine transformation/enrichment processes. I have a snapshot of my Data Science Ubuntu Image (With all of my tools and frameworks installed) that I use to launch instantiations scaled to the task at hand. You can use CloudFormation templates, Vagrant, etc. to deploy your On Demand Data Science systems If you don't want to pay for the Snapshots.

Just keep your data sets on separate volumes and mount your EFS/EBS volumes or restore a snapshot of the Data Sets. When done, you delete the AWS Instance (Lesson Learned: Don't forget to delete the instance!!!!-- large scale syatems can cost upwards of $300/day)

Owen Stephens

unread,

Feb 21, 2017, 5:40:24 AM2/21/17

to OpenRefine

Thanks for sharing this experience Patrick - really good to hear.

I wonder if you wanted to speed up the ingest if you couldn't use one of the OpenRefine libraries to do the load directly from the command line on the AWS host? Just an idea

Owen

Bertram Ludaescher

unread,

Oct 17, 2018, 10:37:44 PM10/17/18

to OpenRefine

Hi Patrick and everyone else,

The experience that Patrick shares seems to indicate that OpenRefine running on an EC2 results in "automatic speed-up", if you only give it more CPU resources (and RAM). Is that so?

I might be misunderstanding, but in that case, why not just keep requesting more CPUs and more RAM and forget about other parallelization approaches?

Are OpenRefine workflows (the JSON recipes) automatically "embarrassingly parallel"? And does the OR architecture take care of this parallelism automatically?

I'm getting mixed messages here.. ;-)

Can someone shed some more light on this?

Thanks, cheers,

Bertram

Antonin Delpeuch (lists)

unread,

Oct 18, 2018, 7:30:26 AM10/18/18

to openr...@googlegroups.com

Hi Bertram,

At the moment, no parallelism is used at all.
Scaling OR to handle larger projects is on our plans.

Antonin

> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com

> <mailto:openrefine+...@googlegroups.com>.

Patrick Maroney

unread,

Oct 27, 2018, 11:14:48 AM10/27/18

to openr...@googlegroups.com

Bertam et al,

TLDR -- Give 'on demand' AWS EC2 OpenRefine instances a shot with your data/use cases.

Thank you commenting on the post. The purpose was to share a simple and highly effective method for increasing productivity with OpenRefine. In my specific set of use cases, the improvement to my iterative processes of "taming" a given data set with OpenRefine was orders of magnitude.

I think it's safe to assert that all analysts want to form hypothesis, act on them, and validate their assumptions at the speed of their curiosity and intuition. So from that perspective, the decision to establish a model where I can deploy responsive "on demand" AWS EC2 OpenRefine instances scaled to the specific use case/data set has paid for itself many times over.

You need to preposition your data sets.
You need to get adept at firing up/tearing down/resizing AMI instances.
Start with the best estimate of how much memory you need to load your data into OpenRefine
Fire the instances up as soon as you are ready to work
You can dynamically switch AWS instance types to scale up/down.
Remember to power them down when your done*.

*If you inadvertently leave a p3.16xlarge running over a 3 day weekend, you'll have a $2,000 surprise on Tuesday morning. For a month: $17,000.

If you look at the AWS cost/hour and compare that to your cost/value/hour, it should be easy to justify these expenses. Note this suggestion focuses on those highly interactive phases of learning your data sets, writing/testing transformations, faceting, etc. Once you get your OpenRefine model, transforms, etc. defined there are different methods for establishing effective batch processing of data sets. If you can reduce time waiting after you click a button in OpenRefine by just an order of magnitude, what is that worth to you? If you want to iterate through development and debugging of your custom Python/Jython scripts?

Class Instance Type CPU eCU Memory Instance Storage Cost/MoAvg Cost/Day Cost/Hr
GPU Instances - Current Generation p3.16xlarge 64 188 488 EBS Only $    17,870 $   588 $   24.48
Memory Optimized - Current Generation x1.32xlarge 128 349 1952 2 x 1920 SSD $ 9,737 $   320 $   13.34
Memory Optimized - Current Generation r4.16xlarge 64 195 488 EBS Only $ 3,107 $   102 $   4.26
General Purpose - Current Generation m4.16xlarge 64 188 256 EBS Only $ 2,336 $   77 $   3.20
General Purpose - Current Generation m4.10xlarge 40 124.5 160 EBS Only $ 1,460 $   48 $   2.00

Caveat Emptor: Of course, in general, there are scenarios where "more" actually results in diminishing returns (i.e., Java garbage collection issues).

I would argue that all Analysts and Data Scientists should add on-demand "cloud provisioning" of data analytics tools/frameworks (e.g., OpenRefine, Jupyter Lab, Apache Zeppelin, Anaconda) to their core suite of tools and capabilities.

Patrick Maroney

Principal Engineer - Data Science & Analytics

Wapack Labs LLC

pmar...@wapacklabs.com

Public Key: http://pgp.mit.edu/pks/lookup?op=get&search=0x7C810C9769BD29AF

You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/-loChQe4CNg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.

signature.asc

Jennifer Newcomer

unread,

Jan 31, 2022, 8:33:44 PM1/31/22

to OpenRefine

Hi All,

I found this old post and hope my reply will be found and answered by one of you - perhaps Owen or Thad. I have been banging my head against the OR wall. I've used for a couple of years and had run what I now understand was a pretty large project on a much slower laptop. Now I am running an update to that original project, have a new machine (32GB RAM) and the file I am trying to run a text facet on is only 23MB (~248k records - intentionally reduced) and the process keeps crashing. I have the latest version (3.5.2), upped the memory in the configuration file (all the way to 4096M), made sure my Java is running 64 bit and nothing seems to budge. I'm sure there is something critical I missed. Would someone be able to point me to my blind spot?

Thanks in advance!

Jennifer

Thad Guidry

unread,

Jan 31, 2022, 8:55:32 PM1/31/22

to openr...@googlegroups.com

Hi Jennifer, (long lost! :-)

That 4096M is megabytes, so it's about 4G in actuality. And probably not quite enough?
Try increasing your setting for max memory for Refine in the refine.ini file (or openrefine.l4j.ini) to

REFINE_MEMORY=8G

in order to give OpenRefine 8 gigs of RAM to allocate for Java heap memory.

That should help for most of your projects I'm sure, and if it does not, try to give 16G or 24G if really necessary (of course, shutting down other applications so that Windows can still function just fine within your 32G RAM.)

Thad

https://www.linkedin.com/in/thadguidry/

https://calendly.com/thadguidry/

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/e4a67cb0-5155-4980-976b-0c247b1d1326n%40googlegroups.com.

Owen Stephens

unread,

Feb 1, 2022, 5:32:11 AM2/1/22

to OpenRefine

Hi Jennifer,

There are a couple of things I'd suggest checking to start with:

Firstly double check that OpenRefine is actually picking up the memory configuration correctly. You don't say which operating system you are using, but on Windows or Linux you should be able to see in the console/terminal window a message when OpenRefine starts up that reads something like:

You have XXXM of free memory.
Your current configuration is set to use XXXXM of memory.

If the second line doesn't say "4096M of memory" then the configuration isn't being picked up correctly.

(if you are using macOS then use the instructions for Mac on this page https://docs.openrefine.org/manual/running to "run OpenRefine using Terminal" to see these messages).

If it isn't picking up the config, let us know so we can work with you to understand why.

Secondly, when you are working with Facets there are two aspects to the performance - one is OpenRefine creating the facet, and the other is the display of that facet in the browser interface - perhaps surprisingly we've found that it is often the latter that causes performance issues. Essentially if you are dealing with a facet that has many values in it, you may find that the issue is with the display of the facet in the browser - and allocating more memory to OpenRefine won't affect this at all. There's a discussion of this in this Github issue https://github.com/OpenRefine/OpenRefine/issues/2032 - and I even identified a partial fix in https://github.com/OpenRefine/OpenRefine/issues/672 but unfortunately it remains an open issue.

If the issue you are seeing is caused by this second factor, then you could try:

Changing browser

Making sure the browser isn't already running memory heavy operations (e.g. many tabs open)

Implementing the fix I describe in https://github.com/OpenRefine/OpenRefine/issues/672 on your local install (which means changing some code, but I'm happy to support you through trying that if you want to give it a go)

I'd suggest those as starting points - of course you may be encountering some different type of problem here, or as Thad suggests just need to give OR some additional memory - I tend to run at 4Gb by default and increase as needed for larger files - I've never noticed any bad side effects from increasing the amount of memory OpenRefine has allocated so I think you can be quite bold in allocating memory to it - especially if you close down other applications first.

Best wishes and good luck,

Owen

Jennifer Newcomer

unread,

Feb 1, 2022, 10:15:04 AM2/1/22

to openr...@googlegroups.com

Good morning Thad and Owen,

First, thank you so much for your immediate responses. I don't have an IT background, so I admit this got me into territory I am not fluent in. I have some work to do to confirm/test your suggestions and will report back to the group.

Again, my sincere thanks for your insights.

Jennifer

You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/-loChQe4CNg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/492601bd-5f18-431f-98e7-518f7c66261fn%40googlegroups.com.

--

Jennifer Newcomer

Research Director

Colorado Futures Center

jenn...@coloradofuturescsu.org

303.819-1525

PhD Candidate

University of Colorado, Denver

Urban & Regional Planning | Geography

Jennifer Newcomer

unread,

Feb 1, 2022, 6:34:08 PM2/1/22

to openr...@googlegroups.com

Hi Owen,

To your point, I'm not seeing in the console/terminal (I'm running Windows 10) telling me:

You have XXXM of free memory.
Your current configuration is set to use XXXXM of memory.

This is what I see when I start OR:

Is this what you were referring to?

Thanks,

Jennifer

On Tue, Feb 1, 2022 at 3:32 AM Owen Stephens <ow...@ostephens.com> wrote:

You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/-loChQe4CNg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/492601bd-5f18-431f-98e7-518f7c66261fn%40googlegroups.com.

Owen Stephens

unread,

Feb 2, 2022, 4:29:10 AM2/2/22

to OpenRefine

Hi Jennifer

That's exactly what I meant by the console/terminal. Apologies, I'm not usually a Windows user so I made some assumptions that it would display this message when it looks like that doesn't happen.

Looking at the instructions for Windows on https://docs.openrefine.org/manual/running there are two ways of starting OpenRefine on Windows, and according to the documentation, depending on which one you use it looks like the way to change the amount of memory being used is different.

If you are running OpenRefine by clicking the OpenRefine.exe, the file you need is openrefine.l4j.ini

If you are running OpenRefine by running refine.bat, the file you need is refine.ini

Can you confirm which one you are using, and check you edited the appropriate configuration file for the increased memory?

Also if you are using one way, possibly try switching to the other (I think that using refine.bat will give you a bit more information in the console over what memory is being used, but I'm not 100% on that)

Best wishes

Owen

Jennifer Newcomer

unread,

Feb 2, 2022, 9:58:05 AM2/2/22

to openr...@googlegroups.com

Hi again,

I've tried launching OR by running refine.bat and have yet to be successful. This is what I receive:

Admittedly, I could have mis-typed, as I'm not much of a command prompt user. I did edit the refine.ini file as such:

Any other thoughts?

Thanks!

Jennifer

On Wed, Feb 2, 2022 at 6:43 AM Jennifer Newcomer <jenn...@coloradofuturescsu.org> wrote:

Hi Owen,

I start OpenRefine by clicking the OpenRefine.exe, and have used openrefine.l4j.ini to change the memory. See below for how I have it configured.

I'll try your recommendation and start OpenRefine by running refine.bat and configure the refine.ini file.

Thanks again!
Jennifer

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/8f75e253-abb3-463d-9cb4-9a43da5020f1n%40googlegroups.com.

--
Jennifer Newcomer
Research Director
Colorado Futures Center
jenn...@coloradofuturescsu.org
303.819-1525
PhD Candidate
University of Colorado, Denver
Urban & Regional Planning | Geography

Jennifer Newcomer

unread,

Feb 2, 2022, 9:58:07 AM2/2/22

to openr...@googlegroups.com

Hi Owen,

I start OpenRefine by clicking the OpenRefine.exe, and have used openrefine.l4j.ini to change the memory. See below for how I have it configured.

I'll try your recommendation and start OpenRefine by running refine.bat and configure the refine.ini file.

Thanks again!

Jennifer

On Wed, Feb 2, 2022 at 2:29 AM Owen Stephens <owen....@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/8f75e253-abb3-463d-9cb4-9a43da5020f1n%40googlegroups.com.

Owen Stephens

unread,

Feb 2, 2022, 10:49:57 AM2/2/22

to OpenRefine

Hi Jennifer

You should be able to run refine.bat by just double clicking it in the folder (i.e. in the same way you usually launch OpenRefine but click on refine.bat rather than openrefine.exe). You shouldn't need to type a command to just run it.

However if you want to do it from the console based on what is in your screenshot I think you could use something like:

cd c:\openrefine-win-with-java-3.5.2\openrefine-3.5.2\

then

refine.bat

Best wishes

Owen

Jennifer Newcomer

unread,

Feb 2, 2022, 9:22:44 PM2/2/22

to openr...@googlegroups.com

Hi Owen,

Well, I was able to get OR running from the bat file, but still to no avail, my facet process keeps failing. And, I bumped up the memory to 16GB. Perhaps my next step is to look into your point regarding the facet display in the browser. I've tried both Chrome and Microsoft Edge. Any other suggestions at this point? I'm also going to dig up my old laptop and see if I can get it to run.

Thanks again - I really, really appreciate your help!

Jennifer

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/ec14f2b9-adea-4042-b0e6-ebb9a0ce54c0n%40googlegroups.com.

Owen Stephens

unread,

Feb 3, 2022, 4:23:46 AM2/3/22

to OpenRefine

Hi Jennifer

Do you see any additional information in the console when you run OpenRefine from the bat file?

Do you know to any level of approximation how many values would be in the facet - to get an idea of whether this should be causing you any issues?

Also - could you confirm how you "bumped up the memory to 16Gb"? And how you then started OpenRefine to try this - because these two things (where you set the memory and how you started OpenRefine) have to align for the memory setting to be picked up

Thanks

Owen

Thad Guidry

unread,

Feb 3, 2022, 8:54:35 AM2/3/22

to openr...@googlegroups.com

Owen,

Also they can… when starting with refine.bat it will log a file called support.log which they can open and copy paste the lines to us in a reply.

Thad

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/da2b940e-9dd4-463a-a2f1-dcb6eb591de2n%40googlegroups.com.

--

Jennifer Newcomer

unread,

Feb 3, 2022, 11:14:14 AM2/3/22

to openr...@googlegroups.com

Hi Owena and Thad,

I'm adding clips of the console when I started OpenRefine from the bat file. One with the base memory and one after I bumped up the memory to 8GB. I did the memory bump through the refine.ini file. Also, I'm running it on Microsoft Edge, and included screenshots of what the file looks like before I run the text facet (I am running it on ~176k records) on the address field and the error message when it ran out of memory. Is there anything I missed?

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAChbWaNSG_%2BbXNFNxN5xnFcL5r3bcrcAw1y8%2BXEZ_31er63oRw%40mail.gmail.com.

Owen Stephens

unread,

Feb 3, 2022, 11:43:19 AM2/3/22

to OpenRefine

Thanks Jennifer - this is really helpful.

So the screenshots of the console confirm that with this configuration, running with refine.bat, the additional memory is being allocated without an issue.

The screenshot of the error screen highly suggests to me that the issue is in the browser, not with the "backend" OpenRefine process - which means increasing the memory allocated to OpenRefine won't solve your problem.

Perhaps we should take a step back and ask - what is it you are trying to do with a facet of the "own_address" column - what questions are you trying to answer or what changes are you trying to make to the data?

Best wishes

Owen

Jennifer Newcomer

unread,

Feb 3, 2022, 12:18:35 PM2/3/22

to openr...@googlegroups.com

Hi Owen,

Oh good. I'm glad I got you the information to help further guide your suggestions. In terms of the goal of my project, I need to run the cluster functions, both key collision and nearest neighbor methods and their various functions to identify and merge what are effectively the same address. This way I can run group queries in my db to identify proxy common ownership portfolios. I've not found another open source tool that accomplishes this task in the way OpenRefine does, particularly the merge function. I know R has a package refine, but it's not clear to me if it's as user friendly as OpenRefine's GUI when it comes to selecting values to merge.

Does that make sense?

Thanks!

Jennifer

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/7bbd06fd-5aa3-4397-8f5a-55f91eb2a906n%40googlegroups.com.

Owen Stephens

unread,

Feb 3, 2022, 12:55:10 PM2/3/22

to OpenRefine

Hi Jennifer. The first thing I'd suggest trying is running the Cluster without creating a facet (because I think the problem here is the browser running out of memory when it tries to display the facet values).

You can do this by using the dropdown menu on the own_address and selecting Edit cells -> Cluster and edit

That should bring up the Cluster screen with the Key Collision / fingerprint method selected

Can you see if that works?

Owen

Jennifer Newcomer

unread,

Feb 3, 2022, 1:33:19 PM2/3/22

to openr...@googlegroups.com

Hi Owen,

I think that worked! And it's lightning fast. I clearly overlooked that option. I believe my issue is resolved.

My sincere thanks to you and Thad for following my posts and your time to help me unwind my issue. If paths ever cross in the future I at least owe you a HH beverage.

Cheers!

Jennifer

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/073a2952-7e4a-4882-a8cb-cf4682107090n%40googlegroups.com.

Owen Stephens

unread,

Feb 4, 2022, 5:27:56 AM2/4/22

to OpenRefine

That's great news - glad it's working for you!

Best wishes

Owen

On Thursday, 3 February 2022 at 18:33:19 UTC Jennifer Newcomer wrote:

Hi Owen,

Reply all

Reply to author

Forward