wrong memory usage creating project - refine does not use all available RAM

881 views
Skip to first unread message

expa...@gmail.com

unread,
Jun 27, 2020, 8:57:14 AM6/27/20
to OpenRefine

Hi 
I am trying to create a project from a huge csv dataset. The file size is about 3 GB.

Before launching OpenRefine I edited refine.ini and changed these two lines:

# Memory and max form size allocations
#REFINE_MAX_FORM_CONTENT_SIZE=1048576
REFINE_MEMORY=32000M

# Set initial java heap space (default: 256M) for better performance with large datasets
REFINE_MIN_MEMORY=16000M

So I would expect OpenRefine being allowed to use between 16 and 32 GB of memory.
But when I launch it and try to create the project, this is what I see:

-------------------
Reading datafile.csv:
[a progress bar here]
["Cancel" button] 160 minutes remaining. Memory usage: 100% (1073/1073 MB)
-------------------

So, I have several questions:

(1) Looks like refine is not being able to use all the memory I assigned for the job. 
Perhaps there is something else I should touch in refine.ini or somewhere else?

(2) When refine is already running, is there any command to check how memory is available for it?
I mean, other than looking at refine.ini

(3) I am using an account in a shared machine. It is a Linux computing cloud where I can reserve the memory needed for my job.
The more memory I request, the more time I have to wait.
When there are enough resources available, I got an email and can link to a remote desktop where I got refine running.

So, how much memory/cpu would be needed depending on the size of my dataset.
Is there any documentation about how to do these calculations?

Thanks a lot in advance

expa...@gmail.com

unread,
Jun 27, 2020, 11:39:35 AM6/27/20
to OpenRefine
I forgot to say I launch OpenRefine by calling the refine shell script 
(so in theory, the refine.ini config should be applied)

Thanks

Thad Guidry

unread,
Jun 27, 2020, 11:43:20 AM6/27/20
to openr...@googlegroups.com
Does a "support.log" file get created when starting OpenRefine?  Can you check the OpenRefine folder for that file and attach in reply?



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/42237636-bc4e-458e-9785-69acc32ea0d6o%40googlegroups.com.

expa...@gmail.com

unread,
Jun 27, 2020, 12:29:17 PM6/27/20
to OpenRefine
Thanks for your answer Thad

I am a bit lost because this is a shared server. 
I am not sure about where to look for support.log file

This is how my OpenRefine installation path looks like:
/home/my-user-path/downloads/openrefine

Can you tell me where to look for "support.log", relative to that path?



To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

Thad Guidry

unread,
Jun 27, 2020, 12:52:22 PM6/27/20
to openr...@googlegroups.com
Same folder where you would see the refine.ini , refine.bat, refine, OpenRefine.exe, etc.
If you downloaded OpenRefine and extracted it to a folder.  The "support.log" is generated into that folder whenever "refine" or "refine.bat" is executed.



To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/a0da9d20-90a0-4ff4-86de-cafae008e9dco%40googlegroups.com.

expa...@gmail.com

unread,
Jun 27, 2020, 12:53:24 PM6/27/20
to OpenRefine
Hold on ... I cancelled and relaunched the job, and now memory things look much better

Reading datafile.csv
125 minutes remaining Memory usage: 52% (17490/33554MB)

I guess my launching script was somehow cached and not reserving as much RAM as I had requested. Sorry.

But my other questions are still valid: 

OpenRefine is now using 16 GB RAM (instead of 1GB), but the remaining time has not decreased that much.
Why does it only use the min RAM value (REFINE_MIN_MEMORY=16GB), and not the REFINE_MEMORY (32)

What can I do to speed up things? 
Actually this is just a test. I plan to process much bigger csv files, like 20 million rows or so.
That would be about 20 GB size.

What recommendations of CPU/RAM for OpenRefine dealing with this?

BTW, I am still interested in finding out where the support.log file is located in my system.

Thanks a lot in advance 

Thad Guidry

unread,
Jun 27, 2020, 12:59:43 PM6/27/20
to openr...@googlegroups.com
Our instructions for managing memory for OpenRefine are detailed on our FAQ (pay attention to the first 3 sentences, in your case, 64bit Java is important to be installed):




To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/074c6fde-1a13-4b53-a7f3-91ab38d36bf7o%40googlegroups.com.

Thad Guidry

unread,
Jun 27, 2020, 1:12:35 PM6/27/20
to openr...@googlegroups.com
Be aware that OpenRefine might not be the best tool for the purpose that you are trying to perform.

OpenRefine is an interactive ETL tool (Extract, Transform, Load).
You might be better served with a Batch Processing ETL tool like:
Pentho Data Integration
Talend

We do plan to support Batch Processing for large datasets in the future via Apache Spark.

If you are not looking to transform your large datasets, but instead explore them, then you might be better served with other data analysis tools.
Apache Hadoop and it's many data exploration suite of related software.
etc.

We have a small list of related tools(not exhaustive) that might help you more.


Tom Morris

unread,
Jun 27, 2020, 3:10:55 PM6/27/20
to openr...@googlegroups.com
On Sat, Jun 27, 2020 at 12:53 PM <expa...@gmail.com> wrote:
Reading datafile.csv
125 minutes remaining Memory usage: 52% (17490/33554MB)

Is the "time remaining" staying constant, going up, or going down? How many rows in your test file? 3M?
 
OpenRefine is now using 16 GB RAM (instead of 1GB), but the remaining time has not decreased that much.
Why does it only use the min RAM value (REFINE_MIN_MEMORY=16GB), and not the REFINE_MEMORY (32)

It will use up to 32GB when it needs it (assuming you're got everything set correctly). The fact that it's not yet, means that it's not memory bound.
 
What can I do to speed up things? 

If you've got "guess data types" turned on, turn it off and do the data type conversion separately after the file is loaded. Other than that, there's probably not much you can do. OpenRefine is single threaded.
 
Actually this is just a test. I plan to process much bigger csv files, like 20 million rows or so.
That would be about 20 GB size.

3M rows will be a little sluggish, assuming you get it loaded. 20M rows is about an order of magnitude more than OpenRefine is designed for.

Tom

expa...@gmail.com

unread,
Jun 27, 2020, 6:19:12 PM6/27/20
to OpenRefine

Thanks Thad & Tom.

Regarding my memory/speed questions, I looked at the FAQ link you sent me:

- My java version is 64bits: 
- "Parse cell text into numbers, dates, ..." was always turned OFF
- "Time remaining" is not constant. It changes up & down a little bit. And in the long term, it decreases as expected.
But I was expecting to be able to decrease this time by using a machine with a huge RAM.

For your comments looks like I should forget about using OpenRefine, which is a shame.
I was specially interesting in facetting and clustering geographical names in my very large datasets.

Although these datasets are very big (20-30 million rows), the real interesting stuff for me is just in a couple of columns.
I mean I could keep just the "row id" and the "geographical name" columns, so I can facet and cluster it.
Then I would use the "row id" for joining and updating my whole dataset with clean geographical names.

Do you think this would make a big difference?

Another option: what if I use a database instead a csv file?
I guess this wouldn't make a difference because openrefine would need to load everything in memory anyway.
But I prefer to ask before

Thad Guidry

unread,
Jun 27, 2020, 6:38:02 PM6/27/20
to openr...@googlegroups.com
If you are interested in data analysis (faceting/clustering) of large data... then I'd highly suggest using another data analysis tool.

OpenRefine is not a data analysis tool (but can be used on small data sets for some exploration)

Large data imported into OpenRefine (no matter the importer, SQL, CSV, JSON) will still use memory.

I would highly suggest taking a look at KNIME first which has an open source version

Best of luck!



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/b4172060-fff6-44de-9e2a-4daa6bb58fb1o%40googlegroups.com.

expa...@gmail.com

unread,
Jun 27, 2020, 7:15:51 PM6/27/20
to OpenRefine
Thanks Thad

Not sure if I explained myself well. With "faceting / clustering" I was not meaning "data analysis".
I was meaning "data cleaning", like in these videos:


I am just interested in OpenRefine for data cleaning of a couple of columns in a large dataset. Nothing else.
With this, I mean changing names in a "country" column, using the "facet text" option, and then the "cluster" option in the facet tab.
So I can easily homogenize with "ES" all these different values: es, ES, Es, España, sppain, Spain, Espagna, Espagne, Spanien, ... 
(some variants are just names in different languages, some other are ISO codes, some other are typos, ... but all of them refer to the same country).

Same thing happens with some other columns (state, city)

After doing that data cleaning, I just plan to import it to a database to use it from there.


Isn't OpenRefine the best tool for this?

To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

Thad Guidry

unread,
Jun 27, 2020, 7:34:27 PM6/27/20
to openr...@googlegroups.com
No, OpenRefine is not the best tool for that actually for large data!
I would say Informatica's tools would be the best for that.  But you probably don't have $250,000 :-)

However, there are many other open source tools that can perform "aggregations of text values" on a single column that you can then apply a "transform" on for various matching string patterns.
You can just click and group the patterns together in other tools to form "sets" or clusters and then transfrom the entire set into whatever other value you want, like "ES" if you need.

Even Hadoop and Pig could actually do those same functions.  Although they don't have as nice of a UI for that.
One that does have a better UI and expression editing UX is Apache Spark.

With Apache Spark you can build those facets and see the clusters of patterns.  There are even other UI tools that work with Apache Spark that build those Facets for ui and work and look very similar to OpenRefine!
And if you know a bit of Python or SQL then you can easily apply your string matching patterns to change all the values you wish into "ES".

Good luck in your explorations!

We hope to provide an Apache Spark backend in the future.  Stay tuned!



To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/94dcbdb6-e3ad-4b8a-ad9e-ada0d0363bc4o%40googlegroups.com.

expa...@gmail.com

unread,
Jun 27, 2020, 11:23:23 PM6/27/20
to OpenRefine


On Sunday, June 28, 2020 at 1:34:27 AM UTC+2, Thad Guidry wrote:
With Apache Spark you can build those facets and see the clusters of patterns.  There are even other UI tools that work with Apache Spark that build those Facets for ui and work and look very similar to OpenRefine!

Could you name or provide a link to any of these Spark facet building tools?
I would be so grateful if you know about something that resembles the OpenRefine facet/cluster interface.
 
We hope to provide an Apache Spark backend in the future.  Stay tuned!


Sure. Thanks a lot for your support!! 

Thad Guidry

unread,
Jun 27, 2020, 11:36:40 PM6/27/20
to openr...@googlegroups.com
You can ask all about Spark on their own community areas.




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

expa...@gmail.com

unread,
Jun 28, 2020, 4:49:53 AM6/28/20
to OpenRefine
Of course but it looks like you already know about some specific tools which resemble this feature of openrefine.
If you just give me their names I could ask about them with less chances of being misunderstood.

Probably my fault, but you had also misunderstood me when I asked about this data cleaning feature of openrefine.
The word “cluster” in spark google searches leads me to a different sense of the word (related to a cluster of machines, and not to clusters of text facets)

Thanks a lot again

Thad Guidry

unread,
Jun 28, 2020, 11:08:55 AM6/28/20
to openr...@googlegroups.com
As I said, I know of a few Hadoop/Spark projects that provided interactive UI for building facets, but some have been closed. So its best to ask on the Spark user mailing list.

One free tool that I know of used in collaborative environments is Apache Zeppelin and the Helium plugin system with it's online registry.
Because Zeppelin is essentially a notebook-type system, you can get collaborative help from the community and perhaps have someone help you code the text clustering facet (if one doesn't already exist) or if you don't have much time to learn.

A non-free tool would be something like https://cloud.google.com/dataprep that does give you substantial free credits to do exploration work such as what you are looking for. (the wrangling part is based on Trifacta which has columnar text faceting features similar to OpenRefine to let you easily discover and replace Strings based on patterns.

Sorry I cannot be of more help because I have been away from some of that ecosystem for the past 3 years working primarily in IoT and Linked Data.
But as I said, the best way to start is to ask the folks that really know... the Spark user mailing list (which is searchable in the top right corner).

All the best,


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages