cluster - large data

379 views
Skip to first unread message

Prakash R

unread,
Sep 21, 2020, 10:39:38 AM9/21/20
to OpenRefine

Hi,

I downloaded the OpenRefine 3.4 for correcting very large data (around 1,500,000 rows). I am processing in the I5 processor,10 gen, 32 GB RAM, and 500 GB SSD desktop system. I increased the OpenRefine memory to 2 GB. The file has 12 columns and when I run the text facet for a particular column which has 900000 entries, I am not able to get the cluster. What happens it goes back to the loading page, where there is no display of text facet. Any need your help to resolve this issue.

Prakash

Thad Guidry

unread,
Sep 21, 2020, 11:42:00 AM9/21/20
to openr...@googlegroups.com
If within those 12 columns you have strings that will be on average over 50 chars, then you will likely need to 3x or 4x your allocated memory to OpenRefine.
Try using the following to start OpenRefine on Linux:

./refine /m 16G

or uncomment and set the line
REFINE_MEMORY=16G
in the refine.ini file and then use refine.bat (you can make a shortcut for it later) to start on Windows

Let me know if this helps or not,


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/588ee6e7-a44f-402a-a930-50fe385abd02n%40googlegroups.com.

Prakash R

unread,
Sep 22, 2020, 3:25:47 AM9/22/20
to openr...@googlegroups.com
Thank you Guidry,

The problem is still the same after changing the REFINE_MEMORY=16G.

Does this program work well in Linux?

Prakash



--
R. Prakash
Roja Muthiah Research Library
Chennai
M| 9600041245

Isao Matsunami

unread,
Sep 22, 2020, 4:55:16 AM9/22/20
to openr...@googlegroups.com
How many different strings do you estimate in your dataset?  The number of comparison between each  pair may explode.

2020年9月22日(火) 16:25 Prakash R <ira.p...@gmail.com>:
--
*************************************************
中日新聞電子編集部
松波 功 Isao Matsunami

451-0043 名古屋市西区新道1-2-18-908
mobile: +81-90-3954-5786
mail: isa...@on.rim.or.jp  PGP:1DF1 4682
*************************************************

Thad Guidry

unread,
Sep 22, 2020, 8:37:21 AM9/22/20
to openr...@googlegroups.com
Hi Prakash,

Are you using the version OpenRefine 3.4 Windows kit with embedded Java?  https://openrefine.org/download.html

Can you attach in a reply your support.log file which can be found in the OpenRefine folder where you unzipped and installed.

Also, can you try to copy paste any text with errors in the OpenRefine console window, while you are performing that text facet to help us debug further?

Yes, OpenRefine works great on all 3 major OS's, Windows, Linux, macOS.


Prakash R

unread,
Sep 23, 2020, 1:16:40 AM9/23/20
to openr...@googlegroups.com
Dear Thad,

I didn't find the support.log file in the folder.

Prakash

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Prakash R

unread,
Sep 23, 2020, 6:27:02 AM9/23/20
to openr...@googlegroups.com
Dear Thad,

I have splitted the files and run the 5 lakhs rows. It gives me the cluster for 13987 from 135681 names. When I correct the cluster and merge selected & Re-cluster, I am getting this message " A web page is slowing down your browser. What would you like to do?. I am using Firefox. Pls help me

Prakash

Thad Guidry

unread,
Sep 23, 2020, 8:29:26 AM9/23/20
to openr...@googlegroups.com
Hi Prakash,

The clustering rows dialog display uses Javascript and if there are many rows of clusters then that might be detected by the browser as being very compute heavy to display and update.
There is not much you can do about that at this time other than work with smaller cluster sets, and preferably, to use Facets to filter down the sets of rows PRIOR to using the clustering dialog.
That way the clustering dialog will only have a limited set of clustered rows (10,000 or less for example, or much less if you can).  Yes this means that you might have to work through some clustered sets 2 or 3 times, but at least things won't seem stuck or throw browser warnings about slowness.

We do hope to improve all of that in future versions of OpenRefine, but this likely won't appear until next year, perhaps the following year.
Until that time, just work through smaller batches via Facets, Filtering, etc.



Prakash R

unread,
Oct 6, 2020, 10:46:20 AM10/6/20
to openr...@googlegroups.com
Dear Thad,

I need your advice.

Since my system (I5, 1oth Gen., 32BG RAM) was not able to perform the cluster operation, I am looking for the AWS. Would you pls suggest me the configuration for processing the 1,500,000 rows and 2 columns data in AWS? Which OS is best in AWS? I am planning to hire the system for a week and finish the work.

Regards
Prakash

Thad Guidry

unread,
Oct 6, 2020, 11:00:28 AM10/6/20
to openr...@googlegroups.com
Hi Prakash,

I would use a dedicated ETL tool that can support Fuzzy String Matching (String Similarity).

Pentaho has this.
KNIME has this (Many examples on their Hub which allows you to just drag the workflow directly in the KNIME dashboard running on your computer! ( ex. https://hub.knime.com/knime/spaces/Examples/latest/08_Other_Analytics_Types/01_Text_Processing/09_Fuzzy_String_Matching )
and many other ETL tools.

I think KNIME would help you much more for your task.  (it will  consume much less memory than OpenRefine 3.x - our 4.x version will be similar in architecture to KNIME )  And learning KNIME would be worthwhile since it's already used in Government, Science, and Pharmaceuticals.



Tom Morris

unread,
Oct 6, 2020, 11:47:10 AM10/6/20
to openr...@googlegroups.com
This is likely the same as the problem reported in 2013 in https://github.com/OpenRefine/OpenRefine/issues/695

On Wed, Sep 23, 2020 at 8:29 AM Thad Guidry <thadg...@gmail.com> wrote:
The clustering rows dialog display uses Javascript and if there are many rows of clusters then that might be detected by the browser as being very compute heavy to display and update.
There is not much you can do about that at this time other than work with smaller cluster sets, and preferably, to use Facets to filter down the sets of rows PRIOR to using the clustering dialog.
 ...
We do hope to improve all of that in future versions of OpenRefine, but this likely won't appear until next year, perhaps the following year.
Until that time, just work through smaller batches via Facets, Filtering, etc.

Actually, I fixed it back in July https://github.com/OpenRefine/OpenRefine/pull/2996
It will be included in the next release of OpenRefine. If you would like to experiment with it, you could try using any snapshot release from Aug/Sept.

Note that no amount of server resources will help because this is purely a front end scalability issue. For the example in the issue with 400K rows and 41K clusters, server time to compute the clusters was 3-7 seconds while the browser time was 200 seconds. I improved that to 80 seconds, but, more importantly, introduced a cap on the number of rows rendered so that they can be rendered in <10 seconds. By iteratively merging clusters and using the cluster characteristic facets, you should be able to work your way through all the clusters for review.

Tom

Tom

Thad Guidry

unread,
Oct 6, 2020, 2:09:46 PM10/6/20
to openr...@googlegroups.com
Tom,
1.5 million rows.... is work, lots and lots of work, even with the new fix.  Which we do appreciate!
I still prefer other tools when the scale of the problem is that large.  But that's me.
One thing that I like in other tools that we cannot do in OpenRefine yet is parallelize across clustering runs and spread that across CPU cores or entire machines. (Levenshtein happening on 4 cores on a column, and Jaro-Winkler on 4 cores against another column or the same column).

Prakash,
What Tom meant about the no amount of server resources and front end scalability issue...is directed to OpenRefine.  He was not talking about other tools.
I do hope that eventually OpenRefine will have robust Clustering options and parallelization for larger data and I think we'll get there.
You might also want to understand the differences between edit distance algorithms and token based algorithms.  For English strings, I typically use Jaro-Winkler algorithm, which helps with typos, from the edit distance type, and then Sorenson-Dice algorithm that is a token based type to find large domain concept similarities since it overestimates.  Here's halfway decent primer on what I mean https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Prakash R

unread,
Oct 7, 2020, 12:18:40 AM10/7/20
to openr...@googlegroups.com
Dear Thad and Tom,

Thank you so much for your response. It is very useful. 

Would it be possible to share with any other tools to clean up this type of large data?

Prakash

Fran Parras

unread,
Oct 7, 2020, 4:21:15 AM10/7/20
to openr...@googlegroups.com
Thad, my understanding you guys were exploring to use other runners like Spark and in the past I saw OpenRefine compatibility with Spark as well.. Just double check, Spark can address the scalability problem? what do you think?

Cheers,
Fran

Antonin Delpeuch (lists)

unread,
Oct 7, 2020, 4:35:01 AM10/7/20
to openr...@googlegroups.com

Hi Fran,

Yes we are working on making the backend handle large datasets better. I haven't looked at your issue closely but since Tom mentioned that it was purely a frontend-side performance issue, that is unrelated to these improvements.

Best,

Antonin

Fran Parras

unread,
Oct 7, 2020, 4:36:20 AM10/7/20
to openr...@googlegroups.com

Yves P.

unread,
Oct 7, 2020, 4:41:53 AM10/7/20
to openr...@googlegroups.com
Le 6 oct. 2020 à 20:09, Thad Guidry <thadg...@gmail.com> a écrit :

You might also want to understand the differences between edit distance algorithms and token based algorithms.

For English strings, I typically use Jaro-Winkler algorithm, which helps with typos, from the edit distance type, and then Sorenson-Dice algorithm that is a token based type to find large domain concept similarities since it overestimates.  Here's halfway decent primer on what I mean https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227

Thanks for this interesting article :)

Jaro-Winkler and Sorenson-Dice are not yet supported in OpenRefine, are they ?

__
Yves

Tom Morris

unread,
Oct 8, 2020, 8:47:16 PM10/8/20
to openr...@googlegroups.com
Correct. The string similarity library that we use provides Jaro-Winkler, but we haven't integrated it.

As always, if either of these are features that you're interested in, please create an enhancement request with some supporting words for why they'd be useful: https://github.com/OpenRefine/OpenRefine/issues/new/choose

Tom

Tom Morris

unread,
Oct 8, 2020, 8:56:54 PM10/8/20
to openr...@googlegroups.com
>> On 07/10/2020 10:21, Fran Parras wrote:
>>
>> Thad, my understanding you guys were exploring to use other runners like Spark and in the past I saw OpenRefine compatibility with Spark as well.. Just double check, Spark can address the scalability problem? what do you think?
>
> On Wed, Oct 7, 2020 at 4:35 AM Antonin Delpeuch (lists) <li...@antonin.delpeuch.eu> wrote:
>
> Yes we are working on making the backend handle large datasets better. I haven't looked at your issue closely but since Tom mentioned that it was purely a frontend-side performance issue, that is unrelated to these improvements.
>

Yes, it's not an uncommonly held view that all scalability issues can
be solved with more memory or processors or both, but there are a
number of front end browser scalability issues as well:
- text facets with large number of choices (one of the first we fixed,
which is why there's now a cap on the number of choices)
- records with large numbers of rows
- very wide rows
- clustering with large numbers of clusters and/or choices (this is
the one that I hopefully fixed in July, which is why I'm curious as to
whether the fix works in this case)

While 1.5M rows is above the OpenRefine design center. My 400K row
messy clustering test only took ~5sec to cluster. ~20sec for 1.5M
would be tedious, but perhaps not as tedious as switching to an
entirely different tool.

Tom
Reply all
Reply to author
Forward
0 new messages