Open Refine is crashing

443 views
Skip to first unread message

alice brennan

unread,
Jan 9, 2014, 5:38:59 PM1/9/14
to openr...@googlegroups.com
Hi -- I'm using refine to look through police records. 

At the moment I have a data set that is 99,000 rows by about 10 columns. I've increased the ram allocation on my mac but for some reason it's still operating really slowly and tends to crash a lot when I try and cluster certain columns. 

Could you give me some advice as to what it might be? I'm on a retina, so it's a new comp with capabilities and I'm using version 2.5. 

Hope to hear from you soon. 

Alice 
Screen Shot 2014-01-09 at 5.38.24 PM.png

Thad Guidry

unread,
Jan 9, 2014, 9:23:49 PM1/9/14
to openr...@googlegroups.com
Alice,

It looks like your not clustering...but instead trying to use the Text Facet ?  and it is showing over 41,000 choices.

What interesting things are you trying to find out about the Name column ?

Similar First names ?  Use a custom text facet that partitions to only seeing the 1st name with GREL:  value.partition(" ")[0]

Similar Last names ?  just change to see the Last part after the space with GREL: value.partition(" ")[1]

Let us know more about what your trying to do exactly with this data...

You also might want to download and use 2.6 beta... 2.5 has lots of bugs that we have fixed as well.




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--

Nick Hayes

unread,
Jan 28, 2014, 5:51:24 PM1/28/14
to openr...@googlegroups.com
I also downloaded WaterFox, which is a 64 bit version of Firefox, and will allow for more RAM allocation.  Just make sure you have more than 4gb free..

Martin Magdinier

unread,
Jan 28, 2014, 7:38:19 PM1/28/14
to openrefine

Nick thanks for this tip. Did you experience a real improvement when using waterfox? Can you share more about your experience?

--

Tom Morris

unread,
Jan 29, 2014, 10:34:21 AM1/29/14
to openr...@googlegroups.com
As Thad implies, asking your browser to sort a list of 41,000 names, might be asking too much of it.  That's much bigger than the default limit for the text facet, so you must have increased the limit at some point.

The sorting is currently done client-side to save a server round trip and provide the user with snappy performance, but this assumes that the list contains a reasonable number of entries.  For a list this size, it'd be better for us to ask the server to do the sort and send us all 41,000 entries again in sorted order (in part because Javascript/your browser isn't very good at sorting).

For now I'd suggest using additional facets or subsetting your data set in some way (Thad had some good suggestions, as well), so that you don't have an enormous number of entries in your text facet.

This isn't a memory issue, so more RAM isn't going to help.

Tom

alice brennan

unread,
Jan 30, 2014, 4:32:33 PM1/30/14
to openr...@googlegroups.com
here is a screen shot of it. 
Screen Shot 2014-01-30 at 4.27.20 PM.png

Thad Guidry

unread,
Jan 30, 2014, 4:50:06 PM1/30/14
to openr...@googlegroups.com
Alice,

Describe for us the usefulness of your facet on name.  What are you trying to figure out.. .how many last names are "King" ?  how many "King" names there are ?  What are you trying to figure out about the values inside that name column ?

If you need to just sort on a column like name, and make the sort permanent as well, use the Sort command on the dropdown arrow above any column.



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

alice brennan

unread,
Jan 30, 2014, 5:03:53 PM1/30/14
to openr...@googlegroups.com
Hey -- so these are arrest records -- so the name column is super important. I don't just want all the roberts faceted, I want all the 'robert smith' occurrences. Basically I want to know how many times particular individuals have been arrested and be able to facet that easily so then I can dig around in the other columns. 

Does this make sense? 

Yeah there are over 41,000 choices but this data set contains 99,000 rows, meaning there are certain people who have been arrested MANY times (185 times in some instances) ... 

Thank you! 

Thad Guidry

unread,
Jan 30, 2014, 9:32:58 PM1/30/14
to openr...@googlegroups.com
So....

I would create a new column based on the value in your name column (so that you do not disturb your original column, you make a copy of it, basically)...then, use clustering menu option on that copied name column.... to cluster the similar names using the various algorithms available in that dialog.

Does that make sense to you Alice ?  Have you seen the various video tutorials around the net on how to use OpenRefine's clustering dialog ?

alice brennan

unread,
Jan 31, 2014, 12:05:06 AM1/31/14
to openr...@googlegroups.com
Hey -- Yeah I've used refine a bit before. Thank you for your advice. I want to be able to facet and then export a count, and or delve deeper into each name etc through the facet search.  I've done it many times before and on a previous version of refine -- it just seems this version isn't dealing with so many choices. I've used refine for big data sets in the past and it's not been this slow? I  assume it's because there are so many choices right? Is there any solution to this at all? Or is refine limited to 56,000? 
Cheers for your help!!!

A




Alice Brennan, 
Journalist/Producer/Researcher
Tel: 347 247 5550
follow me on twitter  http://twitter.com/alicitabrennan



--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/iy6YLgiQcNE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.

alice brennan

unread,
Jan 31, 2014, 12:08:59 AM1/31/14
to openr...@googlegroups.com
Ah right -- are you guys intending on increasing the limit at all? 
For these sorts of data sets (ie police data) it's very useful I might just break it up by date and introduce smaller data sets -- I was hoping to keep it all together though because I'm doing geographic as well as demographic and time sensitive analysis. 
Thanks for all your help thus far. 
A


On Thursday, January 9, 2014 5:38:59 PM UTC-5, alice brennan wrote:

Thad Guidry

unread,
Jan 31, 2014, 1:54:45 PM1/31/14
to openr...@googlegroups.com
Alice,

Clustering is not the same notion as Faceting.  In OpenRefine they are 2 different things as metioned on our wiki docs.

Please take a look at the various tutorials on the web that cover the topic of "Clustering in OpenRefine"....here is one that I found for you that is very similar to your use case:

If that does not help, then you might want to engage with one of us for a bit of one-on-one paid coaching/consulting with you.

Let us know,

alice brennan

unread,
Jan 31, 2014, 2:07:51 PM1/31/14
to openr...@googlegroups.com
Hey -- i know this and that's why I want to facet not cluster... but when I facet them and then try and arrange by count it crashes. This has been the problem all along. I don't want to cluster. I want to facet and then be able to facet on other columns based on the first facet I did. 
That make sense? 


Alice Brennan, 
Journalist/Producer/Researcher
Tel: 347 247 5550
follow me on twitter  http://twitter.com/alicitabrennan



--
Reply all
Reply to author
Forward
0 new messages