Using OpenRefine to select a "random" set of records to work on

480 views
Skip to first unread message

Raj Janakarajan

unread,
Jun 4, 2014, 6:32:09 PM6/4/14
to openr...@googlegroups.com
Hello,

My file is about 2 million records.  So i could like to select a random set of say "10000" records to discover data issues and then apply fix to the complete dataset.   I have been using the "Load at most .....rows of data" setting in the screen before creating a project.   Does this feature select a random set of records or first/last set of records?

Raj 

Thad Guidry

unread,
Jun 4, 2014, 9:36:11 PM6/4/14
to openr...@googlegroups.com
First x number of rows  .. the remaining rows are discarded and not imported into the project.

You can also alternatively import all the rows and then create sub-projects of smaller amounts of rows by exporting... and using the GREL statement of row.index() <10000 etc... to get whichever range of rows that you want to flag and export only those flagged rows, etc.  Experiment !  And hope your enjoying OpenRefine !

Let us know if we can help further or if you can help us with donations towards the project to finish up polishing Beta 2.6 version.



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Raj Janakarajan

unread,
Jun 4, 2014, 10:32:37 PM6/4/14
to openr...@googlegroups.com
Thad, thank you.  Raj


--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/xMa3P9DrKoo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Data Architect ❘ Zephyr Health 
589 Howard St. ❘ San Francisco, CA 94105
o: +1 415 529-7649 | s: raj.janakarajan
http://www.zephyrhealth.com

Tom Morris

unread,
Jun 4, 2014, 11:13:08 PM6/4/14
to openr...@googlegroups.com
First of all, if your records aren't too big/wide, and you've got a reasonable amount of memory, 2 million rows is easily doable in Refine and then you can just subset using a facet on rowIndex modulo your sampling factor.  I was working on a narrow (5 columns?) data set of 3.3 million rows in Refine the other day with no problem at all.

If you really need to sample on import, we don't have that currently.  Reservoir sampling (as well as just straight probability sampling) is something that's been on my mind to add, but, although it's straightforward, there are many things ahead of it on the priority list.  I've heard that one of the Yahoo researchers has done work in this space, but they haven't yet contributed it to the community.

Fortunately, the Unix (or Cygwin) command line can do this trivially with a command like those referenced in this StackOverflow answer.  The nice thing about the command line is that you can also use the cut command to only include the columns of interest, cutting down the amount of data you need to deal with even more.

Tom


--

Karl Stutzman

unread,
Jun 10, 2014, 3:55:23 PM6/10/14
to openr...@googlegroups.com
Although I haven't hit the limit of what I can do with OpenRefine on my computer after ramping up the memory allocation, I am contemplating working with some larger data sets.

Could you put more detail behind what is "too big/wide" in terms of records and what is a "reasonable amount of memory"? Are there practical limits of OpenRefine beyond the capacity of one's machine?

Before I try something that won't work, it would be nice to know the limits of OpenRefine.

Thank you,
Karl


--
Karl Stutzman
Anabaptist Mennonite Biblical Seminary Library
Reply all
Reply to author
Forward
0 new messages