New OpenRefine Extension: GOKb Utilities

136 views
Skip to first unread message

Owen Stephens

unread,
May 22, 2017, 9:00:36 AM5/22/17
to OpenRefine
I'm pleased to announce a new extension for OpenRefine: the GOKb Utilities extension for OpenRefine adds a number of new functions to the OpenRefine data cleaning software.

The utilities in this extension were originally developed as part of the Global Open Knowledgebase (GOKb) project. The GOKb project developed software and a service to facilitate a community managed set of freely available information about electronic resources (such as electronic journals and books). OpenRefine is a key part of the data management process within GOKb and OpenRefine was highly integrated into the GOKb application and database through the development of an OpenRefine extension. However a few of the utilities developed did not rely on the integration with GOKb - and they have now been separated from the main extension and bundled together here. I've also added in some additional utilities that were developed outside the GOKb project.

This extension does not rely on GOKb at all and can be used directly with OpenRefine like any OpenRefine extension.

This extension includes the following functions:
  • Prepend rows: Add new blank rows to an existing OpenRefine project. Accessed via the 'All' dropdown menu "All->Edit Rows->Prepend rows"
  • Trim all data: Remove preceding/trailing whitespace from across all cells in the project. Accessed via the 'All' dropdown menu "All->Trim all data"
  • extractHost: new GREL function that extracts a host name from a URL
  • inArray: new GREL function that checks for the existence of a value in an array
  • randomNumber: new GREL function that generates a random integer in a specified range
To install the extension, download the zip file from https://github.com/ostephens/refine-gokbutils/archive/master.zip, unzip the files and drop the resulting folder into the /extensions folder in OpenRefine.

I've only tested this with 2.7 rc2, but it should work with any version. Any issues or requests for development can be posted at https://github.com/ostephens/refine-gokbutils/issues

I hope some of this is helpful to some of you.

Best wishes

Owen

The GOKb Utils extension was made possible by the following:

## Contributors (alphabetical order):

- [Ian Ibbotson](https://github.com/ianibbo)
- [Steve Osguthorpe](https://github.com/sosguthorpe)
- [Owen Stephens](https://github.com/ostephens)

## Acknowledgements
The GOKb project, without which this extension would not exist, was funded by the Andrew W. Mellon Foundation. GOKb was initially designed and implemented by the Kuali OLE Founding partners: 
- North Carolina State University (lead school)
- Indiana University
- University of Florida
- Lehigh University
- Duke University
- University of Chicago
- University of Maryland
- University of Michigan
- University of Pennsylvania
- Jisc of the United Kingdom


Ettore Rizza

unread,
May 22, 2017, 10:21:59 AM5/22/17
to OpenRefine
Great! Thank you very much, Owen (and the other participants in the project). 

Did that take you a long time? There are a hundred features that I would like to see added to Open Refine *, and I understand that it will be necessary to get our hand dirty to see them. How to participate without spending money in a bounty**? That's the question.

* As I am in the field of information sciences, the first would be something that would add metadata to open refine projects.

** As I am in the field of information sciences, I've no money...

Thad Guidry

unread,
May 22, 2017, 10:55:12 AM5/22/17
to OpenRefine
Hmm... it might make sense for us as a community to get a working branch that begins to use and compile with Kotlin programming language and start doing some refactoring work as well on that branch.  Since its gaining tremendous traction and is a drop-in language for the JVM.  Google has even approved of Kotlin for Android development.  https://backchannel.com/the-language-that-stole-android-developers-hearts-807fdbf07c2a

In moving to Kotlin (even bits or classes at a time, and IntelliJ makes this crazy easy with CTRL-ALT-SHIFT K), we might attract more developers and certainly it will make Owen's life easier for hacking directly with OpenRefine and producing less extensions.

Going to ponder a few things and look around this week.

Jacky do you have any thoughts ?
-Thad

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
May 22, 2017, 11:36:03 AM5/22/17
to OpenRefine
Hi Ettore,

Did that take you a long time?
So the overall development happened over such a long period of time it's hard to answer. The original GOKb extension was a large undertaking and included significant integration with a back end application - and it was that integration and related functionality that took a lot of the development time.

However, these small bits of functionality probably didn't take long - but I don't know exactly how long. I can tell you from my recent experience adding simple GREL functions[1] (and two of the GREL functions in here were written by me - inArray and randomNumber) that writing a simple new GREL function took only an hour or so - given the rest of the extension framework was already in place.

I think basically the answer is 'how long is a piece of string' as an extension can contain functionality as simple or complex as you want to write - so there is no straightforward answer. Some things are simple, some are complicated :)

For example I'm not sure I'd even know where to start to add metadata to OpenRefine projects :) Although projects already have some metadata I think (there is an existing ProjectMetadata class) - but extending the metadata stored and creating UI to interact with the metadata is all new stuff which would need thinking about. This all costs - either time or money  - and of course requires expertise.

My expertise and time are limited - so I can do simple basic things, and through following examples I'm trying to understand how to do slightly more difficult things (for example, at some point I want to pull apart the "Trim all data" function in this extension to understand how I might develop additional functions that act across all cells in the project in one go) - but ultimately I rely on others to develop more complex features.

As an aside - I would encourage anyone that can to contribute to OR development via BountySource https://www.bountysource.com/teams/openrefine/issues?tracker_ids=32795 - although I know this isn't possible for everyone, if you find OR useful enough to give a few dollars a month it will help the development of the product.

Not sure how much this answer helps! I guess there are no easy answers here

Best wishes

Owen

qi cui

unread,
May 23, 2017, 9:37:52 PM5/23/17
to OpenRefine
Thad, I have very little knowledge of Kotlin so I cannot comment too much on it. It is JVM based and could be good candidate. 

Scala also good and since it is functional programming nature and is used a lot on computation/big data/data transformation fields, It might worth considering. 
Message has been deleted
Message has been deleted

Ettore Rizza

unread,
May 24, 2017, 12:09:53 AM5/24/17
to OpenRefine
Thank you very much for these explanations and for this extension, Owen !

The new "trim all data" feature will soon become indispensable. But I ask naively: would it be very hard to generalize this functionality and to make it an "apply to all" function, in which it would suffice to indicate an operation (for example "toNumber ()") and apply it to all columns? This would really be a big step forward for Open Refine.

Owen Stephens

unread,
May 24, 2017, 4:45:59 AM5/24/17
to OpenRefine
Hi Ettore,

Glad you will find it useful (it was that particular function that made me want to get this extension out there)

In terms of extending it in the way you suggest - I don't know the answer without more work, but my feeling is that adding a similar function which applied a particular operation (like 'toNumber') across all cells would be easy by adapting the current function. However writing a more generalisable functionality that allows you to write GREL expressions to apply across all cells would be more difficult.

One of the problems with functions that affect all cells is the challenge of have a UI which still allows users to see in real-time what their change will do - because the current GREL expression/Preview window is designed to preview one column at a time - and it is hard to know how this would extend to give a real-time preview across all the columns.

With the 'trim all whitespace' the Preview isn't so important as the changes aren't very perceivable to the human eye - so the Preview is of limited use.

I'm not sure the Preview is essential (there is always the option to undo) but where you start to move away from this I think it needs some careful thought as it breaks the usual OR approach.

If you have a list of potential 'apply to all cells' candidate functions (like toNumber) please create an issue at https://github.com/ostephens/refine-gokbutils/issues and I'll have more of a look when I get a chance (I want an excuse to get a better understanding of how this code works anyway as it was written by someone other than me!)

Owen

qi cui

unread,
May 26, 2017, 8:00:00 PM5/26/17
to OpenRefine
I used to take a look of the "transform" menu. It is feasible to add a similar "Transform All" menu to allow use to select the columns and apply GREL and other operations.

Is this what you are looking for? 

Ettore Rizza

unread,
May 27, 2017, 3:17:37 AM5/27/17
to OpenRefine
@qi cui  If it is easily achievable, it would be of course a great improvement. When performing a data cleaning on a file containing dozens of columns, there is often a time when you must leave Open Refine and use a scripting language. 

Example in real life: 50 columns of badly encoded texts in which we must apply to each value.replace('é', 'é').replace('è', 'è').replace('ç', 'ç')... But there are hundreds of other examples.

Owen's "trim all" function is already a big enhancement, but in an ideal world you should find this function directly in the file import screen, as well as a "clean rownames" function ( leave unique column names composed only of lowercase letters, numbers and underscores.)



Thad Guidry

unread,
May 27, 2017, 9:42:47 AM5/27/17
to OpenRefine
I would go even higher than that and have an All Cells -> menu above Edit cells ->
Then it is easier for folks to write even simple Javascript or Python extensions for that menu itself, along with keeping a default Transform...

-Thad

qi cui

unread,
May 27, 2017, 7:45:54 PM5/27/17
to OpenRefine

Will take some time next week to implement this.

peter.m...@hmri.org.au

unread,
Jun 4, 2017, 5:42:04 AM6/4/17
to OpenRefine
Owen,  will this extension be incorporated into the base OR code download, so that when updating in future it will be included?   Sorry for the process question as being new dont quite understand the workflows and design models.

Peter

Owen Stephens

unread,
Jun 5, 2017, 4:04:23 AM6/5/17
to OpenRefine
Hi Peter,

The development of extensions is independent of the development of the base OR code, so there are many extensions out there that have not, and probably will not, be incorporated into the base code.

What will happen with this particular extension is impossible to say. It looks like there is some interest in taking some aspects of the functions we've developed here and incorporating them into the base code - but there is no particular expectation on my part that this will happen. I'm hoping we can do some further work with our extension, and if some of it gets into the core code, that's great - but if not, I hope we can continue to provide it via the extension.

Owen
Reply all
Reply to author
Forward
0 new messages