OpenRefine and R Lang Support

Thad Guidry

unread,

Apr 22, 2018, 1:24:40 PM4/22/18

to r-h...@r-project.org, OpenRefine Development, in...@bedatadriven.com, openrefine

Hello Statheads !

As part of planning Phase 2 of OpenRefine and Google News Initiative funding, OpenRefine is expecting to add support for useful technologies that News Reporting agencies and organizations use. Statistics and Data Cleaning often become a need and one of those technologies happens to be your beloved R lang. :-)

OpenRefine is asking the R Lang community for help and involvement on understanding and scoping their desires and needs since they will be the one's directly benefiting from this joint effort with funding for development provided by Google News Initiative.

So far we have only this issue / idea of integrating R lang : https://github.com/OpenRefine/OpenRefine/issues/1226

I would be happy to setup a collaborative doc and perhaps host a few online knowledge sharing sessions with any R Lang Developers / Expert Users who would be interested in collaborating with our OpenRefine team. Just let us know !

Phase 2 Enhancements for OpenRefine - PLANNING STAGE - R lang support has been added

Phase 1 Enhancements for OpenRefine - IN PROGRESS

Regards,

Thad Guidry

OpenRefine Foundation
NOT a Statistician :-)

Thad Guidry

unread,

Apr 23, 2018, 7:14:01 PM4/23/18

to Alexander Bertram, r-h...@r-project.org, OpenRefine Development, openrefine, Maarten-Jan Kallen

Awesome, thanks Alex !

Yes exactly...defining WHAT common use cases that users would like to see...I'd love to see a list grow or issues opened. Knowing that OpenRefine is not a statistical tool, but instead to help support Data Cleaning, Text Mining, etc. But if the community wants to see some tighter integration with R and statistics and UI then I would not be opposed, its just that our OpenRefine team would not be the ones to help support that but would certainly help support extensions. We have a big task to improve our UI and I think a lot of cool things could be done with extensions once that task has finished. (we're hoping this year)

-Thad

On Mon, Apr 23, 2018 at 2:15 AM Alexander Bertram <al...@bedatadriven.com> wrote:

Hi Thad,

Long time fan of Refine and now Open Refine, and happy to collaborate!

Renjin should make it pretty easy to integrate within OpenRefine from a technical perspective; I think the biggest challenge would be defining what R integration would look like.

I think enabling R as a language would be a great first step. This should be quite straightforward to do in a first pass, and I can help wire up our just-in-time compiler if performance becomes an issue.

I can imagine that your users would also benefit from more "guided" tools that could be powered by R and R packages. For example, there are a number of good text mining and natural language processing packages that you could embed to provide a "Sentiment Analysis Wizard" or something similar.

But for starters, I've subscribed to the ticket on GitHub and will see what I can do to help!

Best,
Alex

Tony Hirst

unread,

May 11, 2018, 9:23:56 AM5/11/18

to OpenRefine

That sounds interesting - and makes me wonder: what would it mean if OpenRefine could act as a Jupyter client?

Off the top of my head:

- it would be able to launch / connect to a Jupyter kernel (eg R or python)

- this would allow code based transformations to be executed using those kernels

By the by, I also note other integrations between and bits of the Jupyter ecosystem, such as launching from a notebook server menu: https://github.com/betatim/openrefineder

I guess it would also be possible to display OpenRefine in a panel in Jupyterlab? But what if OpenRefine acted on a data structure that other components in the Jupyterlab context could see and access...?

--tony

Thad Guidry

unread,

May 11, 2018, 9:54:42 AM5/11/18

to openr...@googlegroups.com

Hi Tony !

Our development team doesn't know much about R Lang.

It would be helpful to have folks put problem statements on a new Wiki page https://github.com/openrefine/openrefine/wiki

to describe the WHYS rather than the HOWS. "we could put OpenRefine in a panel on Jupyterlab"... ok WHY ? How does it even help a user?

We really need to understand WHY integrating the 2 technologies makes sense and WHAT are the benefits for users.

Capturing Good Problem Statements would help us. And preferably those would be written by someone very experienced with R Lang on a daily basis.

-Thad

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tony Hirst

unread,

May 11, 2018, 10:34:04 AM5/11/18

to OpenRefine

Why?

Suppose I am working on a crappy dataset, viewing a fragment of a csv file or dataframe in one panel, working with that data in another.

If OpenRefine provided a view onto the same dataframe, I could be cleaning it in OpenRefine as I work on the analysis of it in another panel.

As it currently stands, I have to import and export data from OpenRefine if I actually want to analyse it.

--tony

John Little

unread,

May 11, 2018, 11:17:56 AM5/11/18

to openr...@googlegroups.com

Thad.

I'm intrigued by this general proposal but am wondering about the characteristics of a good problem statement. Are there existing problem statements on the wiki that can guide a response to your request?

As I read your initial outline I imagined the outcome would be similar to how Jython is an alternative to GREL. This is interesting to me because I do use R and OpenRefine, and do not use Python. Alex's reply about Renjin sounds like it fits the bill. Then again, I have no experience with Renjin so I can only say it sounds good. And at the same time, this kind of integration is generally beyond the scope of my thinking or experience.

Nonetheless, I can draft a variation of the existing wiki page covering Jython and substitute some R functions. If that is that getting at your request? But it seems that's more a How than a Why.

As to the Why, my sense is that Jython is often used to extend OpenRefine when the natural constraints of OR limit advanced data transformations. I can certainly imagine that similar activity can be done with R inside of an OpenRefine expression window, particularly with the Tidyverse packages which are more familiar to me than base-R. I can also imagine how an R programmer, like a Python programmer, can write and share code snippets to be pasted into the expression window by non-R (or non-Python) OpenRefine users.

I suppose getting anything up on the wiki is a start, but I don't generally think much about code platform development so I'm basically looking for examples that respond to your question.

--John

Thad Guidry

unread,

May 11, 2018, 10:12:25 PM5/11/18

to openr...@googlegroups.com

Then John, your a perfect candidate to teach me :-) , so that I can wrap my brain around why so many folks want to see this integration happen, I hear about it all the time, but I don't understand any of their pain. I'd love to see why so many R Lang folks use OpenRefine for cleaning and what is lacking in R Lang or Tidyverse or Jupyter clients/notebooks...that makes them even want to spin up OpenRefine.

Do you have time this weekend to do a Hangout and show me R Lang and your tools and workflows ?

I asked Tony Hirst the same thing on our issue just now https://github.com/OpenRefine/OpenRefine/issues/1226

-Thad

Ettore Rizza

unread,

May 11, 2018, 11:04:00 PM5/11/18

to OpenRefine

I can wrap my brain around why so many folks want to see this integration happen, I hear about it all the time, but I don't understand any of their pain.

I think John has perfectly summed up the expectations of the R and OpenRefine users (which is my case too): to be able to use R instead of Jython.

On the one hand, because not everyone knows Python. On the other hand, because R has more than 10 000 packages that allow you to do complicated things with little code. For example, Natural language processing. Exploratory.io (already mentioned in another thread) is an excellent example of what the integration of R in OpenRefine might look like.

Obviously, we will not be able to use ggplot2 to make graphics. But the integration of R in Refine would make it possible to write simpler R scripts, without "for loop", "apply" and other boileplate codes.

Thad Guidry

unread,

May 11, 2018, 11:16:19 PM5/11/18

to openr...@googlegroups.com

Ettore,

So I hear this need:

1. So you would like to see R as an expression language (with a much nicer interface than what we have now with our expression editor popup) ?

But I also hear about this need:

2. Would you also like to see OpenRefine's Project data storage as an actual dataframe that you could point R to and you could manipulate from other tools or R Studio or whatever ? or not so much ?

-Thad

Ettore Rizza

unread,

May 12, 2018, 5:03:42 AM5/12/18

to OpenRefine

2. Would you also like to see OpenRefine's Project data storage as an actual dataframe that you could point R to and you could manipulate from other tools or R Studio or whatever ? or not so much ?

@Thad It would be very cool to switch more easily from Open Refine to R or Python through an intermediate format like feather (based on Apache Arrow). In general, anything that can facilitate the integration of OpenRefine in a Data Science workflow deserves to be encouraged. I feel like that there are not enough data scientists in the users base. It's a shame.

@Tony By chance, I was just playing with Felix Lohmeier's Open Refine CLI in Jupyter. Who knows, there may be a way to write one or two snippets that make it easier to switch from a notebook to Refine and vice versa.

Thad Guidry

unread,

May 12, 2018, 9:21:08 AM5/12/18

to openr...@googlegroups.com

I'm still trying to wrap my brain around the many parts of the ecosystem of Jupyter and R lang itself. Juypter seems it is very much an OpenRefine kind of web application but built differently for a different purpose of sharing and interactive visualizations.

Jupyter Parts:
1. The Notebook Document Format
Jupyter Notebooks are an open document format based on JSON. They contain a complete record of the user's sessions and include code, narrative text, equations and rich output.

2. Interactive Computing Protocol
The Notebook communicates with computational Kernels using the Interactive Computing Protocol, an open network protocol based on JSON data over ZMQ and WebSockets.

3. Kernels

Kernels are processes that run interactive code in a particular programming language and return output to the user. Kernels also respond to tab completion and introspection requests.

There are certainly similarities and overlap in 1 , 2, and 3 with OpenRefine.

But I don't know where the data is actually stored. If Notebook's can be shared, then it seems the data is also stored not only on the JUPYTER_PATH but also in the Nb Format itself ?

A. Is there a need in OpenRefine to have an Import Jupyter Notebook ?

B. Or do we not even worry about that need and just allow a user to have a Jupyter Notebook open and OpenRefine open at the same time and seamlessly work with the same data at the same time ?

If B is more highly valued to users, then does anyone have any idea about how that might be technically feasible ? I am clueless about Jupyter and R for the most part, and don't want to waste hours reading just to frame up some architecture integration documentation. I'd rather just cut to the chase and let the community help me draft that architecture integration. I cannot do that alone.

SO.... Here's the start of the Wiki.... feel free to fill in TECHNICAL DETAILS if you can.

https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-integration-ideas-with-R-lang-and-Jupyter

-Thad

Ettore Rizza

unread,

May 12, 2018, 10:25:56 AM5/12/18

to OpenRefine

But I don't know where the data is actually stored. If Notebook's can be shared, then it seems the data is also stored not only on the JUPYTER_PATH but also in the Nb Format itself ?

Exact. The notebook stores what has been displayed the last time you saved and checkpointed. If I open an old .ipnb file, it shows me directly the result of my last commands.

This data is, in this case, stored in the ipnb file as an HTML table.

By default, the notebook does not store variables. But by using the magic keyword %store before writing a piece of code, we can save them in the python-specific pickel format.

Thad Guidry

unread,

May 12, 2018, 10:28:34 AM5/12/18

to openr...@googlegroups.com

My head is still spinning. (I keep seeing a lot of inefficiencies with all that, which is probably why I cannot understand it yet)

Ettore, please contribute to the wiki and its details pages if you can.

-Thad

John Little

unread,

May 14, 2018, 10:11:51 AM5/14/18

to openr...@googlegroups.com

I'd be happy to show you what I know over Hangout. Sadly, as you can see, this weekend was not possible. But I'm pretty flexible this week if you want to try another time. --John

John Little

unread,

May 14, 2018, 10:37:30 AM5/14/18

to openr...@googlegroups.com

One thing I want to ask, based on the weekend thread... is integrating Jupyter the same as integrating R? I readily admit to not being a Jupyter user ("but I have a lot of friends who are".) I only want to suggest that there may be different developing use cases, which may share selective commonalities. Anyway, there's a lot to consider. I think That said his head was swimming -- yup, me too..

--

Thad Guidry

unread,

May 14, 2018, 11:02:25 AM5/14/18

to openr...@googlegroups.com

I am seeing less value in integrating with Jupyter. I could be wrong, but the only thing I see is potentially the same use case. Sharing the same data across a common platform that R and Jupyter also share.

https://github.com/OpenRefine/OpenRefine/wiki/DatasharingWithR

-Thad

Tony Hirst

unread,

May 18, 2018, 5:09:14 AM5/18/18

to OpenRefine

For my own workflow, with scripts written using python in Jupyter notebooks, I use Openrefine to clean data because it is more direct, and there are operations that I don't have to think about writing the code for - I can just select them from menus.

To make the workflow reproducible, eg applying the cleaning steps over multiple similarly formatted files, I can then export the history file and automate against that.

I think there are two ways in which Jupyter client support might be useful:

1) if I can connect to a Jupyter kernel, then I can write code to transform the data in Openrefine using any language that has a Jupyter kernel as long as I can reliably pass data to the kernel, and retrieve it (once transformed) into Openrefine. I am just using Jupyter kernels to execute code for me in an arbitrary language and with a suitable environment described (the kernel can have all necessary packages installed, for example). The data flow is: data in Openrefine, transformed via jupyter kernel code, returned to Openrefine.

2) Jupyter kernels support multiple connections. So I can *imagine* a workflow where I am exploring a dataset in Jupyter notebooks whilst at the same time cleaning it in Openrefine. ie using Openrefine functions to transform the data I am working on via Jupyter notebooks. Here the data flow is: data in Jupyter kernel; data transformed in Openrefine, data returned to Jupyter kernel.

--tony

Thad Guidry

unread,

May 18, 2018, 6:15:16 AM5/18/18

to openr...@googlegroups.com

Thanks Tony,

Yes I have already added Data Sharing as a desired feature.

https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-integration-ideas-with-R-lang-and-Jupyter

-Thad

Ettore Rizza

unread,

Jul 15, 2018, 10:26:54 AM7/15/18

to OpenRefine

Just a little example that illustrates why I would love to see R built into OpenRefine. Here is the GREL formula I used to transform French strings such as "1 January 2018", containing non-breaking spaces, into dates:

value.escape('html').replace('&nbsp;', ' ').unescape('html').replace('1er', '1').replace('janvier', 'january').replace('février', 'february').replace('mars', 'march').replace('avril', 'april').replace('mai', 'may').replace('juin', 'june').replace('juillet', 'july').replace('août', 'august').replace('septembre', 'september').replace('octobre', 'october').replace('novembre', 'november').replace('décembre', 'december').toDate()

And here's what I'd like to see (value = "1er février 2018"):

library(lubridate)

return dmy(value)

>> 2018-02-01

I know that I can create an issue on Github to improve this or that function. But it takes time, and the developers of OpenRefine can be counted on the fingers of one hand. With R as a scripting language, it's a bit like Hadley Whickam or Garret Grolemund join the team.

Reply all

Reply to author

Forward