R data frame integration with clojure

137 views
Skip to first unread message

Carsten Behring

unread,
Sep 27, 2017, 9:29:22 AM9/27/17
to Numerical Clojure
Dear all,

I try to slowly change my data science work towards clojure, currently working exclusively in R.

I would be far more confident in this, if I would know that I can still use R, if something is missing.
Regarding the pure data munging functionality, clojure is on the same level more-or-less the R.
But as there are far more R packages (for anything you can imagine), so I would like to see
an easy just-in-case integration.

I am aware of the ggplot integration in Gorilla repl, and this is indeed a good idea.

But that's only for plotting, which is only a subset of R functionality.

I would like to have a generic way  to work with as little friction as possible in R and Clojure on data frames.
So that ideally I can start creating a data frame in clojure and (if I miss certain functionality (=specific packages) in clojure, I cold go To R and back.

This could be either done with a pipeline orchestrate by "make" or similar, so out-of-process, communicating via files.

I would prefer a solution based on JRI.
I prototyped some functions (using JRI and partial "clj-jri") and they basically result in the idea, to have 3 bridge functions in clojure:

(assign-dataframe-in-R  var-name dataset)

(execute-r-source "transform.R")

(get-dataset-from-R var-name)



They should 
- transparently convert a clojure data structure into an R object (data.frame or data_frame)
- set the object it in the current R session,
- execute a piece of R code in-process which transforms the arbitrary input data.frame to an arbitrary output data frame.
- does the conversion from R-data frame into clojure data structure and returns it
It is rather straightforward to do with JRI.

This approach has the huge advantage that the transform.R  is a very normal R file, which can be edited and debuged 
in the same emacs instance the Clojure... The clojure and R files can be side-by-side.

I have decided not to rely on the dataset implementations of Incanter / core.matrix for different reasons,
but to use sequences-of-maps instead.
Inspired by this package: https://github.com/sbelak/huri

I am bit wondering what you think about this approach, and if you are aware of a place, where this code could live.

I was further looking at the "feather" project which is based on Apache Avro (https://github.com/wesm/feather).
It has a python/R implementation for a in common byte representation of a data frame, but there is not a java/clojure implementation yet.

Regards,

Carsten










Carsten Behring

unread,
Sep 27, 2017, 9:48:12 AM9/27/17
to Numerical Clojure
Example code to use it would be:

(def df [{:a 1 :b 2}{:a 10 :b 20}])
(assign-df-to-R "df_in" df)


(R/eval-source "./test.R")

(df-from-R "df_out")


With test.R :

library(tidyverse)
df_out <- df_in %>% 
    mutate(colNew=c(1,2)) 

Mike Anderson

unread,
Sep 27, 2017, 11:58:18 PM9/27/17
to Numerical Clojure
core.matrix is designed perfectly to deal with use cases like this. I would suggest making a core.matrix implementation that wraps R dataframes (although if you don't want to do direct interop, you can just load the data from a file into a core.matrix dataset which provides similar functionality.

Process would be something like:
1. Implement a wrapper type for an R dataframe (using defrecord or deftype)
2. Extend the mandatory core.matrix protocols to this datatype (so that all the compliance tests pass, at a minimum)
3. Extend the core.matrix dataset protocols to this datatype (to access things like column names etc.)
4. Optionally implement other core.matrix protocols (for performance and/or more advanced features)


Benefits:
1. You can then use R dataframes in any situation where code expects a core.matrix array
2. You can use all standard core.matrix operations on an R dataframe
3. You can easily convert between R dataframes and other core.matrix implementations with code like (array :r-dataframe some-array) or (array :vectorz r-dataframe)

Happy to help guide you on making a core.matrix implementation work correctly if you decide to go for this solution. There may be a little complexity if you want the implementation to be efficient, e.g. you may want a separate lightweight wrapper class for a single dataframe row etc.

Carsten Behring

unread,
Sep 28, 2017, 3:09:53 AM9/28/17
to Numerical Clojure
Hmm.
My first reaction was, that this is not my use case.
But then I though about it again...

It is true that a big part of the work for my use case is "convert" 
a data frame from a clojure data structure to a data structure which can be used by JRI to be passed to the R process.

This means to convert the clojure data structure to type "REXP",
see here:

REXP is more or less like "java.lang.Object", so can express any R data type in terms of java classes.


So it might indeed be possible and useful to think about a core.matrix "implementation" based on class REXP.

The strange thing here is, that for my use case I would not need to implement any "mathematical operation" on the implementation.
(only basic accessor to retrieve all columns and all rows)
Is this useful to do ?

So I would (at least in the first place) only implement a tiny subset of al the protocols:

This could then help for interoperability of matrices between clojure and R.

But my main use case is "interoperability of datasets".
I looked a clojure.core dataset, which is not protocol based,
so I cannot implement "dataset" based on "rexp".

But what could be indeed be done is to implement the conversion
from "dataset to REXP"  with a function which takes a 

DataSet type/object from

as input.


I see this as two alternatives for interop between dataset of clojure and R
Either 

- a REXP based implementation of a subset of core.matrix protocols
- an adhoc conversion functions from core.matrix.dataset to REXP and back

I somehow tend towards the second choice, with the additional little problem that I am not sure  to settle on 
clojure.core.datasets (vs simple sequence-of-maps)


Any advice on this ?

Carsten Behring

unread,
Sep 28, 2017, 9:50:19 AM9/28/17
to Numerical Clojure
I found a project, which does exactly what I need,
just for an older version of incanter (1.5.6),
so using the old non-core matrix based data set implementation.

I think, the first step forward would be to make it work with 
the incanter 1.9.0 Dataset implementation, this should be rather straightforward.

As a second step, it could then handle as well core.matrix objects.
This should be rather straight forward as well, I hope.

I will do this work in here:

Please rpovide me with any comments

Mars0i

unread,
Sep 28, 2017, 11:24:05 PM9/28/17
to Numerical Clojure
I have nothing practical to add, but I want to express support for what you're doing.  Thanks. 

I agree that neither Incanter or any other pure Clojure tool will never catch up with the number and variety of useful packages statistical packages available in R.  Easing integration with R in any way is a good thing.

Carsten Behring

unread,
Sep 29, 2017, 6:37:51 AM9/29/17
to Numerical Clojure
I wrote a little summary of the main extension to rincanter I plan for allowing a more seamless integration between R and clojure by "easy exchange of data frames" between R and clojure


All needed pieces for this are present in rincanter already, so it should be rather easy to implement.

My version of rincanter works already now with the main clojure.core.matrix data structures, namely "matrix" and "dataset"
and was updated to clojure 1.8.0 and incanter 1.9.0

This version of rincanter was already changed to communicate to the R session with a client-server protocol and the "Rserve" package of R.
I am not 100 % sure, if this is "better" then the original implementation, which started a R session via a native library.
It removes the ugly requirement for native libraries in the project.clj,
but it requires to start an R session before (or call a clojure function which shells out to do so)

For me, it worked without problems.

Benchmarking this is probably important to do.


https://github.com/behrica/rincanter

If you have any further comments, please make them in github directly

Peter Schmiedeskamp

unread,
Sep 29, 2017, 4:09:38 PM9/29/17
to Numerical Clojure
You might also take a look at Renjin, which is an implementation of R for the JVM. I've not personally used it other that some cursory poking around, but you might find the interop between it and Clojure easier.


If you end up using this in any way, you should consider writing a blog post :-)

Cheers,
Peter

Carsten Behring

unread,
Sep 30, 2017, 9:55:20 AM9/30/17
to Numerical Clojure
Thanks for the hint to Renjin.

If I would need to re-implement the interop between Clojure/Java and R, it might have been worth too look at it.
but I found, "rincanter", which did that exactly some years ago and it seems to work well.
So I based "rojure" on it, and just needed some small changes. See my announcement of the first "rojure" release.


I am doubtfull on the long term sucess of Reijn, is seems far too much work
to get all R packages (written in CPP, Fortran, R) reliably working.

So for Clojure I believe, it is more realsitic to assume that "most" day-to-day functionality needed for data science will appear rather quickly in Clojure / Java.
Java 9 and it's new REPL might help there as well.

So we need some form of bridging towards R / python.

This can be done in various forms, all with advantages and drawbacks:

- rincanter/rojure
- opencpu
- renjin
- gg4clj
- make and csv file exchange
- a Java/clojure implementation of "feather" https://github.com/wesm/feather, missing for now.
Reply all
Reply to author
Forward
0 new messages