Re: Is there any desire or need for a Clojure DataFrame? (X-POST from Numerical Clojure mailing list)

1,179 views
Skip to first unread message
Message has been deleted

Christopher Small

unread,
Mar 9, 2016, 6:47:44 PM3/9/16
to Clojure

If you're going to do any work in this area, I would highly encourage you to do in as part of the core.matrix library. That is what Incanter is or will be using for it's dataset implementation. But it's nice that those abstractions and implementations be separate from Incanter itself, since Incanter is a rather large dependency.

Core.matrix is certainly (in my eyes) becoming the de facto matrix computation library in the Clojure ecosystem, and I think in the level of interop between different implementations there, and extent of utilization by the clojure community, we rival the python offerings. However, while core.matrix has some dataset protocols, api functions and basic implementations, there's still some work to get the full expressiveness of the data.frame pattern as seen in R and Pandas. Specifically, there is no support for setting rownames (or arbitrary "name" assignments beyond that of a single dimension (columns...)). This is something I started working on a while back, but wasn't able to finish. I could potentially push what I came up with to a fork, but unfortunately, I don't have any more time to work on the problem at the moment.

Mike Anderson is a great project maintainer, and will probably be happy to help guide you in stitching together a solution.

Best

Chris





On Wednesday, March 9, 2016 at 12:57:31 PM UTC-8, arthur.ma...@gmail.com wrote:
Is there any desire or need for a Clojure DataFrame?


By DataFrame, I mean a structure similar to R's data.frame, and Python's pandas.DataFrame.

Incanter's DataSet may already be fulfilling this purpose, and if so, I'd like to know if and how people are using it.

From quickly researching, I see that some prior work has been done in this space, such as:


Rather than going off and creating a competing implementation (https://xkcd.com/927/), I'd like to know if anyone here is actively working on, or would like to work on a DataFrame and related utilities for Clojure (and by extension Java)? Is it something that's sorely needed, or is everybody happy with using Incanter or some other library that I'm not aware of? If there's already a defacto standard out there, would anyone care to please point it out?

As background information:

My specific use-case is in NLP and ML, where I often explore and prototype in Python, but I'm then left to deal with a smattering of libraries on the JVM (Mallet, Weka, Mahout, ND4J, DeepLearning4j, CoreNLP, etc.), each with their own ad-hoc implementations of algorithms, matrices, and utilities for reading data. It would be great to have a unified way to explore my data in the Clojure REPL, and then serve the same code and models in production.

I would love for Clojure to have a broadly compatible ecosystem similar to Python's Numpy/Pandas/Scikit-*/Scipy/matplotlib/GenSim,etc. Core.Matrix and Incanter appear to fulfill a large chunk of those roles, but I am not aware if they've yet become the defacto standards in the community.

Any feedback is greatly appreciated.

Daniel Slutsky

unread,
Mar 9, 2016, 7:04:17 PM3/9/16
to Clojure
Thank you for raising this question.

By the way, one desired feature for a Clojure dataframe abstraction would be good interop with Renjin's dataframes.
Renjin is a JVM-based rewrite of (a subset of) R. It offers a large number of JVM-based statistical libraries. Most of them rely on the dataframe abstraction for their data. R is also very Lisp-like in its data representation, so wrapping all this with Clojure would be a delight.
Message has been deleted

Christopher Small

unread,
Mar 9, 2016, 7:52:23 PM3/9/16
to clo...@googlegroups.com
Sounds great; and sure thing, will do :-)

The basic idea I had was to implement a bidirectional index mapping names <-> indices. This requires making sure you keep the index up to date any time you change the data, but seemed the easiest way forward.

My fork is here: https://github.com/metasoarous/core.matrix/commits/develop

Here are a couple of related issues:

https://github.com/mikera/core.matrix/issues/193
https://github.com/mikera/core.matrix/issues/220

Hope you can come up with something nice!

I would focus first on coming up with what seems like a nice set of protocols, so that we can be flexible with implementations. Ideally, we'd be able to just apply some wrapper to any core.matrix array, vector, matrix, etc that provided named/labeled access to the data, and would be fairly seamless with the rest of the library. But you should also be able to wrap something like Renjin's dataframes (as Daniel Slutsky mentioned; just implement the protocols using their classes, I imagine). There might have to be some iteration here. Like: initial protocol design -> initial implementation -> redraft potocols -> try new implementation -> redraft protocols, etc. I've noticed that it can be difficult to properly abstract implementation details away from the protocol/API on the first go (though you might have mastered this more than I :-)).

My 2c

Goodluck!

Chris



On Wed, Mar 9, 2016 at 4:29 PM, <arthur.ma...@gmail.com> wrote:
Chris, thanks for the reply. 

It's good to know that I'm not the only one who misses this functionality! My goal is definitely to be compatible with Incanter and core.matrix, as they both seem mature, and I will never have the time to implement that functionality from scratch myself. I'll be studying the source of Pandas over the next few days, as I want to have a good idea of how they implement their dataframes before starting on the Clojure version. My long-term goal is for future authors to look to this set of core tools for data analysis as the basis for any packages they build.

If you'd like to publish whatever you've written (hacked up code is ok), I'll take a look at that as a starting point, or at least as one possible design.

- Arthur

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/4a_f1-xboOY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Mikera

unread,
Mar 10, 2016, 7:45:44 PM3/10/16
to Clojure
core.matrix maintainer here.

I think it would be great to have more work on dataframe-type support. I think the right strategy is as follows:
a) Make use of the core.matrix Dataset protocols where possible (or add new ones)
b) Create implementation(s) for these protocols for whatever back-end data frame implementation is being used

The beauty of core.matrix is that we *can* support multiple implementations without fragmentation, because the protocol based approach means that every implementation can use the same API. This is already working well for the array programming APIs (it's easy to mix and match Clojure data structures, Vectorz Java-based arrays, GPU backed arrays in computations). We just need to do the same for DataFrames.

Now: the current core.matrix Dataset API is a bit focused on 2D data tables, but I think it can be extended to general N-dimensional dataframe capability. Would be a great project for someone to take on, happy to give guidance and help merge in changes as needed.

I don't have a particularly strong opinion on which Dataframe implementations are best, but it looks like Spark and Renjin are both great candidates and would be very useful additions to the Clojure numerical ecosystem. If we do things right, they should interoperate easily with the core.matrix APIs, making Clojure ideal for "glue" code across such implementations.

Dragan Djuric

unread,
Mar 10, 2016, 8:09:14 PM3/10/16
to Clojure
This is already working well for the array programming APIs (it's easy to mix and match Clojure data structures, Vectorz Java-based arrays, GPU backed arrays in computations). 

While we could agree to some extent on the other parts of your post but the GPU part is *NOT* true: I would like you to point me to a single implementation anywhere (Clojure or other) that (easily or not) mixes and matches arrays in RAM and arrays on the GPU backend. It simply does not work that way.
Message has been deleted

Mikera

unread,
Mar 10, 2016, 8:47:58 PM3/10/16
to Clojure
On Friday, 11 March 2016 09:09:14 UTC+8, Dragan Djuric wrote:
This is already working well for the array programming APIs (it's easy to mix and match Clojure data structures, Vectorz Java-based arrays, GPU backed arrays in computations). 

While we could agree to some extent on the other parts of your post but the GPU part is *NOT* true: I would like you to point me to a single implementation anywhere (Clojure or other) that (easily or not) mixes and matches arrays in RAM and arrays on the GPU backend. It simply does not work that way.

You misunderstand my point. Obviously, there may need to be some copying when you move between managed and unmanaged memory. 

But I'm not talking about that: the point is that this can happen "under the hood", without the user needing to do explicit conversions etc. All thanks to the protocol implementations, you can mix and match GPU, native and Java backed instances with the same API. 

core.matrix can trivially do stuff like (add! native-array java-array) for example.

What's not to like about that?

Mikera

unread,
Mar 13, 2016, 11:45:53 PM3/13/16
to Clojure


On Friday, 11 March 2016 09:21:09 UTC+8, arthur.ma...@gmail.com wrote:
Renjin and Spark's dataframes are not going to be easily removed from their respective codebases, as far as my brief perusal of the source can tell. I agree that N-D DataFrames would be a good addition to the ecosystem, similar to the goals of Python's xarray (xarray.pydata.org). However, it is not a priority for myself as of this time. Thanks for pointing out the DataSet proposal. I'll take a look at that later.

On a slightly related note, where is the best place to ask core.matrix questions? I have some small questions about sparse matrix support in core.matrix, and what sparse formats are implemented.

There is the Numerical Clojure group: 

For quick questions / discussion many people are on the #data-science channel in the Clojure slack  

Or you can just file a core.matrix issue with a question: I'm usually quite responsive with these and they may serve as a reference for future people who run into similar questions: 

Chaoya Li

unread,
Jun 4, 2016, 11:51:49 PM6/4/16
to Clojure
Hi I'm interested in Clojure DataFrame implementation. How is this going now? Are you coding for core.matrix or are you writing a new library from scratch? How can I join in this project?

在 2016年3月10日星期四 UTC+8上午4:57:31,arthur.ma...@gmail.com写道:
Message has been deleted

Daniel Slutsky

unread,
Jun 7, 2016, 4:19:15 AM6/7/16
to Clojure
Hi.

I'm experimenting with Renjin interop - in particular, trying to make a Renjin objects implement core.matrix protocols (as mikera suggested).

I hope to be able to share some draft soon.  then ask your opinions about it.


Hi.

I'm experimenting with Renjin interop - in particular, trying to make a Renjin objects implement core.matrix protocols (as mikera suggested).

I hope to be able to share some draft soon.


On Monday, June 6, 2016 at 6:28:59 PM UTC+3, arthur.ma...@gmail.com wrote:
Chaoya,

     I haven't been working on this, and I don't really intend to anytime soon, there's other work that I must attend to in the immediate time-frame.

- Arthur
Reply all
Reply to author
Forward
0 new messages