Statistics roadmap post

Simon Byrne

unread,

Jan 13, 2016, 5:26:00 PM1/13/16

to julia-stats

Hi All,

The roadmap for the Moore Foundation work has been posted here:

http://juliacomputing.com/blog/2016/01/14/stats-roadmap.html

Comments/thoughts welcome.

-Simon

Iain Dunning

unread,

Jan 13, 2016, 6:10:26 PM1/13/16

to julia-stats

Hey Simon,

Interesting reading! I think its a good summary of work that could be done, but it feels like its missing a final paragraph that outlines JC's timeline for how the grant money will be spent and the particular things JC will be tackling first. This is probably good advertisement for JC in itself, but also guides possible contributors to things that JC might not be able to get around to.

Cheers,

Iain

jock....@gmail.com

unread,

Jan 13, 2016, 6:25:51 PM1/13/16

to julia-stats

Simon that's a great list. Nailing the basics and bringing them up to a modern, best-in-class standard is definitely in order and totally the right use of the grant. Done well, there's nothing that couldn't be built on top of this foundation. I second Iain's comment re deliverables and timelines - it'd be good not only for devs and collaborators, but also for businesses who are assessing Julia for commercial use.

David Anthoff

unread,

Jan 13, 2016, 6:29:50 PM1/13/16

to julia...@googlegroups.com

Yes, great write-up and plan!

And I concur with Iain, it would be great if the role JC will play in this could be made more explicit. Maybe even say who from JC is going to coordinate this and contribute? I for once don’t even know who is part of JC other than the original creators of Julia (I assume?), so it would be nice to associate names with the statistics effort that is sponsored by the Moore Foundation.

Cheers,

David

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Viral B. Shah

unread,

Jan 13, 2016, 11:57:20 PM1/13/16

to julia-stats

I think a timeframe is a good next step. This was to get the discussion started, and we can now create a rough prioritization, that we can all debate here. Some of these things are straightforward and some are more exploratory, but the goal is exactly as discussed in this thread. Get going on a solid foundation and longer term projects so that other folks can build on it.

Thinking aloud, should the DataFrames work be the highest priority? The challenge with that is that it is more exploratory and could possibly have false starts. Or should we focus on making the stats packages higher quality?

There will be some dependencies on the compiler in terms of optimizations, which will be folded into the regular julia development as part of this work.

On other JC related questions, I really should just update our website.

-viral

Lars Tonkard

unread,

Jan 14, 2016, 11:31:01 AM1/14/16

to julia-stats

The modeling stuff can be temporarily performed through pycall if more involved tests are needed so my vote (not that it counts much) is for Dataframe and tidy data.

- Any reason not to just lift the api from Dplyr?

- How will the expression system work? Can macro's be purposed for this?

Question about the modeling-

- Do we want to improve on R's formula syntax and have a generic model building front end?

- Aside from streaming tuples into iterative models, how is it intended to fit models on disparate backends? Can one hook into the C-Api and do in database linear algebra? This seems technically infeasible otherwise.

Simon Byrne

unread,

Jan 14, 2016, 12:13:26 PM1/14/16

to julia-stats

To answer several questions:

> ... it would be great if the role JC will play in this could be made more explicit. Maybe even say who from JC is going to coordinate this and contribute?

I'll be the one mostly dedicated to this on the Julia Computing side, though others will be involved.

> Any reason not to just lift the api from Dplyr?

dplyr has lots of good ideas (which is why I mentioned it explicitly), but it relies heavily on R's nonstandard evaluation and weird scoping. Establishing a clear, straightforward syntax here is certainly one of the more challenging parts.

> How will the expression system work? Can macro's be purposed for this?

This is one of the main challenges. Macros are incredibly useful, but can often be a bit too magical, making code difficult to reason about.

> Do we want to improve on R's formula syntax and have a generic model building front end?

Yes. R's formula interface is certainly powerful, but does have its drawbacks. Once you step out of the linear model context, the idea of using + to specify the columns to use in the model is a bit odd. It would also be useful if the syntax for referring to columns was the same as the data manipulation framework.

> Aside from streaming tuples into iterative models, how is it intended to fit models on disparate backends? Can one hook into the C-Api and do in database linear algebra? This seems technically infeasible otherwise.

My idea is that you should be able to pass any "data table" object (e.g. DataFrame, or some DB query) to model fitting function and have it "just work". The exact method would depend on the model and backend, but there are a variety of approaches we could use so that you don't have to keep the whole dataset in memory (e.g. chunked-QR, stochastic gradient, distributed linear algebra)

-Simon

Lars Tonkard

unread,

Jan 14, 2016, 12:38:35 PM1/14/16

to julia-stats

> My idea is that you should be able to pass any "data table" object (e.g. DataFrame, or some DB query) to model fitting function and have it "just work".

This sounds amazing.

>. It would also be useful if the syntax for referring to columns was the same as the data manipulation framework.

You mean some unification of the two?

What would this look like? A DAG of columns and distributions.jl?

Johan Sigfrids

unread,

Jan 14, 2016, 2:03:25 PM1/14/16

to julia-stats

Will migrating DataFrames to NullableArrays be part of this?

Drew G

unread,

Jan 15, 2016, 7:43:21 AM1/15/16

to julia-stats

This is a fantastic list and makes me extremely excited about Julia's future as a modern language for data analysis

Simon, I'm glad to hear that you recognize the need for a clear, straightforward syntax for dataframes, and would like to reiterate your point. I hope that much of the time and attention goes into making the task of working with data on a day-to-day basis a clear and simple pleasure, even if that data is small and sits in memory. In that sense, I think of "modern" as being equally about the elegance of the front-end interface and syntax as about the breadth of back-end support.

I'm sure you have thought about this as well, but just wanted to share my perspective. Looking forward to all of the great work ahead!

Benjamin Deonovic

unread,

Jan 16, 2016, 1:16:20 PM1/16/16

to julia-stats

Complementing the above work, we intend to support a more flexible choice of algorithms, such as QR, Cholesky, stochastic gradient descent, MCMC techniques (for example via Lora.jl or Stan.jl), and variational methods for Bayesian models.

Just want to point out that Mamba.jl is a much more mature MCMC package in julia. Lora.jl has just recently gone through a major revamp and is still in heavy development, doesn't have any convergence diagnostics, or plotting features, and master branch only contains a few samplers (devel branch has several more). Stan.jl requires user to have Stan installed, so I don't think that would be an appropriate addition to GLM.jl, also Stan.jl utilizes Mamba for convergence diagnostics and plotting.

This isn't a slight against Lora.jl or Stan.jl. Theodore Papamarkou is doing a great job with Lora.jl and Rob Goedman's port of Stan to julia is fantastic. I just wish Mamba got a bit more traffic than it does. Of course having several packages that do the same thing is not a bad thing. It can encourage innovation and development. It does seem a bit unfair that Lora.jl gets to be featured in JuliaStats.

Alex Williams

unread,

Jan 16, 2016, 2:06:06 PM1/16/16

to julia...@googlegroups.com

This is perhaps a question/comment for the julia-opt mailing list, but I've previously thought that it would be nice if GLM.jl was linked into the optimization environment (mostly Optim.jl) rather than calling its own internal optimization routines.

It would be really awesome to develop packages for stochastic gradient descent and related techniques. This one looks like a good start: https://github.com/lindahua/SGDOptim.jl

--

Rob J. Goedman

unread,

Jan 16, 2016, 3:39:20 PM1/16/16

to julia...@googlegroups.com

I fully support Benjamin’s email (with respect to Stan.jl vs. Mamba.jl).

Mamba.jl should be part of JuliaStats in my opinion.

Regards,

Rob

Lars Tonkard

unread,

Jan 16, 2016, 7:09:40 PM1/16/16

to julia-stats

I think its the lack of autodiff which is an extension of a difference in philosophy of use.

Simon Byrne

unread,

Jan 17, 2016, 5:11:46 AM1/17/16

to julia-stats

>> It would also be useful if the syntax for referring to columns was the same as the data manipulation framework.

> You mean some unification of the two?

What I means is that how you refer to a column of data should be the same across all packages. I'm not 100% sure what that should be yet though.

> Will migrating DataFrames to NullableArrays be part of this?

Possibly. Though NullableArrays is currently the most performant option, it is somewhat unsatisfying in that values all end up being wrapped with a Nullable type, which is awkward to use. One of the parts of this project is to improve the performance of small Union types, for example by generating explicit branches at compile time rather than relying on runtime dispatch. If this could be made reasonably fast then it might be worth sticking with the current approach.

Another approach would be to allow DataFrames to accept different array types, so that ordinary well-typed vectors can be used if there is no missing data, or even other AbstractVectors such as Ranges. This would then allow use of either DataArrays or NullableArrays.

> In that sense, I think of "modern" as being equally about the elegance of the front-end interface and syntax as about the breadth of back-end support.

I certainly agree. This will probably take several iterations, but hopefully we can get something elegant.

> Just want to point out that Mamba.jl is a much more mature MCMC package in julia.

Sorry, I had forgotten about Mamba.jl (the list wasn't intended to be exhaustive), but I would certainly hope to include that as well. It is a very nice package.

Reply all

Reply to author

Forward