R and statistical programming

1,384 views
Skip to first unread message

Harlan Harris

unread,
Feb 25, 2012, 1:22:53 PM2/25/12
to juli...@googlegroups.com
Hi everyone, I've installed Julia and have started playing with it. Amazing work so far! But I wanted to start a thread to talk about whether Julia has the potential to be a competitor to R, or not. So far, it seems like people are thinking of it as an alternative to Matlab, or maybe SciPy or Clojure/Incanter. Of the programming languages used for data analysis, R has (in my view) the most desperate need for a reboot.

Quick summary of R for people who haven't used it before. It's the open-source implementation of the venerable S programming language, written in the early 90s by statisticians, for statisticians. It's a mostly-interpreted language, written in C and Fortran, with features designed for interactive exploration and modeling of real-world data. It has Scheme-like scoping and support for functional programming, is partially homoiconic (with some effort), uses lazy evaluation, and has several mostly-terrible bolted-on OOP extensions. Everything is a vector. It has an amazing set of libraries written by the community, including most new cutting-edge statistical algorithms. It also is slow, finicky, with an out-dated and wildly inconsistent core library, and only-partially-resolved issues with big data, parallelism, and graphics.

You can try it out here: http://www.stats4stem.org/rweb.html

Let me demo a few features that are absolutely critical for Julia to have any hope of adoption from the statistical community. Feel free to paste this code into RWeb:

# quick R demo showing NA, data frames, attributes, formulas, lm, and optional parameters

# Factors are sorta like enumerated types, but more flexible and with great support by the statistics library functions. By default, strings are converted to
# factors, but here I'm disabling that.
options(stringsAsFactors=FALSE)

# Three important things here: NA, attributes/names, and the data frame data structure.
# NA, a special value for ALL data types in R, is distinct from NaN. This is an absolute must for any potential competitor to R.
# I'm creating a data.frame, which is a list of heterogeneous-typed vectors constrained to have the same length, ala a database table.
# That should be easy to create in Julia. But also note that the columns have names. These are stored as "attributes" of the structure. All
# objects in R can have as many attributes as needed, and many of those attributes (like row and column names, dimension names)
# are used extensively by many functions. It's hard to imagine how an interactive exploration of data could work if things could not be
# trivially labelled).
qq <- data.frame(x = c(1:4, NA),
                 group = c('A', 'B', 'B', 'A', 'A'),
                 y = c(2, 5, 7, 8, 9.5))

# four different views of the qq object. Underlyingly, the first two calls get shuffled off to print.data.frame() and summary.data.frame().
# Not a problem for Julia.
print(qq)
summary(qq)
str(qq)
attributes(qq)

# Lists, like qq, can be indexed easily by name. The second form is the obvious one, the first one is easier to type.
# Note, however, that the second function call has a named parameter. It's difficult to imagine doing an interactive
# statistical analysis if you had to keep looking up argument positions.
# Also note the nice vectorized addition and other mathematical operators, all of which do the appropriate thing with the
# NA.
mean(qq$x)
mean(qq[,'x'], na.rm=TRUE)
qq$x + 1
sqrt(qq$x)

# The first argument is a formula type, which is an expression that is not evaluated immediately. Here, I'm saying that y is
# modeled by x and group, which will be columns in the qq data structure. Note that this function is returning an object
# of class "lm", which the REPL automatically prints, using print.lm().
lm(y ~ x + group, qq)

# Note the large number of options to this function. Again, some way of using named parameters is going to be really key
# for clean, clear, reproducible data exploration.
args(lm)



There's a lot of stuff in Julia I really like so far. Implicit multiplication is awesome. Compact function declarations are awesome. Coroutines and the parallel operations support looks to be awesome. I love the way the type system is put together, and the lightweight feel to the methods. Macros look very useful and well-implemented.

How are you guys, the developers of Julia, thinking about R and statistical (vs mathematical) programming?

 -Harlan

ps - speaking of "you guys", who are you guys? There are no bios on julialang.org that I can see! Is the project grant-supported, or volunteer labor?

Viral Shah

unread,
Feb 25, 2012, 2:13:40 PM2/25/12
to juli...@googlegroups.com
This is a good question. What should julia do in order to provide an alternative to R? In my opinion, it is R's libraries and its community that define it - rather than the language design or its implementation.

It would be worthwhile to study the design of R's libraries and use that to implement julia's statistics functionality rather than basing it on Matlab's statistics toolbox. How usable would R's libraries be? So far, we have not paid much attention to this, and we need to start thinking about it.

As for the authors, Jeff is a grad student at MIT, Stefan at UC Santa Barbara, Alan Edelman is Professor of Mathematics at MIT, and I work for Government of India. Julia is largely a volunteer effort with almost all of us working on it as much as we can, but anchored around the MIT community. For example, it was used in the Parallel Computing at MIT (http://beowulf.csail.mit.edu/18.337/index.html). We will be looking for grants and such to support work, travel to conferences, etc.

-viral

Viral Shah

unread,
Feb 25, 2012, 2:35:35 PM2/25/12
to juli...@googlegroups.com
As you note, it is fairly straightforward to implement a data structure like R's data frames in julia, including names for the columns, and indexing columns with names. Most likely, julia currently has all the functionality needed for this.

Named parameters is a feature that has been planned from day one, but we just haven't got around to implementing it. So far, we have managed ok without it, but will become higher priority as more libraries are written.

-viral

On 25-Feb-2012, at 11:52 PM, Harlan Harris wrote:

Harlan Harris

unread,
Feb 25, 2012, 2:56:03 PM2/25/12
to juli...@googlegroups.com
I mostly agree that it's the libraries and the community, rather than the language, that draws users. But I do think there are key aspects to the language that draws package authors. In particular, ubiquitous support for NA is absolutely critical.

I started working a bit on NA for Julia. Some basic functionality is quite easy, but I do think it's worthwhile considering whether NA should be a core aspect of the language, rather than an add-on. Using unions to implement NAs, as I tried, may have substantial performance penalties. Here's my code so far: https://gist.github.com/1910073 (Incidentally, love that github supports Julia syntax highlighting!!)

I also agree, mostly, that R's stats toolbox is a better basis than Matlab's, but there are some serious limitations to R's core libraries that should be avoided by Julia. (I also agree with your later comments that data.frames are probably straightforward, and that named parameters need to get prioritized!)

Oh, two other quick things:

1-indexing is definitely the way to go, for this language. It's an index, not an offset.

It seems like a few people have started using the #julialang hashtag on Twitter.

 -Harlan

Jeff Bezanson

unread,
Feb 25, 2012, 3:44:22 PM2/25/12
to juli...@googlegroups.com
Thanks Harlan, this is a great writeup and very helpful to us.

Unfortunately right now the implementation is not smart enough to
store an array of Union types efficiently using bit pattern tricks.
Using NaNs and minimum-integer values like R would be the way to go.
Best is to define a bits type:

bitstype 64 NAInt <: Integer

and operations on this can be overloaded to respect the special NA value.
As an aside, I would use
const NA = _NA()
instead of the macro...1 character shorter!

Stefan Karpinski

unread,
Feb 25, 2012, 3:53:48 PM2/25/12
to juli...@googlegroups.com
I'm pretty sure that GitHub's apparent support for Julia syntax highlighting is purely accidental. I'm not sure what language they think our files are, but whatever it is, the highlighting looks pretty good. I think that's mostly due to the dirt-simple, plain-vanilla nature of Julia's syntax. We've tried very hard to not do anything weird with syntax.

Is there any reason that NaNs can't be treated as N/A for statistical functions? That seems like the simplest and certainly most efficient option. Otherwise there's no way to store large data matrices as efficient inline arrays. However, since Float64 NaNs actually have a "payload" of 52 bits (and Float32 NaNs have 23 bits) that can be set to anything you want while still being a NaN, we could potentially have special NaN values that mean N/A in order to distinguish them from NaNs produced in other ways.

I've probably done the most R programming of any of the core Julia developers, but it's been a fairly long time. I agree with Harlan's overall assessment of R. There's absolutely no reason that we can't be as good at stats as R is. One of the huge advantages Julia has over R is that the stats library doesn't have to be written in C — you can write it in Julia, have it be fast, and use all the lovely goodies like generic programming and multiple dispatch. With that additional leverage, "catching up" to R should be much more doable than if we had to redo all of the R-specific C code that R's libraries consist of.

At the same time, I think we need to avoid a lot of the mistakes that R made. The good part of R is that for people who just want to run an ANOVA or something other very sophisticated statistical analysis, it's usually as simple as doing `anova(x)` or something like that (I forget what the actual call is — that may be it). The result is a big old object that has everything you could possibly want to know about the anova analysis of the data `x`. It may even plot the results for you in some graphical form. That's great for usability for programming novices, but horrible for programmability. So the main thing I'd want to improve over R is programmability and reusable building blocks for statistics. I don't want to be forced to have an explicit vector of all my data — what if the format I have is an empirical PDF or empirical CDF? In R you're pretty much screwed unless you rewrite the tests for yourself. So let's make something where you can define a stats test once and just as easily apply it to an explicit list of data points, a PDF or a CDF — or whatever other form you happen to have your data in.

Stefan Karpinski

unread,
Feb 25, 2012, 3:58:00 PM2/25/12
to juli...@googlegroups.com
In general, I would be amenable to setting aside typemin(Int32) and typemin(Int64) as NA values for integral types. Those values are a royal pain in the ass anyway. For example, abs(typemin(Int64)) == typemin(Int64) — i.e. it's still a negative value! There's not good way around this except to throw an error, which would kill performance for integers, which is obviously unacceptable. Unfortunately, there's also no way to make typemin(Int64) and typemin(Int32) "poisonous" like NaN is in IEEE 754. So you could potential add an integer NA to some other integer and get a meaningless non-NA values.

Won g'Faere

unread,
Feb 26, 2012, 7:47:22 AM2/26/12
to juli...@googlegroups.com
On Sat, Feb 25, 2012 at 3:44 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:
Thanks Harlan, this is a great writeup and very helpful to us.

Unfortunately right now the implementation is not smart enough to
store an array of Union types efficiently using bit pattern tricks.
Using NaNs and minimum-integer values like R would be the way to go.
Best is to define a bits type:

bitstype 64 NAInt <: Integer

and operations on this can be overloaded to respect the special NA value.
As an aside, I would use
const NA = _NA()
instead of the macro...1 character shorter!


For data analysis its hard to imagine how a language could be used if it lacked missing values.

R has NA_integer_, NA_real_, NA_complex_ and NA_character_  .  Just NA is a logical NA which is convenient to write and gets coerced to the appropriate type.

Some other statistical languages have even more powerful notions allowing more than one type of NA.  This is helpful to distinguish among different sorts of missingness such as structural missing values vs. erroneous measurement or refusal to answer a question vs. data entry error.  See this description of SAS:


Peer Stritzinger

unread,
Feb 26, 2012, 8:50:53 AM2/26/12
to juli...@googlegroups.com
On Sat, Feb 25, 2012 at 9:58 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
> In general, I would be amenable to setting aside typemin(Int32) and
> typemin(Int64) as NA values for integral types. Those values are a royal
> pain in the ass anyway. For example, abs(typemin(Int64)) == typemin(Int64) —
> i.e. it's still a negative value! There's not good way around this except to
> throw an error, which would kill performance for integers, which is
> obviously unacceptable. Unfortunately, there's also no way to make
> typemin(Int64) and typemin(Int32) "poisonous" like NaN is in IEEE 754. So
> you could potential add an integer NA to some other integer and get a
> meaningless non-NA values.

For non floats there will be always one extra check involved for every
use of a integer type.
This would be a high price to pay for all other uses.

Couldn't the type system be used to decide if such a check needs to be made?

OTOH: a union based/like solution for Type or NA will probably be
faster than R anyway so it won't hurt those that need it too much.
With a union based approach there is the advantage of having unlimited
special values.

-- Peer

-- Peer

Harlan Harris

unread,
Feb 26, 2012, 11:24:39 AM2/26/12
to juli...@googlegroups.com
Won, thanks, yes. Here's another, clearer resource about SAS (which, in general, is probably not a good model for Julia): http://www.pauldickman.com/teaching/sas/missing.php

And section 3.3.4 of this document describes the R NA better: http://cran.r-project.org/doc/manuals/R-lang.pdf

Peer, it sounds like you're arguing for a set of parallel data types that support NA (or maybe multiple NAs), implemented by a union, with appropriate conversion and promotions? It does seem like the type check in the compilation stage would mean that if you wanted to use non-NA-supporting types that there would be no need for a run-time check, and the operation would be C-fast. Seems reasonable to me. And yes, beating R in many operations is a pretty low bar...

(Funny/sad side note: in R, parens and blocks are explicitly represented in the AST, so y = (a + b) is slower than y = a + b, and if I remember right, {y = a + b} is slower still.)

What's the right way to organize thinking through the rest of the issues around this, and then implementing them? I'm happy to work with others on it. It'd be great if we had someone with more experience working on the R (or SAS!) internals than I have.

 -Harlan

Matt Shotwell

unread,
Feb 26, 2012, 12:26:58 PM2/26/12
to juli...@googlegroups.com
Here are some additional ideas/features where R may give some
inspiration:

1. sample quantiles: R's "quantile" function is mature.

2. quantile, density, and distribution functions: The standalone "Rmath"
library may be useful, if only for translation.

3. formatted data i/o: R provides a series of read.* functions, for
example, "read.csv". Memory mapping for big data would be nice also.

4. formulas, "~" operator: Formulas make it easy to create and
manipulate design matrices, among other uses.

5. optimal Lapack routines for symmetric and positive definite matrices:
It's tricky to balance this with the overhead (e.g., checking for
positive definiteness). The R "Matrix" package may be helpful.

6. graphics - the notion of building graphics from basic shapes is
useful. It might be timely to consider a standardized framework for
graphics. For example, if there were a standardized abstract graphics
object (e.g., a layout, coordinate system, collection of shapes and
their attributes), the tasks of writing render/output methods (e.g., to
SVG) could then be taken up by non-Julia core developers.

Keep up the great work on Julia!

Matt

P.S. There appears to be a typo in the Standard Library Reference:

flipdim(A, d) — Reverse A in dimension d.
flipud(A) — Equivalent to flip(1,A).
fliplr(A) — Equivalent to flip(2,A).

Shouldn't "flip(1,A)" be "flipdim(A,1)"?

Jeff Bezanson

unread,
Feb 26, 2012, 2:34:42 PM2/26/12
to juli...@googlegroups.com
Thanks for the typo catch.

I will repeat, if you define

bitstype 64 NAInt <: Integer

you can then overload operators for NAInt to respect the NA value, and
in all other respects this will be just as efficient as Int. And of
course Int performance won't change.

For fancier metadata like different kinds of missing values, I'd think
the most efficient approach is to have separate numeric arrays with
the metadata and use masking operations and such.

Stefan Karpinski

unread,
Feb 26, 2012, 8:00:35 PM2/26/12
to juli...@googlegroups.com
Yes, the SAS-style NA with arbitrary "missing reason" codes are really just sneaking an entire second column of metadata into your dataset. Which to me makes much more sense to do explicitly. For just a single NA value, the ability to define your own new integer types is very powerful here.

James Bullard

unread,
Feb 27, 2012, 1:28:52 PM2/27/12
to juli...@googlegroups.com
As a long-term R programmer and begrudging agreer to most of the astute criticisms made by Harlan in his original email concerning R, I am really excited about the prospect of Julia becoming a replacement. For my money, R is invaluable in terms of making good statistical graphics - both by using the standard plotting packages as well as via ggplot2. As a bit of background, ggplot2 is a library provided by Hadley Wickham implementing the grammar of graphics which really makes the plotting of many variables at once natural. The price you pay is unacceptably slow performance.

I'm wondering where can I keep up, comment, help on the development of native Julia plotting functionality? There are benefits to having plotting as a library, but what I really think is handy with R (as opposed to say python) is that I can start R and just do:

plot(1:10)

immediate gratification. I think it is imperative that any technical computing language has this at the core. Julia's use of generics makes this type of functionality very similar to R except much more sound fundamentally, e.g.,

plot(lm(y ~ x))

works in R through a very arcane method of dispatch, but from a user's perspective it is nice - I can plot fits, I can plot matrices, I can plot vectors. The ability for a library writer to provide implementations to common generics greatly simplifies a user's interaction, for instance, if I implement some complicated statistical model, I'll provide functions like,

residuals, coefficients, predict, ...

which really have meaning across broad classes of models and can make the exploration and use of different models near "pluggable".


Just some comments, thanks for the really nice language and documentation, as I can see from the activity of this mailing list, there is definitely an itch to scratch.

jim

Stefan Karpinski

unread,
Feb 27, 2012, 3:31:07 PM2/27/12
to juli...@googlegroups.com
Ggplot2 looks like a very promising source of good ideas for how to design our graphics API. I suspect we can probably make it much faster too. One thing that we need to consider that ggplot2 doesn't appear to (at a very cursory glance, admittedly), is interaction: we want users to be able to make interactive graphics. I really like D3 for web-based rendering, but generating json and svg data directly at the Julia graphics API level is not the right level of abstraction, imo.

Harlan Harris

unread,
Feb 27, 2012, 3:38:40 PM2/27/12
to juli...@googlegroups.com
James, thanks for weighing in! I think it should be just as easy, if not easier, to implement the plot(lm()) sort of thing in Julia as it is in R. It'll "just" be necessary to design the data structures and methods with a lot of thought.

I love ggplot2 and use it constantly. (In another window, this very minute, in fact!) It's an amazing way to iteratively explore your data and visualize it at the same time. However, I happen to know that Hadley is interested in dynamic/interactive graphics too, and at some point plans to get back to that. R is a pretty terrible platform to think about designing interactive graphics, though, which is probably why he hasn't tried.

Back on the NA topic, there's a github issue about this here: https://github.com/JuliaLang/julia/issues/470 Does not seem resolved at this point. We may need/want to try implementing and testing out a few different options to see what works best in practice...

 -Harlan

Nick

unread,
Feb 28, 2012, 5:30:47 AM2/28/12
to julia-dev
Hi

Julia looks really exciting!

As someone who much prefers to use Numpy/Scipy for data analysis, I
would agree that plotting at least similar to R's and Python's
matplotlib is a must. A match to the IPython shell functionality would
be nice too :) IPython and the nice Python language specifics is the
single biggest reason I use Python rather than R or Matlab (apart from
cost also).

For R's data.frame, another potentially interesting source of
inspiration is Python's pandas library, which may perhaps give some
ideas on how to avoid shortcomings / improve functionality relative to
data.frame (https://github.com/pydata/pandas).

Look forward to seeing Julia grow and hope to get a chance to take it
for a real test-run soon.

Nick

On Feb 27, 10:38 pm, Harlan Harris <har...@harris.name> wrote:
> James, thanks for weighing in! I think it should be just as easy, if not
> easier, to implement the plot(lm()) sort of thing in Julia as it is in R.
> It'll "just" be necessary to design the data structures and methods with a
> lot of thought.
>
> I love ggplot2 and use it constantly. (In another window, this very minute,
> in fact!) It's an amazing way to iteratively explore your data and
> visualize it at the same time. However, I happen to know that Hadley is to I
> interested in dynamic/interactive graphics too, and at some point plans to
> get back to that. R is a pretty terrible platform to think about designing
> interactive graphics, though, which is probably why he hasn't tried.
>
> Back on the NA topic, there's a github issue about this here:https://github.com/JuliaLang/julia/issues/470Does not seem resolved at
> this point. We may need/want to try implementing and testing out a few
> different options to see what works best in practice...
>
>  -Harlan
>
> On Mon, Feb 27, 2012 at 3:31 PM, Stefan Karpinski <ste...@karpinski.org>wrote:
>
>
>
>
>
>
>
> > Ggplot2 looks like a very promising source of good ideas for how to design
> > our graphics API. I suspect we can probably make it much faster too. One
> > thing that we need to consider that ggplot2 doesn't appear to (at a very
> > cursory glance, admittedly), is interaction: we want users to be able to
> > make interactive graphics. I really like D3 for web-based rendering, but
> > generating json and svg data directly at the Julia graphics API level is
> > not the right level of abstraction, imo.
>
> >>> On Sun, Feb 26, 2012 at 4:34 PM, Jeff Bezanson <jeff.bezan...@gmail.com>wrote:
>
> >>>> Thanks for the typo catch.
>
> >>>> I will repeat, if you define
>
> >>>> bitstype 64 NAInt <: Integer
>
> >>>> you can then overload operators for NAInt to respect the NA value, and
> >>>> in all other respects this will be just as efficient as Int. And of
> >>>> course Int performance won't change.
>
> >>>> For fancier metadata like different kinds of missing values, I'd think
> >>>> the most efficient approach is to have separate numeric arrays with
> >>>> the metadata and use masking operations and such.
>
> >>>> On Sun, Feb 26, 2012 at 12:26 PM, Matt Shotwell <biostatm...@gmail.com>
> ...
>
> read more »

Douglas Bates

unread,
Feb 28, 2012, 10:37:27 AM2/28/12
to juli...@googlegroups.com
I have been following this thread with interest.  When I first discovered julia I immediately thought of it in the context of a statistical language like R (I am a member of the core development team for R and, before that, S).


On Sunday, February 26, 2012 12:26:58 PM UTC-5, Matt Shotwell wrote:
Here are some additional ideas/features where R may give some
inspiration:

1. sample quantiles: R's "quantile" function is mature.

2. quantile, density, and distribution functions: The standalone "Rmath"
library may be useful, if only for translation.

I think that would be a great idea.  With luck it should be possible to get a large part of the quantile, density, and distribution functions implemented easily.
 

3. formatted data i/o: R provides a series of read.* functions, for example, "read.csv". Memory mapping for big data would be nice also.

First up is getting a data-frame-like capability as all those functions create data frames.  It may be useful to get an opinion from John Chambers, the designer of the S language, on how he would design data frames now.  He has said that if he had it to do over again he would implement them differently.

Has there been any thought to interfacing with SQLite or a similar database library?

4. formulas, "~" operator: Formulas make it easy to create and manipulate design matrices, among other uses.

That would help a lot.  Almost all the model-fitting functions in R have, as the first two arguments, "formula" and "data" then go through calls to model.frame and model.matrix to set up the model matrices, response values, offset, etc.  Building a capability like that would be very helpful in porting over modelling capabilities.

5. optimal Lapack routines for symmetric and positive definite matrices:
It's tricky to balance this with the overhead (e.g., checking for
positive definiteness). The R "Matrix" package may be helpful.

Speaking as one of the authors of the R Matrix package I would probably not go down that path.  At the R level the Matrix package is based upon S4 classes and methods which would not be easy to reimplement.  These days I am concentrating more on Eigen (http://eigen.tuxfamily.org) for linear algebra through the RcppEigen package for R.  The motivation for the Matrix package was to provide dense and sparse matrix classes for use in fitting mixed-effects models in what is now the lme4 package.  The development version of lme4, called lme4Eigen, uses Eigen for linear algebra.

lme4Eigen, RcppEigen and Rcpp are all C++-based - which makes interfacing with julia a bit more complicated.  However, Eigen is a template library, meaning that it consists of templated header files, which makes handling different types much easier.  When compiling Lapack/BLAS you must decide if you are going to use 32-bit ints or 64-bit ints and then stay with that choice.

Especially with Eigen 3.1, which is now in alpha release, the handling of sparse matrices is very good - much easier to work with than sparseSuite - and it provides interfaces to the sparseSuite code and Paradiso/MKL BLAS, if you have a license for it.  There is also an openGL interface.

6. graphics - the notion of building graphics from basic shapes is
useful. It might be timely to consider a standardized framework for
graphics. For example, if there were a standardized abstract graphics
object (e.g., a layout, coordinate system, collection of shapes and
their attributes), the tasks of writing render/output methods (e.g., to
SVG) could then be taken up by non-Julia core developers.

Yes, there has been a huge amount of work by Paul Murrell on the grid package which is what makes the lattice and ggplot2 packages possible.  I would bypass the traditional R graphics system in favor of the grid-based approach.

Harlan Harris

unread,
Feb 28, 2012, 11:43:50 AM2/28/12
to juli...@googlegroups.com
Prof. Bates, what a treat to have you join this discussion!

I'm curious what John Chambers has to say about data.frames. There are clear performance issues with them, in some contexts, or does he mean that the syntax and interface for them should be better/different?

I also think that the library functions surrounding data.frame manipulation should borrow heavily from Hadley Wickham's incredibly powerful and intuitive plyr and reshape2 packages, and not so much from the core R libraries. Someone should sit down and think out an orthoganal Julia-ish design, rather than just porting blindly from R (or Pandas).

 -Harlan

Douglas Bates

unread,
Feb 28, 2012, 6:21:11 PM2/28/12
to juli...@googlegroups.com
On Tuesday, February 28, 2012 9:37:27 AM UTC-6, Douglas Bates wrote:
I have been following this thread with interest.  When I first discovered julia I immediately thought of it in the context of a statistical language like R (I am a member of the core development team for R and, before that, S).

On Sunday, February 26, 2012 12:26:58 PM UTC-5, Matt Shotwell wrote:
Here are some additional ideas/features where R may give some
inspiration:

1. sample quantiles: R's "quantile" function is mature.

2. quantile, density, and distribution functions: The standalone "Rmath"
library may be useful, if only for translation.

I think that would be a great idea.  With luck it should be possible to get a large part of the quantile, density, and distribution functions implemented easily.
 

3. formatted data i/o: R provides a series of read.* functions, for example, "read.csv". Memory mapping for big data would be nice also.

First up is getting a data-frame-like capability as all those functions create data frames.  It may be useful to get an opinion from John Chambers, the designer of the S language, on how he would design data frames now.  He has said that if he had it to do over again he would implement them differently.

Has there been any thought to interfacing with SQLite or a similar database library?

4. formulas, "~" operator: Formulas make it easy to create and manipulate design matrices, among other uses.

That would help a lot.  Almost all the model-fitting functions in R have, as the first two arguments, "formula" and "data" then go through calls to model.frame and model.matrix to set up the model matrices, response values, offset, etc.  Building a capability like that would be very helpful in porting over modelling capabilities.

5. optimal Lapack routines for symmetric and positive definite matrices:
It's tricky to balance this with the overhead (e.g., checking for
positive definiteness). The R "Matrix" package may be helpful.

Speaking as one of the authors of the R Matrix package I would probably not go down that path.  At the R level the Matrix package is based upon S4 classes and methods which would not be easy to reimplement.  These days I am concentrating more on Eigen (http://eigen.tuxfamily.org) for linear algebra through the RcppEigen package for R.  The motivation for the Matrix package was to provide dense and sparse matrix classes for use in fitting mixed-effects models in what is now the lme4 package.  The development version of lme4, called lme4Eigen, uses Eigen for linear algebra.

lme4Eigen, RcppEigen and Rcpp are all C++-based - which makes interfacing with julia a bit more complicated.  However, Eigen is a template library, meaning that it consists of templated header files, which makes handling different types much easier.  When compiling Lapack/BLAS you must decide if you are going to use 32-bit ints or 64-bit ints and then stay with that choice.

Especially with Eigen 3.1, which is now in alpha release, the handling of sparse matrices is very good - much easier to work with than sparseSuite - and it provides interfaces to the sparseSuite code and Paradiso/MKL BLAS, if you have a license for it.  There is also an openGL interface.

s/sparseSuite/SuiteSparse/

Fishtank

unread,
Feb 28, 2012, 6:22:21 PM2/28/12
to juli...@googlegroups.com
Hello,

A colleague of mine pointed this language out. Have spent the last hour reading the manual. A few suggestions from a daily R user , (mostly for data manipulation)

1. The data frame is essential.
2. NA - tried getting away from it, but cannot. NA is essential for every type though i dont see the need for a complex NA, an integer NA (at least visible to the user, how it's implemented internally is another matter). Also, there should be one for strings and booleans too.
3. The ability to subset vectors using TRUE and FALSE e.g. a[x] where x is a vector of TRUE and FALSE
4. Vector boolean logic e.g. df[ df$column1 & df$colum2, ] subsets those rows for which column1 and colum2 are TRUE

(3) and (4) makes using R really nice.
5. Graphics - grid is good, i'm not much of fan of ggplot only because lattice works so well for me. Once one groks lattice, it can really do  everything. All are based on grid,
so providing grid like capabilities is most useful. But why not just interface to QT graphics? It's probably an uphill task, but I'm not talking about PyQT but graphics primitives based on a QT library.
6. Formula syntax and types
7. I didn't see default arguments in function definitions, did i miss it?
8.  Having lm (and others) return an object which can plotted or get residuals from or be treated in so many manner of ways is extremely useful to the data analyst (DA).
Maybe programming inefficient, but R is firstly a DSL used by DAs to analyze data and anything that makes that easier is 'good'.
9. Is there a concept of a package/module/library system? I see there (in the Potential Features and I think documentation at the level of Pythons, hunting functions in modules won't be a problem)

I agree with a point made above, that to  a certain extent, it's the libraries and not the language of R that makes it popular. But things like NA, data frames, the S3 class system (S4 is tedious to code for)
do contribute to it's ease of use.

I like the simple syntax. Very nice stuff. Hope something improves upon R - it more often than not makes me upset.
Cheers
Sapsi

P.S. Can we Haskell/Mathematica style pattern matching? It's so much fun.

Jeff Bezanson

unread,
Feb 28, 2012, 7:04:54 PM2/28/12
to juli...@googlegroups.com
Good stuff to think about. I only have time to respond to a bit of it now:

#3 we definitely have
#4 is easy with indexing but the syntax is a bit different

Not 100% sure what #6 means but we have quoted expresions, i.e.
":(x+y)" gives you a data structure representing that symbolic
expression.

#7 - Currently you can simulate optional arguments easily with
multiple dispatch:
f(x, y) = 1
f(x) = f(x, 0)

We use this all over the place. But, some more compact syntax for that
might be nice.

We're not the kind of language that has multiple object systems. We
have one powerful system.

Our philosophy is to push almost everything to libraries, because
people will not all agree. Some people want bigints, some people might
want saturating arithmetic, some people want NA, other people want
performance at any cost. Boolean is a core concept (e.g. for picking
which branch to take), so we can't complicate it with issues like
missing data. But a separate BoolNA type is of course fine.

Stefan Karpinski

unread,
Feb 28, 2012, 7:33:24 PM2/28/12
to juli...@googlegroups.com
#6 means that R has a syntax for writing "formulas" of variables. Personally, I never really groked it. Maybe someone here with more R expertise can explain how it works.

Saptarshi Guha

unread,
Feb 28, 2012, 9:16:10 PM2/28/12
to juli...@googlegroups.com
It's way of writing statistical (or dependency structures). E.g if y depends on x and y and their interactions 

y ~ x+y+x:y

Lattice also uses this to determine how to display panels.

This can be separated out into library/module,so the user could write something like

F( FML( y~ x+y+x:y), ...)

and FML could parse the formula.

So it's not a deal breaker. How it is parsed internally i cannot answer.

Q: So for a string with NA, i would have a type(like  a union) which String or a Boolean (indicating NA)?

Regards
Sapsi

Harlan Harris

unread,
Feb 28, 2012, 9:17:09 PM2/28/12
to juli...@googlegroups.com
R has lazy evaluation, which gets used/abused a number of ways in the language and by package authors. Formulas are (more or less, I think...) essentially just bits of syntax that get parsed into an AST but never evaluated, with several special operators (~, :, |). The functions can then take the AST apart and use it for various tasks, such as constructing model matrices in regression models.

Here's an example of how it's used:

fit <- lmer(y ~ a + b + (1 | c), data)

The first argument isn't evaluated (I think -- or if it is, I think a formula just returns itself). For Julia, which presumably will not have lazy evaluation, it probably makes sense to explicitly quote function expressions. Something like:

fit = lmer(:(y ~ a + b + (1 | c)), data)

Which would presumably parse into something like:

(~, y, +(+(a, b), (|, 1, c)))

Which could then be used as needed to define a model matrix or whatever.

That would require defining the operators. What restrictions does the lexer put on Julia operators? How do you define a new one?

 -Harlan

Jeff Bezanson

unread,
Feb 28, 2012, 9:27:21 PM2/28/12
to juli...@googlegroups.com
Ok, I see.

We don't have support for user-defined operators in the lexer (i.e.
declaring that some sequence of characters is to be parsed as an
operator). But, I'm happy to add support for extra operators in the
lexer as long as it doesn't clash with anything.

As a language enthusiast I have to point out that lazy evaluation and
accessing the parse tree are totally orthogonal --- you can easily
have one without the other. Using lazy evaluation to achieve parse
tree access works, but is very strange.

Viral Shah

unread,
Feb 29, 2012, 7:04:14 AM2/29/12
to juli...@googlegroups.com
Hi Doug,

I think getting the library is relatively well-understood - quantiles, density, distributions, database hookups, formatted i/o (csvread already exists), matrix stuff, and even graphics.

Not being a regular user of R, the real question that I haven't quite understood is the embedding of data frames so deeply into the R language. What makes them so essential? Is it important for julia to have them exactly the way they are in R, or can we do things better today? Why is Matlab's statistics toolbox not sufficient from a design standpoint, for example?

You mention that John Chambers has talked about doing it differently if he were to start afresh. Would you be able to help with driving some of these design decisions relating to statistical programming? Are there key papers that describe the design decisions behind S? If we can get the design right, then getting the functionality will be relatively straightforward.

-viral

Andrei Jirnyi

unread,
Feb 29, 2012, 11:56:57 AM2/29/12
to julia-dev
On Feb 25, 12:22 pm, Harlan Harris <har...@harris.name> wrote:
> several mostly-terrible bolted-on OOP extensions. Everything is a vector.

I actually like the R version of OOP a lot :) particularly S3.
The best part is transparency, you don't need to know anything about
OOP to use the interfaces: plot(lm(y~x)) just works. Implementing
things is rather straightforward too, even for people who have never
seen any other programming language. I like S-Plus even more though (I
am really missing detach(save=T) in R).
The main thing that really handicaps R imo is the horrible
documentation (whoever came up with the standard package documentation
format must have been on some kind of a sabotage mission).

That being said, getting Julia to a place where it is comparable with
R in terms of functionality would be rather tricky, honestly I'd say
impossible -- R itself is a very quickly moving target, with packages
being contributed all the time (e.g. compare the number of user
written statistics packages for R to that for Matlab, and Matlab has
many more users than Julia!) Also, now we are seeing folks like SAS
and even some database vendors adding functionality to run R code from
within -- I'd suspect it will be a long time before Julia is in that
position.

Going after the Matlab statistics toolbox users seems like a much more
realistic objective, as functionality is much more limited there, and
it is a much slower target to keep up with.

Adding Julia <-> R interop would immediately add a lot of value, too,
as would a better equivalent to the data frame object (Matlab is
rather clunky here). Having a nice time series object would also be
pretty useful, although this one is tricky to get right ("xts" package
in R is quite decent though).

--aj

Viral Shah

unread,
Feb 29, 2012, 12:10:34 PM2/29/12
to juli...@googlegroups.com
Matching R function for function is not fun, almost impossible, and should certainly not be our goal. What we need to figure out is the essence of R's design for statistics usage, and build it into julia.

I guess one could call R from julia at some point, but that leads to a system that is almost impossible to debug. That has never stopped anyone though!

Apart from the interop, what you are suggesting is that data frame objects and time series objects would be nice to have. I'll try put a wiki page together combining all the suggestions from this thread, and perhaps a design and target will emerge out of that.

-viral

R. Michael Weylandt

unread,
Feb 29, 2012, 12:24:31 PM2/29/12
to juli...@googlegroups.com
What's the knock on the R documentation as a system? As a system, I
actually quite like it: being able to type

? name.of.function

and have text-based documentation (in the form of a man page)
automatically appear in the same window seems pretty ideal. To be
able to browse the same documentation in hyper-linked form through a
browser is just a lovely bonus. For those who don't know, the system
requires the developer to provide documentation for every user-facing
(exported from the namespace) function in a R-specific LaTeX-esque
format that allows inter-help links and runnable examples which can
also be called by running example(name.of.function) from the prompt.

I often agree the R help pages can be too terse for some (I generally
prefer that style, but I recognize I'm in the small minority there)
and aren't the most user friendly, but the *system* itself seems to be
one of the best I've used. I certainly prefer it to python or perl;
Matlab is similar in that one can call documentation form the help
line with many of the same features, but the virtue there seems to be
in the extent, not the implementation. Julia, much further down the
road, could be well served by a parallel system.

I'm a relatively active R user -- though certainly not on Dr. Bates'
level -- and once I get a little more acquainted with Julia, I'd be
more than happy to help develop some first-order statistical tools or
provide one user's feedback on the good bits and bad bits of R's data
structures.

A few steps back in the thread there was a question about R's formula
interface: for those who haven't played with it, I'll give a quick
intro. Most R modelling functions can be run with the following
syntax:

model.name( y ~ x, data = data.set.name)

What this does (in short) is to create a new environment in which the
formula terms are treated as names for vectors which get their values
from the columns of the data frame -- the function then creates the
model as appropriate. The syntax is relatively straightforward:

~ means "based on / as a function of / responding to"
+ means "add this term"
- means "leave this one out"
. means "all other terms not accounted for (a wildcard)"
| is used for nesting.
* and : are used for interaction terms.

The power of this interface is that it abstracts from the user the
nature of these variables: if I were to change x from continuous to
categorical, my formula wouldn't have to change at all: the model
matrix is set up behind the scenes (in some of the most difficult R
code to grok). If I add a variable to my data set and I have the
wild-card in, it automatically adapts and if I don't it nicely ignores
it. To change the kind of model, it suffices to change the function
name (e.g., lm to glm) while the formula can proceed unchanged even
though the computation is almost entirely different.

This idea gets pushed even further in data aggregation / reshaping
functions: to take an example from R help the other data, a poster had
a data set of three variables: a continuous response (y) and two
categorical predictors (x1, x2): he wanted to group by by both of
these categories and calculate subset medians: in R, this was as easy
as aggregate(y ~ x1 + x2, data = dat, FUN = median) --which one could
read as: using the data set "dat," aggregate y based on x1 and x2 and
apply the median function to each group. To get medians based on just
x1 or x2, it's as easy as dropping the other term.

It would be very powerful to adopt this interface -- and perhaps Julia
could do it more cleanly in the model matrix code since
back-compatability doesn't have to be a consideration -- but I think
it's R at its best (unlike drop = TRUE or stringsAsFactors =
TRUE...those who've used R can feel my scowl coming through).

Thanks for the great project! I'm looking forward to having a little
more time to play with Julia and hopefully (when the semester cools
down) to contribute to an emergent statistical library.

Best,

Michael

Won g'Faere

unread,
Feb 29, 2012, 3:23:09 PM2/29/12
to juli...@googlegroups.com
On Wed, Feb 29, 2012 at 12:10 PM, Viral Shah <vi...@mayin.org> wrote:
Matching R function for function is not fun, almost impossible, and should certainly not be our goal. What we need to figure out is the essence of R's design for statistics usage, and build it into julia.

I guess one could call R from julia at some point, but that leads to a system that is almost impossible to debug. That has never stopped anyone though!

Apart from the interop, what you are suggesting is that data frame objects and time series objects would be nice to have. I'll try put a wiki page together combining all the suggestions from this thread, and perhaps a design and target will emerge out of that.


The popular time series classes in R are not part of the core of R but are user level packages and in Python pandas (which correspond to R data frames) is also a user level package so these items don't necessarily have to be part of the language proper.  Similarly no special syntax is really needed for formulas (R has special syntax but its not really so important) so this could be pushed out of the core as well.

For the language its probably best to stick to items that don't require domain knowledge and just make sure that the basics are in place and that its easy for those with domain level knowledge to add the requisite features through addon packages.  Its too bad if all they do is replicate R functionality but at least that uses up their time and not the time of the core group pushing the language forward. Furthermore it insulates the language against design errors and moving in non-productive directions as its easier to replace an addon package than to change the language.


Andrei Jirnyi

unread,
Feb 29, 2012, 3:46:29 PM2/29/12
to julia-dev
On Feb 29, 11:24 am, "R. Michael Weylandt"
<michael.weyla...@gmail.com> wrote:
> browser is just a lovely bonus. For those who don't know, the system
> requires the developer to provide documentation for every user-facing
> (exported from the namespace) function in a R-specific LaTeX-esque
> format

Yes this is true; and I am wondering if it hurts much more than
helps.
For a large package, there can be hundreds of functions -- the vast
majority of them being very special-purpose helper functions, example
datasets, and such. The documentation pdf's just have them all
exposed, and listed alphabetically, which typically results in 100+
pages of junk, only 2 or 3 of them being of any use. Most packages
have no other documentation -- "vignettes" are nice but rare. Just
putting a single page at the beginning listing what is actually
important would be a big deal, and not that much to ask from the
authors.

"?function" at the prompt is nice, but every language has that.
Finding functions that fit a particular purpose is not trivial though
-- if you have a good way of doing that, please let me know, because I
surely don't... even Stata's [search] (which searches for a term in
all function descriptions, first in local packages, then in the
contributed packages in the repository) is somewhat helpful. Is there
an R equivalent?

--aj

Douglas Bates

unread,
Feb 29, 2012, 4:47:16 PM2/29/12
to juli...@googlegroups.com
On Tuesday, February 28, 2012 6:04:54 PM UTC-6, Jeff Bezanson wrote:
Good stuff to think about. I only have time to respond to a bit of it now:

#3 we definitely have
#4 is easy with indexing but the syntax is a bit different

Not 100% sure what #6 means but we have quoted expresions, i.e.
":(x+y)" gives you a data structure representing that symbolic
expression.

That would probably be sufficient for implementing a lot of the formula language.  In R the '~' symbol, which is read as "is modeled by" is another operator that protects the expression from being evaluated and returns the expression as an object of class "formula"  (there's a bit more than that but that's the basics).  It can be used in a monadic or dyadic form. 

The real work with formulas is done in the functions model.frame and model.matrix.  The model.frame function extracts only the variables from the data frame that are used in the formula and provides some information on the types of variables.  It also evaluates function calls that are not part of the formula language.  So if you create a model frame from the formula log(y) ~ sqrt(x) + z the result will have columns 'log(y)', 'sqrt(x)' and z.  This is also where the various na.action specifications are applied, which in most cases means removing any rows of the frame that contain missing values and storing a bit of information to be able to take vectors of the reduced size and expand them to the original size.  Another typical argument is "subset" which applies a logical expression to the frame to generate a subset of the rows.

After the model frame is created the next call is usually to model.matrix which interprets the formula mini-language to generate the columns of a model matrix.  It is at this point that factors are expanded into a set of columns called the "contrasts" for the factor.  Exactly which set of contrasts is used depends on other parts of the formula.  So this step involves both symbolic and numeric evaluations, similar to what happens in many sparse matrix methods.  Especially when you have categorical variables (factors in R) in the formula you must be careful about expanding them to indicator columns or you can end up with a rank-deficient model matrix.  The symbolic analysis doesn't guarantee that you won't have a rank-deficient model matrix but it works nearly all the time.

This formula/data specification is a surprisingly compact and flexible way of generating the model matrix.  The idea was first introduced in a statistical package called GLIM from the UK and has been extended and enhanced in S, SAS and R.

I haven't looked at the statistics toolbox in Matlab but I imagine that the user must generate the model matrix by hand.  This is a tedious and error-prone process.  Once you have been able to use a formula language you find it awkward to go back.

This is also why the data.frame structure is so widely used.  Basically this is a vector of arbitrary named vector-like structures, such as numeric vectors or factors or integer vectors, subject to the constraint that they must all have the same length.  It's similar to a relation in a relational database.  I'll probably slip at some point and say that this vector of vectors is a list of vectors because the term "list" in R refers to a vector structure that can hold arbitrary structures.  (There are also "pairlist" objects that are Lisp-like lists with cons, cdr, etc. but they are, for the most ppart not visible to users.)  So just be cautious that a list in R means a potentially recursive heterogeneous array.

The actual implementation of a data.frame is as a list (in the R sense) of objects that all must have the same length.  They can be indexed as [i,j] like a matrix but they also have a column-wise accessor which is [[i]].  The double bracket is another token in R.  The difference between the double bracket and a single bracket is that a single bracket extractor

myframe["foo"]

applied to a list always returns a list.  The double bracket can only be used with a single index and extracts the constituent vector (in general, element of the list).  The distinction is like that between a subset of size 1 (the single bracket) and the element itself (the double bracket).

The $ operator in R is syntactic sugar for the double bracket extractor with a literal string.  So

myframe[["foo"]]

can be written as

myframe$foo

Data frames can also be attached to the search path, although that practice is discouraged now.  The preferred approach is to use a function called "with" as in

with(myframe, mean(foo))

which has the effect of putting the names of the columns in the frame at the front of the search path during the evaluation of the expression then removing them.

I enclose a brief example.  Let me know if you would like more details or examples.    I haven't gone over the manual sufficiently closely to decide if there is a arbitrary collection data type which is like the "list" structure in R.  As for formulas, essentially you just need to parse tree for the expression and some way of flagging it as a formula.  All the rest is done in the functions that manipulate formulas.



#7 - Currently you can simulate optional arguments easily with
multiple dispatch:
f(x, y) = 1
f(x) = f(x, 0)

We use this all over the place. But, some more compact syntax for that
might be nice.

We're not the kind of language that has multiple object systems. We
have one powerful system.

Our philosophy is to push almost everything to libraries, because
people will not all agree. Some people want bigints, some people might
want saturating arithmetic, some people want NA, other people want
performance at any cost. Boolean is a core concept (e.g. for picking
which branch to take), so we can't complicate it with issues like
missing data. But a separate BoolNA type is of course fine.

formula_Rout.txt

Douglas Bates

unread,
Feb 29, 2012, 5:11:28 PM2/29/12
to juli...@googlegroups.com
On Wednesday, February 29, 2012 6:04:14 AM UTC-6, Viral Shah wrote:
Hi Doug,

I think getting the library is relatively well-understood - quantiles, density, distributions, database hookups, formatted i/o (csvread already exists), matrix stuff, and even graphics.

Not being a regular user of R, the real question that I haven't quite understood is the embedding of data frames so deeply into the R language. What makes them so essential? Is it important for julia to have them exactly the way they are in R, or can we do things better today? Why is Matlab's statistics toolbox not sufficient from a design standpoint, for example?


I replied to a later message indicating that the data frame is a useful way of expressing a table like that of a relational database or the SAS or SPSS (definitely not languages to look to for inspiration) data set.

 

You mention that John Chambers has talked about doing it differently if he were to start afresh. Would you be able to help with driving some of these design decisions relating to statistical programming? Are there key papers that describe the design decisions behind S? If we can get the design right, then getting the functionality will be relatively straightforward.

I think he was considering more the implementation than the design per se.  It happens that John is now at Stanford, http://stat.stanford.edu/people/faculty/chambers/index.html, so I mentioned Jeff's talk to him and he may attend.