Syntax for specifying statistical models

101 views
Skip to first unread message

Doug Johnson

unread,
Apr 6, 2012, 4:14:31 PM4/6/12
to juli...@googlegroups.com
An earlier post discussed what features from R Julia should adopt and specifically mentioned R formulas.  One area where it seems like Julia could improve on R is in the way statistical models are specified.  R formulas are great for specifying single equation models but there is no standard syntax in R for specifying more complicated statistical models.  That means that when you are using a package which deals with more complicated statistical models you are forced to learn that package's specific syntax.
I raise this as an issue because I think that there would be big gains to tackling this question early on in Julia's lifecycle and thus pre-empting developers from creating their own custom syntaxesDesigning a syntax for more complicated statistical models is a little tricky since the syntax should allow for as many of the commonly used models as possible without being too complex or giving users too much flexibility.  I'm not a statistician so I'm not really qualified to answer this question but here are a couple of other people's attempts / ideas on the question:
1. A general framework for specifying (somewhat) complicated statistical models in R: http://gking.harvard.edu/files/z.pdf
2. Comments on the BUGS framework for specifying statistical models: http://www.stat.columbia.edu/~gelman/research/published/bugsnext2.pdf`

John Myles White

unread,
Apr 6, 2012, 4:18:56 PM4/6/12
to juli...@googlegroups.com
I am a major believer in BUGS, but I think BUGS is really an independent language rather than a syntactical convenience like R's formulas. Adding BUGS capability to Julia seems like a major undertaking.

For those interested in using BUGS, I'm probably to write a wrapper to send Julia data to JAGS soon. The speed with which I do it depends largely on the complexity of the code for 'rjags', which I'll simply translate into Julia.

 -- John

Harlan Harris

unread,
Apr 6, 2012, 4:38:36 PM4/6/12
to juli...@googlegroups.com
I tend to agree with John about BUGS/JAGS. But Doug, yes, someone at some point should think through a Julian way to specify linear and related models. One of the real tricks with trying to think about translating R idioms to Julia is that R relies extensively on lazy evaluation, while Julia will evaluate any expression you pass in to a function. So a design like:

lm(y ~ a + b, df)

won't work, because the express gets evaluated, which defeats the purpose. You can maybe do this, though:

function lm(formula::Expr, dat::DataFrame) ...

lm(:(y ~ a + b), df)

which is not as nice as in R. But you do have to talk one of the core devs into giving you ~ as an operator, which they may be reluctant to do. I'm almost tempted to suggest that formulae should just be non-standard strings! (http://julialang.org/manual/metaprogramming/) Someone could write a special-purpose parser that doesn't have to worry about conflicting with existing operators or whatever.

lm(f"y ~ a + b", df)

In any case, the syntax(es) for hierarchical models and multivariate models and such could use a group-up review, given a few decades of progress in the field since R/S were developed. And if you're not relying on Julia's lexer/parser, you'd probably make some drastically different choices...!

 -Harlan

Jeff Bezanson

unread,
Apr 6, 2012, 4:43:15 PM4/6/12
to juli...@googlegroups.com

Probably happy to make ~ an infix operator since that doesn't conflict. Making it syntactic is also possible but more controversial.

Kevin Van Horn

unread,
Apr 7, 2012, 1:12:04 AM4/7/12
to juli...@googlegroups.com
I would really like to have ~ as an infix operator, too. That would allow one to create something like BUGS/JAGS an an embedded language in Julia -- you'd have some sort of @model macro whose body would specify the conditional distributions; it would evaluate to some sort of abstract syntax tree. Then you could do things like pass the AST to a routine that does automatic differentiation to generate Julia code to compute gradients, or that applies various rules to generate MCMC updates, etc.

While we're on the topic, are there any plans to allow user-defined operators? (E.g., specifying that a certain character sequence is a left-associative binary operator of a given precedence.) These can be very useful to make code more readable, especially if you start doing functional-style programming (e.g., the reverse composition operator '|>' used in F#); you can't expect the language designers to think of all the useful operators in advance.

Vitalie Spinu

unread,
Apr 7, 2012, 2:32:29 AM4/7/12
to juli...@googlegroups.com
>>>> Jeff Bezanson <jeff.b...@gmail.com>

>>>> on Fri, 6 Apr 2012 16:43:15 -0400 wrote:

> Probably happy to make ~ an infix operator since that doesn't conflict.
> Making it syntactic is also possible but more controversial.

Please don't be in a hurry here. Julia has something that R never had
and R users are not accustomed with - macros.

Statistical languages like BUGS should be implemented as a micro
language similarly to Common Lisp loop macro. Formulas are rudimentary,
20 years old constructs, good only for specifying simple, one-line
models.

Many other so praised features of R are old and don't correspond to
current needs anymore. Please don't be in a hurry to get things from R
as they are, especially on expense of modifying the language.

Vitalie.

>>> formulas are great for specifying single equation models but there is nostandard syntax in R for specifying more complicated statistical models.

Vitalie Spinu

unread,
Apr 7, 2012, 2:45:27 AM4/7/12
to juli...@googlegroups.com
>>>> Vitalie Spinu <spin...@gmail.com>

>>>> on Sat, 07 Apr 2012 08:32:29 +0200 wrote:

>>>> Jeff Bezanson <jeff.b...@gmail.com>
>>>> on Fri, 6 Apr 2012 16:43:15 -0400 wrote:

>> Probably happy to make ~ an infix operator since that doesn't conflict.
>> Making it syntactic is also possible but more controversial.

> Please don't be in a hurry here. Julia has something that R never had
> and R users are not accustomed with - macros.

> Statistical languages like BUGS should be implemented as a micro
> language similarly to Common Lisp loop macro. Formulas are rudimentary,
> 20 years old constructs, good only for specifying simple, one-line
> models.

Forgot to mention here that to manipulate formulas programmaticaly is a
hell, as one needs text parsing to construct them.

Harlan Harris

unread,
Apr 7, 2012, 8:44:19 AM4/7/12
to juli...@googlegroups.com
Vitalie, can you give an example of how you're thinking about using macros to describe statistical models? It's not 100% obvious to me what you mean by that. Thanks!

Vitalie Spinu

unread,
Apr 7, 2012, 10:40:40 AM4/7/12
to juli...@googlegroups.com
>>>> Harlan Harris <har...@harris.name>

>>>> on Sat, 7 Apr 2012 08:44:19 -0400 wrote:

> Vitalie, can you give an example of how you're thinking about using macros
> to describe statistical models? It's not 100% obvious to me what you mean
> by that. Thanks!

Well, obviously I am not 100% sure either. The main problem in a stat
language is how you cleanly express hierarchy and grouping.

Just to give a visual feeling of a "microlanguage", here is how an
example of a normal hierarchy with heterogeneity across group1 and
group2 might look:

@model with data MYDATA
for vars
a ~ rvNorm mu = 0 sigma = 1 by GROUP1
s ~ rvGamma alpha = 1 beta = 1 by GROUP2
b ~ rvNorm mu = a sigma = s
Y ~ rvNorm mu = b*X sigma = 3
do
fun = s + 3
precision = 1/s

return [a, b, precision, fun]
@end

Which without infix "~" operator in julia might look like:

@model with MYDATA
for a::rvNorm{0,1} by GROUP1
s::rvGamma{alpha, beta} by GROUP2
b::rvNorm{a,s}
Y::rvNorm{b*X}
do begin
fun = d + 3
precision = 1/s
end
return [a, b, precision, fun]
@end

MYDATA contains X, Y, GROUP1, GROUP2. Last two factors in R sense.

The above is sort of vectorized R-ish thinking, something closer to
BUGS syntax might be more appropriate for julia.

Something across these lines I've been trying to get done an year or so
ago for R (https://github.com/vitoshka/pbm). Will resume working on it
pretty soon; still hope to get something out out of the idea.


> On Sat, Apr 7, 2012 at 2:32 AM, Vitalie Spinu <spin...@gmail.com> wrote:
>> >>>> Jeff Bezanson <jeff.b...@gmail.com>
>> >>>> on Fri, 6 Apr 2012 16:43:15 -0400 wrote:
>>
>> > Probably happy to make ~ an infix operator since that doesn't conflict.
>> > Making it syntactic is also possible but more controversial.
>>
>> Please don't be in a hurry here. Julia has something that R never had
>> and R users are not accustomed with - macros.

To be clear, I don't mind an infix ~ operator at all. I am against
giving it the quoting meaning by default (as in R). But this is
apparently also what Jeff meant.


Best,
Vitalie.

Harlan Harris

unread,
Apr 7, 2012, 11:00:58 AM4/7/12
to juli...@googlegroups.com
Oh, interesting, thank you! We're actually talking about almost the same thing, I thnk. You're proposing using macros to define a new microlanguage/DSL, will can use the Julia lexer but with a new parser/syntax. The way that Julia implements non-standard string literals (if I understand the docs correctly) would be to interpret something like f"y ~ a + (1 | b)" by using a macro with the form:

macro f_str(x)
  Formula(x)
end

Which could do anything we want it to. It looks like you actually want more than just the formulas in the DSL, right? You're putting the specifications for the data to use and some other aspects of the model in the DSL too.

Presumably you'd want to link this low-level syntax to a higher-level syntax used by end-users, who may just want to do a linear regression or something, and not have to define everything in terms of the underlying math.

Also, presumably whatever model-building DSL should be useful for specifying algorithmic models -- decision trees and SVMs and so forth.

It's also worth noting that people use formula notation in R to specify things that have nothing to do with statistical models. Hadley likes to use them for things like facet_grid(x ~ y) in ggplot2, and cast(a ~ b + c ~ d) in reshape2. This may be ill advised, but it's worth noting.

 -Harlan
Reply all
Reply to author
Forward
0 new messages