It would be worthwhile to study the design of R's libraries and use that to implement julia's statistics functionality rather than basing it on Matlab's statistics toolbox. How usable would R's libraries be? So far, we have not paid much attention to this, and we need to start thinking about it.
As for the authors, Jeff is a grad student at MIT, Stefan at UC Santa Barbara, Alan Edelman is Professor of Mathematics at MIT, and I work for Government of India. Julia is largely a volunteer effort with almost all of us working on it as much as we can, but anchored around the MIT community. For example, it was used in the Parallel Computing at MIT (http://beowulf.csail.mit.edu/18.337/index.html). We will be looking for grants and such to support work, travel to conferences, etc.
-viral
Named parameters is a feature that has been planned from day one, but we just haven't got around to implementing it. So far, we have managed ok without it, but will become higher priority as more libraries are written.
-viral
On 25-Feb-2012, at 11:52 PM, Harlan Harris wrote:
Unfortunately right now the implementation is not smart enough to
store an array of Union types efficiently using bit pattern tricks.
Using NaNs and minimum-integer values like R would be the way to go.
Best is to define a bits type:
bitstype 64 NAInt <: Integer
and operations on this can be overloaded to respect the special NA value.
As an aside, I would use
const NA = _NA()
instead of the macro...1 character shorter!
Thanks Harlan, this is a great writeup and very helpful to us.
Unfortunately right now the implementation is not smart enough to
store an array of Union types efficiently using bit pattern tricks.
Using NaNs and minimum-integer values like R would be the way to go.
Best is to define a bits type:
bitstype 64 NAInt <: Integer
and operations on this can be overloaded to respect the special NA value.
As an aside, I would use
const NA = _NA()
instead of the macro...1 character shorter!
For non floats there will be always one extra check involved for every
use of a integer type.
This would be a high price to pay for all other uses.
Couldn't the type system be used to decide if such a check needs to be made?
OTOH: a union based/like solution for Type or NA will probably be
faster than R anyway so it won't hurt those that need it too much.
With a union based approach there is the advantage of having unlimited
special values.
-- Peer
-- Peer
1. sample quantiles: R's "quantile" function is mature.
2. quantile, density, and distribution functions: The standalone "Rmath"
library may be useful, if only for translation.
3. formatted data i/o: R provides a series of read.* functions, for
example, "read.csv". Memory mapping for big data would be nice also.
4. formulas, "~" operator: Formulas make it easy to create and
manipulate design matrices, among other uses.
5. optimal Lapack routines for symmetric and positive definite matrices:
It's tricky to balance this with the overhead (e.g., checking for
positive definiteness). The R "Matrix" package may be helpful.
6. graphics - the notion of building graphics from basic shapes is
useful. It might be timely to consider a standardized framework for
graphics. For example, if there were a standardized abstract graphics
object (e.g., a layout, coordinate system, collection of shapes and
their attributes), the tasks of writing render/output methods (e.g., to
SVG) could then be taken up by non-Julia core developers.
Keep up the great work on Julia!
Matt
P.S. There appears to be a typo in the Standard Library Reference:
flipdim(A, d) — Reverse A in dimension d.
flipud(A) — Equivalent to flip(1,A).
fliplr(A) — Equivalent to flip(2,A).
Shouldn't "flip(1,A)" be "flipdim(A,1)"?
I will repeat, if you define
bitstype 64 NAInt <: Integer
you can then overload operators for NAInt to respect the NA value, and
in all other respects this will be just as efficient as Int. And of
course Int performance won't change.
For fancier metadata like different kinds of missing values, I'd think
the most efficient approach is to have separate numeric arrays with
the metadata and use masking operations and such.
Here are some additional ideas/features where R may give some
inspiration:1. sample quantiles: R's "quantile" function is mature.
2. quantile, density, and distribution functions: The standalone "Rmath"
library may be useful, if only for translation.
3. formatted data i/o: R provides a series of read.* functions, for example, "read.csv". Memory mapping for big data would be nice also.
First up is getting a data-frame-like capability as all those functions create data frames. Â It may be useful to get an opinion from John Chambers, the designer of the S language, on how he would design data frames now. Â He has said that if he had it to do over again he would implement them differently.
Has there been any thought to interfacing with SQLite or a similar database library?
4. formulas, "~" operator: Formulas make it easy to create and manipulate design matrices, among other uses.
That would help a lot. Â Almost all the model-fitting functions in R have, as the first two arguments, "formula" and "data" then go through calls to model.frame and model.matrix to set up the model matrices, response values, offset, etc. Â Building a capability like that would be very helpful in porting over modelling capabilities.
5. optimal Lapack routines for symmetric and positive definite matrices:
It's tricky to balance this with the overhead (e.g., checking for
positive definiteness). The R "Matrix" package may be helpful.
Speaking as one of the authors of the R Matrix package I would probably not go down that path. Â At the R level the Matrix package is based upon S4 classes and methods which would not be easy to reimplement. Â These days I am concentrating more on Eigen (http://eigen.tuxfamily.org) for linear algebra through the RcppEigen package for R. Â The motivation for the Matrix package was to provide dense and sparse matrix classes for use in fitting mixed-effects models in what is now the lme4 package. Â The development version of lme4, called lme4Eigen, uses Eigen for linear algebra.
lme4Eigen, RcppEigen and Rcpp are all C++-based - which makes interfacing with julia a bit more complicated. Â However, Eigen is a template library, meaning that it consists of templated header files, which makes handling different types much easier. Â When compiling Lapack/BLAS you must decide if you are going to use 32-bit ints or 64-bit ints and then stay with that choice.
Especially with Eigen 3.1, which is now in alpha release, the handling of sparse matrices is very good - much easier to work with than sparseSuite - and it provides interfaces to the sparseSuite code and Paradiso/MKL BLAS, if you have a license for it. Â There is also an openGL interface.
6. graphics - the notion of building graphics from basic shapes is
useful. It might be timely to consider a standardized framework for
graphics. For example, if there were a standardized abstract graphics
object (e.g., a layout, coordinate system, collection of shapes and
their attributes), the tasks of writing render/output methods (e.g., to
SVG) could then be taken up by non-Julia core developers.
Yes, there has been a huge amount of work by Paul Murrell on the grid package which is what makes the lattice and ggplot2 packages possible. Â I would bypass the traditional R graphics system in favor of the grid-based approach.
I have been following this thread with interest. Â When I first discovered julia I immediately thought of it in the context of a statistical language like R (I am a member of the core development team for R and, before that, S).
On Sunday, February 26, 2012 12:26:58 PM UTC-5, Matt Shotwell wrote:Here are some additional ideas/features where R may give some
inspiration:1. sample quantiles: R's "quantile" function is mature.
2. quantile, density, and distribution functions: The standalone "Rmath"
library may be useful, if only for translation.I think that would be a great idea.  With luck it should be possible to get a large part of the quantile, density, and distribution functions implemented easily.Â3. formatted data i/o: R provides a series of read.* functions, for example, "read.csv". Memory mapping for big data would be nice also.
First up is getting a data-frame-like capability as all those functions create data frames. Â It may be useful to get an opinion from John Chambers, the designer of the S language, on how he would design data frames now. Â He has said that if he had it to do over again he would implement them differently.
Has there been any thought to interfacing with SQLite or a similar database library?
4. formulas, "~" operator: Formulas make it easy to create and manipulate design matrices, among other uses.
That would help a lot. Â Almost all the model-fitting functions in R have, as the first two arguments, "formula" and "data" then go through calls to model.frame and model.matrix to set up the model matrices, response values, offset, etc. Â Building a capability like that would be very helpful in porting over modelling capabilities.
5. optimal Lapack routines for symmetric and positive definite matrices:
It's tricky to balance this with the overhead (e.g., checking for
positive definiteness). The R "Matrix" package may be helpful.Speaking as one of the authors of the R Matrix package I would probably not go down that path. Â At the R level the Matrix package is based upon S4 classes and methods which would not be easy to reimplement. Â These days I am concentrating more on Eigen (http://eigen.tuxfamily.org) for linear algebra through the RcppEigen package for R. Â The motivation for the Matrix package was to provide dense and sparse matrix classes for use in fitting mixed-effects models in what is now the lme4 package. Â The development version of lme4, called lme4Eigen, uses Eigen for linear algebra.
lme4Eigen, RcppEigen and Rcpp are all C++-based - which makes interfacing with julia a bit more complicated. Â However, Eigen is a template library, meaning that it consists of templated header files, which makes handling different types much easier. Â When compiling Lapack/BLAS you must decide if you are going to use 32-bit ints or 64-bit ints and then stay with that choice.
Especially with Eigen 3.1, which is now in alpha release, the handling of sparse matrices is very good - much easier to work with than sparseSuite - and it provides interfaces to the sparseSuite code and Paradiso/MKL BLAS, if you have a license for it. Â There is also an openGL interface.
#3 we definitely have
#4 is easy with indexing but the syntax is a bit different
Not 100% sure what #6 means but we have quoted expresions, i.e.
":(x+y)" gives you a data structure representing that symbolic
expression.
#7 - Currently you can simulate optional arguments easily with
multiple dispatch:
f(x, y) = 1
f(x) = f(x, 0)
We use this all over the place. But, some more compact syntax for that
might be nice.
We're not the kind of language that has multiple object systems. We
have one powerful system.
Our philosophy is to push almost everything to libraries, because
people will not all agree. Some people want bigints, some people might
want saturating arithmetic, some people want NA, other people want
performance at any cost. Boolean is a core concept (e.g. for picking
which branch to take), so we can't complicate it with issues like
missing data. But a separate BoolNA type is of course fine.
We don't have support for user-defined operators in the lexer (i.e.
declaring that some sequence of characters is to be parsed as an
operator). But, I'm happy to add support for extra operators in the
lexer as long as it doesn't clash with anything.
As a language enthusiast I have to point out that lazy evaluation and
accessing the parse tree are totally orthogonal --- you can easily
have one without the other. Using lazy evaluation to achieve parse
tree access works, but is very strange.
I think getting the library is relatively well-understood - quantiles, density, distributions, database hookups, formatted i/o (csvread already exists), matrix stuff, and even graphics.
Not being a regular user of R, the real question that I haven't quite understood is the embedding of data frames so deeply into the R language. What makes them so essential? Is it important for julia to have them exactly the way they are in R, or can we do things better today? Why is Matlab's statistics toolbox not sufficient from a design standpoint, for example?
You mention that John Chambers has talked about doing it differently if he were to start afresh. Would you be able to help with driving some of these design decisions relating to statistical programming? Are there key papers that describe the design decisions behind S? If we can get the design right, then getting the functionality will be relatively straightforward.
-viral
I guess one could call R from julia at some point, but that leads to a system that is almost impossible to debug. That has never stopped anyone though!
Apart from the interop, what you are suggesting is that data frame objects and time series objects would be nice to have. I'll try put a wiki page together combining all the suggestions from this thread, and perhaps a design and target will emerge out of that.
-viral
? name.of.function
and have text-based documentation (in the form of a man page)
automatically appear in the same window seems pretty ideal. To be
able to browse the same documentation in hyper-linked form through a
browser is just a lovely bonus. For those who don't know, the system
requires the developer to provide documentation for every user-facing
(exported from the namespace) function in a R-specific LaTeX-esque
format that allows inter-help links and runnable examples which can
also be called by running example(name.of.function) from the prompt.
I often agree the R help pages can be too terse for some (I generally
prefer that style, but I recognize I'm in the small minority there)
and aren't the most user friendly, but the *system* itself seems to be
one of the best I've used. I certainly prefer it to python or perl;
Matlab is similar in that one can call documentation form the help
line with many of the same features, but the virtue there seems to be
in the extent, not the implementation. Julia, much further down the
road, could be well served by a parallel system.
I'm a relatively active R user -- though certainly not on Dr. Bates'
level -- and once I get a little more acquainted with Julia, I'd be
more than happy to help develop some first-order statistical tools or
provide one user's feedback on the good bits and bad bits of R's data
structures.
A few steps back in the thread there was a question about R's formula
interface: for those who haven't played with it, I'll give a quick
intro. Most R modelling functions can be run with the following
syntax:
model.name( y ~ x, data = data.set.name)
What this does (in short) is to create a new environment in which the
formula terms are treated as names for vectors which get their values
from the columns of the data frame -- the function then creates the
model as appropriate. The syntax is relatively straightforward:
~ means "based on / as a function of / responding to"
+ means "add this term"
- means "leave this one out"
. means "all other terms not accounted for (a wildcard)"
| is used for nesting.
* and : are used for interaction terms.
The power of this interface is that it abstracts from the user the
nature of these variables: if I were to change x from continuous to
categorical, my formula wouldn't have to change at all: the model
matrix is set up behind the scenes (in some of the most difficult R
code to grok). If I add a variable to my data set and I have the
wild-card in, it automatically adapts and if I don't it nicely ignores
it. To change the kind of model, it suffices to change the function
name (e.g., lm to glm) while the formula can proceed unchanged even
though the computation is almost entirely different.
This idea gets pushed even further in data aggregation / reshaping
functions: to take an example from R help the other data, a poster had
a data set of three variables: a continuous response (y) and two
categorical predictors (x1, x2): he wanted to group by by both of
these categories and calculate subset medians: in R, this was as easy
as aggregate(y ~ x1 + x2, data = dat, FUN = median) --which one could
read as: using the data set "dat," aggregate y based on x1 and x2 and
apply the median function to each group. To get medians based on just
x1 or x2, it's as easy as dropping the other term.
It would be very powerful to adopt this interface -- and perhaps Julia
could do it more cleanly in the model matrix code since
back-compatability doesn't have to be a consideration -- but I think
it's R at its best (unlike drop = TRUE or stringsAsFactors =
TRUE...those who've used R can feel my scowl coming through).
Thanks for the great project! I'm looking forward to having a little
more time to play with Julia and hopefully (when the semester cools
down) to contribute to an emergent statistical library.
Best,
Michael
Matching R function for function is not fun, almost impossible, and should certainly not be our goal. What we need to figure out is the essence of R's design for statistics usage, and build it into julia.
I guess one could call R from julia at some point, but that leads to a system that is almost impossible to debug. That has never stopped anyone though!
Apart from the interop, what you are suggesting is that data frame objects and time series objects would be nice to have. I'll try put a wiki page together combining all the suggestions from this thread, and perhaps a design and target will emerge out of that.
Good stuff to think about. I only have time to respond to a bit of it now:#3 we definitely have
#4 is easy with indexing but the syntax is a bit differentNot 100% sure what #6 means but we have quoted expresions, i.e.
":(x+y)" gives you a data structure representing that symbolic
expression.
#7 - Currently you can simulate optional arguments easily with
multiple dispatch:
f(x, y) = 1
f(x) = f(x, 0)We use this all over the place. But, some more compact syntax for that
might be nice.We're not the kind of language that has multiple object systems. We
have one powerful system.Our philosophy is to push almost everything to libraries, because
people will not all agree. Some people want bigints, some people might
want saturating arithmetic, some people want NA, other people want
performance at any cost. Boolean is a core concept (e.g. for picking
which branch to take), so we can't complicate it with issues like
missing data. But a separate BoolNA type is of course fine.
Hi Doug,I think getting the library is relatively well-understood - quantiles, density, distributions, database hookups, formatted i/o (csvread already exists), matrix stuff, and even graphics.
Not being a regular user of R, the real question that I haven't quite understood is the embedding of data frames so deeply into the R language. What makes them so essential? Is it important for julia to have them exactly the way they are in R, or can we do things better today? Why is Matlab's statistics toolbox not sufficient from a design standpoint, for example?
You mention that John Chambers has talked about doing it differently if he were to start afresh. Would you be able to help with driving some of these design decisions relating to statistical programming? Are there key papers that describe the design decisions behind S? If we can get the design right, then getting the functionality will be relatively straightforward.
Ggplot2 looks like a very promising source of good ideas for how to design our graphics API. I suspect we can probably make it much faster too. One thing that we need to consider that ggplot2 doesn't appear to (at a very cursory glance, admittedly), is interaction: we want users to be able to make interactive graphics. I really like D3 for web-based rendering, but generating json and svg data directly at the Julia graphics API level is not the right level of abstraction, imo.
<formula_Rout.txt>
Are you looking for `??` which uses fuzzy matching to search
documentation? It's not perfect, but I think it does a pretty good
job. Off-topic, but the findFn function of the sos package tries to
help with this as well.
Michael
On Mar 1, 11:21Â am, "R. Michael Weylandt" <michael.weyla...@gmail.com>
wrote:
> Ahh -- I do understand what you mean, but I think that's a criticism
> of badly designed namespaces rather than of the system. Only
> user-facing functions or dataset require documentation -- if a package
> maintainer blindly exports everything from the namespace, then there
> will be documentation of "hidden" functions required, but that
> shouldn't be the case generally. (A common example is that "hidden" S3
Typically the exported names do serve a purpose -- example datasets
are used in examples (duh), various helper functions provide advanced
functionality or useful shortcuts for expert users, data manipulation
functions help convert data to the appropriate input data structures,
etc. etc. -- so there is usually a good reason to export them and
include their descriptions in the big reference file. Yet, for someone
who has no experience with the package they are all just noise. Also,
function examples in the pdf's would have been much more useful had
they included some output (especially for the graphical stuff). None
of this would be difficult to do, especially with all the literate
programming facilities that exist out there -- yet instead we have a
system which is worse than useless (had there been no pdf's at all, at
least perhaps more authors would be compelled to provide short
examples on their websites...)
-viral
> <formula_Rout.txt>
I'm sure there are a couple of critical things that have to happen in the top half of this list that I'm totally forgotten. I hope to have some time to work on the first two of those at some point soon...
 -Harlan
This is a good roadmap - should we capture it on a wiki page? Filing issues may be a bit early.
-viral
-viral
I'd love to hear what others have to say on the topic. For now, let's just get started. Looking at your list of things for statistical programming, most of them would need to be in core once developed and stabilized.
I suspect that once we get to a release, we will have to put some process in place for future development and introduction of new features. Until we get to 1.0, I personally prefer that most of the stuff stay in the mainline development branch so that everyone can take a look at it, and things don't get too disperse. Once we get to a point where we want to release 1.0, we can decide what is core, and what is not, and then create separate projects.
-viral