Variable scope bug in ddply

699 views
Skip to first unread message

Stavros Macrakis

unread,
Jun 8, 2011, 5:42:58 PM6/8/11
to manipulatr
   local({a<-2; ddply(data.frame(q=1:3),.(q),mutate,r=q/a)})
   Error in eval(expr, envir, enclos) : object 'a' not found

gives an error with a local var a

If the variable is global, there is no problem:

   b<-2; ddply(data.frame(q=1:3),.(q),mutate,r=q/b)

mutate on its own seems OK:

   local({c<-2; mutate(data.frame(q=1:3),.(q),r=q/c)})

and the problem is not specific to mutate:

> local({a<-2; ddply(data.frame(q=1:3),.(q),summarize,r=sum(q)/a)})
Error in eval(expr, envir, enclos) : object 'a' not found
> local({a<-2; ddply(data.frame(q=1:3),.(q),transform,r=sum(q)/a)})
Error in eval(expr, envir, enclos) : object 'a' not found
> local({a<-2; ddply(data.frame(q=1:3),.(q),subset,q>a)})
Error in eval(expr, envir, enclos) : object 'a' not found

Hadley Wickham

unread,
Jun 13, 2011, 11:51:57 AM6/13/11
to Stavros Macrakis, manipulatr
This is actually a really challenging problem to solve because of the
mix of dynamic and lexical scoping. One work around is to explicitly
define a function:

local({
a <- 2
df <- data.frame(q = 1:3)
ddply(df, "q", function(piece) {
mutate(piece, r = q / a)
})
})

I have tried to fix it properly once only to fail. lapply works
around the problem by manually constructing and evaluating calls at
the C-level, which I'd really rather avoid.

Hadley

> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to
> manipulatr+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/manipulatr?hl=en.
>

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Stavros Macrakis

unread,
Jun 13, 2011, 1:53:50 PM6/13/11
to Hadley Wickham, manipulatr
I think it's an important problem to solve.  Otherwise you can't nest ddply in a natural way, as in

ddply(baseball,
      .(),
      function(all) {
         total <- sum(all$hr)     ## only works if changed to <<-
 ddply(all,
       .(lg),
       function(league) {
          lg_total <- sum(league$hr)  ## only works if changed to <<-
  ddply(league,
        .(id),
summarize,
HR=sum(hr),
HRlg=sum(hr)/lg_total,
HRtot=sum(hr)/total
)})})

Of course, if you change to <<-, not only is that ugly (and pollutes the global name space), but it presumably prevents you from using .parallel.

I don't see why it's a dynamic vs. static scope issue.  I'd have though it's a matter of constructing the correct environments and evaluating each thing in the right one.  But then, I'm not quite ready to volunteer to fix this myself....

          -s

Peter Meilstrup

unread,
Jun 13, 2011, 2:49:46 PM6/13/11
to Stavros Macrakis, Hadley Wickham, manipulatr

ddply() hands down its "..." arguments to alply() and in turn to
llply(). By the time llply() is working, there's no way to tell which
environment the expression "q/a" was defined in.

It's an abstraction breakdown that R suffers when you want to pass
your arguments down to a worker function and that worker function does
nonstandard eval.

I think it could be fixed if there was a way to read off the
environment associated with a promise object. delayedAssign() lets you
construct promises with arbitrary expressions and environments, and
substitute() lets you read off the expressions, so to my mind there's
no reason not to be able to read off the environment.

(I might be characterizing the problem wrong. It's something I only
recently started banging my head against.)

Peter

Stavros Macrakis

unread,
Jun 13, 2011, 3:36:55 PM6/13/11
to Peter Meilstrup, Hadley Wickham, manipulatr
On Mon, Jun 13, 2011 at 14:49, Peter Meilstrup <peter.m...@gmail.com> wrote:
ddply() hands down its "..." arguments to alply() and in turn to
llply(). By the time llply() is working, there's no way to tell which
environment the expression "q/a" was defined in.

It's an abstraction breakdown that R suffers when you want to pass
your arguments down to a worker function and that worker function does
nonstandard eval.

Well, clearly the topmost function needs to capture its enclosing environment and pass it to inner functions whenever expressions (as opposed to closures) are being passed around.

Naively, that doesn't seem that hard -- you just have an optional argument of every function which takes an expression as an argument which is the environment; and it defaults to environment().  The signature of ddply would be

function (.data, .variables, .fun = NULL, ..., .progress = "none", 
    .drop = TRUE, .parallel = FALSE, .env = environment() )

and when it calls ldply it overrides the default, passing .env = .env.  And since .fun may perform non-standard evaluation, it needs a .env argument as well.

Problem: what if .fun *doesn't* have a .env argument?  It would be pretty ugly to have to do something like

        if (".env" %in% names(as.list(args(apply)))) ...

As I said above, "naively, that doesn't seem that hard".  I fiddled quickly with the code and couldn't make it work, but I'm afraid I need to get back to my regularly-scheduled job now....

           -s

Hadley Wickham

unread,
Jun 13, 2011, 3:46:25 PM6/13/11
to Peter Meilstrup, Stavros Macrakis, manipulatr
> ddply() hands down its "..." arguments to alply() and in turn to
> llply(). By the time llply() is working, there's no way to tell which
> environment the expression "q/a" was defined in.

Well it is possible, but it's tricky - you need to capture the
parent.frame() (not the parent environment) when llply is called, and
then pass that down along the call stack. I've managed to do that.

The challenge is then figuring out how to evaluate the function in the
right context so that it can access the variables in the environment
in which it was called (dynamic scope) as well as the variables
representing the data that's being process (lexical scope). I could
never figure this out.

Hadley

Stavros Macrakis

unread,
Jun 13, 2011, 3:57:36 PM6/13/11
to Hadley Wickham, Peter Meilstrup, manipulatr
Couldn't you create an inner environment containing the local variables you want it to see, e.g.

   e <- list2env( ...local mappings... , parent = .env )
   
and evaluate the expressions within that?   ...local mappings... presumably are things like column names

        -s

Hadley Wickham

unread,
Jun 13, 2011, 4:15:58 PM6/13/11
to Stavros Macrakis, Peter Meilstrup, manipulatr
> Well, clearly the topmost function needs to capture its enclosing
> environment and pass it to inner functions whenever expressions (as opposed
> to closures) are being passed around.

But it's not the enclosing environment - it's the parent (calling)
frame. This is the difference between lexical and dynamic scope.

Hadley

Hadley Wickham

unread,
Jun 13, 2011, 4:17:09 PM6/13/11
to Stavros Macrakis, Peter Meilstrup, manipulatr
> Couldn't you create an inner environment containing the local variables you
> want it to see, e.g.
>    e <- list2env( ...local mappings... , parent = .env )
>
> and evaluate the expressions within that?   ...local mappings... presumably
> are things like column names

Yes, and that may have been something I tried. But I have the same
problem as you - this isn't really the job I get paid to do, so I have
to minimise the amount of time I spend on it. I'm investigating
business models so that I could hire a programmer to work full-time
for me, but it's likely to be a while before I find both the money and
the right person.

Stavros Macrakis

unread,
Jun 13, 2011, 5:55:13 PM6/13/11
to Hadley Wickham, Peter Meilstrup, manipulatr
How about a work-study student?  There must be a lot of smart scholarship students at Rice who'd find this sort of thing not just remunerative but also educational?

            -s
Reply all
Reply to author
Forward
0 new messages