Profiles are Dangerous?

Michael Smith

unread,

Oct 1, 2014, 8:53:38 AM10/1/14

to drake-w...@googlegroups.com

I just tried switching between two profiles, where the input file differs depending on the profile. In code,

$ cat Drakefile

PROFILE:=profile-default

%include $[PROFILE]

out.rds <- $[INFILE], script.R

./script.R $[INFILE] $OUTPUT

However, when I export an alternative profile that contains another infile, and then run drake again, it says 'nothing to do.' But clearly the infile is different now (based on the new profile), so 'out.rds' should be updated.

A workaround is to touch the new infile, but that sort of defeats the whole purpose of having a data workflow tool and using profiles.

Michael Smith

unread,

Oct 3, 2014, 10:13:22 AM10/3/14

to drake-w...@googlegroups.com

Any suggestions how to deal with this problem, short of +... or manually touching the relevant files?

Thanks,

M

Aaron Crow

unread,

Oct 10, 2014, 3:02:50 PM10/10/14

to Michael Smith, drake-w...@googlegroups.com

Hi Michael, did you solve this? I've tried to reproduce but, at least at first glance, I'm seeing Drake do the right thing (detect that the newer infile specified by a new profile means a rerun of the step is needed). If this is still a problem for you, please let me know your precise steps to reproduce.

--
You received this message because you are subscribed to the Google Groups "drake-workflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to drake-workflo...@googlegroups.com.
Visit this group at http://groups.google.com/group/drake-workflow.
For more options, visit https://groups.google.com/d/optout.

Michael Smith

unread,

Oct 10, 2014, 8:54:13 PM10/10/14

to Aaron Crow, drake-w...@googlegroups.com

Hi Aaron,

Haven't solved it. Attached is a minimal reproducible example. You can
run it as follows. When I run `drake` for the second time, I would
expect it to re-create `out.rds` based on `in-A.rds`, but it doesn't do
anything.

$ R -q
> saveRDS(1, "in-default.rds") # Create example infiles.
> saveRDS(2, "in-A.rds")
> q("no")

$ drake
The following steps will be run, in order:
1: out.rds <- in-default.rds, script.R [missing output]
Confirm? [y/n] y
Running 1 steps with concurrence of 1...

--- 0. Running (missing output): out.rds <- in-default.rds, script.R
[1] 9
--- 0: out.rds <- in-default.rds, script.R -> done in 0.25s
Done (1 steps run).

$ export PROFILE=profile-A

$ drake
Nothing to do.

Thanks,
M

> <mailto:drake-workflo...@googlegroups.com>.

drake-example.zip

Aaron Crow

unread,

Oct 12, 2014, 3:07:23 PM10/12/14

to Michael Smith, Artem Boytsov, drake-w...@googlegroups.com

Right. Sorry Michael, I didn't test this correctly before my first response. I can reproduce it now.

I guess this isn't actually a bug per se. Drake is deciding not to rerun because out.rds *is* actually current relative to in-A.rds, assuming that in-A.rds existed before your first run. And Drake is not tracking "out of date" by input file names, but just simply by comparing timestamps between the inputs and outputs. I suppose Drake could be upgraded to remember what input filenames were used to create the output and refer back to that, but my sense is that's a whole new can of worms.

One workaround is to include a var in your output file name, and make sure that var is different for each profile. An clean way to do this is to include either PROFILE or INFILE in the filename, e.g.:

out.$[PROFILE].txt <- $[INFILE]

So then...

The following steps will be run, in order:

1: out.profile2.drakefile.txt <- infile2.txt [missing output]

(And if you have further steps that use that output, you'll of course need to use the same file name construction.

Another feature to consider is Drake's branching, which allows you to specify a "tag" for all outputs at runtime. I think this would be a bit of an annoyance in your case (you'd need to be sure to specify an appropriate tag for each run), but something to know about anyway.

Curious if Artem has any thoughts on this scenario?

Michael Smith

unread,

Oct 12, 2014, 9:35:57 PM10/12/14

to Aaron Crow, Artem Boytsov, drake-w...@googlegroups.com

Hi Aaron,

Thanks, that's a nice workaround.

The only problem I see is that if I have a lot of chained steps down the
road, it's going to result in a _lot_ of derived files (for each
profile), which I actually would like to avoid.

Thanks,
M

On 10/13/2014 03:07 AM, Aaron Crow wrote:
> Right. Sorry Michael, I didn't test this correctly before my first
> response. I can reproduce it now.
>
> I guess this isn't actually a bug per se. Drake is deciding not to rerun
> because out.rds *is* actually current relative to in-A.rds, assuming
> that in-A.rds existed before your first run. And Drake is not tracking
> "out of date" by input file names, but just simply by comparing
> timestamps between the inputs and outputs. I suppose Drake could be
> upgraded to remember what input filenames were used to create the output
> and refer back to that, but my sense is that's a whole new can of worms.
>
> One workaround is to include a var in your output file name, and make
> sure that var is different for each profile. An clean way to do this is
> to include either PROFILE or INFILE in the filename, e.g.:
>

> out.*$[PROFILE]*.txt <- $[INFILE]

>
> So then...
>
> The following steps will be run, in order:
>

> 1: out.*profile2*.drakefile.txt <- infile2.txt [missing output]

> <mailto:drake-workflow%2Bunsu...@googlegroups.com>
> > <mailto:drake-workflo...@googlegroups.com
> <mailto:drake-workflow%2Bunsu...@googlegroups.com>>.

Reply all

Reply to author

Forward