Working with modules from multiple sources

Eliot McIntire

unread,

Apr 12, 2018, 11:53:44 AM4/12/18

to SpaDES Users

Dear SpaDES users,

As our case studies expand, we are facing the phases of growth. The question now:

To maintain reproducibility and version control as we expand to new cases, how do we organize a project when modules come from multiple sources?

One solution has been to use "links" locally on a machine. I have not liked this method because it is not machine transferable, i.e., whatever scripts I put together on my Windows machine will not work on my Linux machine.

Another suggestion is to use this folder structure:

MyProject - modules - module1

- module2

- module3

- module4

- ...

- inputs -

- outputs - output1

- output2

- ...

- cache - cache1

- cache2

- ...

When the project has all self contained modules, then this works as is, use a single GitHub.com repository for the whole project.

When the project has modules from other people or sources, then we maintain this structure, but we use a combination of .gitignore and "nested" git repositories, example below:

MyProject - modules - module1

- module2 (.gitignored in MyProject, git clone fork of PredictiveEcology/module2)

- module3

- module4 (.gitignored in MyProject, git clone fork of amc/module4)

- ...

- inputs -

- outputs - output1

- output2

- ...

- cache - cache1

- cache2

- ...

Any thoughts?

Tati Micheletti

unread,

Apr 12, 2018, 1:04:20 PM4/12/18

to SpaDES Users

Dear SpaDESers,

I do think it is nice to be able to have the "latest" version of modules such as LCC05 and LBMR, which are constantly being tested and improved by several of us. However, even though I understand that this first approach is not "machine transferable", I also think we need to minimize having several copies of SpaDES' modules in our local machine and be able to use the same most recent modules independently of the project. The structure proposed might also confuse people that are not so keen with github. I would like to discuss the first approach proposed such as:

SpaDES-modules - module1 (git repo)

- module2 (git repo)

- module3 (git repo)

- module4 (git repo)

- ...

MyProject - inputs -

- outputs - output1

- output2

- ...

- cache - cache1

- cache2

- ...

"Problems" to consider with this approach:

1. Make sure you correctly set the path (setPaths()) to your 'modules' folder (which will be outside your projects) - this will have to be hard coded for each project/machine/user, but is the only thing that each user will have to do when receiving a project. It is pretty minimal in my opinion and doesn't influence reproducibility;

2. Make sure you always make a git pull from upstream (maybe it is possible to code it in the global script, for example?) where the most updated modules will be (I thought about creating a github profile that will only host SpaDES modules we have been working on, and any new modules from other sources will have to be cloned from these other developers);

3. If you make any changes to the module that you think are good for everybody (i.e. updating code as changing if/else to switch statements for older modules, organizing functions as separate R scripts so the module is more easily readable for non-programmers, etc), make sure you make a pull request for these so we always have the most updated modules.

I am not sure if we would have any problems with SpaDES for not having the modules folder inside the project folder. But I consider this approach actually more modular because the modules are not within each project, they are something apart from it and the same modules are reusable for all projects. What do you think?

Tati

Alex Chubaty

unread,

Apr 12, 2018, 2:04:08 PM4/12/18

to spades...@googlegroups.com

Originally, I had taken Tati's approach and kept a single directory for all my SpaDES modules; however, I've since begun to appreciate the simplicity of using a single directory per project. It's self-contained and doesn't require keeping track of external (module) dependencies (which could otherwise be problematic if two projects are using different module versions). That being said, using nested git repos is a bad idea; use git submodules instead.

On the other hand, I still use a common-directory approach for large datasets. I use symlinks (which are OS-specific) to keep only a single copy of large data files on my harddrive (really important for laptops and SSDs). The only issue with this approach is if you're on Windows... I had started playing around writing some R functions that would create symlinks in a cross-platform way, but gave up when trying to get things working on Windows. If anyone is interested in pursuing this further, I can make that code available.

Alex Chubaty

unread,

Apr 12, 2018, 2:54:44 PM4/12/18

to SpaDES Users

I've gone ahead and added my symlink code to my amc package on GitHub (development branch): https://github.com/achubaty/amc/blob/development/R/symlinks.R

Also relevant is the `copy` function which deals with symlinks: https://github.com/achubaty/amc/blob/development/R/copy.R

Both functions probably need better names; both definitely need better testing (especially on Windows). I appreciate any input/suggestions.

Eliot McIntire

unread,

Apr 12, 2018, 4:22:50 PM4/12/18

to SpaDES Users

Since I and many use Windows, the symlinks doesn't seem like a workable option. But, perhaps they will be part of a solution on some level.

I am glad to see that git submodules exist. That was the hope behind the initial posted suggestion here. I will likely start going that way.

Please add to this Pros and Cons of "submodules" (Eliot with Alex modification) vs. "external folder" (Tati) vs "symlinks" (Alex):

https://docs.google.com/document/d/1_L7SA7Cyfr9Y7ef0ZG7g2ieA8HsKi_ZAhfwS4qGV4K0/edit?usp=sharing

Alex Chubaty

unread,

May 16, 2018, 5:39:35 PM5/16/18

to SpaDES Users

on the data isde of things, see the discussion at https://github.com/PredictiveEcology/SpaDES/issues/320

Alex Chubaty

unread,

Jun 21, 2018, 10:23:19 AM6/21/18

to spades...@googlegroups.com

I forgot to mention my recent blog post on this topic, which hopefully clarifies the use of git submodules: http://predictiveecology.org/2018/06/14/managing-large-spades-projects.html

Reply all

Reply to author

Forward