Best practice for locating largish data

Steve Cumming

unread,

Mar 12, 2018, 4:52:39 AM3/12/18

to SpaDES Users

I am revising the component of my fire model, and have gotten to the module scfmCrop; this module takes as input a national landcover map, and "age" map, and a fairly large fire incidence database.

My understanding is, those objects should be loaded (or even downloaded) in the .inputObjects event.....and so should presumably located in the module's data subdirectory.

Is this considered best practice even for large data objects shared among many users?

sc

Alex Chubaty

unread,

Mar 12, 2018, 11:02:54 AM3/12/18

to SpaDES Users

Yes, currently the module's data directory is the place to put data. Remember that modules need to be self-contained, and coding your module to put data anywhere else (i.e., outside the module's directory) means you're imposing your idiosyncratic file path on other users. Don't do this.

For large data sets, especially those used by multiple modules, I use symlinks to a single location on my machine. This is something I needed to set up manually, but it means each module can see it's own data but the actual files live in one place instead of having multiple copies scattered around everywhere. I haven't been successful getting an auto-symlinking scheme to work within SpaDES because of problems on Windows (admin access is required). I may revisit this issue if there's a pressing need.

Alex

Steve Cumming

unread,

Mar 12, 2018, 3:27:13 PM3/12/18

to SpaDES Users

Thank you, just checking.

Eliot McIntire

unread,

Mar 12, 2018, 4:21:17 PM3/12/18

to SpaDES Users

My current work flow follows below. The use case is multiple modules use same dataset, all cropped, masked to the study area:

Have copy of data in cloud somewhere, either an actual online repository, or a personally created one with Google Drive (download via googledrive package), Dropbox (download via rdrop2 package).
Have every module that needs it, use "suppliedElsewhere" (new function in SpaDES.core@development ) inside the .inputObjects function to determine whether it has been supplied elsewhere (via inputs or objects argument in simInit or another module's .inputObjects or another module's "init" event)
If no other module has downloaded it, then download it, and make it into the R object you need.
Subsequent modules all do same, and thus will not download it if the object is correct.

The result is that only one module will have it in its data folder. This will not work if from one downloaded file, unique objects are needed by each module.

if (!suppliedElsewhere("object", sim)) {

sim$object <- Cache(prepInputs, ...)

}

Benefits of this:

modular -- not matter which group of modules somebody tries, they will get the dataset if they need it, but only one copy.
cloud is "original" source, so everybody who runs the module starts with the same dataset
don't need symlinks (which, as Alex mentioned don't work cross platform between Windows and Linux/Mac)

Reply all

Reply to author

Forward