hey bambooers,
i have a question about some functionality and wanted to see what you think... let me describe the scenario/use case a bit
so in the nigeria project (nmis) we are going to try to get bamboo in the data pipeline. for now, all of the custom analysis will still be in R, but we want to push the data into bamboo and have the nmis website running off of static files that it pulls from bamboo (as csvs) whenever we trigger an update (either on a fab:deploy or cron job). the idea is that we can get nmis running off of bamboo now, and then migrate the analysis from R into bamboo later as we add functionality for the cleaning steps and tweak performance.
the ideal workflow can be performed completely by a data analyst (i.e. non-developer) offline with no intervention. after making any changes to their calculations, the analyst will run the analysis scripts which pull the latest data using formhub.R, perform the analysis, and export the csvs to bamboo (using a new bamboo.R package). after this step, the analyst is finished. with the data up to date in bamboo, the nmis site is free to refresh its data (fab/cron).
this would be nice (right?) but there is one small issue. in the end, we will have the entire analysis done in bamboo and so the dataset_ids won't change. the datasets will be continuously updated and nmis would know to pull from a given set of ids. however, since we are not quite there yet, whenever we re-run the analysis in R, we essentially need to reload the dataset in bamboo (many things may change including data and schema). this means we either need to (1) delete the old dataset and make a new one, or (2) try to somehow clear the old dataset and update it with the new data. the problem with (1) is that the ids will change and so somehow nmis would have to know the new values of the ids. the problem with (2) is that bamboo doesn't yet have a nice way to do it.
here is what i propose: i think that we want to go with (2) because it is a more typical usage pattern in bamboo. we generally want to have one place (id) for a given dataset and manipulate it. in order to do this, i think that we should add the functionality to essentially delete a dataset and reset it without losing the original id. it seems like a reasonable function to sort of start over in the same place (right?). i think it makes dataset re-creation a bit cleaner and would result in less dataset pollution (new datasets without deleting the old one). the api endpoint that i think makes sense for this is to pass an existing_id query parameter to dataset creation where the existing_id is the dataset_id of the dataset to remove and replace with the data that is also being sent (since it is a create action, either csv/json data and/or sdf schema). semantically, i could also see the argument for using a put command instead of post...
note: we also might want to add a function to delete all (or a portion, given some criteria) of the data (but keep the schema intact) for cases where we want to just dump the dataset, but continue to update it in the same way. i can see this as being very useful (especially in cases where you only want to keep around a certain amount of data that is updating constantly) but this is not a requirement of the current scenario.
what do you think about this? does the scenario make sense? is it something that we want to support? does the implementation work or could it be more generalized?
let me know!
cheers,
mark