dataset reset

Mark E Johnston

unread,

May 13, 2013, 3:23:51 PM5/13/13

to bambo...@googlegroups.com

hey bambooers,

i have a question about some functionality and wanted to see what you think... let me describe the scenario/use case a bit

so in the nigeria project (nmis) we are going to try to get bamboo in the data pipeline. for now, all of the custom analysis will still be in R, but we want to push the data into bamboo and have the nmis website running off of static files that it pulls from bamboo (as csvs) whenever we trigger an update (either on a fab:deploy or cron job). the idea is that we can get nmis running off of bamboo now, and then migrate the analysis from R into bamboo later as we add functionality for the cleaning steps and tweak performance.

the ideal workflow can be performed completely by a data analyst (i.e. non-developer) offline with no intervention. after making any changes to their calculations, the analyst will run the analysis scripts which pull the latest data using formhub.R, perform the analysis, and export the csvs to bamboo (using a new bamboo.R package). after this step, the analyst is finished. with the data up to date in bamboo, the nmis site is free to refresh its data (fab/cron).

this would be nice (right?) but there is one small issue. in the end, we will have the entire analysis done in bamboo and so the dataset_ids won't change. the datasets will be continuously updated and nmis would know to pull from a given set of ids. however, since we are not quite there yet, whenever we re-run the analysis in R, we essentially need to reload the dataset in bamboo (many things may change including data and schema). this means we either need to (1) delete the old dataset and make a new one, or (2) try to somehow clear the old dataset and update it with the new data. the problem with (1) is that the ids will change and so somehow nmis would have to know the new values of the ids. the problem with (2) is that bamboo doesn't yet have a nice way to do it.

here is what i propose: i think that we want to go with (2) because it is a more typical usage pattern in bamboo. we generally want to have one place (id) for a given dataset and manipulate it. in order to do this, i think that we should add the functionality to essentially delete a dataset and reset it without losing the original id. it seems like a reasonable function to sort of start over in the same place (right?). i think it makes dataset re-creation a bit cleaner and would result in less dataset pollution (new datasets without deleting the old one). the api endpoint that i think makes sense for this is to pass an existing_id query parameter to dataset creation where the existing_id is the dataset_id of the dataset to remove and replace with the data that is also being sent (since it is a create action, either csv/json data and/or sdf schema). semantically, i could also see the argument for using a put command instead of post...

note: we also might want to add a function to delete all (or a portion, given some criteria) of the data (but keep the schema intact) for cases where we want to just dump the dataset, but continue to update it in the same way. i can see this as being very useful (especially in cases where you only want to keep around a certain amount of data that is updating constantly) but this is not a requirement of the current scenario.

what do you think about this? does the scenario make sense? is it something that we want to support? does the implementation work or could it be more generalized?

let me know!

cheers,

mark

Peter Lubell-Doughtie

unread,

May 13, 2013, 5:27:46 PM5/13/13

to bambo...@googlegroups.com

Ohai Mark,

This all makes sense to me, some deletion is already supported, you can drop columns and you can delete rows by index. But, as you note, there is no way to reset a dataset (overwrite the schema).

A "reset" command does sound like a useful feature and a PUT request to "/dataset/[ID]/reset", which takes the same parameters and internally calls an abstracted version of "create" sounds like a way to do this.

For deletion it seems like a delete command which takes a MongoDB query would accomplish this. Should we write these up as issues for the next milestone 0.6.2?

Thanks,
Peter

--
You received this message because you are subscribed to the Google Groups "bamboo-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bamboo-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mark E Johnston

unread,

May 13, 2013, 5:30:46 PM5/13/13

to bambo...@googlegroups.com

cool, yeah i think the put method makes more sense. i will create the issues and add them to the milestone

mark

renaud gaudin

unread,

May 14, 2013, 3:22:23 PM5/14/13

to bambo...@googlegroups.com

Hey,

I agree with Peter although your question raised a point: How can we ensure that the structure of a dataset has not changed. Bamboo datasets are mostly used the API-way: i.e., you know the structure, you know what to expect and you work from that (and you don't check that you are getting what's expected because it's booooring, right?).

So, I'm thinking, should we add a lock feature or something?

renaud

Mark E Johnston

unread,

May 14, 2013, 4:12:11 PM5/14/13

to bambo...@googlegroups.com

renaud,

i'm not exactly sure what you mean. in my mind, the structure of the dataset that you are talking about is the schema. as far as changing the structure, we generally don't allow changes that would cause integrity issues arising from a change in the schema.

what i think you are getting at is some sort of dataset "protection" that would prevent any changes to the schema. the worry, i assume, is that the dataset schema undergoes a change and then (downstream) consumers will have issues. in these situations, it seems to me that the protection is not necessary, and like you said, _should_ be dealt with in the clients. still, it is an interesting idea and one that would be easy to implement. currently without a permission system, i fear that it would necessitate more overhead in locking/unlocking and really only protect from one's self.

is this sort of what you were thinking? note: i am not saying that protecting from one's self isn't something worth adding, just that i am not sure how much of an issue it is now...

mark

renaud gaudin

unread,

May 15, 2013, 4:53:35 AM5/15/13

to bambo...@googlegroups.com

Yes, that's what I'm talking about and you're right that it's something that would be fixed with permissions.

Reply all

Reply to author

Forward