Major proposed change: Experiment/Dataset many-to-many

2 views

Skip to first unread message

Tim Dettrick

unread,

Apr 15, 2012, 9:01:49 PM4/15/12

to tardis...@googlegroups.com

Greetings all,

Short version:

We (the team here at UQ) are considering making the Experiment/Dataset relationship many-to-many (ie. Datasets can exist in multiple experiments) with a view to generalizing experiment into a “collection” of datasets. The aim is to solve the “3-tier problem” for everybody without breaking lots of existing functionality based on experiments. In the process, grouping datasets would start feeling more like tagging than hierarchical placement.

This is a medium-term idea, and feedback would be appreciated. The idea is to keep it simple initially and give MyTardis room to grow.

Long version:

We had a chat with our lab managers here at UQ on Friday, and once again we ran into the infamous “3-tier” problem. In this case though, it wasn’t that we needed “n-tiers”, it was more that there were many different ways to collect datasets.

The lab managers pointed out that some researchers would be quite happy to group their data by the booking session that they collected it in, but in other cases grouping by sample would be useful, and in other cases grouping for publication might be useful. We had originally intended to solve this by moving/splitting datasets between collections, but the lab managers pointed out that the relationship to the booking session was actually a nice one to keep for finding data. We could always use metadata to mark it, but that still left us wondering how to initially group datasets.

We considered tagging to group related datasets to samples… at which point the tag would need a title, description, and the possibility of associating metadata with it… which is really an experiment. So, why not simply extend experiment?

The idea would be to transform experiment into a more flexible “collection”, and would most likely happen in stages:

1. First, the relationship between Experiment/Dataset would be transformed into many-to-many. This would be a fairly big change, but existing installations could be left largely unaffected.

2. Experiments would morph into a collections, with existing experiments acquiring a “type” of “experiment”. This would provide room for other “types” of groupings.

3. The new collections would be slimmed down, with truly “experiment” concepts being shifted out into metadata or one-to-one “facet” objects.

4. At some future point it might make sense for collections to include other collections, to create a “project” or “gallery” concept.

Does this sound like an improvement to MyTardis? It’s certainly a lot of work, but it would help remove one of the most often stated limitations of MyTardis. I feel like I’m running out ahead of a lot of the development community these days, so I’d like to get some feedback about whether you think this the right way to solve the problem (or even if the problem needs solving in your case).

It would certainly have an impact on how the file-system is arranged, as “experiment/dataset/datafile” would be impossible to replicate without file-system links. Permissions would stay largely the same, but deletion would be an interesting concept that could potentially cause issues. (Most likely we’d handle it like hard-links, with the last reference removing the dataset, but that doesn’t solve data file deletion.)

What do you think?

Thank you,

Tim Dettrick

Senior Software Engineer

ITEE eResearch Group

The University of Queensland

Steve Bennett

unread,

Apr 15, 2012, 10:40:10 PM4/15/12

to tardis...@googlegroups.com

Hi Tim,

Interesting idea. The question "what is a collection" is one of the biggest issues in digital humanities at the moment, and recurs in lots of domains. As you note, objects don't necessarily belong strictly to one collection: so then, what kinds of collections could it belong to, who would define them, how persistent would they be, etc?

It seems to me there are maybe four kinds of collections on the table here:

1) Containers for datasets (ie, a unique home that the dataset belongs to - currently an 'experiment')

2) Flexible collections (ie, non-unique containers that behave much like current experiments)

3) Labels/tags (explicit, but minimal, ways to group datasets together)

4) Other, implicit groupings (eg, anything by the same author, anything on a certain date, anything made by a certain instrument...)

I certainly like the idea of more flexibility, and particularly ditching the notion that datasets belong uniquely to "experiments". (Works great for the Sync, not so good elsewhere)

Would it be possible for a collection to simultaneously act like 2) and 3) above? That is, you can quickly add a label to a dataset, then give additional metadata to that label, so it's fairly indistinguishable from an experiment now.

The lab managers pointed out that some researchers would be quite happy to group their data by the booking session that they collected it in, but in other cases grouping by sample would be useful, and in other cases grouping for publication might be useful.

Do these "groupings" have to be permanent, or stored anywhere? What if there was just a mechanism to see all datasets associated with a sample (a search)? Does that meet the same needs?

1. First, the relationship between Experiment/Dataset would be transformed into many-to-many. This would be a fairly big change, but existing installations could be left largely unaffected.

Good.

2. Experiments would morph into a collections, with existing experiments acquiring a “type” of “experiment”. This would provide room for other “types” of groupings.

Any idea what you would do with the "type"? Does Tardis care whether a collection has type "sample", "experiment", "booking session"...? I think pretty soon that would be restrictive - the ultimate is to start with a search query, then save that as a collection if it proves useful. (Much like Lighthouse does with "ticket bins")

3. The new collections would be slimmed down, with truly “experiment” concepts being shifted out into metadata or one-to-one “facet” objects.

Yep - would need improvements to the display of metadata. But yes - fields on experiments are just metadata like any other. (OTOH, the code to query metadata fields is messier than real database fields - might need some handy helper methods.)

4. At some future point it might make sense for collections to include other collections, to create a “project” or “gallery” concept.

Yeah, maybe. Also for use cases like "here are all the datasets collected by my PhD students".

Does this sound like an improvement to MyTardis?

Definitely - and it's basically a necessity for Tardis to continue to colonise other disciplines.

It’s certainly a lot of work, but it would help remove one of the most often stated limitations of MyTardis. I feel like I’m running out ahead of a lot of the development community these days, so I’d like to get some feedback about whether you think this the right way to solve the problem (or even if the problem needs solving in your case).

Yes, I think it's basically right.

It would certainly have an impact on how the file-system is arranged, as “experiment/dataset/datafile” would be impossible to replicate without file-system links. Permissions would stay largely the same, but deletion would be an interesting concept that could potentially cause issues. (Most likely we’d handle it like hard-links, with the last reference removing the dataset, but that doesn’t solve data file deletion.)

Maybe we keep a "spiritual home" for each dataset - the collection it was originally created under, and leave it in that directory. For our users, the fact that there is some structure on disk is a benefit - it will be helpful in case of any migration away from Tardis in the future. But maybe it's not crucial.

Steve

Steve Androulakis

unread,

Apr 16, 2012, 1:07:11 AM4/16/12

to tardis...@googlegroups.com

Hi Tim, Steve B,

I really quite like this idea, for the most part. I think this is the first 'expansion' of the model I've heard of that doesn't detract from the conceptual simplicity of what is there now, or threaten to knock over the dominoes towards 'infinite customisation' (eg. n tiers).

Also, it still feels experiment-centric which I believe will prove more and more useful in time, particularly for public dissemination / contextualising groups of data for processing/sharing.

This idea would immediately solve a problem I deal with semi-regularly and increasing: publishing experiments containing datasets from different blocks of time at the synchrotron.

In my case, the 'experiment' I publish relates to a single publication. MyTardis receives data from the synchrotron in multiple experiments that correspond to different scheduled times and contain a lot of data not intended for publication (ever or under this particular published experiment).

The answer at present has been for me to copy the data and datasets manually out of the existing 'scheduled time' experiments and form a compilation that maps to a publication (FYI I do this with the staging area and Create Experiment interface). This has worked but is fiddly, not intended for end users to attempt and results in multiple copies of files stored in the exact same file store.

If I have read the proposed changes correctly, one would simply create a new experiment intended for the public, and 'link' the datasets from various other experiments under it. Cool.

As for the file system, I realise to do this 'properly' we'd probably have to comprehensively change the underlying file structure. This would break the synchrotron workflow somewhat (file transmission would have to specify destination datasets' folders in all likelihood instead of just an experiment folder) but I'm not sure if it's so bad to retain the current file structure, as you guys suggested.

I'm trying to divorce this system from a distributed, 'web accessible file system' as much as possible, because I believe there are numerous gains from that - so it shouldn't really matter to a user where a file is stored in the back end at all. As an aside. I think those desiring of an infinitely flexible file store with a web front end should just get Dropbox Pro accounts or equivalent. There's a lot of merit in that kind of approach, but we're doing something conceptually different and complimentary as should be obvious now!

I haven't had enough time to think about the concept of defined experiment 'types', their form and their implications for the system. Hey, I'm on annual leave here! But I'll add that even with all of this I still think there would be a need for tags/keywords.

Tldr; Tim you have the right idea on the major stuff, so I agree with Steve B.

Cheers,

Steve

Sent from my iPad

Reply all

Reply to author

Forward

0 new messages