Cheers
Hi All,
I've created a new branch: sample-table-impl-b1 which supports Samples and Equipment largely as described below:
· I’ve added a Schema type of GENERAL to allow Schema to be defined that can be attached to any of Experiment, Dataset or Datafile. This would be used to allow Samples, and possibly Equipment, to exist at any level. The search routines will need to be updated to handle GENERAL schema. I don’t know if the METS parser needs modification.
· I’m not sure what Steve means by “making Equipment core schema ... and tying it to the create experiment interface”. There isn’t any need to do anything special for Equipment (or Samples). For the synchrotron we would define Equipment with the same set of parameters as defined in the (to be removed) Equipment table (I’ve put this in initial_data.json, but it is Synchrotron specific).
· Schema has a pretty-print name, see Schema.name and Schema.displayName(). The experiment view page has been updated to use the user friendly name. The experiment view needs to be modified to consistently and sensibly order the printing of parameter sets.
· I couldn’t really follow Grischa’s example display of Sample and Equipment below, but think that what’s in the branch is close.
The one thing that I’m not sure I agree with is not supporting global Equipment. The synchrotron runs about 700 experiments per year, and this number will grow significantly. Equipment at the synchrotron has about 8 parameters. This means that the same 8 parameters are going to be added about 70 times each year, which seems very wasteful, and also makes it more work to maintain (each copy could get out of sync).
To implement global ParameterSets we can add the ability to reference a ParameterSet by namespace and parameter value to the parser. We would also have to write garbage collection housekeeping (delete ParameterSets that aren’t referenced by any Experiment / Dataset / Datafile).
There’s a “Test Experiment.mets” file in tardis that can be loaded to provide an example (this won’t be merged in to the trunk if we decide to go with this implementation).
Cheers,
Alistair
Thanks for the clarification.
Are you happy with Sample and Equipment as defined in sample-table-impl-b1?
Thanks,
Alistair
> · Schema has a pretty-print name, see Schema.name and
> Schema.displayName(). The experiment view page has been updated to use the
> user friendly name. The experiment view needs to be modified to
> consistently and sensibly order the printing of parameter sets.
Since all the schemas will come under the category of GENERAL can you
please suggest a sensible method ordering these parameter sets?
> · I couldn’t really follow Grischa’s example display of Sample and
> Equipment below, but think that what’s in the branch is close.
I think Grisha was suggesting that we could just use the existing
ParameterSet infrastructure.
> The one thing that I’m not sure I agree with is not supporting global
> Equipment. The synchrotron runs about 700 experiments per year, and this
> number will grow significantly. Equipment at the synchrotron has about 8
> parameters. This means that the same 8 parameters are going to be added
> about 70 times each year, which seems very wasteful, and also makes it more
> work to maintain
You are making the assumption that the 8 fields you are storing are
actually useful. I would suggest that only 4 of the fields are
actually useful: make, model, type, serial no. The rest of them are
just adding unnecessary descriptions that will need to be maintained.
I think for the 2800 rows per year is not really a big problem and if
you were to run 10 times the experiments then you would still only be
storing 28000 rows per year which isn't a big deal.
> (each copy could get out of sync).
The most obvious field to go out of date will be the "decommissioning
date" which i don't really see as being very closely associated with
datasets? I guess what i am getting at is that the data which is
supposed to populate these new schema really isn't related to data
specific metadata.
> To implement global ParameterSets we can add the ability to reference a
> ParameterSet by namespace and parameter value to the parser. We would also
> have to write garbage collection housekeeping (delete ParameterSets that
> aren’t referenced by any Experiment / Dataset / Datafile).
I don't understand how you are implementing these global ParameterSets
can you please explain further?
I would also like to ask some questions of the suggested Sample
Schema; it seems to me that many of the field on that schema are
pointless, Hazard? Safety Information? Isn't it a bit pointless to be
storing that in a datastore that will be access well after the
experiment is completed. Also what does the Sample Table 'Principal
Name' refer to?
Could you also update your doc strings to match the syntax as
described by http://sphinx.pocoo.org/
Thanks,
Russell
--
Russell Sim
Senior Software Developer
Monash ARDC-EIF Data Capture and Metadata Store
Building 75, Clayton Campus, Wellington Road, Clayton, Victoria. 3800
Telephone: +613 9902 0795
Facsimile: +613 9905 9888
Email: russe...@monash.edu
Web: http://www.monash.edu/eresearch
So did I. I went back and tracked down this change, Gerson introduced it on
29 Nov as part of his Sample table implementation.
Gerson, can you provide the reasoning behind the introduction of the Schema
types?
If it was purely to support the Sample and Equipment types, and no one else
wants it, I'm happy to have it backed out.
> Since all the schemas will come under the category of GENERAL can you
> please suggest a sensible method ordering these parameter sets?
Probably by Schema.name. Any better suggestions?
> I think Grisha was suggesting that we could just use the existing
> ParameterSet infrastructure.
That's what I'm doing... :-)
On Global ParameterSets:
> I don't understand how you are implementing these global ParameterSets
> can you please explain further?
You suggested making the relationship between ParameterSets and the base
table (Experiment, Dataset, Datafile) a many-to-many relationship, which
Gerson has done. This means that multiple Experiments can reference a
single ExperimentParameterSet, which is what I'm referring to as a Global
ParameterSet.
> I guess what i am getting at is that the data which is
> supposed to populate these new schema really isn't related to data
> specific metadata.
I take your point about it not really being data specific metadata, but the
practicality of having accurate data may win out over pure data specific
metadata.
> You are making the assumption that the 8 fields you are storing are
> actually useful. I would suggest that only 4 of the fields are
> actually useful: make, model, type, serial no. The rest of them are
> just adding unnecessary descriptions that will need to be maintained.
> I think for the 2800 rows per year is not really a big problem and if
> you were to run 10 times the experiments then you would still only be
> storing 28000 rows per year which isn't a big deal.
It's not the total additional storage that is significant, but that each
piece of information is duplicated many times (70 in my example).
And Samples:
The safety information (Hazard, etc.) is there because one of the problems
the facilities have is getting accurate sample information (it is usually
just spread throughout log books). A number of facilities have used the
approach of storing the safety information (which has to be kept up to date)
with the metadata to improve the quality of the sample information.
I copied the attributes, including Principal Name, out of the safety
information collected by the Proposals system at the Synchrotron.
> Could you also update your doc strings to match the syntax as
> described by http://sphinx.pocoo.org/
I looked up the web site to figure out how to format definition lists (which
I used for the Field documentation). I'll work my way through the tutorial,
but what in particular needs changing?
Thanks!
Alistair
-----Original Message-----
From: tardis...@googlegroups.com [mailto:tardis...@googlegroups.com]
On Behalf Of Russell Sim
Sent: Tuesday, 7 December 2010 04:43
To: tardis...@googlegroups.com
Subject: Re: On Equipment Metadata (and Sample)
On 08/12/10 05:03, Alistair Grant wrote:
> On 7 December 2010 04:43, Russell wrote:
>
>> Hang on a second, i though that any schema could be applied to any type?
>>
> So did I. I went back and tracked down this change, Gerson introduced it on
> 29 Nov as part of his Sample table implementation.
>
> Gerson, can you provide the reasoning behind the introduction of the Schema
> types?
>
> If it was purely to support the Sample and Equipment types, and no one else
> wants it, I'm happy to have it backed out.
>
Yes it's been introduced to support sample and equipment.
>
>
>> Since all the schemas will come under the category of GENERAL can you
>> please suggest a sensible method ordering these parameter sets?
>>
> Probably by Schema.name. Any better suggestions?
>
>
>
>> I think Grisha was suggesting that we could just use the existing
>> ParameterSet infrastructure.
>>
> That's what I'm doing... :-)
>
Grischa was actually trying to suggest that we use the existing
parameterset infrastructure (ie not make any changes anymore to support
sample and equipment). He was suggesting that if we ever need to provide
a sample information, they'll be added as experiment/dataset/datafile
parameters through the experiment/dataset/datafile parameterset
infrastructure.
Now that I've read more info about what is going to be stored as sample
metadata, won't it be better for these information to be stored
somewhere else? Won't it make tardis less generic? Will people really be
doing this type of search "can you find me all the datafiles that have
used a sample that has this particular hazard"?
>
Hi All,
I think we're very close to reaching consensus on ParameterSets, Samples and Equipment.
I've summarised the major points below. I’ll also update the SampleTable and ParameterSets wiki pages to reflect the proposed implementation once GoogleCode maintenance is finished (making the site read-only).
Guiding Principles (with some explanatory comments):
· To keep federation straightforward, Experiments in TARDIS should be self-contained entities that can be copied from one instance of TARDIS to another.
o This means that we shouldn’t implement what I was referring to as global ParameterSets, i.e. ParameterSets that are referenced from more than one Experiment.
· TARDIS core task is to store metadata about experimental data, i.e. it isn’t about storing lots of static information about equipment.
o Knowing which equipment was used to record data is still useful, however it should be minimal, such as an identifier of the equipment and a reference to where more information is available.
o If there isn’t an equipment register, a small equipment register application could be co-hosted with TARDIS and referenced from within the Equipment metadata, as described above.
· TARDIS will store metadata that reflects the state as it existed when the metadata was recorded.
o I.e. in general we’re not about going back and updating information if it becomes out of date.
Design Decisions:
· Sample and Equipment information should be stored using the general ParameterSet functionality, i.e. there’s no code specific to Samples or Equipment. We will be extending the general ParameterSet functionality as described in the following bullet points.
· We want to be able to store multiple items of Equipment against a single Experiment / Dataset / Datafile. This is possible using the existing ParameterSet functionality.
· We want to be able to share Samples between Datasets or Datafiles. This requires the implementation of the many-to-many relationship between Datasets and Datafiles and their associated ParameterSets. Note that it doesn’t extend to Experiments, see the guiding principle on Experiment encapsulation above.
· Schema will get a user-friendly name that will be displayed on the View Experiment page and used for sorting ParameterSets.
The design as proposed above will of course still allow administrators to define extended Sample or Equipment definitions if they wish, e.g. storing sample hazard information. That is a purely local decision about what is important metadata.
Thanks,
Alistair
-----Original Message-----
From: tardis...@googlegroups.com [mailto:tardis...@googlegroups.com] On Behalf Of Steve Androulakis
Sent: Wednesday, 8 December 2010 00:36
To: tardis-devel
Subject: Re: On Equipment Metadata (and Sample)
Actually, I favour generic in this case. I agree - I don't see people
· TARDIS will store metadata that reflects the state as it existed when the metadata was recorded.
o I.e. in general we’re not about going back and updating information if it becomes out of date.
Hi Ian,
I haven’t seen the functionality being added in EIF019, but I believe that it will address your concerns, i.e. being able to correct mistakes, make annotations, etc. Steve or Russell might give more details.
I was referring to the case where information becomes out of date over a “long” time, e.g. a piece of equipment is decommissioned, we won’t go back and update metadata to reflect the fact that it is now decommissioned. It was operating when the metadata was entered.
Cheers,
Alistair
From: tardis...@googlegroups.com [mailto:tardis...@googlegroups.com] On Behalf Of Ian Thomas
Sent: Wednesday, 8 December 2010 23:55
To: tardis...@googlegroups.com
Subject: Re: On Equipment Metadata (and Sample)
Hi all,
Hi All,
I agree, and have already dropped the global parameter set idea, i.e. sharing ParameterSets between Experiments.
What I would still like to see is the ability to attach one ParameterSet to multiple Datasets or Datafiles within an Experiment.
Experiment’s will still be self-contained. See also:
· The ParameterSets page on the wiki
· Models.py r541 in sample-table-impl-b1.
Note that with the DatasetParameterSet and DatafileParameterSet code still needs to be added to enforce the business logic that ParameterSets are not allowed to be attached to multiple Experiments. It isn’t a problem for ExperimentParameterSets as they may only be attached to a single Experiment.
Grischa wrote:
Given the low amount of data we are going to store, ie. less than a million of rows, we do not really need to worry about storing potentially duplicate data.
We’re expecting the DatafileParameter table to grow to over 500 million records over time.
Thanks,
Alistair