On Equipment Metadata (and Sample)

Steve Androulakis

unread,

Dec 5, 2010, 7:21:13 PM12/5/10

to tardis-devel

A lot of talk about the storage of eqipment and sample metadata.
Here's me trying to define what we can assume so we can move on.

Aspects I'm aware of about equipment:
- Many pieces of equipment can be associated with an experiment/
dataset. I know of specific cases where this is true already.
- Equipment could realistically be reported at any object level
(experiment/dataset/datafile). Though I think datafile is going too
far?
- Equipment strikes me as being more of a schema (in the parameter
model) than its own model. This seems to be the current line of
thinking
- If we're making equipment a core schema, and (should) be tying it to
the 'create experiment' interface and other data entry points as semi-
required information, then the required minimum information should be
small.
- Absolute minimum information I can think of is Name (ie the software
PHASER, or the instrument Rigaku R-Axis) and possibly Version.
- Optional information could be serial number
- I'm not claiming to know all the minimum information and optional
information, but this is leading up to a point..
- I'm starting to doubt the need for equipment to exist on its own and
referenced from multiple objects by key, if the information we're
storing about equipment is small enough..
- Could it be enough to simply use parameters and sets as they are now
for equipment?
- Filtering/searching and easier selection could be done with
DISTINCT. Ie. get me all DISTINCT equipment names. You could also then
select one of the distinct names and use DISTINCT version to select a
version number or simply type it in and then filter based on that.
Potentially bad on performance?
- Equipment used and its specific configuration are two separate
things. Specific configurations should supplement equipment
definitions if needed and be in their own custom schemas, no need to
make that part of the core due to the large potential variance in
configurations to be reported.
- In summation: If these assumptions are all true, then no models
would need to be changed, a basic equipment schema would be created,
and we'd all have to reflect the required nature of the models
(required/optional fields) in our business logic and cleverly create
filters to give users the ability to filter on the equipment
information they want to.

Sample has similar aspects from a model/logic design perspective? I
think it does, though you'd definitely give people the ability to
annotate datafiles whereas I'm not sure for Equipment.

Gerson Galang

unread,

Dec 5, 2010, 8:19:40 PM12/5/10

to tardis...@googlegroups.com

Sounds like there's going to be a few more email exchanges about this
topic so it's probably better if I work on implementing another tardis
functionality and go back to this one until we decide how we should
implement the equipment and sample support.

Cheers

Steve Androulakis

unread,

Dec 5, 2010, 8:24:15 PM12/5/10

to tardis-devel

Sounds good to me!

Steve

Grischa

unread,

Dec 6, 2010, 12:36:14 AM12/6/10

to tardis-devel

I had a look at the currently (last week's merc-sprint1 branch)
implemented way of handling metadata, schemas etc.
It already seems very flexible and usable for any purpose conceivable.
In my opinion it should get two more features:
1. The schema table should get a field for a pretty-print name that
makes sense to the user. Eg. "Beamline details" or "Crystallisation
conditions".
2. The schema table should get a "type" field, by which schema
displays can be grouped.

After doing that the display of a made-up dataset with three schemas
attached could look like this (in parentheses the internal
representation):

EQUIPMENT (following this comes a list of schemas of type "equipment")
Beamline MX1 (schema: http://tardis.edu.au/protein-xtal/beamlines)
- parameter 1: 123 (dataset_parameter entry linked to the dataset
parameterset defined with above schema)
- parameter 2: abc (same)
Cryo-system (schema: http://tardis.edu.au/protein-xtal/cryo)
- temperature: 100 deg C (dataset_parameter etc..)
- gas: Helium (same)

SAMPLE (following this comes a list of schemas of type "sample")
Protein Crystal (schema: http://tardis.edu.au/protein-xtal/crystals)
- name: my crystal
- number: 57

or any other permutation of those.

By adding this we support any of the use cases while preserving the
existing model except for the two extra fields.
I also agree with the use of DISTINCT.

Cheers

On Dec 6, 11:21 am, Steve Androulakis <steve.androula...@gmail.com>
wrote:

Alistair Grant

unread,

Dec 6, 2010, 4:01:26 PM12/6/10

to tardis...@googlegroups.com

Hi All,

I've created a new branch: sample-table-impl-b1 which supports Samples and Equipment largely as described below:

· I’ve added a Schema type of GENERAL to allow Schema to be defined that can be attached to any of Experiment, Dataset or Datafile. This would be used to allow Samples, and possibly Equipment, to exist at any level. The search routines will need to be updated to handle GENERAL schema. I don’t know if the METS parser needs modification.

· I’m not sure what Steve means by “making Equipment core schema ... and tying it to the create experiment interface”. There isn’t any need to do anything special for Equipment (or Samples). For the synchrotron we would define Equipment with the same set of parameters as defined in the (to be removed) Equipment table (I’ve put this in initial_data.json, but it is Synchrotron specific).

· Schema has a pretty-print name, see Schema.name and Schema.displayName(). The experiment view page has been updated to use the user friendly name. The experiment view needs to be modified to consistently and sensibly order the printing of parameter sets.

· I couldn’t really follow Grischa’s example display of Sample and Equipment below, but think that what’s in the branch is close.

The one thing that I’m not sure I agree with is not supporting global Equipment. The synchrotron runs about 700 experiments per year, and this number will grow significantly. Equipment at the synchrotron has about 8 parameters. This means that the same 8 parameters are going to be added about 70 times each year, which seems very wasteful, and also makes it more work to maintain (each copy could get out of sync).

To implement global ParameterSets we can add the ability to reference a ParameterSet by namespace and parameter value to the parser. We would also have to write garbage collection housekeeping (delete ParameterSets that aren’t referenced by any Experiment / Dataset / Datafile).

There’s a “Test Experiment.mets” file in tardis that can be loaded to provide an example (this won’t be merged in to the trunk if we decide to go with this implementation).

Cheers,

Alistair

Steve Androulakis

unread,

Dec 6, 2010, 4:06:45 PM12/6/10

to tardis-devel

> . I'm not sure what Steve means by "making Equipment core schema ...

> and tying it to the create experiment interface". There isn't any need to
> do anything special for Equipment (or Samples). For the synchrotron we
> would define Equipment with the same set of parameters as defined in the (to
> be removed) Equipment table (I've put this in initial_data.json, but it is
> Synchrotron specific).

Core schema = in the initial JSON fixture in the core codebase.

Tying it to the create experiment interface was perhaps not
descriptive enough. I mean making it prominent. There'll probably be
the ability to create an experiment and choose any schema to add
parameters to any object, but I think equipment is important enough
and central enough to appear in its own section on the page. Just a
visual thing, not talking about internal logic. :)

Alistair Grant

unread,

Dec 6, 2010, 4:32:15 PM12/6/10

to tardis...@googlegroups.com

Hi Steve,

Thanks for the clarification.

Are you happy with Sample and Equipment as defined in sample-table-impl-b1?

Thanks,
Alistair

Steve Androulakis

unread,

Dec 6, 2010, 4:54:34 PM12/6/10

to tardis-devel

I like the Sample/Equipment implementations. Simple and good!

I understand what you mean re: global equipment and the sync.. and I
think that's fine. Anyone else got an issue with global equipment
definitions referenced by key?

Russell Sim

unread,

Dec 6, 2010, 10:42:52 PM12/6/10

to tardis...@googlegroups.com

On 7 December 2010 08:01, Alistair Grant <akg...@gmail.com> wrote:
> · I’ve added a Schema type of GENERAL to allow Schema to be defined
> that can be attached to any of Experiment, Dataset or Datafile. This would
> be used to allow Samples, and possibly Equipment, to exist at any level.

Hang on a second, i though that any schema could be applied to any type?

> · Schema has a pretty-print name, see Schema.name and
> Schema.displayName(). The experiment view page has been updated to use the
> user friendly name. The experiment view needs to be modified to
> consistently and sensibly order the printing of parameter sets.

Since all the schemas will come under the category of GENERAL can you
please suggest a sensible method ordering these parameter sets?

> · I couldn’t really follow Grischa’s example display of Sample and
> Equipment below, but think that what’s in the branch is close.

I think Grisha was suggesting that we could just use the existing
ParameterSet infrastructure.

> The one thing that I’m not sure I agree with is not supporting global
> Equipment. The synchrotron runs about 700 experiments per year, and this
> number will grow significantly. Equipment at the synchrotron has about 8
> parameters. This means that the same 8 parameters are going to be added
> about 70 times each year, which seems very wasteful, and also makes it more
> work to maintain

You are making the assumption that the 8 fields you are storing are
actually useful. I would suggest that only 4 of the fields are
actually useful: make, model, type, serial no. The rest of them are
just adding unnecessary descriptions that will need to be maintained.
I think for the 2800 rows per year is not really a big problem and if
you were to run 10 times the experiments then you would still only be
storing 28000 rows per year which isn't a big deal.

> (each copy could get out of sync).

The most obvious field to go out of date will be the "decommissioning
date" which i don't really see as being very closely associated with
datasets? I guess what i am getting at is that the data which is
supposed to populate these new schema really isn't related to data
specific metadata.

> To implement global ParameterSets we can add the ability to reference a
> ParameterSet by namespace and parameter value to the parser. We would also
> have to write garbage collection housekeeping (delete ParameterSets that
> aren’t referenced by any Experiment / Dataset / Datafile).

I don't understand how you are implementing these global ParameterSets
can you please explain further?

I would also like to ask some questions of the suggested Sample
Schema; it seems to me that many of the field on that schema are
pointless, Hazard? Safety Information? Isn't it a bit pointless to be
storing that in a datastore that will be access well after the
experiment is completed. Also what does the Sample Table 'Principal
Name' refer to?

Could you also update your doc strings to match the syntax as
described by http://sphinx.pocoo.org/

Thanks,
Russell

--
Russell Sim
Senior Software Developer
Monash ARDC-EIF Data Capture and Metadata Store
Building 75, Clayton Campus, Wellington Road, Clayton, Victoria. 3800
Telephone: +613 9902 0795
Facsimile: +613 9905 9888
Email: russe...@monash.edu
Web: http://www.monash.edu/eresearch

Alistair Grant

unread,

Dec 7, 2010, 1:03:34 PM12/7/10

to tardis...@googlegroups.com

On 7 December 2010 04:43, Russell wrote:
> Hang on a second, i though that any schema could be applied to any type?

So did I. I went back and tracked down this change, Gerson introduced it on
29 Nov as part of his Sample table implementation.

Gerson, can you provide the reasoning behind the introduction of the Schema
types?

If it was purely to support the Sample and Equipment types, and no one else
wants it, I'm happy to have it backed out.

> Since all the schemas will come under the category of GENERAL can you
> please suggest a sensible method ordering these parameter sets?

Probably by Schema.name. Any better suggestions?

> I think Grisha was suggesting that we could just use the existing
> ParameterSet infrastructure.

That's what I'm doing... :-)

On Global ParameterSets:

> I don't understand how you are implementing these global ParameterSets
> can you please explain further?

You suggested making the relationship between ParameterSets and the base
table (Experiment, Dataset, Datafile) a many-to-many relationship, which
Gerson has done. This means that multiple Experiments can reference a
single ExperimentParameterSet, which is what I'm referring to as a Global
ParameterSet.

> I guess what i am getting at is that the data which is
> supposed to populate these new schema really isn't related to data
> specific metadata.

I take your point about it not really being data specific metadata, but the
practicality of having accurate data may win out over pure data specific
metadata.

> You are making the assumption that the 8 fields you are storing are
> actually useful. I would suggest that only 4 of the fields are
> actually useful: make, model, type, serial no. The rest of them are
> just adding unnecessary descriptions that will need to be maintained.
> I think for the 2800 rows per year is not really a big problem and if
> you were to run 10 times the experiments then you would still only be
> storing 28000 rows per year which isn't a big deal.

It's not the total additional storage that is significant, but that each
piece of information is duplicated many times (70 in my example).

And Samples:

The safety information (Hazard, etc.) is there because one of the problems
the facilities have is getting accurate sample information (it is usually
just spread throughout log books). A number of facilities have used the
approach of storing the safety information (which has to be kept up to date)
with the metadata to improve the quality of the sample information.

I copied the attributes, including Principal Name, out of the safety
information collected by the Proposals system at the Synchrotron.

> Could you also update your doc strings to match the syntax as
> described by http://sphinx.pocoo.org/

I looked up the web site to figure out how to format definition lists (which
I used for the Field documentation). I'll work my way through the tutorial,
but what in particular needs changing?

Thanks!
Alistair

-----Original Message-----
From: tardis...@googlegroups.com [mailto:tardis...@googlegroups.com]
On Behalf Of Russell Sim
Sent: Tuesday, 7 December 2010 04:43
To: tardis...@googlegroups.com
Subject: Re: On Equipment Metadata (and Sample)

Gerson Galang

unread,

Dec 7, 2010, 5:33:52 PM12/7/10

to tardis...@googlegroups.com

Please see my comments interleaved below...

On 08/12/10 05:03, Alistair Grant wrote:
> On 7 December 2010 04:43, Russell wrote:
>
>> Hang on a second, i though that any schema could be applied to any type?
>>
> So did I. I went back and tracked down this change, Gerson introduced it on
> 29 Nov as part of his Sample table implementation.
>
> Gerson, can you provide the reasoning behind the introduction of the Schema
> types?
>
> If it was purely to support the Sample and Equipment types, and no one else
> wants it, I'm happy to have it backed out.
>

Yes it's been introduced to support sample and equipment.

>
>
>> Since all the schemas will come under the category of GENERAL can you
>> please suggest a sensible method ordering these parameter sets?
>>
> Probably by Schema.name. Any better suggestions?
>
>
>
>> I think Grisha was suggesting that we could just use the existing
>> ParameterSet infrastructure.
>>
> That's what I'm doing... :-)
>

Grischa was actually trying to suggest that we use the existing
parameterset infrastructure (ie not make any changes anymore to support
sample and equipment). He was suggesting that if we ever need to provide
a sample information, they'll be added as experiment/dataset/datafile
parameters through the experiment/dataset/datafile parameterset
infrastructure.

Now that I've read more info about what is going to be stored as sample
metadata, won't it be better for these information to be stored
somewhere else? Won't it make tardis less generic? Will people really be
doing this type of search "can you find me all the datafiles that have
used a sample that has this particular hazard"?
>

Steve Androulakis

unread,

Dec 7, 2010, 6:35:59 PM12/7/10

to tardis-devel

Actually, I favour generic in this case. I agree - I don't see people
searching on datafiles that have used a particular sample that
contains a specific hazard. However, I do anticipate people wanting to
_see_ it reported when they view a sample. The existing infrastructure
can do that well in my view.

Alistair Grant

unread,

Dec 8, 2010, 2:39:37 PM12/8/10

to tardis...@googlegroups.com

Hi All,

I think we're very close to reaching consensus on ParameterSets, Samples and Equipment.

I've summarised the major points below. I’ll also update the SampleTable and ParameterSets wiki pages to reflect the proposed implementation once GoogleCode maintenance is finished (making the site read-only).

Guiding Principles (with some explanatory comments):

· To keep federation straightforward, Experiments in TARDIS should be self-contained entities that can be copied from one instance of TARDIS to another.

o This means that we shouldn’t implement what I was referring to as global ParameterSets, i.e. ParameterSets that are referenced from more than one Experiment.

· TARDIS core task is to store metadata about experimental data, i.e. it isn’t about storing lots of static information about equipment.

o Knowing which equipment was used to record data is still useful, however it should be minimal, such as an identifier of the equipment and a reference to where more information is available.

o If there isn’t an equipment register, a small equipment register application could be co-hosted with TARDIS and referenced from within the Equipment metadata, as described above.

· TARDIS will store metadata that reflects the state as it existed when the metadata was recorded.

o I.e. in general we’re not about going back and updating information if it becomes out of date.

Design Decisions:

· Sample and Equipment information should be stored using the general ParameterSet functionality, i.e. there’s no code specific to Samples or Equipment. We will be extending the general ParameterSet functionality as described in the following bullet points.

· We want to be able to store multiple items of Equipment against a single Experiment / Dataset / Datafile. This is possible using the existing ParameterSet functionality.

· We want to be able to share Samples between Datasets or Datafiles. This requires the implementation of the many-to-many relationship between Datasets and Datafiles and their associated ParameterSets. Note that it doesn’t extend to Experiments, see the guiding principle on Experiment encapsulation above.

· Schema will get a user-friendly name that will be displayed on the View Experiment page and used for sorting ParameterSets.

The design as proposed above will of course still allow administrators to define extended Sample or Equipment definitions if they wish, e.g. storing sample hazard information. That is a purely local decision about what is important metadata.

Thanks,

Alistair

-----Original Message-----
From: tardis...@googlegroups.com [mailto:tardis...@googlegroups.com] On Behalf Of Steve Androulakis
Sent: Wednesday, 8 December 2010 00:36
To: tardis-devel
Subject: Re: On Equipment Metadata (and Sample)

Actually, I favour generic in this case. I agree - I don't see people

Ian Thomas

unread,

Dec 8, 2010, 5:54:47 PM12/8/10

to tardis...@googlegroups.com

Hi all,

[snip]

· TARDIS will store metadata that reflects the state as it existed when the metadata was recorded.

o I.e. in general we’re not about going back and updating information if it becomes out of date.

Does this address the situation where the recorded meta data is incorrect and needs to be corrected after the fact?

This might occur with automatic ingestion, but more so with the create_experiment staging area function where users

may have to enter data by hand and may make mistakes?

Ian

Alistair Grant

unread,

Dec 8, 2010, 6:06:52 PM12/8/10

to tardis...@googlegroups.com

Hi Ian,

I haven’t seen the functionality being added in EIF019, but I believe that it will address your concerns, i.e. being able to correct mistakes, make annotations, etc. Steve or Russell might give more details.

I was referring to the case where information becomes out of date over a “long” time, e.g. a piece of equipment is decommissioned, we won’t go back and update metadata to reflect the fact that it is now decommissioned. It was operating when the metadata was entered.

Cheers,

Alistair

From: tardis...@googlegroups.com [mailto:tardis...@googlegroups.com] On Behalf Of Ian Thomas
Sent: Wednesday, 8 December 2010 23:55
To: tardis...@googlegroups.com
Subject: Re: On Equipment Metadata (and Sample)

Hi all,

Grischa

unread,

Dec 8, 2010, 6:46:20 PM12/8/10

to tardis...@googlegroups.com

Hi all,

Gerson was correct in his interpretation of what I meant.

Initially, I had thought many-to-many would be good. However, I then played out the following likely scenario, which also touches on Ian's comment:

User creates experiment 1, creates new parameterset.

User creates experiment 2, attaches same parameterset.

User realises experiment 1 needs additional/different parameters and changes them.

Without intention the user would have changed data of experiment 2.

Given the low amount of data we are going to store, ie. less than a million of rows, we do not really need to worry about storing potentially duplicate data.

Also, for schemas a grouping field may be more robust than a schema naming convention that relies on the user/admin reading/following the documentation.

Cheers

Alistair Grant

unread,

Dec 9, 2010, 4:57:58 AM12/9/10

to tardis...@googlegroups.com

Hi All,

I agree, and have already dropped the global parameter set idea, i.e. sharing ParameterSets between Experiments.

What I would still like to see is the ability to attach one ParameterSet to multiple Datasets or Datafiles within an Experiment.

Experiment’s will still be self-contained. See also:

· The ParameterSets page on the wiki

· Models.py r541 in sample-table-impl-b1.
Note that with the DatasetParameterSet and DatafileParameterSet code still needs to be added to enforce the business logic that ParameterSets are not allowed to be attached to multiple Experiments. It isn’t a problem for ExperimentParameterSets as they may only be attached to a single Experiment.

Grischa wrote:

Given the low amount of data we are going to store, ie. less than a million of rows, we do not really need to worry about storing potentially duplicate data.

We’re expecting the DatafileParameter table to grow to over 500 million records over time.

Thanks,

Alistair

Grischa

unread,

Dec 9, 2010, 6:39:35 PM12/9/10

to tardis...@googlegroups.com

Oh, I made an error due to the ambiguity of the term "experiment"... In my thinking a Tardis-experiment may consist of several real experiments, but each real experiment would be stored in a dataset. Apologies for the confusion.

If you replaces "experiment" with "dataset" in my example, you can see that there still is a problem with many-to-many.

Also, if there is concern about duplicated information on the file level, why is common metadata then not stored on the dataset level. This level will automatically apply to all attached files.

Cheers

Reply all

Reply to author

Forward