Hierarchy of data in HIRI

10 views

Skip to first unread message

Guillaume Prévost

unread,

Oct 15, 2013, 11:57:36 PM10/15/13

to tardis...@googlegroups.com

Hi,

We are starting to work on the setup of a myTardis platform for micro-wells equipments of the HIRi lab (Health Innovation Research Institute) in RMIT, mostly by creating new filters for extracting meta-data out of the files from these instruments (Flexstation III). After analysing a little further the data files HIRi sent us in order to work on a myTardis filter, I noticed a conceptual issue in the hierarchy of the data, that will need to be sorted before the development of that filter starts.

As you know, the hierarchy of the data saved in myTardis is as follows:

-> Experiment

|-> Dataset

|-> Datafile

From the files we have received, it seems that one data file can relate to more than one experiment and contain more than one dataset.

For example, in the XML format export, which the the easiest to read:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

<microplateDoc xmlns="http://moleculardevices.com/microplateML">

<fileVersion>2.1.0</fileVersion>

<experimentSection sectionName="BasicEndpoint"/>

<experimentSection sectionName="Exp01">

<plateSection>

...

</plateSection>

</experimentSection>

</microplateDoc>

This is also true for the PDA format, a proprietary format which is the standard output of the software (SoftMax Pro)used to work with the instrument, and which is the one we'll need to work on.

When uploading one PDA file in myTardis, that single file could relate to more than one experiment/datasets, which would not fit in the myTardis model.

In the best case, if each datafile relates to a single experiment & dataset, we would still end up with a hierarchy where each experiment has a single dataset containing a single datafile.

For example, for N data files:

-> Experiment 1

|-> Dataset 1.1

|-> Datafile 1.1.1

-> Experiment2

|-> Dataset 2.1

|-> Datafile 2.1.1

...

-> Experiment N

|-> Dataset N.1

|-> Datafile N.1.1

instead of the classic myTardis hierarchy where each experience has several datasets and each datasets have several datafiles.

Can you see a way to handle this in a cleanly, hopefully without having to ask the researchers to limit the output to one experiment per datafile (they may come up with datafiles containing several experiments) ? Has anyone faced that type of issue before ? If so, how was this addressed / resolved ?

Thanks for your help,

Guillaume

Steve Bennett

unread,

Oct 16, 2013, 1:16:45 AM10/16/13

to tardis...@googlegroups.com

Hi Guillaume,

>From the files we have received, it seems that one data file can relate to more than one experiment and contain more than one dataset.

Can you give an example of a file relating to more than one experiment? Typically in the projects I've worked on, the MyTardis "experiment" really corresponds to something more like "project" for the researcher, so it would never happen that a file would relate to two at once.

For example, in the XML format export, which the the easiest to read:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

<microplateDoc xmlns="http://moleculardevices.com/microplateML">
<fileVersion>2.1.0</fileVersion>

<experimentSection sectionName="BasicEndpoint"/>
<experimentSection sectionName="Exp01">

<plateSection>
...

</plateSection>
</experimentSection>
</microplateDoc>

This is also true for the PDA format, a proprietary format which is the standard output of the software (SoftMax Pro)used to work with the instrument, and which is the one we'll need to work on.

When uploading one PDA file in myTardis, that single file could relate to more than one experiment/datasets, which would not fit in the myTardis model.

In the best case, if each datafile relates to a single experiment & dataset, we would still end up with a hierarchy where each experiment has a single dataset containing a single datafile.

That's not terrible, although a better approach might be to group all these 'single datafiles' in a single dataset, if that makes sense. Something like:

Experiment

- single-files

--datafile1

--datafile2

--datafile3

- dataset1

-- ...

- dataset2

-- ...

But only if that makes sense to the users.

I think the important thing here is not to be too driven by the structure of the files themselves, but to rearrange or decompose things to give the best result in mytardis. For example, sometimes you come across researchers that work with more than 3 "levels" of data:

- project/experiment/dataset/datafile

- experiment/sample/dataset/datafile

In those cases, we have to flatten them into the mytardis structure, for example:

- project+experiment/dataset/datafile

- experiment/sample+dataset/datafile

But maybe the users prefer:

- experiment+sample/dataset/datafile

For example, for N data files:
-> Experiment 1

|-> Dataset 1.1
|-> Datafile 1.1.1
-> Experiment2

|-> Dataset 2.1
|-> Datafile 2.1.1
...

-> Experiment N
|-> Dataset N.1
|-> Datafile N.1.1

instead of the classic myTardis hierarchy where each experience has several datasets and each datasets have several datafiles.

This looks like it might be more like:

Experiment 1

- Dataset 1.1

-- Datafile 1.1.1

- Dataset 1.2

-- Datafile 1.2.1

...

Ask the users :) Typically, the semantics for "experiment" and "dataset" are:

- experiment: a series of closely related datasets

- dataset: a set of data produced at one time

Can you see a way to handle this in a cleanly, hopefully without having to ask the researchers to limit the output to one experiment per datafile (they may come up with datafiles containing several experiments) ? Has anyone faced that type of issue before ? If so, how was this addressed / resolved ?

If I'm not mistaken, the information model was changed to one-to-many, so datasets can belong to several experiments. I'd check that that's really what's required here though first.