In looking at how to implement the creation of DataCollection Object we
need to decide a few things.
1. Should they be mutable after creation?
2. Can there be more than one instance with the same internal information?
3. Should they have an Unique Identifier?
3a And if so how Unique? (Inside one BridgeDB Instance or world wide)?
3b Do these Identifiers persit over time?
4. If BridgeDB is backed by a Database should DataCollection information
be saved in the Database?
For comparison the answers for a DataSource are
1. They are mutable (but see next answer)
2. There can never be two instances with similar internal information
(excluding two pointers to the same object)
So if a DataSource is changed it effect every use of that DataSource
system wide.
3. DataSources have unique identifiers.
3a. They are only unique to one BridgeDB instance. Although two BridgeDB
instance using the same data input methods will normally shared the same
identifiers
3b. Identifers are recreated every time BridgeDB is started. (Normally
br reading them from some input)
4. There is no code (known to me) that reads and writes DataSources from
a Database.
Although DataSource can be created on the fly based on the sysCode
stored in a file or Database.
My personal opinions for a DataCollection are:
1. Yes DataCollections should be mutable. (See question 3)
2. Yes it MUST be possible to have more than one Instance of a
DataCollections.
2a. DataCollections are opinions. I think these DataSources map. So
that DataCollection represent my Opinion on one data collection.
Another person may also have an opinion on a data collection stored in
his own DataCollection. These two Object will look the same but are in
fact different real world things. If we later (in an exstention to
BridgeDB) add Provenance the two will have different Provenance.
3. Yes each DataCollection should have a unique ID.
3a. I am not sure these need to be World Wide unique. URI are one way of
doing identifiers that helps ensure world wide uniqueness, nut brings up
the issue of should the ID be changed if the DataCollection is changed?
URIs also bring the the big question of should they be resolvable and if
so for how long (versioning ect)
3b Yes I think DataCollection IDs should persist over time. Which means
they MUST be stored, generated by some rule or hard coded in.
4. DataCollection creation/ changing and retreiving should (BUT NEVER
MUST) be based on a DataBase.
One way to do this is use a DataCollectionFactory interface. Systems
without a Database will just pass the calls on to DataCollection
Constructors/ methods. Systems with a database will save the information
before passing the call on.
4a. I used this approach for Provenace and would at some point like to
propose a similar approach to DataSources (without making it required)
--
Christian Brenninkmeijer
University of Manchester
MyGrid team
I'm fine with either approach.
> 2. Can there be more than one instance with the same internal information?
Yes, from my perspective.
> 3. Should they have an Unique Identifier?
Yes.
> 3a And if so how Unique? (Inside one BridgeDB Instance or world wide)?
One instance.
> 3b Do these Identifiers persist over time?
Some do, like all those that originate from identifiers.org.
> 4. If BridgeDB is backed by a Database should DataCollection information be
> saved in the Database?
Martijn?
> My personal opinions for a DataCollection are:
> 1. Yes DataCollections should be mutable. (See question 3)
> 2. Yes it MUST be possible to have more than one Instance of a
> DataCollections.
> 2a. DataCollections are opinions. I think these DataSources map. So that
> DataCollection represent my Opinion on one data collection. Another person
> may also have an opinion on a data collection stored in his own
> DataCollection. These two Object will look the same but are in fact
> different real world things. If we later (in an exstention to BridgeDB) add
> Provenance the two will have different Provenance.
I'm actually having a discussion with someone from MIRIAM about what
is a collection :) So, this is very much subjective indeed.
...
Looking forward to Martijn's ideas... I have not seen anything I'm against...
Egon
--
Dr E.L. Willighagen
Postdoctoral Researcher
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
Sorry for being away for awhile. Please remind me if you need a reply on
something, because I might have missed some things in the past two weeks.
On 07/03/12 14:10, Christian brenninkmeijer wrote:
> Note: For details of what a DataCollection is please see previous thread.
>
> In looking at how to implement the creation of DataCollection Object we
> need to decide a few things.
>
> 1. Should they be mutable after creation?
I like immutability but as you say DataSource isn't immutable either, so
I guess I can go either way.
> 2. Can there be more than one instance with the same internal information?
It's probably not as critical, but it's nice to do so. A big reason for
enforcing single instances for DataSource is that it's much easier to
test for equality. And this again makes it easier to test for equality
on Xref. This helps especially with usage of java.util.Set, which is a
common use case.
> 3. Should they have an Unique Identifier?
If identifiers.org defines one, then you can use that one.
> 3a And if so how Unique? (Inside one BridgeDB Instance or world wide)?
> 3b Do these Identifiers persit over time?
> 4. If BridgeDB is backed by a Database should DataCollection information
> be saved in the Database?
Not sure about this... The information about DataSources is indeed not
in a real "database". However, DataSource info is loaded from a
configuration file, which could be seen as a cache of information from
MIRIAM plus some extras.
So in that sense, DataSource is backed by the MIRIAM registry database.
What is important to me is that however you implement this, it should be
possible to run offline, using a Cache (like we're doing right now) or
gracefully falling back to something else. Software that forces you to
work online all the time is impractical.
>
> For comparison the answers for a DataSource are
> 1. They are mutable (but see next answer)
> 2. There can never be two instances with similar internal information
> (excluding two pointers to the same object)
> So if a DataSource is changed it effect every use of that DataSource
> system wide.
> 3. DataSources have unique identifiers.
That depends on your point of view... You could call
DataSource.getByFullName("Entrez") and then DataSource.getByFullName
("Entrez gene") and end up with two objects.