Fwd: datum, data set, data structure

5 views
Skip to first unread message

Chris Stoeckert

unread,
Jan 5, 2009, 3:20:48 PM1/5/09
to obi-denr...@googlegroups.com, informatio...@googlegroups.com
Happy New Year All!
As we prepare for the OBI workshop next month, I'd like to start discussion of the DENRIE core terms.
The first one on the list (see Dec 17 mail) is datum/ data; dataset; data structure

Thank you Alan for starting this thread. I guess it makes sense to have one discussion rather than two but not sure how to manage this with different lists and different memberships.

Cheers,
Chris

Begin forwarded message:

From: "Alan Ruttenberg" <alanrut...@gmail.com>
Date: January 5, 2009 2:21:03 AM EST
Cc: "Chris Stoeckert" <stoe...@pcbi.upenn.edu>, "Melanie Courtot" <mcou...@gmail.com>, "James Malone" <mal...@ebi.ac.uk>
Subject: Re: datum, data set, data structure

Hi Folks,

I had a conversation with Barry this evening talking some of this
through. The outcome:

Don't distinguish data/datum. He thinks a distinction could be made on
the basis of the paper:
http://ontology.buffalo.edu/bfo/Terminology_for_Ontologies.pdf but we
don't think it is necessary. Perhaps use "Data item" as the parent
term to avoid this.

On the question of the definition of data item and information content
entity the proposal, bringing the conversations on this thread
together would be that data items are information content entities
that are intended to be truthful statements about something (modulo
measurement, error) and are constructed/acquired by a methods which
reliably tend to produce (approximately) truthful statements. (Barry
can probably provide better wording).

Data items would include not only the results of measurements, but
also the results of computations that derive information from such
such as summarizations/averages, or which are created in order to
organize them for access or processing.

For Bjoern, another example of a data item that is not the results of
an assay would be, e.g. the nodes of a tree data structure used to
manage a sorted list of measurements (e.g. to approximate, more
efficiently the solution of a many-body problem).

A data set would itself be a data item, some bundling of other data
items for some purpose, without implication of homogeneity or  other
structural constraints, at least at the broadest level.

Thus the definition of data items involves provenance, intention,
truthiness ;-), and computation. Not a simple thing ...

We also discussed the issue of data content versus structure. Take for
example some set of referring data items, such as a set of recorded
observations of animal behavior (positions, activity, participants,
etc).  This set of data items might be organized/structured in a
number of ways - as one or more lists, as a relational database, in
other kinds of data structures optimized for one or another purpose.
It seems that we might then want to think about data structure as
independent from data content and define independent hierarchies for
each, and relations that allow us to combine them using defined
classes.

For data structures it might be better to not roll our own, but
instead defer to some formalism that already exists for describing
data structures (assuming a reasonable one exists, which I do assume).
Perhaps Jonathan might bring up some candidate.

Ok, that's all for now. Your thoughts (particularly Bjoern's) solicited.

Best,
Alan






On Sun, Dec 14, 2008 at 9:09 PM, Bjoern Peters <bpe...@liai.org> wrote:

I am with Jonathan on pretty much his entire mail. More below.

Barry Smith wrote:
At 10:58 AM 12/14/2008, Jonathan Rees wrote:


I gave a talk on Friday in which I distinguished data from software
according to the way it is considered to be correct. Software is
correct if it does what it's supposed to do, while data is correct if
what it means to say about the world is true. I don't know whether
this distinction works, as I'm not a philosopher, but no one in the
audience complained, and when I floated the idea with Gerry Sussman he
said: Oh, you mean software is a priori, and data is a posteriori.
This made sense to me...


As far as data is concerned, this seems to me to be on the right
track.  Ontologically it would be expressed in terms of an aboutness
relation --

x is a datum --> x is about something (a.k.a true of something)

BS

Currently IAO says that all information content entities are 'about'
something. That is why I asked what is the intended differentiation from
data. From what Barry writes I assume that is wrong, and only data but
not specifications (to me: plans, objectives, hypothesis, conclusions,
laws) are about something? Can we then have the second relation
'specifies', with x is a specification --> x specifies something, and
instances in reality either meet or don't meet a specification?

If data is 'true of something' and it is an information artifact =
created by a sentient, then there has to be an initial creation process.
I still believe that process should be a 'measurement / observation',
which we are considering synonyms of 'assay' in OBI. By capturing the
creation process of data, we will capture the provenance and aboutness
of the data. Subsequent data transformations can retain that without
making reference to it.

Alan gave examples of what is data that is not output of an assay
according to our current definition. I thought the measurement of
distance earth to moon would be an assay (with the Object pair
moon-earth playing the role of evaluant). I agree that 'length of the
file' poses many problems though, making it toxic on many levels. Isn't
there a length to a file without independent of it being recorded as
data (dependent continuants on GDCs again)? Is the length the space on
the disk? Dependent on encoding? I would skip this example, as it
doesn't seem too relevant for this discussion.

Overall I would like to see a class that captures all kinds of
observation / measurement processes that create data, and I am happy to
change the label 'assay', or revise the definition. Also, if my
understanding of 'data' is considered to narrow, I would still like an
entity that captures only the output of measurement / observation
processes and transformations thereof.

- Bjoern

Years ago Brian Cantwell Smith threw a wrench into this tidy
distinction by insisting that software was also about the world, and
could be wrong by saying things about the world that were wrong (by
being an incorrect model). But I will leave that aside.

I would say that what is right about the attempt to link data to assay
is that data's ability to be empirically right is special. Music is
not data, unless it's the subject matter of a study of music, and
articles of theoretical mathematics are not data, unless they are the
subject matter of a study of mathematical articles. (This is a
difference on quotation level.) (Both of these situations are outside
the scope of IAO if I understand the project correctly.)

So naively, I'd say that data has meaning and provenance.

I know you need to account for two dimensions, that of entities
(exemplified by a piece of paper on which measurements are scrawled -
or rather that aspect of that piece of paper that comprises those
measurements) and that of whatever it is that such entities can have
in common, such that we say that they "say the same thing". If you
make statements about the former you risk failing to be able to
transfer that information to other such entities. I.e. if P(x) means
that the measurement written on x is 2.72, and y is a "copy" of x,
then you'd like to be able to conclude that P(y). I would assert that
while grounding in reality is a fine thing, what normal people really
say about information rarely mentions sufficient particulars to
identify that grounding. Instead they say: "I read that high tide was
2.72 meters" and nobody really cares which computer they used or which
disk stored that information. At best they care about the provenance
of the information. This is not a criticism of entity-orientation in
principle, just a warning that the statements we record in curation
need to allow for the right kinds of generalization.

Given the importance of provenance it would be worth making sure that
information never gets separated from the processes involved in its
life history (an agent uttering something, another agent understanding
it). Probably data without provenance is useless or nonsensical. (If a
little bird tells you a tide-level measurement, you would want to
check it before calling it "data".) I think everyone here has this
point, but it's easy to forget because for some purposes you will
forget where the information came from, and just use it on faith
(because you already determined that it was OK).

My experience is that there will never be completely satisfactory
terminology around these things (data, datum, information entity,
message, file, etc.), and you have to just set your sights low and
rely on good definitions.

As for the atomic data and composite data problem - is there any need
to make this distinction? In curation will you ever care? In logic you
care, and atomic propositions are distinguished from composite
propositions (such as conjunctions) by being called "literals", but
this is usually only of interest when you're trying to develop
procedures for proving theorems or testing entailment.

So here are the things I see still on the table:
- datum in the entity sense, such as previous day's high and low
temperatures recorded in a copy of the newspaper sitting in front of
Barry at breakfast
- datum in the abstract sense (genericaly dependent continuant?),
whatever it is about that entity that is shared by the copy of a
different newspaper sitting in front of me at lunch that reports the
same high and low temperatures
- orthogonally to the above, a distinction between atomic and compound
(I posit that this distinction may not be important)
- orthogonally to the above, one or more classifications into
different kinds of information entity, capturing important
distinctions in meaning (microarray obtained how? software,
literature, etc.) and/or structure (nucleotide sequence, table of
intensity readings, pair of temperature readings) and/or syntax (XML,
CSV, etc.), driven by use cases.

I think the theory should capture the principle that "everything
that's said is said by someone" (i.e. you cannot have a temperature
record without both a temperature and someone or something to record
it) and further that if it's not said for a reason it's hard to
consider it to be information (or data).

I'll try not to influence choice of terms. I understand why there are
problems with "data" and "information" (which I use more or less
interchangeably - I know that's a waste) but urge that we first figure
out which things (classes) need terms and then work towards the least
bad terminology choices. Don't hold out for terminology that feels
really good to everyone because, in this context at least, it will
never come along.

I know it doesn't seem to be in scope, but I wouldn't be surprised if
we ended up reinventing bits of classical logic (propositions,
inference, theories, interpretations, models) and/or the animal
communication literature by the time we're done with this project.
None of this apparatus was invented just for the sake of inventing
apparatus; it derived from a sincere attempt to account for the
phenomena of talking and thinking about the world, which I don't see
as being very different from the IAO activity.

Jonathan

On Sun, Dec 14, 2008 at 12:06 AM, Alan Ruttenberg
<alanrut...@gmail.com> wrote:

On Sat, Dec 13, 2008 at 11:13 PM, Bjoern Peters <bpe...@liai.org> wrote:

No one seems to like my inclusion of assay in the definition of data. I
can live with a broader view of things labeled 'data', but would then
like to know what differentiates 'data' from 'information content
entity'.

Information content entities include information about a realizable
entity, such as specifications.


I would also like to point out the definition of processes
labeled 'assay' that we have in OBI, and would like to know what 'data'
do not orignate from such processes:

A measurment of the distance between the earth and the moon.
The length of a file.


An assay is a process with the objective to create as an output
information about a material entity (bearing evaluant role).

-Alan

James Malone wrote:

The word data is the Latin plural of datum; data are multiple datum so
I don't think the parentage works as proposed.  I think data is a
collection of datum, the equivalent of a part_of for information
entities.  I'm tempted to suggest that data and data set are actually
synonyms; they talk about collections of multiple datum.  I think in
modern times people use them synonymously even though historically the
'set' part of data set was probably used to imply a collection of
unique records (like a mathematical set of non-repeating members).  If
we wish to separate them, then I would say a data set is a collection
of data (which is a collection of datum).

I'm not clear on how to make the distinction on what is an atomic,
singular piece of information and agree that is completely context
specific.  Can we somehow tie that into the definition or does that
represent a fudge?  I think we should try to keep this simple though
and as generic as possible lest we tie ourselves up in knots; I really
would not like to see 'assay' etc in the definition for instance.

Regarding data structure; I was really thinking computer science
centric when I was talking about it previously.  I was thinking trees,
hash maps, records in a database, (though could this be a paper filing
system or an index in a book perhaps?)  It is information about the
way the data is organised, rather than about the information the data
actually represents.

In summary I personally don't think any of the above, data, datum,
data set or data structure would fall under a parental hierarchy to
one another.

Cheers,

James




On Sat, Dec 13, 2008 at 8:06 PM, Bjoern Peters <bpe...@liai.org
<mailto:bpe...@liai.org>> wrote:


   I think you need to include in this discussion the definition of
   'information content entity', to make sure it ends up different from
   'datum'. BTW: why is there no 'information artifact' class?

   Jonathan's definition seems to be computer science inspired, as in
   data
   is different from code. For OBI at least, I thought we instead wanted
   datum limited to 'scientific data', which would be the output of
   measurements and observations (in OBI: assay), and data
   transformations
   thereof.
   Check out this song: as an elucidation of the definition:
   http://faculty.washington.edu/crowther/Misc/Songs/showme.shtml

   Translating this to OBI/IAO, and trying to define the singular vs.
   plural, I would propose:
   An instance of datum (or data point): is an (information
   artifact/information content entity?) that is the output of a single
   instance of an assay or a data transformation, and is about a single
   instance of an evaluant.

   The problem is going to be assays like microarrays, which could be
   described as thousands of simultaneous assays with each array probe.
   Maybe we can and should capture this in the assay definitions,
   where an
   'atomic assay' gives exactly one datum, vs. a parallized assay can be
   broken down into many atomic assays (FACS --> single cell, microarray
   --> single probe, 454-sequencing --> single read) which

produce a data

   set from a single assay. Other data sets are produced as

the output of

   serial applications of assays (even different ones) in an
   investigation.

   I agree that it will be tricky to keep the separation of datum and
   data
   set throughout, so it would be nice to have a parent class for
   both. How
   about this :

   data: is information that is the output of assays or data
   transformations
      datum: is data that can be traced to an atomic assay
      data set: is data that can be broken down into multiple datum

   (I am not too happy about the 'data' label)

   I don't think I understood 'data structure' sufficiently in the
   discussion below to place it.

   - Bjoern

   Alan Ruttenberg wrote:
Dear IAOen,

Here are parts of a conversation leading to a question about the
definitions and distinctions between datum, data

structure, and data

set.

Although James offers a distinction, "datum is an information
   content
entity that is a representation of a single item of information", I
see a couple of issues. First the representation would be of
   anything,
not just information. Section, representations are often composite
structures. Just as there is no thing that has no parts (at least
until we get to subatomic particles, I don't see how to figure out
what is singular in this case.

We don't have a formal definition yet, but Jonathan wrote

this note:


"datum     -- well, this will be very tricky to define, but
   maybe some
information-like stuff that might be put into a computer

and that is

meant, by someone, to denote and/or to be interpreted by some
process... I would include lists, tables, sentences... I think I
   might
defer to Barry, or to Brian Cantwell Smith"

If we follow this, then datum would be a superclass of data set and
data structure
(assuming we have both these other terms). However Chris and James
both find this
counterintuitive, thinking that datum implies singular.

My thoughts are that data structures have parts of different kinds,
and that data sets are aggregates that are collected

together either

because of common provenance, or for common purpose, and which
   tend to
have some collection of parts of the same kind, among other things.
This might suggest:

Datum
 Data structure (or structured data?)
    Data set

So, anybody have thoughts about this?

-Alan


Melanie:

- Where does our relation DT -> rendering go?
 We talked about having "is_rendered_by" during the denrie calls.
- How do we deal with graph and tree data structures?

6. graph

A graph is a collection of points and lines connecting some
   (possibly
empty)subset of them. The points of a graph are most commonly
   known as
graph vertices, but may also be called "nodes" or simply "points."
Similarly, the lines connecting the vertices of a graph are most
commonly known as graph edges, but may also be called "arcs" or
"lines."

definition source: WEB:http://mathworld.wolfram.com/Graph.html


7. tree data structure (as a child of the above graph)

label: tree data structure (disambiguation with forest tree) a tree
data structure is an acyclic connected graph. It is a
   widely-used data
structure that emulates a hierarchical tree structure with a set of
linked nodes. Each node has a set of zero or more children
   nodes, and
at most one parent node.

definition source: WEB:
   http://en.wikipedia.org/wiki/Tree_data_structure

Initial idea from Chris was to add graph as sibling of datum and
   data
set: "I think graph and other data structures are not

types of datum

but rather aggregates of data in a particular structure. How about
making graph a sibling of datum?"

Chris: (responding to)

 Melanie: -- graph (would need to be added to IAO, probably as a
child of datum IAO_0000027

I think graph and other data structures are not types of datum but
rather aggregates of data in a particular structure. How about
   making
graph a sibling of datum?


Alan:

a rendering is about some data, so I would make it a subproperty of
is_about.

What's tricky is what Chris alludes to -  data set, versus data
structure, versus datum, in that a rendering could be of any of
   them,
yet if they are all siblings that suggests a common superclass.

Some questions: Can a datum sometimes be a data

structure, or even a

data set? Any suggestions on how to clearly differentiate among
   them?

I think for now, to make progress, having data structure be a
   sibling
will do, but I might expect it to change when we think it through a
bit more. It's a bit like granularity.

James:

So my first question is, what is datum?  The current definition
   is very
loose, so I'll try and tighten it a little as we iterate.  I would
propose:

datum is an information content entity that is a

representation of a

single item of information, such as from an observation,
   statement of
perceived fact, a communication, a calculation or as the

result of a

process.

Key here is singular form of the class.  So data set is an
   aggregation of
datum, I would propose so the class should contain some
has_part/is_aggregation or similar relating to datum.  The
   crucial thing
for a 'data set' as opposed to just 'lots of datum' is that they
   have some
common feature, even if it is just they were collected at same
   time, I
would suggest.  Data structure is optional information about the
organisation and relation between the data in a data set.  I
   would go
further to say that even data that is randomly collected, such
   as a bag of
words model, but that is contained with a data structure could be
considered a data set as the common feature is the data
   structure which
binds them.  Bag of words is probably not the best example
   because the
other common feature is of course they are all words :)

So to clarify, I think datum is the atomic unit, data set should be
defined in terms of this atomic unit and with an extra clause
   that the
data share some common feature and data structure is information
   about the
organisation of the data set.  My first thoughts on this...

Chris:

Hi James,

I agree with your views. I might go further and say that data
structures have specified relationships (i.e., the structure)
   between
data where a data set is an aggregate of data with some common
feature.

To answer Alan, I don't think datum can be a data structure or a
   data
set.




















Melanie Courtot

unread,
Jan 5, 2009, 7:14:23 PM1/5/09
to Jonathan Rees, informatio...@googlegroups.com, Chris Stoeckert, James Malone, obi-denr...@googlegroups.com, Kieran O'Neill
Hi,

If I follow you, we would have information content entity, and its
subclass data item (previously data/datum).
Data item would encompass all things which are somehow "scientifically
true". Maybe we should use a different label, something like
"scientific data"?

We were discussing with Kieran in my group, and we are not sure about
the part regarding intended truthfulness of the data item. Result of
scientific processes seem ok (including measurements, computation
etc). But what about intentionally falsified data for example?

Melanie


On 5-Jan-09, at 6:40 AM, Jonathan Rees wrote:

> I don't agree that the "data item" notion as you described it is
> complex. It is merely deep. I like it because it parallels the
> conventional definition of "knowledge" as justified true belief. A
> data item is approximately justified approximately true approximate
> belief.
>
> I sometimes say that things like this are "meant" - i.e. whatever said
> the thing, means it, perhaps with some qualification (every statement
> in science is qualified).
>
> I take it data items are GDCs, or something of that ilk, and not the
> scrawlings-in-the-margin-of-Fermat's-copy kind of individual. The
> latter will have to be called something, right? The sooner we have
> agreed distinct names the better.
>
> It does seem a bit bold in a BFO sense to separate syntax from
> semantics like this, but it is quite conventional. When we say that
> Fred said X, we can mean either syntax or semantics, and the
> difference is a level of quotation. If we say "Fred said his name",
> this is a statement of semantics: he might have said "Fred",
> "Frederick", "Mr. Flintstone", "my name is Fred", etc. If we say "Fred
> said 'Fred'", this is a statement of syntax: in particular it tells us
> that he did not say "Mr. Flintstone". Confusion of these two levels
> through misplaced or missing quotation marks in notation design is one
> of the major bugaboos of software engineering.
>
> (Alan, I know how much you dislike HTTP content negotiation, but I I
> have to point out that this syntax/semantics distinction is very
> similar in flavor. CN advocates would like to say that there is an
> "abstract document" or "document without commitment to encoding" that
> says that Fred is a rubble expert. That is, when you name an AD, you
> are naming a meaning (semantics). Then when you do a GET naming the
> AD, what you receive is a "representation" - a syntactic entity ("Fred
> is a rubble expert") that expresses that meaning (that Fred is a
> rubble expert). Data item = abstract document, encoding =
> representation. The property that unites all the representations of an
> AD is that they all encode the same "data item". CN suffers of course
> in having no theory of provenance, but that is a separate question.)
>
> I don't think we should attempt a single universal *syntactic* theory
> of data structures, as there seems to be a never ending supply of such
> universal theories. Mime types seem a good place to start. I think you
> just have to be opportunistic, and invent syntactic classes as you
> need them.
>
> Maybe there is a need to talk about data structures at a non-syntactic
> level. If you want to talk about data structures at the semantic
> level, you'd have to do so in terms of constituent propositions, if
> you believe that meanings are propositional (and you should). E.g. you
> could take the meaning of a syntactic entity that can be parsed as a
> list to be a conjunction (another kind of list, or set) of the
> meanings of the individual list elements. This might apply to the
> tree data structure example - to lift it to the semantic level you
> would not be able to talk about structure without first identifying
> propositional content - what the nodes mean. This is not easy. You'd
> have to figure out whether conjunctions (trees, etc.) are meanings,
> and if not then what sort of beast they are (data structure? parse
> tree? XML infoset? RDF graph?). You'd have to account for algorithms
> that are generic - that don't care what the semantics is, but operate
> usefully at a purely syntactic level (e.g. sorting).
>
> It would be unfortunate to have to introduce an separate level of
> structural analysis sitting in between syntax and semantics, but if
> that's what's needed, so it goes. I guess we should explore the
> options.
>
> As for data structure frameworks - each programming language has its
> own ideas, with lots of variation around what the base types are and
> what the upper part of the type hierarchy looks like (are records
> collections? are collections functions?). IDL was popular for a while,
> but seems to be in decline. Now there's SOAP and so on. There are
> various automated theorem proving systems (e.g. Larch, PRL) with
> theories often tuned to the particular inference technology being
> used. I would prefer to be more scholarly and attempt to identify
> consensus use of terms in the mathematics and computer science
> literatures (e.g. set, list, tree), but this means invention and
> design, which would be bad. It would also over time descend into
> idiosyncracy and overcommitment just as all such systems do. So I'm
> not sure what I'd suggest. I'll think about it.
>
> Jonathan

Allyson Lister

unread,
Jan 6, 2009, 4:26:24 AM1/6/09
to obi-denr...@googlegroups.com, Jonathan Rees, informatio...@googlegroups.com, Chris Stoeckert, James Malone, Kieran O'Neill
Hi all,

Happy New Year!

Melanie - I agree with your interpretation of the structure of data item and its parents/children. That was also what I got from Alan's email. As for using a different label such as "scientific data", I see two problems. Firstly, would we then need to define what scientific is? Secondly, by finishing the label with the word "data", we end up with the data versus datum problem that Alan mentioned, which was why, I am guessing, "data item" was chosen in the first place. (I may have missed other parts of the conversation - I've been trying to keep up my reading of the OBI lists, but it hasn't been perfect of late!) However, I agree completely about the truthfulness aspect, but am not sure how to resolve it...

thanks :)

2009/1/6 Melanie Courtot <mcou...@gmail.com>



--
Thanks,
Allyson :)

Allyson Lister
Research Associate
Centre for Integrated Systems Biology for Ageing and Nutrition
Newcastle University
http://www.cisban.ac.uk
School of Computing Science
Newcastle University
Newcastle upon Tyne, NE1 7RU

Alan Ruttenberg

unread,
Jan 6, 2009, 8:15:09 AM1/6/09
to obi-denr...@googlegroups.com, Jonathan Rees, informatio...@googlegroups.com, Chris Stoeckert, James Malone, Kieran O'Neill
On Mon, Jan 5, 2009 at 7:14 PM, Melanie Courtot <mcou...@gmail.com> wrote:
>
> Hi,
>
> If I follow you, we would have information content entity, and its subclass
> data item (previously data/datum).
> Data item would encompass all things which are somehow "scientifically
> true". Maybe we should use a different label, something like "scientific
> data"?
>
> We were discussing with Kieran in my group, and we are not sure about the
> part regarding intended truthfulness of the data item. Result of scientific
> processes seem ok (including measurements, computation etc). But what about
> intentionally falsified data for example?

The intentionally falsified data is a nice example. I can think of two
approaches. First is to consider it a mistaken classification.
Following the fallibilist point of view we would reclassify this as
not data once it was discovered not to be.

Or we could think about Barry's "methods which reliably tend to
produce" part of the method and the falsifying research part of a
system which does reliably tend to produce (approximately) true
statements.

I would lean to the former, though note that the syntactic aspect of
the data needs be considered as something as more broadly applicable
than just to data.

On the matter of scientific data, I think this is two narrow for the
top level. Consider a coin collectors listing of the years, types, and
markings on the collection of coins they have. These are data times,
but I don't think I would consider them to be scientific. Similarly a
catalog of items for sale and their prices by a vendor.

-Alan

Chris Stoeckert

unread,
Jan 6, 2009, 2:53:24 PM1/6/09
to obi-denr...@googlegroups.com, Jonathan Rees, informatio...@googlegroups.com, James Malone, Kieran O'Neill
Hi,
So to summarize for the purpose of taking this to the OBI workshop:

data item is an information content entity that is intended to be a
truthful statement about something (modulo measurement, error) and is
constructed/acquired by a method which
reliably tend to produce (approximately) truthful statements. (synonym
datum, data). Is a necessary condition that data items have to be the
output of some process?

data set is a data item with some bundling of other data items for

some purpose, without implication of homogeneity or other structural
constraints, at least at the broadest level.

Need a definition for data structure. Is it an information content
entity? Or is it a different type of generically dependent continuant?

Cheers,
Chris

Melanie Courtot

unread,
Jan 6, 2009, 3:32:23 PM1/6/09
to informatio...@googlegroups.com, Alan Ruttenberg, obi-denr...@googlegroups.com, Jonathan Rees, Chris Stoeckert, James Malone, Kieran O'Neill

Considering the following example:

If I say "OBI contains 1344 classes" (I believe my reasoner which says
there are 1344 classes) then I have a data item.

If I say "OBI contains 1700 classes" (which I calculated by hand and
believed to be true, but later on I check and see that I made a
mistake and the correct number is indeed 1344), what I thought was a
data item would be reclassified as not being one.

If I say "OBI contains 1700 classes" (which I know is not true, but
I'm trying to advertise OBI and 1700 sounds better than 1344 ;) ), I
don't have a data item, because I can't produce a measurement with
this output, and I didn't intend to represent the truth.

If I say "Next year, OBI will contain 2000 classes" (which I believe
to be true but it's just my personal opinion) I have an information
content entity as it has not been corroborated by some kind of
evidence (measurement). If I now draw a graph of the evolution in the
number of classes in OBI depending on time, and then extrapolate on
that that by next year OBI will indeed contain 2000 classes, I have a
data item.

If Protege says there are 1340 classes in OBI, and Pellet says there
are 1345, they both are data item about the same thing, as in both
cases the intent was to provide a true representation of the number of
classes in OBI (even though none of them is actually right, but it's
part of the uncertainty in the measure)

Does that seem to match what you mean?

Melanie



On 6-Jan-09, at 5:59 AM, hog...@gmail.com wrote:

> It may also be worthwhile to consider the distinction between a
> measurement and the record of a measurement, and the role that may
> play here. My initial thought is that data fall into the latter
> category. In addition to imprecision/error in the process of
> measuring, there is also the potential for error in making the
> record. It is possible that two different entities perform the
> measurement vs. the record of the measurement (e.g., a machine makes
> the measurement, then I transcribe the value on the machine's
> display onto paper or into some electronic system).
>
> Records of measurements may fail to reflect reality for lots of
> reasons: transposed digits, misplacement of the decimal point, wrong
> units of measure, and intentional falsification, to name just a few.
> A record of a measurement may even refer to a measurement that never
> existed.
>
> Falsification of data may therefore fall into three basic categories:
>
> 1. Misrecording the outcome or some other aspect (such as date/time)
> of a measurement/assessment. For example, stating that patient #234
> had a Hemoglobin A1C of 5.5 instead of 7.5.
> 2. Recording the existence of a measurement/assessment that never
> existed.
> 3. Intentionally leaving out from the record a measurement/
> assessment that did occur.
>
> The difference between errors and falsification is in the intent of
> the recorder: did the person make a mistake or intentionally
> misrepresent something?
>
> Regardless, per Alan, once we know a record is in error, we
> reclassify it as not being data.
>
> Bill
>
> On Jan 6, 2009 8:15am, Alan Ruttenberg <alanrut...@gmail.com>
---
Mélanie Courtot
TFL- BCCRC
675 West 10th Avenue
Vancouver, BC
V5Z 1L3, Canada




Dirk Derom

unread,
Jan 12, 2009, 10:32:34 PM1/12/09
to obi-denrie-branch
Hi all,

Could I get some clarification on the discussion on datum, data item,
data, data structures and scientific data?

First of all, I agree with dropping the term ‘data’ and use something
that still makes sense (e.g. data item), data sets (a collection of
data items), data structure (organized data items). ‘Data’ is too
vague and common to redefine its meaning without confusing people. I
would like to avoid the common usage of data within science and it’s
singular datum without losing a rather obvious term such as data
item.

What I don’t understand is why we need to integrate the process in the
definition of data. Integrating the process in the definition is
acceptable if we recognize that a data item is always a processed/
translated data item.
Each process ‘does something’ with the pre-processed data. Hence, it
is not data to me, but some kind of processed data. The difference
would be the difference between a spike train and a significant effect
by processing the spike train. We could nevertheless add the process
in data item, but not as a necessary element, something like ‘and
could be constructed/acquired by a method’ (instead of ‘is
constructed…’).
Every data item is obviously collected/acquired, but I don’t see the
process of collecting as an intrinsic (ontological) part of the data
item (just as I don’t see the collector, the researcher who actually
collects/acquires the data as a intrinsic part of the data item). If
I’m overlooking something, then please fill me in.

What I really don’t like is the ‘truth’ element in the definition of
data item. Truth is rather difficult and philosophically (too) hard to
define. I realize that the BFO uses a framework, which states that
there is a possible match between truth and representation through
ontology. But as far as I understand its philosophical framework, it
(cf. truth) seems to relate to the actual possibility of constructing
a representational ontology and not to its term/entities in domain
specific ontologies. Hence it does not follow that a data item is a
truthful statement.
I would think that data has no relation with truth, at least not until
it’s translated/processed into truthful statements, which for me would
be more related to ‘information’ or ‘knowledge’. Adding truth to data
will create a continuous shift in ‘is X data or not?’, since truth and
knowledge shifts quite often. Take the very unlikely case where we
find that spike trains do not represent brain functions (some claim
it’s not, however I’m not expert enough to judge such statements).
Then all of the sudden, that’s no longer data, because it has no truth-
value?
Knowledge/information accumulation can be related to truth, however
its truthful status is never decided upon (cf. the ‘tend to’). Even
the more cautious definition (‘tend to’ or ‘are know to produce’)
seems a little too much. This might have something to do with what I
think a data item is: a potential source of truthful information. A
data item cannot be ‘wrong’ or ‘true’. It’s simply a data item. How we
use it, how we process/select/collect it might be truthful or
erroneous, but a data item is always what it is, namely a ‘data item’.
For me truth can only be addressed in the definition of knowledge or
information and the ‘tend to produce truthful’ be part of a
information definition. Information as ‘truthful statement about
something, by collecting/acquiring data set (items, structure) using a
method, which reliably tend to produce truthful statements’.
As an alternative for data item, I kind of like the wiki definition:
“Data refers to a collection of facts usually (!) collected as the
result of experience, observation or experiment, or processes within a
computer system, or a set of premises. This may consist of numbers,
words, or images, particularly as measurements or observations of a
set of variables.” A data item then could be: “Data item refers to a
fact usually collected as the result of experience, observation or
experiment, or processes within a computer system, or a set of
premises. This may be a number, word, or image, particularly as a
measurement or observation of a set of variables.” I really like the
‘usually’ since it addresses the question I had about adding the
process as an essential part of the definition.

Hope it all makes sense,
D
> > On Jan 6, 2009 8:15am, Alan Ruttenberg <alanruttenb...@gmail.com>  
> ...
>
> read more »

Chris Stoeckert

unread,
Jan 13, 2009, 3:47:48 PM1/13/09
to obi-denr...@googlegroups.com
Hi Dirk,
Just to put this in the context of a definition I proposed a
definition from Alan:

"data item is an information content entity that is intended to be a
truthful statement about something (modulo measurement, error) and is
constructed/acquired by a method which reliably tend to produce
(approximately) truthful statements. (synonym datum, data). "

Your alternate proposed definition is:


“Data item refers to a fact usually collected as the result of
experience, observation or experiment, or processes within a computer
system, or a set of premises. This may be a number, word, or image,
particularly as a measurement or observation of a set of variables.”

I'm OK with your definition for a general data item in IAO. However,
in OBI, I only care about data items in the context of experiments and
so it's necessary to relate the data to the process that generated
them. I understand that for IAO data items may not have that
restriction but would want to apply that restriction in OBI. We could
create a measured data item (and an observed data item) in OBI as
types of data items that carry these restrictions.

I also like considering data items as facts rather than bringing in
truthiness.

Cheers,
Chris

Fostel, Jennifer (NIH/NIEHS) [C]

unread,
Jan 14, 2009, 10:17:18 AM1/14/09
to obi-denr...@googlegroups.com
i like Chris's distinction about measured data and observed data. that
is what OBI is describing. we may also need to add something about the
record of a conclusion (aka finding) which may lead us to something
messy along the lines of "human-reasoned data", but this can wait.

regarding fact vs truth, truth is generally understood, but is a high
standard and therefore may be off-putting. i looked up a definiton of
fact, and found this:
a piece of information about circumstances that exist or events that
have occurred

it does not include anyting about the piece of information being
intended to be true, accurate as far as possible to know or whatever
condition we mentally add to the word "fact. may be simpler to stick
with "truth" as modified by Alan. :-)

Cheers,
Chris

> constructed...').

Melanie Courtot

unread,
Jan 14, 2009, 1:47:48 PM1/14/09
to obi-denr...@googlegroups.com
Do I understand correctly that we should probably distinguish 2 "data items"?

1. the general one for IAO
which would cover all data item and is out of the scope for OBI

2. the more specific one for OBI-DENRIE
which would correspond to experimental, corroborated data which are output of an observation or measurement (actually in the OBI scope, they would be output of data transformation and/or assay only) Maybe this would suggest that we should use the IAO general one and add these processes restriction to create the OBI one?

Melanie

Chris Stoeckert

unread,
Jan 14, 2009, 2:30:42 PM1/14/09
to obi-denr...@googlegroups.com
Hi Melanie,
Yes, and the approach of adding process restriction for the OBI one makes sense. 
Chris

Dirk Derom

unread,
Jan 14, 2009, 8:13:18 PM1/14/09
to obi-denr...@googlegroups.com
I like the distinction, but have a (slightly off topic) question on what a process actually means.

Say you have a participant in an experiment, and he/she/it has no explicit task (an explicit task might be 'fixate at the central dot'), and one would record eye movement. I assume that these eye movements are data items.
What would be the process of these data items? Is 'measuring eye movement' a process? Or does process refers to what causes the data items to appear (e.g. the task), or more or less the collecting of it?

I don't seem to grasp the connection with data item that well.


2009/1/15 Chris Stoeckert <stoe...@pcbi.upenn.edu>



--
Kind regards,
Dirk Derom.

--------------------------------------------------------
Check one of the following websites:
-> Neuroinformatics:
http://www.metaneva.org/
-> Graphic Design:
http://sarahverroken.com/
http://sarahverroken.blogspot.com/

Bjoern Peters

unread,
Jan 16, 2009, 3:43:30 PM1/16/09
to obi-denr...@googlegroups.com
eye movements are a process.
recording eye movements is a process (=assay).
the output of the recording process is data


Dirk Derom wrote:
> I like the distinction, but have a (slightly off topic) question on
> what a process actually means.
>
> Say you have a participant in an experiment, and he/she/it has no
> explicit task (an explicit task might be 'fixate at the central dot'),
> and one would record eye movement. I assume that these eye movements
> are data items.
> What would be the process of these data items? Is 'measuring eye
> movement' a process? Or does process refers to what causes the data
> items to appear (e.g. the task), or more or less the collecting of it?
>
> I don't seem to grasp the connection with data item that well.
>
>
> 2009/1/15 Chris Stoeckert <stoe...@pcbi.upenn.edu

> <mailto:stoe...@pcbi.upenn.edu>>

>> <alanruttenb...@gmail.com <mailto:alanruttenb...@gmail.com>>


--
Bjoern Peters
Assistant Member
La Jolla Institute for Allergy and Immunology
9420 Athena Circle
La Jolla, CA 92037, USA
Tel: 858/752-6914
Fax: 858/752-6987
http://www.liai.org/pages/faculty-peters

Reply all
Reply to author
Forward
0 new messages