datum, data set, data structure

16 views
Skip to first unread message

Alan Ruttenberg

unread,
Dec 13, 2008, 1:39:13 PM12/13/08
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone, Jonathan Rees
Dear IAOen,

Here are parts of a conversation leading to a question about the
definitions and distinctions between datum, data structure, and data
set.

Although James offers a distinction, "datum is an information content
entity that is a representation of a single item of information", I
see a couple of issues. First the representation would be of anything,
not just information. Section, representations are often composite
structures. Just as there is no thing that has no parts (at least
until we get to subatomic particles, I don't see how to figure out
what is singular in this case.

We don't have a formal definition yet, but Jonathan wrote this note:

"datum -- well, this will be very tricky to define, but maybe some
information-like stuff that might be put into a computer and that is
meant, by someone, to denote and/or to be interpreted by some
process... I would include lists, tables, sentences... I think I might
defer to Barry, or to Brian Cantwell Smith"

If we follow this, then datum would be a superclass of data set and
data structure
(assuming we have both these other terms). However Chris and James
both find this
counterintuitive, thinking that datum implies singular.

My thoughts are that data structures have parts of different kinds,
and that data sets are aggregates that are collected together either
because of common provenance, or for common purpose, and which tend to
have some collection of parts of the same kind, among other things.
This might suggest:

Datum
Data structure (or structured data?)
Data set

So, anybody have thoughts about this?

-Alan


Melanie:

- Where does our relation DT -> rendering go?
We talked about having "is_rendered_by" during the denrie calls.
- How do we deal with graph and tree data structures?

6. graph

A graph is a collection of points and lines connecting some (possibly
empty)subset of them. The points of a graph are most commonly known as
graph vertices, but may also be called "nodes" or simply "points."
Similarly, the lines connecting the vertices of a graph are most
commonly known as graph edges, but may also be called "arcs" or
"lines."

definition source: WEB:http://mathworld.wolfram.com/Graph.html


7. tree data structure (as a child of the above graph)

label: tree data structure (disambiguation with forest tree) a tree
data structure is an acyclic connected graph. It is a widely-used data
structure that emulates a hierarchical tree structure with a set of
linked nodes. Each node has a set of zero or more children nodes, and
at most one parent node.

definition source: WEB: http://en.wikipedia.org/wiki/Tree_data_structure

Initial idea from Chris was to add graph as sibling of datum and data
set: "I think graph and other data structures are not types of datum
but rather aggregates of data in a particular structure. How about
making graph a sibling of datum?"

Chris: (responding to)

Melanie: -- graph (would need to be added to IAO, probably as a
child of datum IAO_0000027

I think graph and other data structures are not types of datum but
rather aggregates of data in a particular structure. How about making
graph a sibling of datum?


Alan:

a rendering is about some data, so I would make it a subproperty of
is_about.

What's tricky is what Chris alludes to - data set, versus data
structure, versus datum, in that a rendering could be of any of them,
yet if they are all siblings that suggests a common superclass.

Some questions: Can a datum sometimes be a data structure, or even a
data set? Any suggestions on how to clearly differentiate among them?

I think for now, to make progress, having data structure be a sibling
will do, but I might expect it to change when we think it through a
bit more. It's a bit like granularity.

James:

So my first question is, what is datum? The current definition is very
loose, so I'll try and tighten it a little as we iterate. I would
propose:

datum is an information content entity that is a representation of a
single item of information, such as from an observation, statement of
perceived fact, a communication, a calculation or as the result of a
process.

Key here is singular form of the class. So data set is an aggregation of
datum, I would propose so the class should contain some
has_part/is_aggregation or similar relating to datum. The crucial thing
for a 'data set' as opposed to just 'lots of datum' is that they have some
common feature, even if it is just they were collected at same time, I
would suggest. Data structure is optional information about the
organisation and relation between the data in a data set. I would go
further to say that even data that is randomly collected, such as a bag of
words model, but that is contained with a data structure could be
considered a data set as the common feature is the data structure which
binds them. Bag of words is probably not the best example because the
other common feature is of course they are all words :)

So to clarify, I think datum is the atomic unit, data set should be
defined in terms of this atomic unit and with an extra clause that the
data share some common feature and data structure is information about the
organisation of the data set. My first thoughts on this...

Chris:

Hi James,

I agree with your views. I might go further and say that data
structures have specified relationships (i.e., the structure) between
data where a data set is an aggregate of data with some common
feature.

To answer Alan, I don't think datum can be a data structure or a data
set.

Bjoern Peters

unread,
Dec 13, 2008, 3:06:43 PM12/13/08
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone, Jonathan Rees
I think you need to include in this discussion the definition of
'information content entity', to make sure it ends up different from
'datum'. BTW: why is there no 'information artifact' class?

Jonathan's definition seems to be computer science inspired, as in data
is different from code. For OBI at least, I thought we instead wanted
datum limited to 'scientific data', which would be the output of
measurements and observations (in OBI: assay), and data transformations
thereof.
Check out this song: as an elucidation of the definition:
http://faculty.washington.edu/crowther/Misc/Songs/showme.shtml

Translating this to OBI/IAO, and trying to define the singular vs.
plural, I would propose:
An instance of datum (or data point): is an (information
artifact/information content entity?) that is the output of a single
instance of an assay or a data transformation, and is about a single
instance of an evaluant.

The problem is going to be assays like microarrays, which could be
described as thousands of simultaneous assays with each array probe.
Maybe we can and should capture this in the assay definitions, where an
'atomic assay' gives exactly one datum, vs. a parallized assay can be
broken down into many atomic assays (FACS --> single cell, microarray
--> single probe, 454-sequencing --> single read) which produce a data
set from a single assay. Other data sets are produced as the output of
serial applications of assays (even different ones) in an investigation.

I agree that it will be tricky to keep the separation of datum and data
set throughout, so it would be nice to have a parent class for both. How
about this :

data: is information that is the output of assays or data transformations
datum: is data that can be traced to an atomic assay
data set: is data that can be broken down into multiple datum

(I am not too happy about the 'data' label)

I don't think I understood 'data structure' sufficiently in the
discussion below to place it.

- Bjoern

Richard Biehl (DOQS)

unread,
Dec 13, 2008, 3:43:46 PM12/13/08
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone, Jonathan Rees
I think I'd be uncomfortable if the definition of data ended up including a
reference to assays. While data is clearly an output of an instance of
assay, the concept of assay isn't part of the concept of data. I don't have
a good suggestion at this point, just my general unease.

As for singularity.... data v. datum: I think the distinction is going to
require some contextualization of the definition. A datum to the chemist is
likely to be data to the quantum chromodynamicist. I think the question
will be whether or not the prospective datum has a data structure that is
relevant to the context. If not, it's a datum. If so, it's data. There's
also the idea as to whether or not such structure is actually necessary to
the data/datum. NAME is data if FIRST NAME and LAST NAME are relevant and
necessary (e.g. Cher?). As data grow more comprehensive (e.g. ADDRESS,
CONTACT INFO, PERSONNEL RECORD), it becomes harder and harder to claim we're
looking at something without necessary and relevant structure. The
macro-world is mostly data/datasets. If data and datum become a sliding
scale based on context, the notion of fractals will be needed to govern an
interpretation of level-penetrating analyses.

Data structures represent a different form of artifact, on a different
meta-level from data and datum. Both levels have data structure though, so
data structure has to be interpreted according to the meta-level being
discussed.

- Rick

Barry Smith

unread,
Dec 13, 2008, 3:57:37 PM12/13/08
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone, Jonathan Rees
At 03:43 PM 12/13/2008, Richard Biehl \(DOQS\) wrote:

>I think I'd be uncomfortable if the definition of data ended up including a
>reference to assays. While data is clearly an output of an instance of
>assay, the concept of assay isn't part of the concept of data. I don't have
>a good suggestion at this point, just my general unease.

I agree. There are many ways of collecting data.


>As for singularity.... data v. datum: I think the distinction is going to
>require some contextualization of the definition. A datum to the chemist is
>likely to be data to the quantum chromodynamicist.

Here I disagree. The data are different, and this is so even where --
if we bracket issues of granularity -- they refer to the same
entities in reality.

> I think the question
>will be whether or not the prospective datum has a data structure that is
>relevant to the context. If not, it's a datum. If so, it's data.

I do not understand any of this.


> There's
>also the idea as to whether or not such structure is actually necessary to
>the data/datum. NAME is data if FIRST NAME and LAST NAME are relevant and
>necessary (e.g. Cher?).

This seems wrong. A datum can have an internal complexity.
If each of <Cher Smith> and <Sonny Jones> is a datum, then taking the
two together we get a pair of data.

> As data grow more comprehensive (e.g. ADDRESS,
>CONTACT INFO, PERSONNEL RECORD), it becomes harder and harder to claim we're
>looking at something without necessary and relevant structure. The
>macro-world is mostly data/datasets.

I hope not. I certainly did not have data for breakfast, and the
planet Earth, though bigger, is also not a datum.

> If data and datum become a sliding
>scale based on context, the notion of fractals will be needed to govern an
>interpretation of level-penetrating analyses.

Now I am getting worried. We are here trying to get clear about very
simple matters. Our endeavors should not be scientifically sophisticated.
BS

James Malone

unread,
Dec 13, 2008, 4:02:42 PM12/13/08
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone, Jonathan Rees
The word data is the Latin plural of datum; data are multiple datum so I don't think the parentage works as proposed.  I think data is a collection of datum, the equivalent of a part_of for information entities.  I'm tempted to suggest that data and data set are actually synonyms; they talk about collections of multiple datum.  I think in modern times people use them synonymously even though historically the 'set' part of data set was probably used to imply a collection of unique records (like a mathematical set of non-repeating members).  If we wish to separate them, then I would say a data set is a collection of data (which is a collection of datum). 

I'm not clear on how to make the distinction on what is an atomic, singular piece of information and agree that is completely context specific.  Can we somehow tie that into the definition or does that represent a fudge?  I think we should try to keep this simple though and as generic as possible lest we tie ourselves up in knots; I really would not like to see 'assay' etc in the definition for instance.

Regarding data structure; I was really thinking computer science centric when I was talking about it previously.  I was thinking trees, hash maps, records in a database, (though could this be a paper filing system or an index in a book perhaps?)  It is information about the way the data is organised, rather than about the information the data actually represents.

In summary I personally don't think any of the above, data, datum, data set or data structure would fall under a parental hierarchy to one another. 

Cheers,

James

Bjoern Peters

unread,
Dec 13, 2008, 11:13:26 PM12/13/08
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone, Jonathan Rees
No one seems to like my inclusion of assay in the definition of data. I
can live with a broader view of things labeled 'data', but would then
like to know what differentiates 'data' from 'information content
entity'. I would also like to point out the definition of processes
labeled 'assay' that we have in OBI, and would like to know what 'data'
do not orignate from such processes:

An assay is a process with the objective to create as an output
information about a material entity (bearing evaluant role).

- Bjoern

Alan Ruttenberg

unread,
Dec 14, 2008, 12:06:25 AM12/14/08
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone, Jonathan Rees
On Sat, Dec 13, 2008 at 11:13 PM, Bjoern Peters <bpe...@liai.org> wrote:
>
> No one seems to like my inclusion of assay in the definition of data. I
> can live with a broader view of things labeled 'data', but would then
> like to know what differentiates 'data' from 'information content
> entity'.

Information content entities include information about a realizable
entity, such as specifications.

> I would also like to point out the definition of processes
> labeled 'assay' that we have in OBI, and would like to know what 'data'
> do not orignate from such processes:

A measurment of the distance between the earth and the moon.
The length of a file.

> An assay is a process with the objective to create as an output
> information about a material entity (bearing evaluant role).

-Alan

Jonathan Rees

unread,
Dec 14, 2008, 10:58:40 AM12/14/08
to Alan Ruttenberg, informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone
I gave a talk on Friday in which I distinguished data from software
according to the way it is considered to be correct. Software is
correct if it does what it's supposed to do, while data is correct if
what it means to say about the world is true. I don't know whether
this distinction works, as I'm not a philosopher, but no one in the
audience complained, and when I floated the idea with Gerry Sussman he
said: Oh, you mean software is a priori, and data is a posteriori.
This made sense to me...

Years ago Brian Cantwell Smith threw a wrench into this tidy
distinction by insisting that software was also about the world, and
could be wrong by saying things about the world that were wrong (by
being an incorrect model). But I will leave that aside.

I would say that what is right about the attempt to link data to assay
is that data's ability to be empirically right is special. Music is
not data, unless it's the subject matter of a study of music, and
articles of theoretical mathematics are not data, unless they are the
subject matter of a study of mathematical articles. (This is a
difference on quotation level.) (Both of these situations are outside
the scope of IAO if I understand the project correctly.)

So naively, I'd say that data has meaning and provenance.

I know you need to account for two dimensions, that of entities
(exemplified by a piece of paper on which measurements are scrawled -
or rather that aspect of that piece of paper that comprises those
measurements) and that of whatever it is that such entities can have
in common, such that we say that they "say the same thing". If you
make statements about the former you risk failing to be able to
transfer that information to other such entities. I.e. if P(x) means
that the measurement written on x is 2.72, and y is a "copy" of x,
then you'd like to be able to conclude that P(y). I would assert that
while grounding in reality is a fine thing, what normal people really
say about information rarely mentions sufficient particulars to
identify that grounding. Instead they say: "I read that high tide was
2.72 meters" and nobody really cares which computer they used or which
disk stored that information. At best they care about the provenance
of the information. This is not a criticism of entity-orientation in
principle, just a warning that the statements we record in curation
need to allow for the right kinds of generalization.

Given the importance of provenance it would be worth making sure that
information never gets separated from the processes involved in its
life history (an agent uttering something, another agent understanding
it). Probably data without provenance is useless or nonsensical. (If a
little bird tells you a tide-level measurement, you would want to
check it before calling it "data".) I think everyone here has this
point, but it's easy to forget because for some purposes you will
forget where the information came from, and just use it on faith
(because you already determined that it was OK).

My experience is that there will never be completely satisfactory
terminology around these things (data, datum, information entity,
message, file, etc.), and you have to just set your sights low and
rely on good definitions.

As for the atomic data and composite data problem - is there any need
to make this distinction? In curation will you ever care? In logic you
care, and atomic propositions are distinguished from composite
propositions (such as conjunctions) by being called "literals", but
this is usually only of interest when you're trying to develop
procedures for proving theorems or testing entailment.

So here are the things I see still on the table:
- datum in the entity sense, such as previous day's high and low
temperatures recorded in a copy of the newspaper sitting in front of
Barry at breakfast
- datum in the abstract sense (genericaly dependent continuant?),
whatever it is about that entity that is shared by the copy of a
different newspaper sitting in front of me at lunch that reports the
same high and low temperatures
- orthogonally to the above, a distinction between atomic and compound
(I posit that this distinction may not be important)
- orthogonally to the above, one or more classifications into
different kinds of information entity, capturing important
distinctions in meaning (microarray obtained how? software,
literature, etc.) and/or structure (nucleotide sequence, table of
intensity readings, pair of temperature readings) and/or syntax (XML,
CSV, etc.), driven by use cases.

I think the theory should capture the principle that "everything
that's said is said by someone" (i.e. you cannot have a temperature
record without both a temperature and someone or something to record
it) and further that if it's not said for a reason it's hard to
consider it to be information (or data).

I'll try not to influence choice of terms. I understand why there are
problems with "data" and "information" (which I use more or less
interchangeably - I know that's a waste) but urge that we first figure
out which things (classes) need terms and then work towards the least
bad terminology choices. Don't hold out for terminology that feels
really good to everyone because, in this context at least, it will
never come along.

I know it doesn't seem to be in scope, but I wouldn't be surprised if
we ended up reinventing bits of classical logic (propositions,
inference, theories, interpretations, models) and/or the animal
communication literature by the time we're done with this project.
None of this apparatus was invented just for the sake of inventing
apparatus; it derived from a sincere attempt to account for the
phenomena of talking and thinking about the world, which I don't see
as being very different from the IAO activity.

Jonathan

Barry Smith

unread,
Dec 14, 2008, 1:59:35 PM12/14/08
to informatio...@googlegroups.com, Alan Ruttenberg, informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone
At 10:58 AM 12/14/2008, Jonathan Rees wrote:

>I gave a talk on Friday in which I distinguished data from software
>according to the way it is considered to be correct. Software is
>correct if it does what it's supposed to do, while data is correct if
>what it means to say about the world is true. I don't know whether
>this distinction works, as I'm not a philosopher, but no one in the
>audience complained, and when I floated the idea with Gerry Sussman he
>said: Oh, you mean software is a priori, and data is a posteriori.
>This made sense to me...

As far as data is concerned, this seems to me to be on the right
track. Ontologically it would be expressed in terms of an aboutness
relation --

x is a datum --> x is about something (a.k.a true of something)

BS

Bjoern Peters

unread,
Dec 14, 2008, 9:09:03 PM12/14/08
to informatio...@googlegroups.com, Alan Ruttenberg, Chris Stoeckert, Melanie Courtot, James Malone
I am with Jonathan on pretty much his entire mail. More below.

Barry Smith wrote:
> At 10:58 AM 12/14/2008, Jonathan Rees wrote:
>
>
>> I gave a talk on Friday in which I distinguished data from software
>> according to the way it is considered to be correct. Software is
>> correct if it does what it's supposed to do, while data is correct if
>> what it means to say about the world is true. I don't know whether
>> this distinction works, as I'm not a philosopher, but no one in the
>> audience complained, and when I floated the idea with Gerry Sussman he
>> said: Oh, you mean software is a priori, and data is a posteriori.
>> This made sense to me...
>>
>
> As far as data is concerned, this seems to me to be on the right
> track. Ontologically it would be expressed in terms of an aboutness
> relation --
>
> x is a datum --> x is about something (a.k.a true of something)
>
> BS
>
Currently IAO says that all information content entities are 'about'
something. That is why I asked what is the intended differentiation from
data. From what Barry writes I assume that is wrong, and only data but
not specifications (to me: plans, objectives, hypothesis, conclusions,
laws) are about something? Can we then have the second relation
'specifies', with x is a specification --> x specifies something, and
instances in reality either meet or don't meet a specification?

If data is 'true of something' and it is an information artifact =
created by a sentient, then there has to be an initial creation process.
I still believe that process should be a 'measurement / observation',
which we are considering synonyms of 'assay' in OBI. By capturing the
creation process of data, we will capture the provenance and aboutness
of the data. Subsequent data transformations can retain that without
making reference to it.

Alan gave examples of what is data that is not output of an assay
according to our current definition. I thought the measurement of
distance earth to moon would be an assay (with the Object pair
moon-earth playing the role of evaluant). I agree that 'length of the
file' poses many problems though, making it toxic on many levels. Isn't
there a length to a file without independent of it being recorded as
data (dependent continuants on GDCs again)? Is the length the space on
the disk? Dependent on encoding? I would skip this example, as it
doesn't seem too relevant for this discussion.

Overall I would like to see a class that captures all kinds of
observation / measurement processes that create data, and I am happy to
change the label 'assay', or revise the definition. Also, if my
understanding of 'data' is considered to narrow, I would still like an
entity that captures only the output of measurement / observation
processes and transformations thereof.

- Bjoern

Alan Ruttenberg

unread,
Jan 5, 2009, 2:21:03 AM1/5/09
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone
Hi Folks,

I had a conversation with Barry this evening talking some of this
through. The outcome:

Don't distinguish data/datum. He thinks a distinction could be made on
the basis of the paper:
http://ontology.buffalo.edu/bfo/Terminology_for_Ontologies.pdf but we
don't think it is necessary. Perhaps use "Data item" as the parent
term to avoid this.

On the question of the definition of data item and information content
entity the proposal, bringing the conversations on this thread
together would be that data items are information content entities
that are intended to be truthful statements about something (modulo
measurement, error) and are constructed/acquired by a methods which
reliably tend to produce (approximately) truthful statements. (Barry
can probably provide better wording).

Data items would include not only the results of measurements, but
also the results of computations that derive information from such
such as summarizations/averages, or which are created in order to
organize them for access or processing.

For Bjoern, another example of a data item that is not the results of
an assay would be, e.g. the nodes of a tree data structure used to
manage a sorted list of measurements (e.g. to approximate, more
efficiently the solution of a many-body problem).

A data set would itself be a data item, some bundling of other data
items for some purpose, without implication of homogeneity or other
structural constraints, at least at the broadest level.

Thus the definition of data items involves provenance, intention,
truthiness ;-), and computation. Not a simple thing ...

We also discussed the issue of data content versus structure. Take for
example some set of referring data items, such as a set of recorded
observations of animal behavior (positions, activity, participants,
etc). This set of data items might be organized/structured in a
number of ways - as one or more lists, as a relational database, in
other kinds of data structures optimized for one or another purpose.
It seems that we might then want to think about data structure as
independent from data content and define independent hierarchies for
each, and relations that allow us to combine them using defined
classes.

For data structures it might be better to not roll our own, but
instead defer to some formalism that already exists for describing
data structures (assuming a reasonable one exists, which I do assume).
Perhaps Jonathan might bring up some candidate.

Ok, that's all for now. Your thoughts (particularly Bjoern's) solicited.

Best,
Alan

Jonathan Rees

unread,
Jan 5, 2009, 9:40:57 AM1/5/09
to informatio...@googlegroups.com, Chris Stoeckert, Melanie Courtot, James Malone
I don't agree that the "data item" notion as you described it is
complex. It is merely deep. I like it because it parallels the
conventional definition of "knowledge" as justified true belief. A
data item is approximately justified approximately true approximate
belief.

I sometimes say that things like this are "meant" - i.e. whatever said
the thing, means it, perhaps with some qualification (every statement
in science is qualified).

I take it data items are GDCs, or something of that ilk, and not the
scrawlings-in-the-margin-of-Fermat's-copy kind of individual. The
latter will have to be called something, right? The sooner we have
agreed distinct names the better.

It does seem a bit bold in a BFO sense to separate syntax from
semantics like this, but it is quite conventional. When we say that
Fred said X, we can mean either syntax or semantics, and the
difference is a level of quotation. If we say "Fred said his name",
this is a statement of semantics: he might have said "Fred",
"Frederick", "Mr. Flintstone", "my name is Fred", etc. If we say "Fred
said 'Fred'", this is a statement of syntax: in particular it tells us
that he did not say "Mr. Flintstone". Confusion of these two levels
through misplaced or missing quotation marks in notation design is one
of the major bugaboos of software engineering.

(Alan, I know how much you dislike HTTP content negotiation, but I I
have to point out that this syntax/semantics distinction is very
similar in flavor. CN advocates would like to say that there is an
"abstract document" or "document without commitment to encoding" that
says that Fred is a rubble expert. That is, when you name an AD, you
are naming a meaning (semantics). Then when you do a GET naming the
AD, what you receive is a "representation" - a syntactic entity ("Fred
is a rubble expert") that expresses that meaning (that Fred is a
rubble expert). Data item = abstract document, encoding =
representation. The property that unites all the representations of an
AD is that they all encode the same "data item". CN suffers of course
in having no theory of provenance, but that is a separate question.)

I don't think we should attempt a single universal *syntactic* theory
of data structures, as there seems to be a never ending supply of such
universal theories. Mime types seem a good place to start. I think you
just have to be opportunistic, and invent syntactic classes as you
need them.

Maybe there is a need to talk about data structures at a non-syntactic
level. If you want to talk about data structures at the semantic
level, you'd have to do so in terms of constituent propositions, if
you believe that meanings are propositional (and you should). E.g. you
could take the meaning of a syntactic entity that can be parsed as a
list to be a conjunction (another kind of list, or set) of the
meanings of the individual list elements. This might apply to the
tree data structure example - to lift it to the semantic level you
would not be able to talk about structure without first identifying
propositional content - what the nodes mean. This is not easy. You'd
have to figure out whether conjunctions (trees, etc.) are meanings,
and if not then what sort of beast they are (data structure? parse
tree? XML infoset? RDF graph?). You'd have to account for algorithms
that are generic - that don't care what the semantics is, but operate
usefully at a purely syntactic level (e.g. sorting).

It would be unfortunate to have to introduce an separate level of
structural analysis sitting in between syntax and semantics, but if
that's what's needed, so it goes. I guess we should explore the
options.

As for data structure frameworks - each programming language has its
own ideas, with lots of variation around what the base types are and
what the upper part of the type hierarchy looks like (are records
collections? are collections functions?). IDL was popular for a while,
but seems to be in decline. Now there's SOAP and so on. There are
various automated theorem proving systems (e.g. Larch, PRL) with
theories often tuned to the particular inference technology being
used. I would prefer to be more scholarly and attempt to identify
consensus use of terms in the mathematics and computer science
literatures (e.g. set, list, tree), but this means invention and
design, which would be bad. It would also over time descend into
idiosyncracy and overcommitment just as all such systems do. So I'm
not sure what I'd suggest. I'll think about it.

Jonathan

Melanie Courtot

unread,
Jan 5, 2009, 7:14:23 PM1/5/09
to Jonathan Rees, informatio...@googlegroups.com, Chris Stoeckert, James Malone, obi-denr...@googlegroups.com, Kieran O'Neill
Hi,

If I follow you, we would have information content entity, and its
subclass data item (previously data/datum).
Data item would encompass all things which are somehow "scientifically
true". Maybe we should use a different label, something like
"scientific data"?

We were discussing with Kieran in my group, and we are not sure about
the part regarding intended truthfulness of the data item. Result of
scientific processes seem ok (including measurements, computation
etc). But what about intentionally falsified data for example?

Melanie

Alan Ruttenberg

unread,
Jan 6, 2009, 8:15:09 AM1/6/09
to obi-denr...@googlegroups.com, Jonathan Rees, informatio...@googlegroups.com, Chris Stoeckert, James Malone, Kieran O'Neill
On Mon, Jan 5, 2009 at 7:14 PM, Melanie Courtot <mcou...@gmail.com> wrote:
>
> Hi,
>
> If I follow you, we would have information content entity, and its subclass
> data item (previously data/datum).
> Data item would encompass all things which are somehow "scientifically
> true". Maybe we should use a different label, something like "scientific
> data"?
>
> We were discussing with Kieran in my group, and we are not sure about the
> part regarding intended truthfulness of the data item. Result of scientific
> processes seem ok (including measurements, computation etc). But what about
> intentionally falsified data for example?

The intentionally falsified data is a nice example. I can think of two
approaches. First is to consider it a mistaken classification.
Following the fallibilist point of view we would reclassify this as
not data once it was discovered not to be.

Or we could think about Barry's "methods which reliably tend to
produce" part of the method and the falsifying research part of a
system which does reliably tend to produce (approximately) true
statements.

I would lean to the former, though note that the syntactic aspect of
the data needs be considered as something as more broadly applicable
than just to data.

On the matter of scientific data, I think this is two narrow for the
top level. Consider a coin collectors listing of the years, types, and
markings on the collection of coins they have. These are data times,
but I don't think I would consider them to be scientific. Similarly a
catalog of items for sale and their prices by a vendor.

-Alan

hog...@gmail.com

unread,
Jan 6, 2009, 8:59:36 AM1/6/09
to Alan Ruttenberg, obi-denr...@googlegroups.com, Jonathan Rees, informatio...@googlegroups.com, Chris Stoeckert, James Malone, Kieran O'Neill
It may also be worthwhile to consider the distinction between a measurement and the record of a measurement, and the role that may play here. My initial thought is that data fall into the latter category. In addition to imprecision/error in the process of measuring, there is also the potential for error in making the record. It is possible that two different entities perform the measurement vs. the record of the measurement (e.g., a machine makes the measurement, then I transcribe the value on the machine's display onto paper or into some electronic system).

Records of measurements may fail to reflect reality for lots of reasons: transposed digits, misplacement of the decimal point, wrong units of measure, and intentional falsification, to name just a few. A record of a measurement may even refer to a measurement that never existed.

Falsification of data may therefore fall into three basic categories:

1. Misrecording the outcome or some other aspect (such as date/time) of a measurement/assessment. For example, stating that patient #234 had a Hemoglobin A1C of 5.5 instead of 7.5.
2. Recording the existence of a measurement/assessment that never existed.
3. Intentionally leaving out from the record a measurement/assessment that did occur.

The difference between errors and falsification is in the intent of the recorder: did the person make a mistake or intentionally misrepresent something?

Regardless, per Alan, once we know a record is in error, we reclassify it as not being data.

Bill

Melanie Courtot

unread,
Jan 6, 2009, 3:32:23 PM1/6/09
to informatio...@googlegroups.com, Alan Ruttenberg, obi-denr...@googlegroups.com, Jonathan Rees, Chris Stoeckert, James Malone, Kieran O'Neill

Considering the following example:

If I say "OBI contains 1344 classes" (I believe my reasoner which says
there are 1344 classes) then I have a data item.

If I say "OBI contains 1700 classes" (which I calculated by hand and
believed to be true, but later on I check and see that I made a
mistake and the correct number is indeed 1344), what I thought was a
data item would be reclassified as not being one.

If I say "OBI contains 1700 classes" (which I know is not true, but
I'm trying to advertise OBI and 1700 sounds better than 1344 ;) ), I
don't have a data item, because I can't produce a measurement with
this output, and I didn't intend to represent the truth.

If I say "Next year, OBI will contain 2000 classes" (which I believe
to be true but it's just my personal opinion) I have an information
content entity as it has not been corroborated by some kind of
evidence (measurement). If I now draw a graph of the evolution in the
number of classes in OBI depending on time, and then extrapolate on
that that by next year OBI will indeed contain 2000 classes, I have a
data item.

If Protege says there are 1340 classes in OBI, and Pellet says there
are 1345, they both are data item about the same thing, as in both
cases the intent was to provide a true representation of the number of
classes in OBI (even though none of them is actually right, but it's
part of the uncertainty in the measure)

Does that seem to match what you mean?

Melanie
---
Mélanie Courtot
TFL- BCCRC
675 West 10th Avenue
Vancouver, BC
V5Z 1L3, Canada




Reply all
Reply to author
Forward
0 new messages