Here are parts of a conversation leading to a question about the
definitions and distinctions between datum, data structure, and data
set.
Although James offers a distinction, "datum is an information content
entity that is a representation of a single item of information", I
see a couple of issues. First the representation would be of anything,
not just information. Section, representations are often composite
structures. Just as there is no thing that has no parts (at least
until we get to subatomic particles, I don't see how to figure out
what is singular in this case.
We don't have a formal definition yet, but Jonathan wrote this note:
"datum -- well, this will be very tricky to define, but maybe some
information-like stuff that might be put into a computer and that is
meant, by someone, to denote and/or to be interpreted by some
process... I would include lists, tables, sentences... I think I might
defer to Barry, or to Brian Cantwell Smith"
If we follow this, then datum would be a superclass of data set and
data structure
(assuming we have both these other terms). However Chris and James
both find this
counterintuitive, thinking that datum implies singular.
My thoughts are that data structures have parts of different kinds,
and that data sets are aggregates that are collected together either
because of common provenance, or for common purpose, and which tend to
have some collection of parts of the same kind, among other things.
This might suggest:
Datum
Data structure (or structured data?)
Data set
So, anybody have thoughts about this?
-Alan
Melanie:
- Where does our relation DT -> rendering go?
We talked about having "is_rendered_by" during the denrie calls.
- How do we deal with graph and tree data structures?
6. graph
A graph is a collection of points and lines connecting some (possibly
empty)subset of them. The points of a graph are most commonly known as
graph vertices, but may also be called "nodes" or simply "points."
Similarly, the lines connecting the vertices of a graph are most
commonly known as graph edges, but may also be called "arcs" or
"lines."
definition source: WEB:http://mathworld.wolfram.com/Graph.html
7. tree data structure (as a child of the above graph)
label: tree data structure (disambiguation with forest tree) a tree
data structure is an acyclic connected graph. It is a widely-used data
structure that emulates a hierarchical tree structure with a set of
linked nodes. Each node has a set of zero or more children nodes, and
at most one parent node.
definition source: WEB: http://en.wikipedia.org/wiki/Tree_data_structure
Initial idea from Chris was to add graph as sibling of datum and data
set: "I think graph and other data structures are not types of datum
but rather aggregates of data in a particular structure. How about
making graph a sibling of datum?"
Chris: (responding to)
Melanie: -- graph (would need to be added to IAO, probably as a
child of datum IAO_0000027
I think graph and other data structures are not types of datum but
rather aggregates of data in a particular structure. How about making
graph a sibling of datum?
Alan:
a rendering is about some data, so I would make it a subproperty of
is_about.
What's tricky is what Chris alludes to - data set, versus data
structure, versus datum, in that a rendering could be of any of them,
yet if they are all siblings that suggests a common superclass.
Some questions: Can a datum sometimes be a data structure, or even a
data set? Any suggestions on how to clearly differentiate among them?
I think for now, to make progress, having data structure be a sibling
will do, but I might expect it to change when we think it through a
bit more. It's a bit like granularity.
James:
So my first question is, what is datum? The current definition is very
loose, so I'll try and tighten it a little as we iterate. I would
propose:
datum is an information content entity that is a representation of a
single item of information, such as from an observation, statement of
perceived fact, a communication, a calculation or as the result of a
process.
Key here is singular form of the class. So data set is an aggregation of
datum, I would propose so the class should contain some
has_part/is_aggregation or similar relating to datum. The crucial thing
for a 'data set' as opposed to just 'lots of datum' is that they have some
common feature, even if it is just they were collected at same time, I
would suggest. Data structure is optional information about the
organisation and relation between the data in a data set. I would go
further to say that even data that is randomly collected, such as a bag of
words model, but that is contained with a data structure could be
considered a data set as the common feature is the data structure which
binds them. Bag of words is probably not the best example because the
other common feature is of course they are all words :)
So to clarify, I think datum is the atomic unit, data set should be
defined in terms of this atomic unit and with an extra clause that the
data share some common feature and data structure is information about the
organisation of the data set. My first thoughts on this...
Chris:
Hi James,
I agree with your views. I might go further and say that data
structures have specified relationships (i.e., the structure) between
data where a data set is an aggregate of data with some common
feature.
To answer Alan, I don't think datum can be a data structure or a data
set.
I sometimes say that things like this are "meant" - i.e. whatever said
the thing, means it, perhaps with some qualification (every statement
in science is qualified).
I take it data items are GDCs, or something of that ilk, and not the
scrawlings-in-the-margin-of-Fermat's-copy kind of individual. The
latter will have to be called something, right? The sooner we have
agreed distinct names the better.
It does seem a bit bold in a BFO sense to separate syntax from
semantics like this, but it is quite conventional. When we say that
Fred said X, we can mean either syntax or semantics, and the
difference is a level of quotation. If we say "Fred said his name",
this is a statement of semantics: he might have said "Fred",
"Frederick", "Mr. Flintstone", "my name is Fred", etc. If we say "Fred
said 'Fred'", this is a statement of syntax: in particular it tells us
that he did not say "Mr. Flintstone". Confusion of these two levels
through misplaced or missing quotation marks in notation design is one
of the major bugaboos of software engineering.
(Alan, I know how much you dislike HTTP content negotiation, but I I
have to point out that this syntax/semantics distinction is very
similar in flavor. CN advocates would like to say that there is an
"abstract document" or "document without commitment to encoding" that
says that Fred is a rubble expert. That is, when you name an AD, you
are naming a meaning (semantics). Then when you do a GET naming the
AD, what you receive is a "representation" - a syntactic entity ("Fred
is a rubble expert") that expresses that meaning (that Fred is a
rubble expert). Data item = abstract document, encoding =
representation. The property that unites all the representations of an
AD is that they all encode the same "data item". CN suffers of course
in having no theory of provenance, but that is a separate question.)
I don't think we should attempt a single universal *syntactic* theory
of data structures, as there seems to be a never ending supply of such
universal theories. Mime types seem a good place to start. I think you
just have to be opportunistic, and invent syntactic classes as you
need them.
Maybe there is a need to talk about data structures at a non-syntactic
level. If you want to talk about data structures at the semantic
level, you'd have to do so in terms of constituent propositions, if
you believe that meanings are propositional (and you should). E.g. you
could take the meaning of a syntactic entity that can be parsed as a
list to be a conjunction (another kind of list, or set) of the
meanings of the individual list elements. This might apply to the
tree data structure example - to lift it to the semantic level you
would not be able to talk about structure without first identifying
propositional content - what the nodes mean. This is not easy. You'd
have to figure out whether conjunctions (trees, etc.) are meanings,
and if not then what sort of beast they are (data structure? parse
tree? XML infoset? RDF graph?). You'd have to account for algorithms
that are generic - that don't care what the semantics is, but operate
usefully at a purely syntactic level (e.g. sorting).
It would be unfortunate to have to introduce an separate level of
structural analysis sitting in between syntax and semantics, but if
that's what's needed, so it goes. I guess we should explore the
options.
As for data structure frameworks - each programming language has its
own ideas, with lots of variation around what the base types are and
what the upper part of the type hierarchy looks like (are records
collections? are collections functions?). IDL was popular for a while,
but seems to be in decline. Now there's SOAP and so on. There are
various automated theorem proving systems (e.g. Larch, PRL) with
theories often tuned to the particular inference technology being
used. I would prefer to be more scholarly and attempt to identify
consensus use of terms in the mathematics and computer science
literatures (e.g. set, list, tree), but this means invention and
design, which would be bad. It would also over time descend into
idiosyncracy and overcommitment just as all such systems do. So I'm
not sure what I'd suggest. I'll think about it.
Jonathan
If I follow you, we would have information content entity, and its
subclass data item (previously data/datum).
Data item would encompass all things which are somehow "scientifically
true". Maybe we should use a different label, something like
"scientific data"?
We were discussing with Kieran in my group, and we are not sure about
the part regarding intended truthfulness of the data item. Result of
scientific processes seem ok (including measurements, computation
etc). But what about intentionally falsified data for example?
Melanie
The intentionally falsified data is a nice example. I can think of two
approaches. First is to consider it a mistaken classification.
Following the fallibilist point of view we would reclassify this as
not data once it was discovered not to be.
Or we could think about Barry's "methods which reliably tend to
produce" part of the method and the falsifying research part of a
system which does reliably tend to produce (approximately) true
statements.
I would lean to the former, though note that the syntactic aspect of
the data needs be considered as something as more broadly applicable
than just to data.
On the matter of scientific data, I think this is two narrow for the
top level. Consider a coin collectors listing of the years, types, and
markings on the collection of coins they have. These are data times,
but I don't think I would consider them to be scientific. Similarly a
catalog of items for sale and their prices by a vendor.
-Alan