From: "Alan Ruttenberg" <alanrut...@gmail.com>Date: January 5, 2009 2:21:03 AM ESTCc: "Chris Stoeckert" <stoe...@pcbi.upenn.edu>, "Melanie Courtot" <mcou...@gmail.com>, "James Malone" <mal...@ebi.ac.uk>Subject: Re: datum, data set, data structureHi Folks,
I had a conversation with Barry this evening talking some of this
through. The outcome:
Don't distinguish data/datum. He thinks a distinction could be made on
the basis of the paper:
http://ontology.buffalo.edu/bfo/Terminology_for_Ontologies.pdf but we
don't think it is necessary. Perhaps use "Data item" as the parent
term to avoid this.
On the question of the definition of data item and information content
entity the proposal, bringing the conversations on this thread
together would be that data items are information content entities
that are intended to be truthful statements about something (modulo
measurement, error) and are constructed/acquired by a methods which
reliably tend to produce (approximately) truthful statements. (Barry
can probably provide better wording).
Data items would include not only the results of measurements, but
also the results of computations that derive information from such
such as summarizations/averages, or which are created in order to
organize them for access or processing.
For Bjoern, another example of a data item that is not the results of
an assay would be, e.g. the nodes of a tree data structure used to
manage a sorted list of measurements (e.g. to approximate, more
efficiently the solution of a many-body problem).
A data set would itself be a data item, some bundling of other data
items for some purpose, without implication of homogeneity or other
structural constraints, at least at the broadest level.
Thus the definition of data items involves provenance, intention,
truthiness ;-), and computation. Not a simple thing ...
We also discussed the issue of data content versus structure. Take for
example some set of referring data items, such as a set of recorded
observations of animal behavior (positions, activity, participants,
etc). This set of data items might be organized/structured in a
number of ways - as one or more lists, as a relational database, in
other kinds of data structures optimized for one or another purpose.
It seems that we might then want to think about data structure as
independent from data content and define independent hierarchies for
each, and relations that allow us to combine them using defined
classes.
For data structures it might be better to not roll our own, but
instead defer to some formalism that already exists for describing
data structures (assuming a reasonable one exists, which I do assume).
Perhaps Jonathan might bring up some candidate.
Ok, that's all for now. Your thoughts (particularly Bjoern's) solicited.
Best,
Alan
On Sun, Dec 14, 2008 at 9:09 PM, Bjoern Peters <bpe...@liai.org> wrote:I am with Jonathan on pretty much his entire mail. More below.Barry Smith wrote:At 10:58 AM 12/14/2008, Jonathan Rees wrote:I gave a talk on Friday in which I distinguished data from softwareaccording to the way it is considered to be correct. Software iscorrect if it does what it's supposed to do, while data is correct ifwhat it means to say about the world is true. I don't know whetherthis distinction works, as I'm not a philosopher, but no one in theaudience complained, and when I floated the idea with Gerry Sussman hesaid: Oh, you mean software is a priori, and data is a posteriori.This made sense to me...As far as data is concerned, this seems to me to be on the righttrack. Ontologically it would be expressed in terms of an aboutnessrelation --x is a datum --> x is about something (a.k.a true of something)BSCurrently IAO says that all information content entities are 'about'something. That is why I asked what is the intended differentiation fromdata. From what Barry writes I assume that is wrong, and only data butnot specifications (to me: plans, objectives, hypothesis, conclusions,laws) are about something? Can we then have the second relation'specifies', with x is a specification --> x specifies something, andinstances in reality either meet or don't meet a specification?If data is 'true of something' and it is an information artifact =created by a sentient, then there has to be an initial creation process.I still believe that process should be a 'measurement / observation',which we are considering synonyms of 'assay' in OBI. By capturing thecreation process of data, we will capture the provenance and aboutnessof the data. Subsequent data transformations can retain that withoutmaking reference to it.Alan gave examples of what is data that is not output of an assayaccording to our current definition. I thought the measurement ofdistance earth to moon would be an assay (with the Object pairmoon-earth playing the role of evaluant). I agree that 'length of thefile' poses many problems though, making it toxic on many levels. Isn'tthere a length to a file without independent of it being recorded asdata (dependent continuants on GDCs again)? Is the length the space onthe disk? Dependent on encoding? I would skip this example, as itdoesn't seem too relevant for this discussion.Overall I would like to see a class that captures all kinds ofobservation / measurement processes that create data, and I am happy tochange the label 'assay', or revise the definition. Also, if myunderstanding of 'data' is considered to narrow, I would still like anentity that captures only the output of measurement / observationprocesses and transformations thereof.- BjoernYears ago Brian Cantwell Smith threw a wrench into this tidydistinction by insisting that software was also about the world, andcould be wrong by saying things about the world that were wrong (bybeing an incorrect model). But I will leave that aside.I would say that what is right about the attempt to link data to assayis that data's ability to be empirically right is special. Music isnot data, unless it's the subject matter of a study of music, andarticles of theoretical mathematics are not data, unless they are thesubject matter of a study of mathematical articles. (This is adifference on quotation level.) (Both of these situations are outsidethe scope of IAO if I understand the project correctly.)So naively, I'd say that data has meaning and provenance.I know you need to account for two dimensions, that of entities(exemplified by a piece of paper on which measurements are scrawled -or rather that aspect of that piece of paper that comprises thosemeasurements) and that of whatever it is that such entities can havein common, such that we say that they "say the same thing". If youmake statements about the former you risk failing to be able totransfer that information to other such entities. I.e. if P(x) meansthat the measurement written on x is 2.72, and y is a "copy" of x,then you'd like to be able to conclude that P(y). I would assert thatwhile grounding in reality is a fine thing, what normal people reallysay about information rarely mentions sufficient particulars toidentify that grounding. Instead they say: "I read that high tide was2.72 meters" and nobody really cares which computer they used or whichdisk stored that information. At best they care about the provenanceof the information. This is not a criticism of entity-orientation inprinciple, just a warning that the statements we record in curationneed to allow for the right kinds of generalization.Given the importance of provenance it would be worth making sure thatinformation never gets separated from the processes involved in itslife history (an agent uttering something, another agent understandingit). Probably data without provenance is useless or nonsensical. (If alittle bird tells you a tide-level measurement, you would want tocheck it before calling it "data".) I think everyone here has thispoint, but it's easy to forget because for some purposes you willforget where the information came from, and just use it on faith(because you already determined that it was OK).My experience is that there will never be completely satisfactoryterminology around these things (data, datum, information entity,message, file, etc.), and you have to just set your sights low andrely on good definitions.As for the atomic data and composite data problem - is there any needto make this distinction? In curation will you ever care? In logic youcare, and atomic propositions are distinguished from compositepropositions (such as conjunctions) by being called "literals", butthis is usually only of interest when you're trying to developprocedures for proving theorems or testing entailment.So here are the things I see still on the table:- datum in the entity sense, such as previous day's high and lowtemperatures recorded in a copy of the newspaper sitting in front ofBarry at breakfast- datum in the abstract sense (genericaly dependent continuant?),whatever it is about that entity that is shared by the copy of adifferent newspaper sitting in front of me at lunch that reports thesame high and low temperatures- orthogonally to the above, a distinction between atomic and compound(I posit that this distinction may not be important)- orthogonally to the above, one or more classifications intodifferent kinds of information entity, capturing importantdistinctions in meaning (microarray obtained how? software,literature, etc.) and/or structure (nucleotide sequence, table ofintensity readings, pair of temperature readings) and/or syntax (XML,CSV, etc.), driven by use cases.I think the theory should capture the principle that "everythingthat's said is said by someone" (i.e. you cannot have a temperaturerecord without both a temperature and someone or something to recordit) and further that if it's not said for a reason it's hard toconsider it to be information (or data).I'll try not to influence choice of terms. I understand why there areproblems with "data" and "information" (which I use more or lessinterchangeably - I know that's a waste) but urge that we first figureout which things (classes) need terms and then work towards the leastbad terminology choices. Don't hold out for terminology that feelsreally good to everyone because, in this context at least, it willnever come along.I know it doesn't seem to be in scope, but I wouldn't be surprised ifwe ended up reinventing bits of classical logic (propositions,inference, theories, interpretations, models) and/or the animalcommunication literature by the time we're done with this project.None of this apparatus was invented just for the sake of inventingapparatus; it derived from a sincere attempt to account for thephenomena of talking and thinking about the world, which I don't seeas being very different from the IAO activity.JonathanOn Sun, Dec 14, 2008 at 12:06 AM, Alan Ruttenberg<alanrut...@gmail.com> wrote:On Sat, Dec 13, 2008 at 11:13 PM, Bjoern Peters <bpe...@liai.org> wrote:No one seems to like my inclusion of assay in the definition of data. Ican live with a broader view of things labeled 'data', but would thenlike to know what differentiates 'data' from 'information contententity'.Information content entities include information about a realizableentity, such as specifications.I would also like to point out the definition of processeslabeled 'assay' that we have in OBI, and would like to know what 'data'do not orignate from such processes:A measurment of the distance between the earth and the moon.The length of a file.An assay is a process with the objective to create as an outputinformation about a material entity (bearing evaluant role).-AlanJames Malone wrote:The word data is the Latin plural of datum; data are multiple datum soI don't think the parentage works as proposed. I think data is acollection of datum, the equivalent of a part_of for informationentities. I'm tempted to suggest that data and data set are actuallysynonyms; they talk about collections of multiple datum. I think inmodern times people use them synonymously even though historically the'set' part of data set was probably used to imply a collection ofunique records (like a mathematical set of non-repeating members). Ifwe wish to separate them, then I would say a data set is a collectionof data (which is a collection of datum).I'm not clear on how to make the distinction on what is an atomic,singular piece of information and agree that is completely contextspecific. Can we somehow tie that into the definition or does thatrepresent a fudge? I think we should try to keep this simple thoughand as generic as possible lest we tie ourselves up in knots; I reallywould not like to see 'assay' etc in the definition for instance.Regarding data structure; I was really thinking computer sciencecentric when I was talking about it previously. I was thinking trees,hash maps, records in a database, (though could this be a paper filingsystem or an index in a book perhaps?) It is information about theway the data is organised, rather than about the information the dataactually represents.In summary I personally don't think any of the above, data, datum,data set or data structure would fall under a parental hierarchy toone another.Cheers,JamesOn Sat, Dec 13, 2008 at 8:06 PM, Bjoern Peters <bpe...@liai.org<mailto:bpe...@liai.org>> wrote:I think you need to include in this discussion the definition of'information content entity', to make sure it ends up different from'datum'. BTW: why is there no 'information artifact' class?Jonathan's definition seems to be computer science inspired, as indatais different from code. For OBI at least, I thought we instead wanteddatum limited to 'scientific data', which would be the output ofmeasurements and observations (in OBI: assay), and datatransformationsthereof.Check out this song: as an elucidation of the definition:http://faculty.washington.edu/crowther/Misc/Songs/showme.shtmlTranslating this to OBI/IAO, and trying to define the singular vs.plural, I would propose:An instance of datum (or data point): is an (informationartifact/information content entity?) that is the output of a singleinstance of an assay or a data transformation, and is about a singleinstance of an evaluant.The problem is going to be assays like microarrays, which could bedescribed as thousands of simultaneous assays with each array probe.Maybe we can and should capture this in the assay definitions,where an'atomic assay' gives exactly one datum, vs. a parallized assay can bebroken down into many atomic assays (FACS --> single cell, microarray--> single probe, 454-sequencing --> single read) whichproduce a dataset from a single assay. Other data sets are produced asthe output ofserial applications of assays (even different ones) in aninvestigation.I agree that it will be tricky to keep the separation of datum anddataset throughout, so it would be nice to have a parent class forboth. Howabout this :data: is information that is the output of assays or datatransformationsdatum: is data that can be traced to an atomic assaydata set: is data that can be broken down into multiple datum(I am not too happy about the 'data' label)I don't think I understood 'data structure' sufficiently in thediscussion below to place it.- BjoernAlan Ruttenberg wrote:Dear IAOen,Here are parts of a conversation leading to a question about thedefinitions and distinctions between datum, datastructure, and dataset.Although James offers a distinction, "datum is an informationcontententity that is a representation of a single item of information", Isee a couple of issues. First the representation would be ofanything,not just information. Section, representations are often compositestructures. Just as there is no thing that has no parts (at leastuntil we get to subatomic particles, I don't see how to figure outwhat is singular in this case.We don't have a formal definition yet, but Jonathan wrotethis note:"datum -- well, this will be very tricky to define, butmaybe someinformation-like stuff that might be put into a computerand that ismeant, by someone, to denote and/or to be interpreted by someprocess... I would include lists, tables, sentences... I think Imightdefer to Barry, or to Brian Cantwell Smith"If we follow this, then datum would be a superclass of data set anddata structure(assuming we have both these other terms). However Chris and Jamesboth find thiscounterintuitive, thinking that datum implies singular.My thoughts are that data structures have parts of different kinds,and that data sets are aggregates that are collectedtogether eitherbecause of common provenance, or for common purpose, and whichtend tohave some collection of parts of the same kind, among other things.This might suggest:DatumData structure (or structured data?)Data setSo, anybody have thoughts about this?-AlanMelanie:- Where does our relation DT -> rendering go?We talked about having "is_rendered_by" during the denrie calls.- How do we deal with graph and tree data structures?6. graphA graph is a collection of points and lines connecting some(possiblyempty)subset of them. The points of a graph are most commonlyknown asgraph vertices, but may also be called "nodes" or simply "points."Similarly, the lines connecting the vertices of a graph are mostcommonly known as graph edges, but may also be called "arcs" or"lines."definition source: WEB:http://mathworld.wolfram.com/Graph.html7. tree data structure (as a child of the above graph)label: tree data structure (disambiguation with forest tree) a treedata structure is an acyclic connected graph. It is awidely-used datastructure that emulates a hierarchical tree structure with a set oflinked nodes. Each node has a set of zero or more childrennodes, andat most one parent node.definition source: WEB:http://en.wikipedia.org/wiki/Tree_data_structureInitial idea from Chris was to add graph as sibling of datum anddataset: "I think graph and other data structures are nottypes of datumbut rather aggregates of data in a particular structure. How aboutmaking graph a sibling of datum?"Chris: (responding to)Melanie: -- graph (would need to be added to IAO, probably as achild of datum IAO_0000027I think graph and other data structures are not types of datum butrather aggregates of data in a particular structure. How aboutmakinggraph a sibling of datum?Alan:a rendering is about some data, so I would make it a subproperty ofis_about.What's tricky is what Chris alludes to - data set, versus datastructure, versus datum, in that a rendering could be of any ofthem,yet if they are all siblings that suggests a common superclass.Some questions: Can a datum sometimes be a datastructure, or even adata set? Any suggestions on how to clearly differentiate amongthem?I think for now, to make progress, having data structure be asiblingwill do, but I might expect it to change when we think it through abit more. It's a bit like granularity.James:So my first question is, what is datum? The current definitionis veryloose, so I'll try and tighten it a little as we iterate. I wouldpropose:datum is an information content entity that is arepresentation of asingle item of information, such as from an observation,statement ofperceived fact, a communication, a calculation or as theresult of aprocess.Key here is singular form of the class. So data set is anaggregation ofdatum, I would propose so the class should contain somehas_part/is_aggregation or similar relating to datum. Thecrucial thingfor a 'data set' as opposed to just 'lots of datum' is that theyhave somecommon feature, even if it is just they were collected at sametime, Iwould suggest. Data structure is optional information about theorganisation and relation between the data in a data set. Iwould gofurther to say that even data that is randomly collected, suchas a bag ofwords model, but that is contained with a data structure could beconsidered a data set as the common feature is the datastructure whichbinds them. Bag of words is probably not the best examplebecause theother common feature is of course they are all words :)So to clarify, I think datum is the atomic unit, data set should bedefined in terms of this atomic unit and with an extra clausethat thedata share some common feature and data structure is informationabout theorganisation of the data set. My first thoughts on this...Chris:Hi James,I agree with your views. I might go further and say that datastructures have specified relationships (i.e., the structure)betweendata where a data set is an aggregate of data with some commonfeature.To answer Alan, I don't think datum can be a data structure or adataset.
If I follow you, we would have information content entity, and its
subclass data item (previously data/datum).
Data item would encompass all things which are somehow "scientifically
true". Maybe we should use a different label, something like
"scientific data"?
We were discussing with Kieran in my group, and we are not sure about
the part regarding intended truthfulness of the data item. Result of
scientific processes seem ok (including measurements, computation
etc). But what about intentionally falsified data for example?
Melanie
On 5-Jan-09, at 6:40 AM, Jonathan Rees wrote:
> I don't agree that the "data item" notion as you described it is
> complex. It is merely deep. I like it because it parallels the
> conventional definition of "knowledge" as justified true belief. A
> data item is approximately justified approximately true approximate
> belief.
>
> I sometimes say that things like this are "meant" - i.e. whatever said
> the thing, means it, perhaps with some qualification (every statement
> in science is qualified).
>
> I take it data items are GDCs, or something of that ilk, and not the
> scrawlings-in-the-margin-of-Fermat's-copy kind of individual. The
> latter will have to be called something, right? The sooner we have
> agreed distinct names the better.
>
> It does seem a bit bold in a BFO sense to separate syntax from
> semantics like this, but it is quite conventional. When we say that
> Fred said X, we can mean either syntax or semantics, and the
> difference is a level of quotation. If we say "Fred said his name",
> this is a statement of semantics: he might have said "Fred",
> "Frederick", "Mr. Flintstone", "my name is Fred", etc. If we say "Fred
> said 'Fred'", this is a statement of syntax: in particular it tells us
> that he did not say "Mr. Flintstone". Confusion of these two levels
> through misplaced or missing quotation marks in notation design is one
> of the major bugaboos of software engineering.
>
> (Alan, I know how much you dislike HTTP content negotiation, but I I
> have to point out that this syntax/semantics distinction is very
> similar in flavor. CN advocates would like to say that there is an
> "abstract document" or "document without commitment to encoding" that
> says that Fred is a rubble expert. That is, when you name an AD, you
> are naming a meaning (semantics). Then when you do a GET naming the
> AD, what you receive is a "representation" - a syntactic entity ("Fred
> is a rubble expert") that expresses that meaning (that Fred is a
> rubble expert). Data item = abstract document, encoding =
> representation. The property that unites all the representations of an
> AD is that they all encode the same "data item". CN suffers of course
> in having no theory of provenance, but that is a separate question.)
>
> I don't think we should attempt a single universal *syntactic* theory
> of data structures, as there seems to be a never ending supply of such
> universal theories. Mime types seem a good place to start. I think you
> just have to be opportunistic, and invent syntactic classes as you
> need them.
>
> Maybe there is a need to talk about data structures at a non-syntactic
> level. If you want to talk about data structures at the semantic
> level, you'd have to do so in terms of constituent propositions, if
> you believe that meanings are propositional (and you should). E.g. you
> could take the meaning of a syntactic entity that can be parsed as a
> list to be a conjunction (another kind of list, or set) of the
> meanings of the individual list elements. This might apply to the
> tree data structure example - to lift it to the semantic level you
> would not be able to talk about structure without first identifying
> propositional content - what the nodes mean. This is not easy. You'd
> have to figure out whether conjunctions (trees, etc.) are meanings,
> and if not then what sort of beast they are (data structure? parse
> tree? XML infoset? RDF graph?). You'd have to account for algorithms
> that are generic - that don't care what the semantics is, but operate
> usefully at a purely syntactic level (e.g. sorting).
>
> It would be unfortunate to have to introduce an separate level of
> structural analysis sitting in between syntax and semantics, but if
> that's what's needed, so it goes. I guess we should explore the
> options.
>
> As for data structure frameworks - each programming language has its
> own ideas, with lots of variation around what the base types are and
> what the upper part of the type hierarchy looks like (are records
> collections? are collections functions?). IDL was popular for a while,
> but seems to be in decline. Now there's SOAP and so on. There are
> various automated theorem proving systems (e.g. Larch, PRL) with
> theories often tuned to the particular inference technology being
> used. I would prefer to be more scholarly and attempt to identify
> consensus use of terms in the mathematics and computer science
> literatures (e.g. set, list, tree), but this means invention and
> design, which would be bad. It would also over time descend into
> idiosyncracy and overcommitment just as all such systems do. So I'm
> not sure what I'd suggest. I'll think about it.
>
> Jonathan
The intentionally falsified data is a nice example. I can think of two
approaches. First is to consider it a mistaken classification.
Following the fallibilist point of view we would reclassify this as
not data once it was discovered not to be.
Or we could think about Barry's "methods which reliably tend to
produce" part of the method and the falsifying research part of a
system which does reliably tend to produce (approximately) true
statements.
I would lean to the former, though note that the syntactic aspect of
the data needs be considered as something as more broadly applicable
than just to data.
On the matter of scientific data, I think this is two narrow for the
top level. Consider a coin collectors listing of the years, types, and
markings on the collection of coins they have. These are data times,
but I don't think I would consider them to be scientific. Similarly a
catalog of items for sale and their prices by a vendor.
-Alan
data item is an information content entity that is intended to be a
truthful statement about something (modulo measurement, error) and is
constructed/acquired by a method which
reliably tend to produce (approximately) truthful statements. (synonym
datum, data). Is a necessary condition that data items have to be the
output of some process?
data set is a data item with some bundling of other data items for
some purpose, without implication of homogeneity or other structural
constraints, at least at the broadest level.
Need a definition for data structure. Is it an information content
entity? Or is it a different type of generically dependent continuant?
Cheers,
Chris
Your alternate proposed definition is:
“Data item refers to a fact usually collected as the result of
experience, observation or experiment, or processes within a computer
system, or a set of premises. This may be a number, word, or image,
particularly as a measurement or observation of a set of variables.”
I'm OK with your definition for a general data item in IAO. However,
in OBI, I only care about data items in the context of experiments and
so it's necessary to relate the data to the process that generated
them. I understand that for IAO data items may not have that
restriction but would want to apply that restriction in OBI. We could
create a measured data item (and an observed data item) in OBI as
types of data items that carry these restrictions.
I also like considering data items as facts rather than bringing in
truthiness.
Cheers,
Chris
regarding fact vs truth, truth is generally understood, but is a high
standard and therefore may be off-putting. i looked up a definiton of
fact, and found this:
a piece of information about circumstances that exist or events that
have occurred
it does not include anyting about the piece of information being
intended to be true, accurate as far as possible to know or whatever
condition we mentally add to the word "fact. may be simpler to stick
with "truth" as modified by Alan. :-)
Cheers,
Chris
> constructed...').
Dirk Derom wrote:
> I like the distinction, but have a (slightly off topic) question on
> what a process actually means.
>
> Say you have a participant in an experiment, and he/she/it has no
> explicit task (an explicit task might be 'fixate at the central dot'),
> and one would record eye movement. I assume that these eye movements
> are data items.
> What would be the process of these data items? Is 'measuring eye
> movement' a process? Or does process refers to what causes the data
> items to appear (e.g. the task), or more or less the collecting of it?
>
> I don't seem to grasp the connection with data item that well.
>
>
> 2009/1/15 Chris Stoeckert <stoe...@pcbi.upenn.edu
> <mailto:stoe...@pcbi.upenn.edu>>
>> <alanruttenb...@gmail.com <mailto:alanruttenb...@gmail.com>>
--
Bjoern Peters
Assistant Member
La Jolla Institute for Allergy and Immunology
9420 Athena Circle
La Jolla, CA 92037, USA
Tel: 858/752-6914
Fax: 858/752-6987
http://www.liai.org/pages/faculty-peters