Good RDF/Bad RDF experiment

Steve Baskauf

unread,

Mar 7, 2012, 10:24:47 PM3/7/12

to tdwg...@googlegroups.com

I finished a round of experimenting with the TDWG sandbox using a file
that contained "bad" RDF which used a literal object with a predicate
that had a range defined as a non-literal class. You can see what I did
at http://code.google.com/p/tdwg-rdf/wiki/GoodRdfBadRdf . I really only
found out one very interesting thing: when inferencing was turned on,
the 4sr reasoner didn't seem to have any problem inferring that a
literal object was an instance of the dcterms:Agent class based on the
range of dcterms:creator. That's something that you can't declare
directly (well, I don't think you can) because a literal can't be the
subject of a triple. Also, when asked what the rdf:type was of a
literal, it behaved as if it did not have one, rather than saying that
its type was rdfs:Literal. It is hard to know if any kind of
generalizations can be made about this or if these are just
idiosyncrasies of the 4sr reasoner.

Because the 4sr reasoner claimed to reason on subProperty relationships,
I tried investigating whether there were any inferred triples generated
based on the declaration that dcterms:creator was a subclass of
dc:creator. However, nothing happened. So either 4sr doesn't really
make use of subProperty relationships or I don't know what I'm doing
(always a distinct possibility!). I tried looking at the instructions,
but they apparently haven't been written yet. :-(

If anybody has access to a different SPARQL endpoint that does
inferencing, they can try repeating the experiments or doing others.
The file is at
http://bioimages.vanderbilt.edu/rdf/examples/good-rdf-bad-rdf-test.rdf
. For some reason, I can never get my files to load into triplestore 3
(maybe because the server does not correctly identify them as
application/rdf+xml ).

Steve

--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu

Bob Morris

unread,

Mar 7, 2012, 11:37:41 PM3/7/12

to tdwg...@googlegroups.com

This is exactly what I would expect.

Although I didn't look extensively, I found nothing in dcterms that
would prevent something from being both an object property and a
datatype property. More particularly, I find nothing in either
dcterms or rdfs that requires that the class dcterms:Agent and the
class rdfs:Litteral cannot intersect. Assuming that is correct, you
need a reasoner that finds a contradiction in something more
restrictive than the formal semantics of rdfs.
You will need something that asserts an axiom that either asserts or
infers such disjointness. Some species of OWL will have
that--principally OWL DL and some lower ones, but from
http://www.w3.org/TR/owl-ref/ Sec 4:

"NOTE: In OWL Full, object properties and datatype properties are not
disjoint. Because data values can be treated as individuals, datatype
properties are effectively subclasses of object properties. In OWL
Full owl:ObjectProperty is equivalent to rdf:Property In practice,
this mainly has consequences for the use of
owl:InverseFunctionalProperty. See also the OWL Full characterization
in Sec. 8.1."

But also, http://dublincore.org/documents/dcmi-terms doesn't declare
itself an owl ontology in the first place, so one would not expect to
do OWL reasoning.

I am pretty sure that to reason entirely within rdfs and the things
within dwc and those rdf vocabularies dwc uses, you will have to
promulgate a "best practice" in the form of the addition of an axiom
that asserts this disjunction (or accept the not-formally
contradictory situation that you found). Furthermore, there may be
other such conundrums besides this particular disjunction. That is
why there are more OWL species in OWL2 than in OWL1.... the different
OWL 2 "profiles" make different compromises in support of different
uses.

For things that \are/ declared as OWL ontologies, the diagnoses of
http://www.mygrid.org.uk/OWL/Validator and
http://owl.cs.manchester.ac.uk/validator/ are often fairly
informative about what goes with something that is declared as an OWL
ontology.

It's disarmingly simple to fall into OWL FULL when designing an
ontology. http://purl.org/dsw/ and
http://lod.taxonconcept.org/ontology/txn.owl both do, and probably
neither can enforce a distinction between object properties and
datatype properties.

Bob

--
Robert A. Morris

Emeritus Professor of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390

IT Staff
Filtered Push Project
Harvard University Herbaria
Harvard University

email: morri...@gmail.com
web: http://efg.cs.umb.edu/
web: http://etaxonomy.org/mw/FilteredPush
http://www.cs.umb.edu/~ram
===
The content of this communication is made entirely on my
own behalf and in no way should be deemed to express
official positions of The University of Massachusetts at Boston or
Harvard University.

Bob Morris

unread,

Mar 8, 2012, 12:01:42 AM3/8/12

to tdwg...@googlegroups.com

In this post, near the end, I wrote

For things that \are/ declared as OWL ontologies, the diagnoses of
http://www.mygrid.org.uk/OWL/Validator and
http://owl.cs.manchester.ac.uk/validator/ are often fairly
informative about what goes with something that is declared as an OWL
ontology.

But it should say

For things that \are/ declared as OWL ontologies, ... informative
about what goes wrong with...

Steve Baskauf

unread,

Mar 8, 2012, 7:00:36 AM3/8/12

to tdwg...@googlegroups.com

The way I am looking at this, there are three questions that are relevant to the issue of the use of literals with properties that have declared ranges which are not rdfs:Literal:

1. Can we do that?
2. Can we get away with doing that?
3. Should we do that?

As far as question 1 is concerned, Bob (and I think previously Hilmar) have asserted that we CAN do it, at least under relaxed conditions (OWL full for example).

However, just because something is logically possible doesn't mean that doing it will have desirable consequences. I am logically able to do things like never wearing my seat belt when I drive or sticking paper clips into electrical sockets, but doing those things may have undesirable consequences for me. In the one experiment that I have done, I showed that I can "get away with" using a literal with dcterms:creator without "breaking" the SPARQL query (i.e. I get a result for the query rather than an error message or program crash). However, as we know from experimental science supporting a hypothesis in one experiment does not demonstrate that the hypothesis will be supported in all experiments. In this case, not only do we not want to "break" existing applications, but we also don't want to break applications that programmers might want to create in the future.

As for question 3, there are things that I can *get away with* doing, (for example, labeling a men's room "broom closet" or vice versa) that I *should not* do. That is because our society functions effectively when there are symbolic conventions that we agree upon and follow. I will never forget riding in a taxi in Santa Cruz in the "frontier" of Bolivia and seeing traffic lights lying along the side of the road. I asked the driver if they were just putting them up. He said told me "no", they were just taking them down because nobody paid attention to them. Traffic lights are only useful when people know that they should have certain behaviors (i.e. stopping) when the light is red and then behave that way. I believe that there is a reason why Darwin Core says things like "A list (concatenated and separated) of names.." in the definition of dwc:recordedBy. It sets the expectation that users of dwc:recordedBy will provide a string value for that term. If I write code which substitutes the value of dwc:recordedBy for x in the following:

document.getElementById('copy').innerHTML='This occurrence was recorded by '+x+'.';

I want to know that I will get something that looks like:

This occurrence was recorded by Steve Baskauf.

If a data provider goes against the definition given in the DwC standard and uses a URI with dwc:recordedBy because they can "get away with it", then I'm going to get things like:

This occurrence was recorded by http://bioimages.vanderbilt.edu/contact/baskauf.

Is that "bad"? I would say "yes"!

So from the standpoint of the purpose of this group, which I think includes articulating best-practices, questions 2 and 3 are much more relevant than question 1. I hope that in a reasonably short period of time (measured in weeks rather than months) that we can arrive at a consensus on a few general best practice statements that would guide people's use of well-know terms as RDF.

Steve

Hilmar Lapp

unread,

Mar 8, 2012, 9:34:59 AM3/8/12

to tdwg...@googlegroups.com

On Mar 7, 2012, at 11:37 PM, Bob Morris wrote:

> from http://www.w3.org/TR/owl-ref/ Sec 4:
>
> "NOTE: In OWL Full, object properties and datatype properties are not
> disjoint. Because data values can be treated as individuals, datatype
> properties are effectively subclasses of object properties. In OWL
> Full owl:ObjectProperty is equivalent to rdf:Property In practice,
> this mainly has consequences for the use of owl:InverseFunctionalProperty. See also the OWL Full characterization in Sec. 8.1."

Thanks for pointing that out, Bob, and indeed I was suspecting this to be the case. Never hurts to read and remind oneself of the spec.

-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
===========================================================

Steve Baskauf

unread,

Mar 8, 2012, 9:57:18 AM3/8/12

to tdwg...@googlegroups.com

Actually, a better example of why it would be "bad" to use a literal with a property having a defined non-literal range might have been a property like foaf:maker which to my knowledge has always unambiguously had the expectation that its object would be a URI reference. That expectation could be exploited by writing an application which would allow the user to attempt to learn more about the creator by dereferencing the URI*. Such a feature would be guaranteed to fail if data providers started providing literals as objects of triples having foaf:maker as their predicate. I don't provide an example from Darwin Core because I do not believe that Darwin Core presently has any properties that are unambiguously intended to have URIs as their values. But that is a topic for another thread.

Steve

* I acknowledge that there are URIs which are not necessarily dereferenceable, such as URNs. However, even using a non-dereferenceable URI as an object would allow the application to "look up" more information about the resource referenced by the URI if that information were contained in a triple store or RDF data dump to which the application had access.

On 3/8/2012 6:00 AM, Steve Baskauf wrote:

... I believe that there is a reason why Darwin Core says things like "A list (concatenated and separated) of names.." in the definition of dwc:recordedBy. It sets the expectation that users of dwc:recordedBy will provide a string value for that term. If I write code which substitutes the value of dwc:recordedBy for x in the following:

document.getElementById('copy').innerHTML='This occurrence was recorded by '+x+'.';

I want to know that I will get something that looks like:

This occurrence was recorded by Steve Baskauf.

If a data provider goes against the definition given in the DwC standard and uses a URI with dwc:recordedBy because they can "get away with it", then I'm going to get things like:

This occurrence was recorded by http://bioimages.vanderbilt.edu/contact/baskauf.

Is that "bad"? I would say "yes"!

Hilmar Lapp

unread,

Mar 8, 2012, 10:11:02 AM3/8/12

to tdwg...@googlegroups.com

On Mar 8, 2012, at 7:00 AM, Steve Baskauf wrote:

I believe that there is a reason why Darwin Core says things like "A list (concatenated and separated) of names.." in the definition of dwc:recordedBy. It sets the expectation that users of dwc:recordedBy will provide a string value for that term.

True.

If I write code which substitutes the value of dwc:recordedBy for x in the following:

document.getElementById('copy').innerHTML='This occurrence was recorded by '+x+'.';

I want to know that I will get something that looks like:

This occurrence was recorded by Steve Baskauf.

If a data provider goes against the definition given in the DwC standard and uses a URI with dwc:recordedBy because they can "get away with it", then I'm going to get things like:

This occurrence was recorded by http://bioimages.vanderbilt.edu/contact/baskauf.

Is that "bad"? I would say "yes"!

That depends on how you're looking at it. For one, it is bad because your code is bad - simply assuming that you don't have to dereference the URI to get a label is brittle coding. It's also bad because it is indeed counter to the expectation given in the DwC textual definition of the property. It would, however, be quite good if I were a Linked Data client trying to aggregate things for the person named "Steve Baskauf".

BTW according to the DwC documentation dwc:recordedBy refines dwc:accordingTo, which doesn't seem to exist (anymore?). At least I can't find it in the documentation. dwc:recordedBy also doesn't define a range, so as far as a machine is concerned, they really need to be prepared to find anything as the object. (Machines can't interpret definitions written for humans.)

Hilmar Lapp

unread,

Mar 8, 2012, 10:20:25 AM3/8/12

to tdwg...@googlegroups.com

I forgot to comment on this for an earlier post in this thread. I think the mindset that we need properties to carry all the semantics about "what to expect", or what kind of thing the property value denotes, is a remnant from our relational modeling days. There is very little place for that in an RDF world. Objects can, and should speak for themselves - if we use dereferenceable URIs wherever possible, and if we wrote software clients that don't make unwarranted assumptions, then we don't need a gazillion different properties just to tell us certain nuances about the to be expected property value.

RDF is really different from relational modeling. In a relational database, the combination of table and column, and the definition of column type, tell us mostly what we need to know about dealing with a column's value, and so we obsess about those things. We need to fully let go of this paradigm in an RDF world, or we are not gaining its benefits and might as well continue doing relational data and XML.

-hilmar

Steve Baskauf

unread,

Mar 8, 2012, 10:39:09 AM3/8/12

to tdwg...@googlegroups.com, Hilmar Lapp

Responses inline

On 3/8/2012 9:11 AM, Hilmar Lapp wrote:

If I write code which substitutes the value of dwc:recordedBy for x in the following:

document.getElementById('copy').innerHTML='This occurrence was recorded by '+x+'.';

I want to know that I will get something that looks like:

This occurrence was recorded by Steve Baskauf.

If a data provider goes against the definition given in the DwC standard and uses a URI with dwc:recordedBy because they can "get away with it", then I'm going to get things like:

This occurrence was recorded by http://bioimages.vanderbilt.edu/contact/baskauf.

Is that "bad"? I would say "yes"!

That depends on how you're looking at it. For one, it is bad because your code is bad - simply assuming that you don't have to dereference the URI to get a label is brittle coding.

I'm not assuming that. I'm assuming that by some means, the application already is aware of the triple

[resource] dwc:recordedBy "Steve Baskauf"

or

[resource] dwc:recordedBy http://bioimages.vanderbilt.edu/contact/baskauf

(depending on which part of the example we are referring to) and wants to make use of that information. It would indeed be bad to try to infer anything about the URI from the form of its string without dereferencing it.

It's also bad because it is indeed counter to the expectation given in the DwC textual definition of the property. It would, however, be quite good if I were a Linked Data client trying to aggregate things for the person named "Steve Baskauf".

BTW according to the DwC documentation dwc:recordedBy refines dwc:accordingTo, which doesn't seem to exist (anymore?). At least I can't find it in the documentation. dwc:recordedBy also doesn't define a range, so as far as a machine is concerned, they really need to be prepared to find anything as the object. (Machines can't interpret definitions written for humans.)

Well, this is an example of what I have been complaining about with regards to the lack of clarity about what documents actually constitute the normative definition of Darwin Core. I believe (but do not know with certainty) that the document http://rs.tdwg.org/dwc/rdf/dwcterms.rdf IS the normative definition. In that document, dwc:accordingTo exists and has the following properties:

<rdf:Description rdf:about="http://rs.tdwg.org/dwc/terms/accordingTo">
    <rdfs:label xml:lang="en-US">According To</rdfs:label>
    <rdfs:comment xml:lang="en-US">Abstract term to attribute information to a source.</rdfs:comment>
    <rdfs:isDefinedBy rdf:resource="http://rs.tdwg.org/dwc/terms/"/>
    <dcterms:issued>2009-01-21</dcterms:issued>
    <dcterms:modified>2009-01-21</dcterms:modified>
    <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>
    <dcterms:hasVersion rdf:resource="http://rs.tdwg.org/dwc/terms/history/#accordingTo-2009-01-21"/>
    <dcterms:replaces rdf:resource="http://rs.tdwg.org/dwc/terms/accordingTo-2009-01-21"/>
    <dwcattributes:status>recommended</dwcattributes:status>
<dwcattributes:abcdEquivalence>not in ABCD</dwcattributes:abcdEquivalence>
</rdf:Description>

So if I am correct about what constitutes the normative document, dwc:accordingTo IS a current and recommended term. However, as you note, it is found nowhere in the human readable documents such as http://rs.tdwg.org/dwc/terms/index.htm . This is unacceptably sloppy in my view, but as of yet there hasn't been any response to the issue that I raised about it ( http://code.google.com/p/darwincore/issues/detail?id=134 ).

As far as the issue of what a machine should be ready to do, I think that is what lies at the core of this discussion. If we have the expectation that we should expect a machine to figure out any kind of slop that people create and call RDF, then we may be making the job of writing software so difficult that nobody will do it. But I don't write software, so maybe the job is easier than I think. I think this would be a good next topic of discussion, but it should be started under a new subject line. I will do that.

Steve Baskauf

unread,

Mar 8, 2012, 10:48:40 AM3/8/12

to tdwg...@googlegroups.com, Hilmar Lapp, John Deck

I think that Hilmar and my last emails crossed. I am going to take Hilmar's last comment and use it as the basis of a thread with a new subject line. I am going to stay on the side and listen on this topic because I think that people who are or may be writing applications to expose, aggregate, or make use of RDF data should weigh in on this. In particular, I would like to hear some feedback from John Deck and others who have or are writing software to expose databases as RDF. See http://biscicol.blogspot.com/ .

Steve

On 3/8/2012 9:20 AM, Hilmar Lapp wrote:

I forgot to comment on this for an earlier post in this thread. I think the mindset that we need properties to carry all the semantics about "what to expect", or what kind of thing the property value denotes, is a remnant from our relational modeling days. There is very little place for that in an RDF world. Objects can, and should speak for themselves - if we use dereferenceable URIs wherever possible, and if we wrote software clients that don't make unwarranted assumptions, then we don't need a gazillion different properties just to tell us certain nuances about the to be expected property value.

RDF is really different from relational modeling. In a relational database, the combination of table and column, and the definition of column type, tell us mostly what we need to know about dealing with a column's value, and so we obsess about those things. We need to fully let go of this paradigm in an RDF world, or we are not gaining its benefits and might as well continue doing relational data and XML.

-hilmar

Hilmar Lapp

unread,

Mar 8, 2012, 11:01:58 AM3/8/12

to tdwg...@googlegroups.com

On Mar 8, 2012, at 10:39 AM, Steve Baskauf wrote:

If we have the expectation that we should expect a machine to figure out any kind of slop that people create and call RDF, then we may be making the job of writing software so difficult that nobody will do it.

Well, the web as we know it took off because browser software was written with the ability to figure out any kind of slop that people create and call HTML.

I think people who can write software can also get themselves to write clever software. But it's a bigger challenge to get people who don't understand metadata let alone RDF to write metadata fully compliant with specifications that can't be validated because the specifications are a collection of conventions rather than database-enforced integrity constraints.

Data sharing is a messy business.

Steve Baskauf

unread,

Mar 8, 2012, 12:03:35 PM3/8/12

to tdwg...@googlegroups.com, Hilmar Lapp, John Deck

I see your point about HTML. But I'm still left with several questions. If it is so easy to write clever software to interpret RDF slop, then why hasn't the amazing "Semantic Web" come into fruition by now after more than a decade of talking about it? Also, what is the purpose of making the distinction in OWL between datatype properties and object properties if it doesn't really matter whether metadata providers use strings or URIs with particular properties? I realize it is difficult in an email to convey one's tone, but I'm not trying to be cheeky here. I actually don't know or understand the answers to these questions in the light of what Hilmar has just said. If it doesn't really matter how people write their RDF because the software will just fix it all, then I'm wondering whether we actually need this group or not. We could just tell people to do whatever they want rather than trying to define best-practices.

Steve

On 3/8/2012 10:01 AM, Hilmar Lapp wrote:

On Mar 8, 2012, at 10:39 AM, Steve Baskauf wrote:

If we have the expectation that we should expect a machine to figure out any kind of slop that people create and call RDF, then we may be making the job of writing software so difficult that nobody will do it.

Well, the web as we know it took off because browser software was written with the ability to figure out any kind of slop that people create and call HTML.

I think people who can write software can also get themselves to write clever software. But it's a bigger challenge to get people who don't understand metadata let alone RDF to write metadata fully compliant with specifications that can't be validated because the specifications are a collection of conventions rather than database-enforced integrity constraints.

Data sharing is a messy business.

Hilmar Lapp

unread,

Mar 8, 2012, 7:29:18 PM3/8/12

to Steve Baskauf, tdwg...@googlegroups.com, John Deck

On Mar 8, 2012, at 12:03 PM, Steve Baskauf wrote:

I see your point about HTML. But I'm still left with several questions. If it is so easy to write clever software to interpret RDF slop, then why hasn't the amazing "Semantic Web" come into fruition by now after more than a decade of talking about it?

We should be careful in this group to not confuse technologies. Publishing RDF by itself isn't the semantic web, nor is publishing Linked Data. It's perhaps a great (because low-barrier) step towards it, but a semantic web also needs ontologies that support reasoning to the extent of supporting agent-based decision making and knowledge discovery. Building ontologies (in OWL, for example) that allow rich and meaningful reasoning and are commonly reused is a *hard* problem.

And Linked Data *has* taken off, BTW.

We could just tell people to do whatever they want rather than trying to define best-practices.

To (ab)use the HTML analogy further, there are lots and lots of best practices for writing HTML documents. The reason isn't that writing poor HTML slop is difficult or crashes browsers (as we know it doesn't), but that creating HTML in ways that is search engine optimized, better maintainable, better findable, more robust across different browsers, etc etc is increasingly more involved.

So I think there's lots of room in pointing out which practices have which advantages or downsides, or consequences in ways that people may not be aware of. In this context, I'd strongly recommend to read the draft W3C Interest Group note on "Mapping and linking life science data using RDF" that I posted earlier:

https://docs.google.com/a/nescent.org/document/d/1XzdsjCfPylcyOoNtDfAgz15HwRdCD-0e0ixh21_U0y0/edit?hl=en_US

Quoting from the abstract: "This W3C Note summarizes emerging practices for creating and publishing healthcare and life sciences data as Linked Data in such a way that they are discoverable and useable by users, Semantic Web agents, and applications." Isn't that much of what we want to accomplish here for biodiversity data?

-hilmar

Steve Baskauf

unread,

Mar 9, 2012, 4:24:15 PM3/9/12

to Hilmar Lapp, tdwg...@googlegroups.com

Hilmar,

On 3/8/2012 6:29 PM, Hilmar Lapp wrote:

On Mar 8, 2012, at 12:03 PM, Steve Baskauf wrote:

I see your point about HTML. But I'm still left with several questions. If it is so easy to write clever software to interpret RDF slop, then why hasn't the amazing "Semantic Web" come into fruition by now after more than a decade of talking about it?

We should be careful in this group to not confuse technologies. Publishing RDF by itself isn't the semantic web, nor is publishing Linked Data. It's perhaps a great (because low-barrier) step towards it, but a semantic web also needs ontologies that support reasoning to the extent of supporting agent-based decision making and knowledge discovery. Building ontologies (in OWL, for example) that allow rich and meaningful reasoning and are commonly reused is a *hard* problem.

And Linked Data *has* taken off, BTW.

Sorry, I'm letting my cynicism show through. I understand that there has been a lot of progress in exposing RDF. I will still assert that there is a serious lack of easily accessible and usable applications that actually make use of it. The only web-based applications that I've become aware of and used are basically just Linked Data browsers (see http://code.google.com/p/tdwg-rdf/wiki/Beginners#0.3.6._Web_interfaces) and they often fail to dereference the source URIs or are off-line. I may just be missing the "good stuff". If so, please create links to them in the "Significant documents, tools, and RDF implementations" section of the page I linked above, or send the links to me and I'll post them. Also, I feel that there must be more useful non-web (i.e. locally installed) client software than I know about and have listed at http://code.google.com/p/tdwg-rdf/wiki/Beginners#0.3.5._Software_tools , so I would love to have additions there.

So I think there's lots of room in pointing out which practices have which advantages or downsides, or consequences in ways that people may not be aware of. In this context, I'd strongly recommend to read the draft W3C Interest Group note on "Mapping and linking life science data using RDF" that I posted earlier:

https://docs.google.com/a/nescent.org/document/d/1XzdsjCfPylcyOoNtDfAgz15HwRdCD-0e0ixh21_U0y0/edit?hl=en_US

Quoting from the abstract: "This W3C Note summarizes emerging practices for creating and publishing healthcare and life sciences data as Linked Data in such a way that they are discoverable and useable by users, Semantic Web agents, and applications." Isn't that much of what we want to accomplish here for biodiversity data?

Yes! I think I got that email when I was swamped at the end of the (northern hemisphere) fall semester and I apparently overlooked it. I found it way near the bottom of my inbox. I have added a link to it at the bottom of the TG home page http://code.google.com/p/tdwg-rdf/ as well on the pages of resources I listed above. Thanks for drawing it to our attention again. I'm going to try to give it a thorough read this weekend.

Steve

Bob Morris

unread,

Mar 11, 2012, 10:35:33 PM3/11/12

to tdwg...@googlegroups.com

In my opinion, no answers to these issues will work in a vacuum .
All computer (and for that matter, human) information frameworks need
to have structures and content that facilitate their primary use
cases. As motivates this WG, producing rdf from other forms of
data---especially relational database data---is not technically deep,
but producing \useful/ RDF depends on the use. Some uses will be
immune to "bad" data, some will require use of bleeding edge but
existing tools, some will require skilled programmers, some will not
be possible at all.

Some maybe good news:

The W3 RDB2RDF Working Group http://www.w3.org/2001/sw/rdb2rdf/ two
weeks ago released a Candidate Recommendation R2RML: RDB to RDF
Mapping Language, http://www.w3.org/TR/r2rml/
It, or at least the documents of the Working Group that led to it,
might well inform some of the ways tdwg-rdf should consider dealing
with "bad" data, at least as emitted from legacy RDBs. Perhaps
especially http://www.w3.org/TR/2010/WD-rdb2rdf-ucr-20100608/#uc "Use
Cases" has sections worth reading. FWIW, the W3C LODD IG Note
"Mapping and linking life science data using RDF",
https://docs.google.com/document/d/1XzdsjCfPylcyOoNtDfAgz15HwRdCD-0e0ixh21_U0y0/edit?hl=en_US&pli=1
that Hilmar mentioned cites a 3-year old paper of the rdb2rdf group.
Probably it is appropriate to verify that the Candidate Recommendation
and the LODD Note remain consistent, at least as far as tdwg-rdf
concerns may emerge.

Some OK news:

"Bad" RDF is probably perfectly useful for
-- discovery applications such as provided by Linked Open Data protocols.
-- purely syntactic data integration (i.e. RDF graph aggregation)
-- largely human-centric applications requiring no machine reasoning

Some further opinions interspersed below.

On Thu, Mar 8, 2012 at 12:03 PM, Steve Baskauf
<steve....@vanderbilt.edu> wrote:
> I see your point about HTML. But I'm still left with several questions. If
> it is so easy to write clever software to interpret RDF slop, then why
> hasn't the amazing "Semantic Web" come into fruition by now after more than
> a decade of talking about it?

In part because the emphasis has been on the "Web" at least as much as
on the "Semantic". By contrast, the biomedical informatics community
and the military information retrieval community have made vastly more
progress by focusing on knowledge representation more than data
discovery and dissemination. In fairness, both these communities have
always been substantially better funded than almost any other group
interested in Semantic <X>, for any value of <X>.

> Also, what is the purpose of making the
> distinction in OWL between datatype properties and object properties if it
> doesn't really matter whether metadata providers use strings or URIs with
> particular properties?

Sometimes it <does> matter, depending on what \other/ restrictions on
properties that providers wish on their knowledge representation.
http://answers.semanticweb.com/questions/1367/restrictions-on-rdfproperty
has an example. But in OWL, or many other modeling languages, there
often is not a unique way to "preserve" any particular single kind of
restriction . This point is made very well in the brief
http://www.w3.org/TR/owl2-profiles/#Introduction

> I realize it is difficult in an email to convey
> one's tone, but I'm not trying to be cheeky here. I actually don't know or
> understand the answers to these questions in the light of what Hilmar has
> just said. If it doesn't really matter how people write their RDF because
> the software will just fix it all, then I'm wondering whether we actually
> need this group or not. We could just tell people to do whatever they want
> rather than trying to define best-practices.

Cheer up. The software will "just fix it all" only in the simplest of
cases, useful though they are. There's still plenty of work for
tdwg-rdf to do. It's just not clear what that is yet. :-)

--

Steve Baskauf

unread,

Mar 12, 2012, 9:38:22 AM3/12/12

to tdwg...@googlegroups.com, Bob Morris

On 3/11/2012 9:35 PM, Bob Morris wrote:
> Some maybe good news:
>
> The W3 RDB2RDF Working Group http://www.w3.org/2001/sw/rdb2rdf/ two
> weeks ago released a Candidate Recommendation R2RML: RDB to RDF
> Mapping Language, http://www.w3.org/TR/r2rml/
>

I did not read this recommendation very carefully since it is not really
my "thing", but in skimming it, there was one aspect of it which I found
troubling.

In the examples in section 2 ( http://www.w3.org/TR/r2rml/#overview ),
the IRIs (i.e. broader term for URIs) are generated using the primary
keys of the database. See "triples map" rule 1 in that section. I was
under the impression that this was not considered a good practice, at
least when the intent is that the identifiers be persistent. I looked
through the TDWG GUID Applicability Statement standard and actually it
doesn't mention the issue of using primary database keys in GUIDs, but
the LSID Applicability Statement (standard at:
http://www.tdwg.org/standards/150/ , pdf viewable in browser at
http://bioimages.vanderbilt.edu/pages/LSID%20AS_2011_01_final.pdf ) does
talk about it in Recommendation 11: "LSID Authorities should not use the
primary key of relational database tables as object identifications.
Providers should create an extra column in the table (or a separate
table) to manage the LSID independently of the primary key." The
rationale is that "LSID Authorities should not use the primary key of
relational database tables as object identifications. Providers should
create an extra column in the table (or a separate table) to manage the
LSID independently of the primary key." Although this advice is given
specifically in the context of LSIDs, I think that the general principle
holds for any identifier that is intended to be stable and persistent.

I suppose perhaps that the intention of the creators of the
Recommendation was to facilitate "quick and dirty" conversion of
metadata from relational databases into RDF triples. But it seems to me
to be a really bad idea to suggest this as a general practice. The
system that is described in the Recommendation facilitates the mapping
of database column headings to well-known predicates, so I don't see why
they can't give examples where the IRIs are created from a mapping of a
column which contains a stable local identifier which is not the primary
key.

I may just be misunderstanding what they are talking about. If so, I
would welcome clarification from somebody who knows more about it.
Steve

Steve Baskauf

unread,

Mar 13, 2012, 10:17:18 PM3/13/12

to tdwg...@googlegroups.com

Bob Morris wrote:

Although I didn't look extensively, I found nothing in dcterms that
would prevent something from being both an object property and a
datatype property.  More particularly, I find nothing in either
dcterms or rdfs that requires that the class dcterms:Agent and the
class rdfs:Litteral cannot intersect.

From the guide ( http://dublincore.org/documents/2008/01/14/dc-rdf-notes/#sect-3 ) to implementers to the changes introduced with the DCMI Recommendation "Expressing Dublin Core metadata using the Resource Description Framework (RDF)" (i.e. http://dublincore.org/documents/dc-rdf/ ):

"In 'Expressing Simple Dublin Core in RDF/XML', a dc:creator is a name:

<http://www.example.com> dc:creator "John Smith"

In 'Expressing Qualified Dublin Core in RDF/XML', in contrast, a dc:creator is an entity, as in:

<http://www.example.com> dc:creator <http://www.example.org/person32>

... The new RDF encoding specification supports both of these constructs but bases the choice of one form over the other on the range of a property. A property with a 'literal' range will follow the former pattern, while a property with a 'non-literal' range will follow the latter.

In accordance with this approach, the DCMI Usage Board has assigned appropriate ranges to the DCMI properties. A range of "Agent" has been given to dcterms:creator and dcterms:contributor, where 'Agent' is defined as 'A resource that acts or has the power to act'. Similarly, appropriate ranges have been specified for the other DCMI terms. The range 'Literal' applies only to metadata terms which are typically associated with a single value string, such as dcterms:date or dcterms:identifier."

...

I am pretty sure that to reason entirely within rdfs and the things
within dwc and those rdf vocabularies dwc uses, you will have to
promulgate a "best practice" in the form of the addition of an axiom
that asserts this disjunction (or accept the not-formally
contradictory situation that you found).  ...

I think that DCMI has already promulgated a best practice for us. It is expressed in human-readable form as opposed to machine-readable semantics, but it is a best practice nonetheless. It is intended to help get rid of the confusion between a person and that person's name, "a pain in the butt for implementors"(1)

Steve

(1) http://wiki.foaf-project.org/w/UsingDublinCoreCreator

Steve Baskauf

unread,

Mar 20, 2012, 1:24:10 PM3/20/12

to tdwg...@googlegroups.com

Actually, the reason why Darwin-SW ( http://purl.org/dsw/ ) was not
validating as OWL DL was because Protege was inserting unnecessary
(re-)declarations of rdfs:Datatype, owl:ObjectProperty, and
owl:DatatypeProperty, and unnecessarily stating that several properties
were subproperties of owl:topObjectProperty. It had nothing to do with
the ontology itself - the Manchester validator just didn't like those
particular statements for some reason. We have removed the offending
statements and it actually validates just fine as OWL DL at either
http://owl.cs.manchester.ac.uk/validator/ or
http://www.mygrid.org.uk/OWL/Validator .

Steve

On 3/7/2012 10:37 PM, Bob Morris wrote:
>
> It's disarmingly simple to fall into OWL FULL when designing an
> ontology. http://purl.org/dsw/ and
> http://lod.taxonconcept.org/ontology/txn.owl both do, and probably
> neither can enforce a distinction between object properties and
> datatype properties.
>
> Bob
>
>

--

Bob Morris

unread,

Mar 21, 2012, 4:35:23 PM3/21/12

to tdwg...@googlegroups.com

Good news. Now get rid of the un-necessary Functional and inverseFunctional requirements and I'll be an almost perfectly happy camper when my applications are trying to define mappings into relational databases that happen to be emitting some form of DwC, perhaps DSW.

A good candidate for a best practices topic would be something like "what are the tdwg community use cases requiring InverseFunctional properties?". (My guess is none). More generally, one could consider if there are use cases for the other three presently defined OWL2 profiles. My guess is maybe).

Hilmar Lapp

unread,

Mar 21, 2012, 5:00:26 PM3/21/12

to tdwg...@googlegroups.com

On Mar 21, 2012, at 4:35 PM, Bob Morris wrote:

A good candidate for a best practices topic would be something like "what are the tdwg community use cases requiring InverseFunctional properties?". (My guess is none).

Having those would allow us to infer identity of individuals without having to assert that. I already hear you saying that in most cases such an inference would come as an unwanted surprise, but even they really are unwanted, they might help in identifying erroneous or undesirable data publishing practices.

I mean, once you are in OWL, you might as well give an OWL reasoner a chance to help you. The stronger your assertions, the more a reasoner will be able to do with them (even if sometimes things you didn't want - but you can then use those to improve your ontologies and data).

Steve Baskauf

unread,

Mar 22, 2012, 7:03:24 AM3/22/12

to tdwg...@googlegroups.com

I have started an issue on the issue of owl:FunctionalProperty and owl:InverseFunctionalProperty at http://code.google.com/p/tdwg-rdf/issues/detail?id=11 on this topic. Unfortunately, the previous discussion about it is under an subject line that is not directly related to the subject. Sigh....

Joel and I discussed strategy a little while ago and concluded (I think) that it might be good to come up with something less definite than a "best practices" statement which could be called a "usage guideline". The difference might be that a usage guideline is more of a statement of "this is how people are doing something" rather than "this is how you should do something". It might be possible at some point to "promote" a usage guideline to a best practice if there were sufficient consensus. But this would allow us to say something concrete about what we have learned about topics that we discuss without actually having to actually reach a consensus on it.

Steve

Steve Baskauf

unread,

Apr 4, 2012, 1:46:03 PM4/4/12

to tdwg...@googlegroups.com, gsa...@unb.ca, John Wieczorek

At the iDigBio meeting last week, we took advantage of the convergence
of several task group members as well as John Wieczorek (of the Darwin
Core task group) to discuss an issue
(http://code.google.com/p/tdwg-rdf/issues/detail?id=9 ) which was
brought up in the Task Group survey
(http://code.google.com/p/tdwg-rdf/wiki/Survey ). Specifically, one of
the important issues related to the use of Darwin Core terms as RDF
predicates is how to use DwC terms whose current definition specify that
they should have values which are concatenated and separated text
lists. I have posted notes from the meeting at
http://code.google.com/p/tdwg-rdf/wiki/DwcStringTermsAsRdf and you can
read them for the details.

The bottom line was that the assembled group felt that the existing DwC
terms (e.g. dwc:recordedBy) should be used in accordance with the
current term definition (i.e. not repeated, with a single string literal
value which is a concatenated list) and that it would be best to have a
new term which would be repeatable and designated specifically for use
with a URI (as opposed to literal) object. There were two suggestions
for how to accomplish this:

1. Create a new term in the current DwC namespace
(http://rs.tdwg.org/dwc/terms/) which is a modification of the current
term, e.g. dwc:recordedByURI .

2. Create a term in a new namespace which is understood to contain terms
that are intended for use with URI objects, e.g. dwcuri:recordedBy .

If option 1 were chosen, it would require making the changes through the
existing DwC namespace policy
(http://rs.tdwg.org/dwc/terms/namespace/index.htm) which could be a
lengthy process before the terms were available for use. However, it
would avoid a proliferation of namespaces.

If option 2 were chosen, it would not necessarily require a change to
Darwin Core itself (although maybe it would if the namespace were under
http://rs.tdwg.org/dwc/ e.g. http://rs.tdwg.org/dwc/dwcuri/ ). Using
the Darwin-sw namespace (http://purl.org/dsw/ ) was suggested as a
possibility. (I did not take a position on that suggestion since I've
somewhat recused myself from promotion of DSW in this context. Cam may
want to respond to that suggestion.) An advantage of using a different
namespace is that we could avoid confusing people who are not interested
in RDF by steering them away from the documentation about the use of the
terms in the new namespace (vs. having terms added to the existing quick
reference guide for the regular DwC namespace).

I would appreciate feedback about these options or the issue in
general. You can make them as replies to this email or as comments
under Issue 9 in the Issue Tracker
(http://code.google.com/p/tdwg-rdf/issues/detail?id=9 ).

Steve

joel sachs

unread,

Apr 4, 2012, 5:48:25 PM4/4/12

to tdwg...@googlegroups.com, gsa...@unb.ca, John Wieczorek

Hi Steve and all,

A couple of things:

i. There's no way to prevent terms from being repeated on the web. This
maybe isn't clear with dwc:recordedBy, but consider dwc:associatedMedia.
Many people may contribute, for example, pictures of a specimen. Each
contributor will create a triple of the form:
_:foo dwc:associatedMedia X

So whether the Xs are literals or URIs, there will be repeated elements.

ii. Of the 4 main DwC representations that we talk about - spreadheets,
rdf, xml, and rdbms - only spreadsheets do not easily permit repeated
elements. Of course, spreadsheets are the most common of the 4, so I
understand why the standard must accommodate them. The current definition
is
"A list (concatenated and separated) of identifiers (publication, global
unique identifier, URI) of media associated with the Occurrence."

Lists are allowed to have a single element, so, as I see it, the current
definition should suffice. But maybe there are cases I'm overlooking.

Basically, I'm wondering: what were the arguments in favor of introducing
new terms?

Thanks -
Joel.

joel sachs

unread,

Apr 4, 2012, 5:58:43 PM4/4/12

to tdwg...@googlegroups.com, gsa...@unb.ca, John Wieczorek

BTW, thank you to Steve, JohnW, JohnD, Paul, and Bob for getting together
to discuss this. I'd love to see these TDWG-RDF SIGs happen more often,
even when Steve and I aren't there. (For example, neither of us will be at
SPNHC, but maybe someone could choose an outstanding issue, facilitate an
informal BoF on the subject, and report back.)

Joel.

On Wed, 4 Apr 2012, Steve Baskauf wrote:

gsallen

unread,

Apr 5, 2012, 1:46:38 PM4/5/12

to TDWG RDF/OWL Task Group

Thank you all very much for following up on this concern, and to Steve
for inviting me to join this group.

Our project has become somewhat bogged down in the mud of insufficient
resources, so it will be some time before it gets anywhere near a
completed RDF structure. I have, therefore, not been following the
developments very closely. I will weigh in since asked though.

Option 1 makes more sense to me, but mostly because of my limited
understanding of RDF. I'm not sure that it would be my first choice
even if I can understand it better. There is a growing movement
towards developing name authorities as URIs (orchid.org, for example),
so this may become easier to implement in the coming years. Many of
our collectors happen to be students, and won't register their names,
though, so it will be difficult to capture all names through a
standardised registry, leaving the matter wide open to confusion.
That the dwc namespace policy would have to be rewritten stands
against this option.

Option 2 is probably preferable even if more difficult for the
layperson to interpret. It saves the dwc rewrite, and keeps the
problem firmly in the hands of the wonks who understand this stuff. I
see that as a good thing.

In the discussion note, I see the recommendation of using commas to
separate values in the concatenated lists. Please consider using semi-
colons, not commas to separate values in these lists. The standard way
of including names in any database is "LastName, First(or Inits)", a
format that includes a comma as an intrinsic element. Using a semi-
colon also helps prevent against potential problems when transferring
data in csv format.

--------------------------------------------

Geoffrey Allen
Digital Projects Librarian
Electronic Text Centre
Harriet Irving Library
University of New Brunswick
Fredericton, NB E3B 5H5
Tel: (506) 447-3250
Fax: (506) 453-4595
gsa...@unb.ca

Steve Baskauf

unread,

Apr 13, 2012, 4:13:57 PM4/13/12

to tdwg...@googlegroups.com, roge...@mac.com, John Wieczorek, J.Ke...@napier.ac.uk

Joel,
A couple comments about your reply:

joel sachs wrote:

i. There's no way to prevent terms from being repeated on the web. This 
maybe isn't clear with dwc:recordedBy, but consider dwc:associatedMedia. 
Many people may contribute, for example, pictures of a specimen. Each 
contributor will create a triple of the form:
_:foo dwc:associatedMedia X

So whether the Xs are literals or URIs, there will be repeated elements.

I think that the issue here is not that we are somehow trying to prevent people from repeating dwc:recordedBy when it is used with literals. People will do what they want, but I think that it is highly likely that people who currently populate this field in a flat database in accordance with the DwC definition (a single concatenated and separated string) will simply expose that string as a single literal value for dwc:recordedBy. Whether or not they present a single concatenated literal value or several literal values, an application is still going to be stuck with either just presenting the string(s) for humans to view, or to try to parse out the string(s) and guess what they mean.

ii. Of the 4 main DwC representations that we talk about - spreadheets, 
rdf, xml, and rdbms - only spreadsheets do not easily permit repeated 
elements. Of course, spreadsheets are the most common of the 4, so I 
understand why the standard must accommodate them. The current definition 
is
"A list (concatenated and separated) of identifiers (publication, global 
unique identifier, URI) of media associated with the Occurrence."

Lists are allowed to have a single element, so, as I see it, the current 
definition should suffice. But maybe there are cases I'm overlooking.

Basically, I'm wondering: what were the arguments in favor of introducing 
new terms?

I think that the primary issue here is not so much repeatability, but rather a desire to indicate (through use of a URI-specific term) that it is reasonable to expect that the object of the term is a URI which represents a single resource and that the object could be investigated to determine its properties. If programmers write RDF-based applications which intend to make use of the existing dwc:associatedMedia term, they should assume that it is likely that a provider will provide a concatenated string list (since that is what the definition suggests), but also that the provider might provide multiple values which are individual literals. If we do not create separate terms intended for use with URIs, then the programmer also must consider that the term might be repeated and have URI-reference objects. One could argue that this is not a problem and that we just need to have clever programmers that can write applications that are able to do a significant amount of parsing and guessing about what people mean when they provide literal values. But it seems to me that the ultimate direction that we want to go with RDF is clarity and simplicity. For a term like associatedMedia, when used in RDF we really would rather not have strings at all since the term is intrinsically designed to make connections between two resources. We currently allow concatenated string lists with dwc:associatedMedia as a concession to the fact that there are a lot of people who are using spreadsheets and can't handle URI references. But in the end, we want something better than that.

I think that the approach that we are talking about here is similar to that which was taken with the Taxon Concept Schema (TCS) standard (http://www.tdwg.org/standards/117/ viewable directly in a browser at http://bioimages.vanderbilt.edu/pages/TCS-Schema-UserGuide-v1.3.pdf) and its RDF representation, the Taxon Concept Ontology (http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/TaxonConcept.owl). In section 12 of the TCS User Guide, there is a recognition that scientific name strings and AccordingTo publication strings may be the only keys available for identifying taxon concepts at the present time. "Firstly the scientific name needs to be resolved and then the 'AccordingTo' authors and publication need to be resolved. Neither of these is trivial and may be difficult to automate." If centrally issued GUIDs become available, the problem becomes "technically simple". Consequently, the Taxon Concept Ontology has a datatype property, tc:nameString, "A string representation of the TaxonName for this concept." and an object property, tc:hasName, which can have a URI object. Presumably, tc:nameString would be used to identify taxon names in the absence of GUIDs or alongside the GUID if the data provider wanted to make the name string available. But tc:hasName would be used to refer to a taxon name if it were identified as part of a centralized repository. There are corresponding terms for the AccordingTo references: a datatype property tc:accordingToString and an object property tc:accordingTo presumably for the same reason.

What we would facilitate by having both dwc:associatedMedia and dwc:associatedMediaURI would be a way for existing (possibly concatenated) string references (which may or may not be URIs) to be exposed immediately, while providing a way for "clean" URI references to replace or be provided in addition to the string representations at some point in the future.

I am going to copy Jessie Kennedy and Roger Hyam on this email in case they want to correct any misreading of the intent of TCS on my part. They aren't on the TDWG RDF email list at this point, but if they want to reply to me, I will post their response to the list.

Steve

Nico Franz

unread,

Apr 13, 2012, 4:26:36 PM4/13/12

to tdwg...@googlegroups.com, roge...@mac.com, John Wieczorek, J.Ke...@napier.ac.uk

Dear Steve:

I believe your reading of the TCS is correct; it was designed with the expectation that eventually, resolution based on name strings would evolve into resolution based on more complex pieces of information (concepts). Perhaps this article is also relevant:

http://www.jbiomedsem.com/content/2/1/7

Best,

Nico

Nico M. Franz, Ph.D.
Associate Professor & Curator of Insects
School of Life Sciences
PO Box 874501
Arizona State University
Tempe, AZ 85287-4501

Office: (480) 965-2036
Collection: (480) 965-2850
Fax: (480) 965-6899
E-mail: nico.franz @ asu.edu

Franz Lab: http://franz.lab.asu.edu/
ASUHIC: http://hasbrouck.asu.edu/symbiota/portal/index.php

Steve Baskauf

unread,

Apr 19, 2012, 2:59:16 PM4/19/12

to TDWG-RDF TG

I'm forwarding this response which got bounced from the list since Jessie isn't on the list.
Steve

-------- Original Message --------

Subject:	RE: [tdwg-rdf: 44] Issue 9 Repeatable properties in lieu of properties that specify concatenated lists (in DwC)
Date:	Thu, 19 Apr 2012 10:39:12 -0500
From:	Kennedy, Jessie <J.Ke...@napier.ac.uk>
To:	Baskauf, Steven James <steve....@Vanderbilt.Edu>, tdwg...@googlegroups.com <tdwg...@googlegroups.com>
CC:	roge...@mac.com <roge...@mac.com>, John Wieczorek <tu...@berkeley.edu>

Hi Steve,

Regarding your comment on the intent of TCS – you are correct – TCS was meant as a definition where everything must be complete but rather to allow people to record what they did know for legacy data and if they had good new data to capture that too.

TCS was developed to serve the wide ranging interpretations of how one might describe a Taxon.

We started with the quite strict interpretation form our own work on the Prometheus model (published in Taxon) and then when I worked on the SEEK project and looked at how ecologists described taxa – it was very different. I then worked with TDWG to understand the differing perspectives of taxa across the community. In order to deal with new and legacy data it was almost impossible to specify what a taxon must have, but we wanted to allow people to capture what they could provide, believing that if GUIDs were created the concepts could be improved over time as they were required for research and thereby it was worth investing the effort. This didn’t necessarily imply that we would have one GUID for a given taxon concept which others would come along and edit but it would allow people to capture their meaning and part of defining any new concept would be to relate their concept to other existing ones – the thought being that the author of the concept could decide of what he/she meant was congruent, included, overlapped etc with other described concepts. Slowly but surely the network of GUIDs and concepts would grow with the more important ones being sorted out first.

In the ideal world we would have the full description of all specimens, all characteristics of those specimens, the defining characteristics, the basionym etc all defined but we need to be realistic to start, so tying the taxon to at least a physical description would be something to determine meaning from.

It seems we haven’t really committed to this approach enough yet – but I still think it’s the way to go.

Hope this helps,

Jessie

Edinburgh Napier University is one of Scotland's top universities for graduate employability. 93.2% of graduates are in work or further study within six months of leaving. This university is also proud winner of the Queen's Anniversary Prize for Higher and Further Education 2009, awarded for innovative housing construction for environmental benefit and quality of life.

This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender.
It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Edinburgh Napier University does not accept liability for any loss or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the University's system is subject to routine monitoring and filtering by the University.

Edinburgh Napier University is a registered Scottish charity. Registration number SC018373

Paul Murray

unread,

Apr 23, 2012, 1:00:11 AM4/23/12

to tdwg...@googlegroups.com

Tossing up some (fairly low-level) ideas.

In RDF/XML, is it possible to generate a triple that has a literal as the subject? I'd like to do this to support indexed case-insensitive searching in SPARQL

:someTaxonName :genusPart "Doodia" .
"Doodia" :toUpper "DOODIA" .

The same thing can be done just by having another property on the taxon name object, but that would mean duplicate properties for hasEpithet, hasAuthor, etc etc. And quite a lot of duplicate properties, as I'd like to do this to support Tony Rees' taxamatch algorithm (which involves a static transformation of scientific names), and stripping diacritics from author names to make searching easier there.

The other reason is a gut-feel design decision: the transformation to uppercase is not something we are saying about the taxon name object, it is about the string "Doodia" regardless of where it might be used.

Having looked into it a little further, I suspect that it just isn't valid, although at a graph level it would be perfectly ok. Protege won't accept the triples above, let alone any RDF/XML. I think I'm going to have to go with a "MatchingStrings" object and a "genusPartStrings" property … no, that won't do. There will be an enormous number of duplicates because each blank noe has a distinct identity.

A solution that would work is a MatchingStrings object that has a URI that is a hashed version of the string.

Thus:

:someTaxonName :genusPartStrings strings:8787785 .
strings:8787785 rdf:value "Doodia" .
strings:8787785 :toUpper "DOODIA" .
strings:8787785 :toTaxamatch "DDA" .

and so on. The generated RDF might contain duplicates, but there won't be redundant anonymous blank nodes created - just repeated property assertions which are not stored in the graph as repeats. At least, they oughtn't be.

An alternative to using a hash is simply to use the (URL-encoded) string itself as the URI. You could even get away with making up a URI schema "literal", to discourage RDF engines from fetching them.

:someTaxonName :genusPart "Doodia" .
:someTaxonName :genusPartStrings <literal:Doodia> .
<literal:Doodia> rdf:value "Doodia" .
<literal:Doodia> :toAscii "Doodia" .
<literal:Doodia> :toUpperAscii "DOODIA" .
<literal:Doodia> :toUpper "DOODIA" .
<literal:Doodia> :toTaxamatch "DDA" .

This has a number of advantages, transparency being the big one. The disadvantage is that you get long URIs, but not if you are only doing this for epithets and authority strings.

I'm not even sure that discouraging RDF engines from snarfing them is a useful goal - why not simply use http://mydomain/strings/ as the root for all of these objects? The webserver there would not even need to store the strings, it could simply generate the RDF based on the requested url. The SPARQL server needs to have the strings in its dataset, but it only needs the ones we actually want to search on - scientific name parts and authority strings.

So, I would define an RDF class Strings, or Mappings, or Conversions, to be used in this manner, and predicates for the various conversions (rdf:value for the plain text). Importantly, I *can* make use of this without having to define a :genusPartStrings predicate. In SPARQL, the query is simply

?strings :toUpper "DOODIA". # the search
?strings rdf:value ?verbatim . # the string that matched the search
?taxonName :genusPart ?verbatim # the taxon name object whose genus part is "Doodia"

*that* works just fine, and means we don't have to clutter up the space with a swag of duplicate properties. The key is using a URL rather than blank nodes so that we don't have redundant objects in the dataset, and the fact that IRIs will accommodate pretty much anything if it's URL encoded.

If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.

Please consider the environment before printing this email.

Steve Baskauf

unread,

Jun 9, 2012, 10:54:41 AM6/9/12

to tdwg...@googlegroups.com

I started writing this email over two months ago and got sidetracked. What I wanted to do was to try to initiate a conversation about the pros (if any) and cons of declaring properties defined in OWL to be functional or inverse functional. I have created Issue 11 in the TDWG-RDF issue tracker (code.google.com/p/tdwg-rdf/issues/detail?id=11 ) to address this issue. If you wish to state a well-composed response to the issue, I encourage you to enter it as a comment on the issue tracker and then send a note to the list saying that you have done so. This will create a more easily accessed record than replies on the list which would need to be dug out of the email archive.

I'm going to give an example of the type of property which I think Bob objected to, state the reason why Cam and I declared it to be functional, and then guess why I think Bob didn't like that. As it is always dangerous to try to read Bob's mind, he will hopefully correct errors in my guess.

In the OWL definition of Darwin-SW (viewable at http://code.google.com/p/darwin-sw/source/browse/trunk/dsw.owl ) the property dsw:atEvent which has the range dwc:Event and domain dwc:Occurrence is declared to be functional. The rationale for this is based on the idea that there is a one-to-many relationship between an event and occurrences at that event. See http://code.google.com/p/darwin-sw/wiki/RelationshipToExistingModels "Darwin Core (DwC) Standard, 2009" diagrams where "crows-feet" or triangles are used to indicate one-to-many relationships. Because (in this model) each occurrence is associated with one event, dwc:atEvent is declared to be functional (see http://www.w3.org/TR/owl-ref/#FunctionalProperty-def ).

The problem here (I think) is similar to the problem with domain and range declarations: declaring a property to be functional does not somehow prevent a user from assigning more than one object resource to the same subject resource. Rather, if two users assign different object resources to the same subject via that property, a reasoner would infer that the two object resources are the same. In the example of http://bioimages.vanderbilt.edu/baskauf/11713.rdf the occurrence identified by the URI http://bioimages.vanderbilt.edu/baskauf/11713#occ (documented by an image of a leaf) is said to occur dsw:atEvent identified by the URI http://bioimages.vanderbilt.edu/baskauf/11713#eve . Now let's say that someone collected the leaf and the twig which contained it and preserved it as a specimen. In the interest of reusing identifiers, the herbarium curator decided to reuse the occurrence URI. However, rather than considering the event to occur during a very short interval of time (2002-06-02T10:14:03), the curator may want to associate (using dsw:atEvent) the occurrence with an event which was a bout of collection that occurred during the entire week (e.g. a date range of 2002-06-01 through 2002-06-07) represented by some other URI "uri1". Doing this would imply to a reasoner that the event identified by uri1 is owl:sameAs the event identified by http://bioimages.vanderbilt.edu/baskauf/11713#eve and subsequently allow the reasoner to create inferred triples such that every statement made with uri1 as the subject would also be made with http://bioimages.vanderbilt.edu/baskauf/11713#eve as the subject and vice versa. This would be bad because the two events really aren't the same - they have radically different time ranges and meanings to the user who conceived them.

Have I correctly identified the objection, Bob?

Steve

Bob Morris

unread,

Jun 10, 2012, 7:33:23 PM6/10/12

to tdwg...@googlegroups.com

You correctly read my mind in part. I do not like prematurely closing
the world to no particular effect. However, one might argue that this
is a design philosophy, and in the discussion thread you cite in the
Issue 11, Hilmar Lapp expresses the opposite philosophy. All we can
do is agree to disagree about that. But there are worse problems,
namely about tractability, a brief discussion of which I have added to
Issue 11.

Steve Baskauf

unread,

Jun 11, 2012, 8:35:48 AM6/11/12

to tdwg...@googlegroups.com

Bob,
I've posted a reply in the issue tracker for the record. The text of the reply is
------------------------------
Thank you for your comments and for looking carefully at darwin-sw. I'm a bit puzzled about it not validating as OWL DL because we did get it to validate as DL when we posted the last version, although that has now been months ago and I no longer remember the details. Nevertheless, your point about Functional and InverseFunctional properties is well taken. We had not put this level of thought into the implications of assigning those properties, but rather included them because we thought they modeled the relationships as we saw them. So it might be going too far to say that it was part of a design philosophy and it might be better if we took them out.

However, if the Functional/InverseFunctional properties were not there, what problems would there be with declaring pairs of properties to be inverse? It would result in a machine inferring a triple stating the inverse relationship (i.e. A dsw:atEvent B implies B dsw:eventOf A), but that is actually exactly what we want to happen. In the example I gave at http://bioimages.vanderbilt.edu/baskauf/11713.rdf I declare dsw:atEvent and dsw:locatedAt relationships, but if those triples were loaded in a store, I would like for the manager of the triplestore to be able to infer the inverse relationships that I did not state explicitly. For example, as that RDF currently stands, a query asking what events occurred at that location using

<http://bioimages.vanderbilt.edu/baskauf/11713#loc> dsw:locates ?

would not pick up the relationship between http://bioimages.vanderbilt.edu/baskauf/11713#loc and http://bioimages.vanderbilt.edu/baskauf/11713#eve as it is currently described using dsw:locatedAt without the two properties dsw:locatedAt and dsw:locates being declared as inverse properties.

I suppose that an alternative would be to simply provide only one object property to connect each class rather than pairs of inverse properties and that would force people to write the triples in only one direction. But that would impose more constraints on how people had to write their RDF. It seems to me that it would be better to impose the burden on the data consumers (i.e. to infer the inverse relationships) rather than to impose the burden on the data providers (i.e. to express the relationships in only one way) because it seems more likely that the data consumers are generally going to be more savvy in doing their jobs than the providers. Besides, the data consumers are going to have to do a lot of other processing tasks to clean up the data anyway.
-----------------------------

Bob Morris wrote:

===
The content of this communication is made entirely on my
own behalf and in no way should be deemed to express
official positions of The University of Massachusetts at Boston or
Harvard University.

.

Paul Murray

unread,

Jun 12, 2012, 9:25:33 PM6/12/12

to tdwg...@googlegroups.com

On 10/06/2012, at 12:54 AM, Steve Baskauf wrote:

> The problem here (I think) is similar to the problem with domain and range declarations: declaring a property to be functional does not somehow prevent a user from assigning more than one object resource to the same subject resource. Rather, if two users assign different object resources to the same subject via that property, a reasoner would infer that the two object resources are the same.

I suspect the problem with atEvent isn't the particular *way* that OWL deals with functional properties, it's that having this property functional doesn't reflect occurrences and events. At a guess, the issue is events that contain sub-events. An occurrence at event "collecting from this particular tree" is also an occurrence at "collecting at this particular gully".

One way to deal with this type of thing is to have an atEvent that specifically means "most granular, atomic event", and a different predicate - inEvent? - for non-exclusive inclusion. Thus

* atEvent is functional
* atEvent is a subproperty of inEvent

And this is could be general pattern for predicates that may or may not be vague: atLocation/inLocation, atTime/inTimeRange and so on.

With respect to the idea of an event being event:partOf another event (or multiple ones)

* partOf is transitive and reflexive (but not symmetrical)
* the property chain (inEvent,partOf) is a subproperty of inEvent

It would be nice also to have

* the property chain ( inverse(atEvent), inEvent)) is a subproperty of partOf

The effect of it would be that Owl could deduce, if an occurrence is both specifically at "collecting from this tree", and generally in "collecting in this gully", then the tree is in the gully. But this will break compliance with OWL DL, sadly.

Bob Morris

unread,

Jun 12, 2012, 11:52:27 PM6/12/12

to tdwg...@googlegroups.com

I've added to the discussion in the issue tracker with this text:

Apology and partial retraction of
http://code.google.com/p/tdwg-rdf/issues/detail?id=11#c2

Ah, darwin-sw v. 0.2.1 is indeed OWL DL, and is indeed declared such
by the Manchester Validator. My error is that InverseFunctional is
forbidden in DL only for Data properties, not for Object properties,
which is all that you apply it to. Thus my objection on tractability
grounds is wrong, and I retract all of the paragraph beginning with
"BUT". I still hold to the position that a less restrictive ontology,
perhaps with best practices implemented in rule languages, is more
useful than an ontology that enforces those practices.

Separately, I think that in general one should also consider which of
the OWL2 standard profiles an ontology fits (See
http://owl.cs.manchester.ac.uk/validator/) and inquire whether there
is a case for each. See especially
http://www.w3.org/2007/OWL/wiki/Profiles. For example, OWL2 EL is said
to be especially useful when there is a large number of classes and/or
properties. This might apply to those who wish to model every species
as a class, as has recently come up in the discussion list.

On Mon, Jun 11, 2012 at 8:35 AM, Steve Baskauf

Hilmar Lapp

unread,

Jun 13, 2012, 11:02:42 AM6/13/12

to tdwg...@googlegroups.com, tdwg...@googlegroups.com

One of the fastest, and for large datasets like the one we have in Phenoscape the only really viable reasoner so far has been ELK, which is limited to OWL-EL. There's a tool published by R. Hoehndorf, called Elvira (if I recall that correctly from memory) that transforms DL into EL, obviously with possible loss of axioms that can't be recast in EL. (His paper explains this.)

-hilmar

Sent with a tap.

Steve Baskauf

unread,

Jun 24, 2012, 6:13:28 PM6/24/12

to tdwg...@googlegroups.com, John Wieczorek

Several months ago we discussed the issue of extending Darwin Core to provide terms that were designed specifically to be used in RDF and to have URI-referenced objects as opposed to literal objects which may or may not consist of concatenated lists (Issue 9 on the issue tracker: http://code.google.com/p/tdwg-rdf/issues/detail?id=9 ). See http://code.google.com/p/tdwg-rdf/wiki/DwcStringTermsAsRdf for meeting notes.

To break the issue down, there really were two components related to the use of DwC terms as RDF: the issue of creating terms where there was an expectation that the object of a triple was going to be identified by a URI, and the issue of creating clarity about whether one should expect a property having a literal object to have multiple references which were a concatenated, separated list of entities or whether one should expect the term to be repeated with each object a string representing a single entity. In our discussion, we spent a considerable amount of time talking about the latter issue, but in retrospect I think that the former issue is the one that is really the most relevant.

There were two options that were put on the table (see the forwarded message below). One was to create new terms within the current dwc: namespace which were designated specifically for use with URI objects, e.g. dwc:recordedByURI as the URI version of dwc:recordedBy. The other was to create a new namespace designated for use with URI objects which used the same term names, e.g. dwcuri:recordedBy. At the time, there didn't seem to be compelling reasons why one should be chosen over the other. However, having had a lot of time since then to reflect on the issue, I have become convinced that the second approach is the best one to follow. I believe this because I think that the second approach can be part of a slightly broader effort that could resolve several other issues that we have identified as problems with the use of DwC terms as RDF (namely: no RDF guide for DwC, problems with Dublin Core terms imported into DwC, and lack of clarity about use of the "ID" terms. I will outline my reasoning below.

From the standpoint of implementation, one could establish the new terms through promulgation of a rule rather than proposing a long list of term additions. The rule could be relatively simple, something like: any term in the dwc: namespace for which it is appropriate to have a non-literal object will have a term in the dwcuri: namespace which consists of the same local name (sensu http://www.w3.org/TR/swbp-vocab-pub/#naming ) as the dwc: term but which is appended to the dwcuri: namespace string. If new terms are subsequently added to the dwc: namespace or if the definitions of terms in the dwc: namespace are changed in the future, an addition or change involving the corresponding term in the dwcuri: namespace would occur. The determination of which terms would appropriately exist in the new namespace could be made by this group at the time the new namespace was created. At the time of subsequent term additions, the appropriateness of a corresponding dwcuri: term would be included as a part of the discussion.

At the time of the adoption of the new namespace dwcuri: it would be appropriate to develop an Darwin Core RDF guide similar to the existing text (http://rs.tdwg.org/dwc/terms/guides/text/index.htm ) and XML (http://rs.tdwg.org/dwc/terms/guides/xml/index.htm ) guides. I think that this guide could be relatively simple since general guidelines for the use of RDF are already well established. In addition to providing a list of terms considered appropriate to be included in the newdecuri: namespace, there would be a few additional guidelines:

1. The terms in the existing namespace (dwc:) are assumed to have literal objects. This would allow for relatively straightforward mapping of existing databases using DwC to RDF using one of the tools specifically designed to do this (e.g. those listed at http://code.google.com/p/tdwg-rdf/wiki/Beginners5RDFdata#5.3.2.1._Serving_data_from_an_existing_relational_database_as_RD see also section 2 of https://docs.google.com/a/nescent.org/document/d/1XzdsjCfPylcyOoNtDfAgz15HwRdCD-0e0ixh21_U0y0/edit?hl=en_US ). Applications whose developers wish to display DwC metadata properties for humans could do so easily by simply displaying the literal string. Applications which would like to attempt to parse out concatenated and separated lists and do something with them would be on their own just as they would be under the existing system where output is generated from a non-RDF database. We simply would not get into the issue of whether it was appropriate to repeat dwc: properties rather than providing a single list under one property. If providers want to parse out the entity strings from the concatenated, separated list before creating the RDF and then repeat the properties with one entity per property, good for them - consumers would be on their own to figure out what to do with that as well.

2. Terms in the new namespace would be assumed to be repeatable. They would not be assumed to any structure of the sort listed for some of the dwc: terms (e.g. http://rs.tdwg.org/dwc/terms/index.htm#recordedBy where the first listed agent is primarily responsible). If the provider desires such structure, it should be created by having the URI represent an entity whose structure is defined using typical methods in RDF (e.g. foaf:Group instances having multiple foaf:member properties, see http://xmlns.com/foaf/spec/#term_Group) but not RDF containers (http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#containers) such as Bag, Seq, and Alt which are not in widespread use.

3. The potentially problematic issues involved with the inclusion as a part of DwC of Dublin Core terms in the dcterms: namespace having non-literal ranges (see Issue 2 http://code.google.com/p/tdwg-rdf/issues/detail?id=2 and http://code.google.com/p/tdwg-rdf/wiki/DublinCore#2.3._Possible_courses_of_action_where_TDWG_vocabularies_suggest ) could be solved by adding the corresponding terms in the dc: namespace to the "regular" literal Darwin Core namespace (dwc:) list and then moving the non-literal range dcterms: versions of the terms to the list of terms suitable for use with URI objects. For example, dc:language would be listed with the "regular" dwc: terms and dcterms:language would be listed on the URI list. There will undoubtedly be legacy providers who will continue to supply literal values for the non-literal (determs:) terms rather than using the corresponding dc: terms. Developers of consuming applications should be advised to be on the lookout for that. Data aggregators should be requested to map the inappropriate dcterms: properties to the corresponding dc: properties as a part of their data cleaning - this would be appropriate because all dcterms: terms which have corresponding dc: terms are defined to be subproperties of the corresponding dc: terms, so such triples can be inferred by a reasoner anyway.

4. The "ID" terms (e.g. dwc:occurrenceID, dwc:eventID, etc.) would be declared to be inappropriate for use in RDF. Here's why: the URI of the subject resource would be specified in its rdf:about attribute, not as a property. A string representation of the identifier itself should be supplied as the value of a dcterms:identifier property of the subject resource. The URI terms would NOT be appropriate as properties for specifying relationships of a subject resource to an object resource of another class (i.e. as an object property sensu OWL) since it is not clear that this is their intended use (see http://rs.tdwg.org/dwc/terms/guides/xml/index.htm#classes where ID terms are used to represent both subjects and objects and http://code.google.com/p/tdwg-rdf/wiki/DublinCore#1.3.2.4._dcterms:identifier_%28AC%29 for discussion) and because they are defined in RDF as subproperties of dcterms:identifier which has a literal range and thus can be reasoned to have literal ranges themselves. This leaves us with the problem of a lack of terms for expressing the relationships between resources of different classes (i.e. object properties), but that is really a separate and more complicated issue that should be addressed separately, probably in conjunction with other groups outside of TDWG, e.g. the Semantics of Biodiversity workshop, the Genomic Standards Consortium (GSC), The Ontology for Biomedical Investigations (OBI), etc.

As was noted earlier in the discussion, taking the "separate namespace" approach would not be likely to confuse DwC users who are unfamiliar with RDF - they would simply ignore the RDF guidelines and new namespace, and keep doing what they are doing now with the "old" DwC terms. On the other hand, data providers who want to provide richer content and less ambiguity by using more fully described URI-referenced objects will now be able to do so with clarity. There is already precedent to creating namespaces other than http://rs.tdwg.org/dwc/terms/ within the root DwC namespace (http://rs.tdwg.org/dwc/) to achieve certain goals, e.g. the DwC type vocabulary namespace (http://rs.tdwg.org/dwc/dwctype/) and the DwC attributes namespace (http://rs.tdwg.org/dwc/terms/attributes/). So what I'm proposing here would be a reasonable extension to DwC in that spirit (perhaps dwcrdf: = http://rs.tdwg.org/dwc/rdf/). As far as the normative RDF definitions of the new terms is concerned, I think that following the precedent of the current DwC terms (i.e. no ranges and domains and rather uncomplicated defining properties) would be advisable, so creating a document that can provide RDF when the terms are dereferenced should be fairly straightforward.

In the previous email discussion, I don't remember actually hearing any objection to making moves in this direction (just lack of certainty about what the moves should be) and at the physical meeting of interested parties at iDigBio I felt that there was quite a bit of support for the idea. So unless I hear objections to what I've proposed, I would like to move forward with this. I would like to have at least one other group member to volunteer to work with me on coming up with a draft proposal in the next several weeks. I'm willing to do the heavy lifting of creating the draft itself if one or more persons will provide feedback, proofread, etc. We can then bring it back to the group for comments and changes. If there is a consensus satisfaction with the result, we could then as a task group submit it as an issue on the DwC issue tracker for official consideration as an addition to the Darwin Core standard. I believe that proposals such as this are one of the outcomes that people were wanting from the task group, so I think that if we as a group make a recommendation there will be little controversy in accepting the proposal.

If you are interested in working on this, please let me know.

Steve

-------- Original Message --------

Subject:	[tdwg-rdf: 43] Issue 9 Repeatable properties in lieu of properties that specify concatenated lists (in DwC)

Date:	Wed, 4 Apr 2012 12:46:03 -0500
From:	Steve Baskauf <steve....@vanderbilt.edu>
Reply-To:	tdwg...@googlegroups.com <tdwg...@googlegroups.com>
To:	tdwg...@googlegroups.com <tdwg...@googlegroups.com>, gsa...@unb.ca <gsa...@unb.ca>, John Wieczorek <tu...@berkeley.edu>
References:	<4F58267F...@vanderbilt.edu> <CADUi7O4KwXh0k7AoGtw197cudVBHHnf=L611GyiLA...@mail.gmail.com> <4F68BD3A...@vanderbilt.edu> <CADUi7O6W08MabvoAe3Rwqo+BHhYHhVWECnWTF=-J0_yR...@mail.gmail.com> <8530342F-133E-4F37...@nescent.org>

John Deck

unread,

Jun 25, 2012, 1:48:19 PM6/25/12

to tdwg...@googlegroups.com, John Wieczorek

Steve,

Generally, I concur with what you are proposing and happy to help out on this.

One of the outcomes of the Semantics of Biodiversity Workshop was a general intention of aligning DwC Elements within the Basic Formal Ontology framework. This process would likely result in splitting sets of terms out and delegating to other existing sub-components of BFO (e.g. IDO, OBI, IAO...). The result then, would not be a replicate of DwC called DwC-URI (or whatever you would call it), but rather a more constrained set of terms, likely having mostly to do with tracking objects within the context of museums.

The proposed hackathon at GSC14 in Oxford in September will be addressing this issue and creating a tangible proposal in this regard so happy to report back on that progress and feed it back in.

John

--
John Deck
(541) 321-0689

Steve Baskauf

unread,

Jun 26, 2012, 9:04:07 AM6/26/12

to tdwg...@googlegroups.com, John Wieczorek

Great, thanks John.

I realize that what I'm proposing here still leaves a lot of holes but I think it is tractable in the near term and first step that I think is somewhat necessary. Hopefully there will be progress later on creating the missing terms (i.e. object properties) that are necessary to connect entities in different classes. It may be that that task would be handled better outside of Darwin Core, particularly if there is consensus among TDWG and other organizations on what terms are needed. Hopefully progress will be made in subsequent meetings.

Steve

Reply all

Reply to author

Forward