Representing RDF data as cnt:chars is a bad idea

Vladimir Alexiev

unread,

Feb 8, 2012, 3:27:35 AM2/8/12

to oac-d...@googlegroups.com

Sec 3.9.1 Inline Data Annotations says: "possible to include RDF using various techniques, including Named Graphs and reification. For consistency, we recommend the use of ContentAsText unless there is a requirement or use case that other techniques can solve more easily"

I think it's a very bad idea to embed RDF data in an RDF graph as anything but triples, i.e. in any encoded representation.

This creates more work to parse and decode it
More importantly, it defeats semantic repository indexes and query optimizations. You can do nothing with the data before you decode it: cannot search, filter, count...

I propose to change this text to: "It is recommended not to use this representation for RDF data, since it makes the data opaque to semantic repository storage and querying mechanisms. For RDF you should use applicaton-specific properties, or techniques such as Named Graphs and reification".

And add this to the end of 3.7.3.1 Inline Constraints: "You should not use this technique for representing RDF data. See the next section for a better technique"

Robert Sanderson

unread,

Feb 8, 2012, 4:17:03 PM2/8/12

to oac-d...@googlegroups.com

Hi Vladimir,

The reasons we had for suggesting this approach are as follows:

* We would have MUCH preferred to use a named graph approach, but our
understanding is that TRiG is not widely supported. As the TRiG
constructions are part of the base Annotation document, it would mean
that the serialization would not be parsable at all, and we strongly
wish to avoid that situation. If this is not the case, please let us
know so we can revise things.

* Reification is possible, but it is our understanding that it has
fallen out of favor. It also does not scale well when you have many
triples to reify rather than just a single statement.

* The vast majority of annotation consuming systems will need to dig
deeper into the Body resource anyway. For plain text you would want
to enable keyword searching, which requires parsing (into words) and
storing (into a keyword database). For XML it requires parsing, and
storing. For external resources in RDF it requires retrieving,
parsing and storing. So overall it's a very minimal condition when
the resource is in RDF and embedded within the annotation, and having
a special case for it seemed overkill.

* Consistency is very important in a standard. By not introducing a
special case, all processors have the same (minimal) algorithm to deal
with annotation bodies:
- If the mime-type is given, and not understandable, do nothing
- If the resource is not embedded, retrieve it.
- Determine the correct parser to use based on the format of the body
- Parse the body
- Act on the parsed information
And this follows for RDF, XML, JSON, CSV, or whatever other formats are used.
If there were a special case for RDF, then that needs to be checked
for every time by every processor.

* Provenance is very important to maintain. If the triples of the
Body end up directly in the Annotation graph, then they would be
assigned the same creator (and other metadata) as the Annotation. For
many use cases, it is extremely important that the agent asserting the
statements is correctly recorded.
Imagine someone layering an annotation interface over a third party
RDF data source. Suddenly, the annotation interface becomes the
asserter for the third party data, rather than the real agent.

Hope that makes things more clear?

And again, if there is a well supported technique for embedding RDF
that allows the Annotation to be parsed, and to embed multiple
statements, we'd very much like to know about it.

Rob

Vladimir Alexiev

unread,

Feb 9, 2012, 4:16:02 AM2/9/12

to oac-d...@googlegroups.com

> * We would have MUCH preferred to use a named graph approach, but our
> understanding is that TRiG is not widely supported.

- You don't have to use named graphs. It could all be represented as triples
(ala reification)
- Please define "widely supported". TriG is supported by Sesame and Jena
(through RIOT), just google "sesame parse trig" or "jena parse trig"
- Certainly parsing cnt:chars is less widely supported

> * Reification is possible, but it is our understanding that it has
> fallen out of favor.

That's what many sites say, and rumor is the new RDF working group will
obsolete it officially...
But I think no clear-cut reasons are provided, so for me the question is
still open.
For ResearchSpace I chose to use reification because then I can annotate a
property-instance with or without talking about the object.
(i.e. rdf:predicate with or without rdf:object).

> * The vast majority of annotation consuming systems will need to dig
> deeper into the Body resource anyway. For plain text you would want
> to enable keyword searching, which requires parsing (into words) and
> storing (into a keyword database)

OWLIM (and some other semantic repositories) do FTS indexing over RDF
literals (and URIs).
In OWLIM you can tune how the FTS molecule (FTS document associated to each
node) is formed.

> overall it's a very minimal condition when
> the resource is in RDF and embedded within the annotation, and having
> a special case for it seemed overkill.

> Consistency is very important in a standard. By not introducing a
> special case, all processors have the same (minimal) algorithm

But OAC is based on RDF, so clearly RDF is a already a special case!
Any OAC processor necessarily needs to work with RDF.
By recommending that part of the data should be stored in an encoded way,
you make it harder for use cases where all data can be captured in RDF.

For the same reason, I think the spec should promote RDF
extensions/specializations, ergo my prev proposal "RDF Constraints".

> with annotation bodies:
> - If the mime-type is given, and not understandable, do nothing
> - If the resource is not embedded, retrieve it.
> - Determine the correct parser to use based on the format of the body
> - Parse the body
> - Act on the parsed information

> If there were a special case for RDF, then that needs to be checked
> for every time by every processor.

But the last bullet varies by the nature of the data!
- if it's RDF then you can query it, search by specific fields, count it,
etc usign SPARQL
- if it's anything else, you can't do that unless you first store it locally
then define your own query language

And if the data is in RDF then the processor only needs to do the last
bullet.

> * Provenance is very important to maintain. If the triples of the
> Body end up directly in the Annotation graph, then they would be
> assigned the same creator (and other metadata) as the Annotation.

How is this necessarily so? Correct me if I'm wrong but the spec doesn't say
a lot about provenance, so I think both of these below are viable:
- oac:Body could have its own dc:creator, different from that of
oac:Annotation
- an application can use named graphs to put each piece in its own graph,
and provide lots of provenance info about the graph

You don't need to pack data to an encoded string to give it provenance
information.

> if there is a well supported technique for embedding RDF
> that allows the Annotation to be parsed, and to embed multiple
> statements, we'd very much like to know about it.

Sure: that technique is RDF :-)

PS: please consider this principle
** EVERY piece of OAC data should be represented in RDF
- It's an extension of the current topic (don't encode data that's already
in RDF).
- I have not proposed it since I'm not sure about it
- it would imply breaking up SVG and fragment-URI data into triples
- cons: it would not leverage established standards for representing
polylines and media fragments
- pro: the data would be available for direct processing, no parsing
required
- pro: it would enable more elaborate ontological modeling (e.g. think CRM:
E46 Section Definition and temporal modeling)

Robert Sanderson

unread,

Feb 9, 2012, 11:44:36 AM2/9/12

to oac-d...@googlegroups.com

On Thu, Feb 9, 2012 at 2:16 AM, Vladimir Alexiev <vlad...@sirma.bg> wrote:
>> * We would have MUCH preferred to use a named graph approach, but our
>> understanding is that TRiG is not widely supported.
> - You don't have to use named graphs. It could all be represented as triples
> (ala reification)

Reification is not even on the table any more at the W3C as far as I can tell.
And if you read the following thread, even graph literals are very
*very* hotly debated:
http://lists.w3.org/Archives/Public/public-rdf-wg/2012Jan/0021.html

And also here on the Linked Data list about named graphs:
http://lists.w3.org/Archives/Public/public-lod/2012Jan/0006.html

As Ivan Herman says, there are 400-500 mails on this subject in the
RDF working group archive. We are not going to solve it over in the
very small OAC corner of the RDF world.

> - Please define "widely supported". TriG is supported by Sesame and Jena
> (through RIOT), just google "sesame parse trig" or "jena parse trig"

By widely supported I mean that if you pick any reasonably thoroughly
developed and supported RDF library across all commonly used
languages, then it is supported.
For example: rdflib in Python is the most well known/well used library
and it appears to only support TriX. RAP for PHP only supports TriX,
but easyrdf does seem to support TriG. RDFQuery for Javascript does
not appear to support either. And so forth.

> - Certainly parsing cnt:chars is less widely supported

I think you miss the point. If we mandate a named graph serialization,
then we are *significantly* limiting the number of frameworks in which
OAC can be implemented. We would be saying "You must use this one
serialization and one of these few implementations", and that's not
something that I'm prepared to do (speaking for myself, if not OAC in
general). The difference between supporting a particular RDF
construction and a different data model plus serialization is very
significant. If you don't support the data model and serialization,
then you cannot parse the annotation *at all*. If you don't support
the already mandated cnt:chars, then you can't process the contents of
the body, but you can still understand the annotation as a whole.

>> * Reification is possible, but it is our understanding that it has
>> fallen out of favor.
> That's what many sites say, and rumor is the new RDF working group will
> obsolete it officially...

Then we're not going to do that. Writing a known to be obsolete
specification is counter productive.

>> Consistency is very important in a standard. By not introducing a
>> special case, all processors have the same (minimal) algorithm
> But OAC is based on RDF, so clearly RDF is a already a special case!

Almost ... OAC has a data model which is expressed in RDF.

> Any OAC processor necessarily needs to work with RDF.

Currently, yes. But the data model can easily be transferred to
another encoding. This would not be the case if it mandated particular
RDF constructions.

> By recommending that part of the data should be stored in an encoded way,
> you make it harder for use cases where all data can be captured in RDF.

There is nothing preventing you from putting in additional triples to
an annotation graph and still being OAC compliant.

[... snip ...]

> PS: please consider this principle
> ** EVERY piece of OAC data should be represented in RDF

We have considered it, and it's simply not the direction in which OAC
is going. We already have pushback from developers in different
communities and people on this very list about the RDF-centric
approach. To then deprecate the use of existing standards, in favor
of non-standard approaches which are not widely implemented, is simply
not going to happen within an *interoperability* specification.

Rob

Vladimir Alexiev

unread,

Feb 10, 2012, 8:05:02 AM2/10/12

to oac-d...@googlegroups.com

> There is nothing preventing you from putting in additional triples to
> an annotation graph and still being OAC compliant.

Agree.
But OAC should promote extensions that standardize such triples,
and should NOT recommend that triples be encoded in an opaque way.

> > Any OAC processor necessarily needs to work with RDF.
> Currently, yes. But the data model can easily be transferred to
> another encoding. This would not be the case if it mandated particular
> RDF constructions.

*Currently* OAC is expressed in RDF.
*Currently* OAC recommends particular RDF constructions (namely, the OAC
classes and properties).

Yet currently you recommend that certain RDF data should be encoded: what's
the benefit?
If in the future OAC decides to target a different data model,
it would have to specify how all current OAC RDF constructions are mapped to
that data model.
Having part of the RDF data encoded may or may not save you something in the
*future*,
but surely it's a concrete disadvantage *now*.

> Reification is not even on the table any more at the W3C...
> graph literals are very *very* hotly debated...
> [also hot discussion] about named graphs...

Ok, so OAC should not take a standpoint and pick one or the other
construct/approach.
But it should *not* recommend encoding the RDF either!

>... appears to only support TriX... seem to support TriG... does not appear
to support either

> If we mandate a named graph serialization,
> then we are *significantly* limiting the number of frameworks

Where did I propose to mandate a named graph serialization???
I said OAC should recommend that RDF properties should be represented as
triples, and not encoded.

If these tools are important to the OAC community, then OAC should mandate
that named graphs should not be used.
But by not mandating the format of cnt:chars, how do you increase
interoperability?

You cannot resolve a problem by hiding it under a carpet.
You cannot resolve the open question "what RDF constructs are appropriate"
by putting the RDF in an encoded string.

> To deprecate the use of existing standards, in favor

> of non-standard approaches which are not widely implemented, is simply
> not going to happen within an *interoperability* specification.

Ironically, encoding triples into anything does exactly that in regards to
the RDF standard :-)

Robert Sanderson

unread,

Feb 10, 2012, 9:52:50 AM2/10/12

to oac-d...@googlegroups.com

Hi Vladimir,

I think I'm getting confused :)

If you're not promoting named graphs (not standard yet, and breaks
non-conforming implementations), reification (obsolete soon), or
encoding in a literal with cnt:chars, could you explain in more detail
what you are in favour of?

Could you please explain how you would serialize the following situation?

A machine has extracted triples from a natural language paragraph that
describes the moon in Van Gogh's The Starry Night, and knows where the
moon is in a time range of a video. It wants to describe the target
area within the frames as a circle, but uses the media fragments
standard to identify the time range in the video.

The attached diagram shows this, but doesn't actually make explicit
that the box is a cnt:ContentAsText, and the value is in cnt:chars.
How would you do it?

Also, perhaps you could take a stab at turning SVG and CSS into RDF?
We use SVG for areas within video and image data (or anything with 2
dimensions, really) and CSS for providing styling hints to the client.
This second requirement as come out of discussions with Annotation
Ontology and the NISO eBooks working group, to explain why it isn't in
the beta spec.

Many thanks!

Rob

moma-moon-machine.png

Vladimir Alexiev

unread,

Feb 14, 2012, 9:44:16 PM2/14/12

to oac-d...@googlegroups.com

Hi Rob!

> cnt:ContentAsText, and the value is in cnt:chars.

Thanks for the excellent example! Discussing a particular example can
advance the discussion faster.

So it's something like this (blatantly skipping prefixes for brevity):

<body> a oac:Body, cnt:ContentAsText;
cnt:chars """uu1 foaf:depicts <Moon>;
shape <Crescent>;
bolometricLuminosity [value 0.5; units <SolarLuminosity>].
""".

> what you are in favour of? How would you do it?

I am in favor of not packing RDF into an encoded string, but keeping it as
triples. I'd do it like this:

<body> a oac:Body;
foaf:depicts <Moon>;
shape <Crescent>;
bolometricLuminosity [value 0.5; units <SolarLuminosity>].

What's the trouble with this representation?
- cnt:chars "boxes" the Body (very obvious from your picture :-)
- I think boxing into a named graph is much better, despite the doubts
whether named graphs are implemented by various client frameworks
- but WHY is such a box needed in the first place? What's the problem with
an unboxed representation as above?

> If you're not promoting reification (obsolete soon)...

This thread has nothing to do with reification.
I've used rdf:subject, rdf:predicate and rdf:object in another thread (for
ConstrainedBody over RDF data).
But I'd be much obliged if anyone can explain what's the curse hanging over
these 3 simple properties.
If I called them something else, would you accept my proposal for
ConstrainedBody over RDF data?

> take a stab at turning SVG and CSS into RDF?

If I had a need for that and a use case, I could :-)
- if you have a SVG engine at your disposal and need general-shape image
annotations, SvgConstraint in a string is great
- if you need simple rectangles, using a #xywh fragment simpler ergo better
http://www.openannotation.org/spec/beta/#DM_Frag_Media
- if you need to correlate image coordinates (pixels) to painting
coordinates (cm) in order to correlate:
-- image annotations to sampling records (e.g. "took pigment sampling 3cm
from TOP and 2cm from LEFT")
-- or annotations on several images of the same painting
THEN I'd model this with explicit RDF properties. And I'd never encode those
to cnt:chars

--

In this thread I argue that if you have RDF data, given that OAC is
expressed in RDF, you should NOT encode the RDF data.
If you encode the RDF, that defeats semantic repository indexes and query
optimizations.

You wouldn't store encoded relational data in a single RDBMS field (that
violates several Codd normal forms),
so why would you do it for RDF?

Reply all

Reply to author

Forward