how to identificate instances of Content

1 view
Skip to first unread message

Gedas

unread,
Nov 24, 2006, 4:12:07 PM11/24/06
to atom-owl
Hello there,

I have been working on my Master thesis what is related with Atom 1.0
feeds. And I have problems what maybe you could help me to solve?

There is ontology representing Atom feed and whole structure (
http://bblfish.net/work/atom-owl/2006-06-06/AtomOwl.html ). I parse the
feeds and create instances in ontology based on this schema. For
operate ontology using Jena 2.4 framework.

And based on idea of RDF every instance has to be identified by URI. As
far as it concerned with :Feed and :Entry it is possible to identify by
:id attribute. Then :Person (in the role of :author or :contributor) is
possible to identify by :email. Then :Category as I understand identify
it's :scheme attribute. But how to identify :Content instances???

Because :Entry and :Content has relation One-To-One I was thinking to
overcome this and name the instances "Entry:id" + "#content", but it
not seems clean for me...

The graphical representation of problem is shown at:
http://www.stud.ktu.lt/~gedsaka/feeds/dublicated_content_instances.png
. There is shown subset of feeds I parsed two times form tbray blog. If
I parse once more, another :Content instance would appear that is
connected with the same :body and the same :type literals.

Maybe there is something wrong in my approach for creating these
instances with identification? Or good suggestion how to identify these
:Content instances?

Thanks for your advice!

Best regards,
Gediminas Sakalauskas

Henry Story

unread,
Nov 24, 2006, 9:41:26 PM11/24/06
to atom...@googlegroups.com
On 24 Nov 2006, at 13:12, Gedas wrote:
> Hello there,
>
> I have been working on my Master thesis what is related with Atom 1.0
> feeds. And I have problems what maybe you could help me to solve?

Very nice. Let me do my best.

> There is ontology representing Atom feed and whole structure (
> http://bblfish.net/work/atom-owl/2006-06-06/AtomOwl.html ). I parse
> the
> feeds and create instances in ontology based on this schema. For
> operate ontology using Jena 2.4 framework.

Very good. You may want to check a very recent mail on this archive.
I have just released a new xquery to transform atom to AtomOwl. I
will publish that as soon as I get a little feedback. Though this is
a little problematic as it creates even more anonymous blank nodes.

> And based on idea of RDF every instance has to be identified by URI.

Or a blank node!
If you don't have a URI to assign to an object, you can use a blank
node, which is just a way of saying: "there exists some thing that...."

> As
> far as it concerned with :Feed and :Entry it is possible to
> identify by
> :id attribute. Then :Person (in the role of :author
> or :contributor) is
> possible to identify by :email. Then :Category as I understand
> identify
> it's :scheme attribute. But how to identify :Content instances???

This is perhaps the little tricky part of AtomOwl. I think I will
need to make this clearer on the AtomOwl page. An Atom Entry is a
blank node. You cannot identify it with the id, because an entry
represents the state of a resource at a time. An Atom Feed can have a
number of entries with the same id. If you use the id to identify the
entry, then you will end up with an entry with multiple titles,
multiple updated time stamps, etc...

If you want to generate ids for entries, the best you may able to do
is to create a DURI, combining the updated time stamp and the id.
( http://larry.masinter.net/duri.html )
Or you could have an inferencing engine that understood CIFPs
( http://esw.w3.org/topic/CIFP ).

But the atom group has not wanted to make any statement about
identity. They have recently added app:edited time stamp. As a
result the above duri could be misleading. A CIFP on (:id, :updated)
could at best seen to be a useful theory, that may help you with
garbage collection from time to time). Perhaps as atom evolves the
CIFP will have to be on (:id, :edited )

> Because :Entry and :Content has relation One-To-One I was thinking to
> overcome this and name the instances "Entry:id" + "#content", but it
> not seems clean for me...

A Content is very close to a literal. I don't know exactly what the
identity criterion of an id should be. You may wish to have it
stronger than duri:{date}:{entryid}#content, as you may want
different entries to be able to share the same content.

For the moment this is unclear.

Perhaps a CIFP on :body and :type would be good. Any two contents
with the same body, and same mime types, are the same content?

This would argue for defining urls for each mime type, and then using
them to define literals, which could be used like so:

"<div></p></div>"^text:xhtml .

( which is just shorthand for:
[] text:xhtml ""<div></p></div>" )

If text:xhtml is inverse functional, then the above is equivalent to

"<div></p></div>"^^text:xhtml .

where text:xhtml is a subPropertyOf :body. I have been arguing for
and against that on this list.
I can't remember my arguments against any more. Oh yes. see
http://groups-beta.google.com/group/atom-owl/browse_thread/thread/
b3779504d28d482a?hl=en
for an argument against the last step, to making content a literal.

But I don't think that is an argument against (:body :type) being a
CIFP. Hmmm. No it is, because if the html contains relative links,
then two different :base relations could in fact be very different
html content.

[] :content [ :base <http://sex.eg.com/>;
:type "text/html";
:body "This is <a href="/fun.html">fun</a>" ] .

compare

[] :content [ :base <http://jesus.eg.com/>;
:type "text/html";
:body "This is <a href="/fun.html">fun</a>" ] .


The tricky thing here for CIFPs is to ask when (:type :body) form a
CIFP and when (:type :base :body ) form a CIFP. This may be what
people are thinking when they say that we are missing tuples in rdf.
Because of the open world assumption we cannot deduce from the
absence of a :base relation that there is not going to be an
extra :base or :lang relation that is going to be discovered at a
later date to distinguish two pieces of content we thought were the
same. So as a result, we would have to put a CIFP on
(:body :type :base :lang) and then give a value for each of these if
only a default value. This is ok, if we think that we have a complete
list of all the relations that are distinctive on a piece
of :content, which I am not sure we have (but I may be wrong. This
may be just a question of working case by case: for different mime
types.

Well this is not completely true. Inside a graph one can reason with
a closed world. One can reason about what is not said inside a graph.
(see http://www.w3.org/2000/10/swap/doc/Reach )
So inside a graph one could create rules to identify 2 contents that
are the same. Across graphs this works less well in a generalized
form. But perhaps I just have not thought about it carefully enough.

You can reduce the duplicate contents somewhat if you give an entry a
stronger identity. Since an entry can only have one content (:content
is functional) . This is the equivalent of your idea of giving the
content a URI#content. Again an inferencing engine would not need the
id to deduce that the two contents are the same.

I think this would be a good topic for research: what are the rules
of content identity.

There is a different way of thinking about content, shown here http://
infomesh.net/2002/notation3/#proposed

[ is :author of [ :title [ :fr "Le Petit Prince" ] ] ] :name "Fred" .

what does this suggest for content?

Perhaps

[] :content [ :fr [ :html [ :body "<b>dis</b>" ] ] ] .

But I don't think that helps identify the content, nor does it make
things clearer.


> The graphical representation of problem is shown at:
> http://www.stud.ktu.lt/~gedsaka/feeds/dublicated_content_instances.png
> . There is shown subset of feeds I parsed two times form tbray
> blog. If
> I parse once more, another :Content instance would appear that is
> connected with the same :body and the same :type literals.
>
> Maybe there is something wrong in my approach for creating these
> instances with identification? Or good suggestion how to identify
> these
> :Content instances?

No. It's just that content identity is not well defined. It would
have been easier if we could have just used literals, as these make
literal identification easier.

If you can think of a way to turn these into literals again, I'd be
thankful.

Have a look at this though:

http://groups-beta.google.com/group/atom-owl/browse_thread/thread/
b3779504d28d482a?hl=en

Henry

Gediminas Sakalauskas

unread,
Nov 26, 2006, 9:35:01 AM11/26/06
to atom...@googlegroups.com
Hello Henry,

Thanks for your advice so much! It makes all much clearer in some aspects.

And here I will try to write what I understood. I found the perception
about all this Semantic Web stuff is ambiguous even people are
thinking in the same sometimes they do not agree ;)

So first I hope that new AtomOwl will not be so much changed. And
increasing number of anonymous blank nodes will not affect my code so
much. Also it will be so much interesting to see improved schema.

I knew about blank nodes, but having instances, I can't use them
without identification. Like in Relational DB table records without
Primary Key.

Now here is the case when not only matters of WHERE the resource is,
but also necessary WHEN it was modified. In RDB we have PK + timestamp
for identify this.

For thus I think the best and easiest solution would be use URN based
on duri and tbr method with the pattern:
urn:tbr:<dateUpdated>:<id>
- dateUpdated – year, month, day, hour, minutes (200611261601)
- id – taken from id tag what in most cases is uri (http://example.org/item)

So for identify instances consider using following rules:
- For :Feed - :id + :updated
- For :Entry - :id + :updated
- For :Content - URN of it's entry + #Content
If later it changes for use :Content only as literal will not be so
much problem avoid it.

Then for identify
- :Person - :email of Author or Contributor
- :Generator - :generatorVersion
But still is not clear about it and maybe there is some info about how
to use the foaf:Agent what aggregates both. Would be good some info on
it.

Then :Category is fully described by it's :scheme attribute. But still
there is unclear the strategy of using :scheme + :term. Is :scheme
just the root element of the categories graph or tree and after
combination of :term atributes represents the path? Or the :term
represents only the leafs of the tree where :scheme structure
represents all the nodes?

Then I also would be thankful for point some info what explains about
the usage of :Link. Because now it is so unclear why it connects
:FeedOrEntry with the :Content? Why we have these attributes in :Link?
I was reading mailing list and docs, but I did not catch the idea.

That's all for this time.

Thank you in advance.

Best regards.
Gediminas Sakalauskas

Henry Story

unread,
Nov 27, 2006, 9:07:14 PM11/27/06
to atom...@googlegroups.com
On 26 Nov 2006, at 06:35, Gediminas Sakalauskas wrote:
> [...]
atom2turtle.xquery
Reply all
Reply to author
Forward
0 new messages