On Graphs

0 views

Skip to first unread message

Henry Story

unread,

Jul 20, 2006, 10:09:25 AM7/20/06

to atom...@googlegroups.com, bloged, us...@sommer.dev.java.net

Elias Torres has put up an Atom server with a SPARQL end point. It
uses graphs a lot, so this is a good time perhaps to get a better
understanding of that area.

1. Queso
--------

Elias recently put up an atom server called Queso, with a SPARQL
interface using an older version of the atom-owl ontology.
See his blog posting: http://torrez.us/archives/2006/07/17/471/ (Note
the server does break down every so often)

What surprised me at first is Queso's use of named graphs. Using the
sparql command line tool that comes with Jena's ARQ library I can
query their endpoint like this

hjs@bblfish:0$ bin/sparql --query ~/tmp/rdf/query --service http://
abdera.watson.ibm.com:8080/sparql

After having placed the following in the "query" file:

PREFIX : <http://www.w3.org/2005/10/23/Atom#>
SELECT ?g ?id ?t
WHERE { GRAPH ?g {
[ :id ?id; :title [ ?r ?t] ].
}
}

Notice that one has to specify which named graph one queries for this
to work.

The graphs are named it turns out by the URNs of the id of the
entries or feed documents. So the
entry with id urn:lsid:abdera.watson.ibm.com:entries:240559180 can be
retrieved here

hjs@bblfish:0$ cwm.py http://abdera.watson.ibm.com:8080/atom/entry/
edit/urn:lsid:abdera.watson.ibm.com:entries:240559180 --n3

2. BlogEd
---------

I have been thinking of things from a client perspective mostly,
working as I am on BlogEd, so things present themselves to me in a
different light.

The current version of BlogEd is just a blog editor that keeps a full
history of all blog entries written by their unique author. There was
no need there to keep track of any graphs. All the data came from the
same person and was produced by the same tool. There is no need to
ever remove information from the one big graph that contains all the
information in BlogEd.

In the next version of BlogEd on the other hand I want to go a lot
further. I want to have BlogEd be both a blog editor and aggregator,
so that I can use it to read news as well as write it. This means
that I will be reading feeds all over the place on the internet. And
since those feeds can change, I need to be able to track where I got
information from, so that I can also remove that information or at
least ignore it, when it changes.

So the idea is that if I read a feed from http://crook.com/myfeed for
a few weeks, and then realise that I never want to have anything to
do with that feed, I should just be able to delete graphs tagged as
having come from <http://crook.com/myfeed>, and perhaps graphs linked
to from that feed by the next relation (where those graphs contain
'first' relations that point back to crook.com/myfeed of course - or
else it may end up deleting feeds I am not responsible for). If I
don't keep provenance information I would find it difficult to remove
that feed without also removing information others have made about
that feed.

This means that for BlogEd every atom file will need to be a graph.

3. The Users Perspective
------------------------

RDF data should be very mergeable. So what I am wondering is how much
an end user should have to know about
how you cluster your information. Since there are so many ways of
doing this, it could end up being quite complicated for an end user
to know how to query the data.

3.1 Querying a Server
---------------------

In the case of Queso I wonder if the end user really needs to know
anything about the graphs at all. Since you control all the data on
your server, it should be quite easy for you to merge all of it
without fear of contradiction. You know if the title is correct, you
can make sure that there is no id that is broken, and that the
updated time stamps are correct. Queso controls all the data. So it
should be possible to merge all the graphs and have that be the
default graph queried.

3.2 Querying BlogEd
-------------------

If BlogEd had a SPARQL query interface, then it may not be so easy to
reliably create a merged graph of all the feeds read, since these
feeds may contain contradictory information, the source of
information not being controlled by BlogEd.

It would be interesting to work out which information is most likely
to be corrupted. Perhaps there is information that can't be, and so
that can always be added to the general pool of knowledge.

Let us take an <entry>...</entry> document placed at <http://ok.com/
e1> .

<entry>
<title>Atom-Powered Robots Run Amok</title>
<link href="http://example.org/2003/12/13/atom03"/>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2003-12-13T18:30:02Z</updated>
<summary>Some text.</summary>
</entry>

If I translate that into

[ a :Entry;
:title "Atom-Powered Robots Run Amok"^:text;
iana:alternate [ a :Content; :src <http://example.org/2003/12/13/
atom03> ];
:id "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a"^^xsd:anyUri;
:updated "2003-12-13T18:30:02Z"^^xsd:dateTime;
:summary "Some text."^:text
] .

Then unless I have CIFP for :id and :updated, which could merge 2
entries, adding some false information to the database
won't overwrite good information. The worst that could happen is that
it would make some queries on the id relate to entries that don't
really have something in common. But perhaps it is on discovering
that these two entries don't really have something in common that the
software could then move to ask the user how to deal with the
situation, which entry is the likely offending one, and what to do
about it.

So even here, it may be worth always having the default graph be a
merge of all the other ones. It makes querying a lot easier.