So I've given it a lot of thought and I have a proposal. Please give it
some though and let me know what you think.
I'd like to reposition Versa as more of a general-purpose Web data query
language. Certainly it should always be as compatible as possible fine
with RDF, but I'd like to give it a richer domain, and a simpler,
well-bounded data model. Several reasons for this:
1) So much of what's happening with regard to Web data, whether we like
it or not, is not happening directly in RDF. A good example is social
networking relationships. Sure the LOD community [1] is trying hard to
keep up with all that work, but I'm not convinced they'll always be able
to do so. It would be nice to generalize the idea of Web data query to
address this more directly.
2) Some of the stuff happening in more general Web data does not fit the
RDF model well as is, anyway. n-ary relationships and ordered
information are a great example. RDF is frankly pretty cumbersome in
the way it captures such constructs, and it would be nice not to have to
wrangle with RDF-isms while trying to get to the essence of the data.
I'd like Versa to be able to handle such constructs more directly.
3) SPARQL, like it or not, has official dress on it as RDF query
language, and I think it wold be productive to reduce the head-to-head
positioning
The actual proposal is to define Versa 2.0 as a lightweight data model,
based on RDF, but designed for a bit more expressiveness, as mentioned
above. The sketch of the data model is:
The graph is represented as a set of n-tuples, not triples. This
accommodates RDF as well as all the various RDF subgraph (quad)
proposals, and a more general view as well. For example ordering a set
of triples can be done by having a tuple for the ordering information.
Note: a tuple in Versa is just a special case subgraph of length 1
Versa also supports a few primitive data types:
* Ordered lists
* Sets
* Associated arrays (a la hashes or python dictionaries)
* Number type (a la XPath and Versa 1)
* String (Unicode) type (a la XPath and Versa 1)
* Boolean type (a la XPath and Versa 1)
With this in mind the language actually doesn't need to change much at
all. The biggest change is a new syntax proposal for access of n-ary
relationships. Short version:
In Versa 1, a basic traversal expression was:
all() - atom:category -> "mytopic"
However, if you think of an atom category, it's an n-ary relationship:
<entry> - subject ----> "mytopic"
|
|
+- scheme -> http://www.dmoz.org/
Meaning that the category comes from the Open Directory
You can approximate this in RDF with the usual blank node technique:
<entry> - a:subject -> [bnode] --- rdf:value -> "mytopic"
|
|
+- m:scheme --> http://www.dmoz.org/
In Versa 1 you'd have to do:
(all() - a:category -> *) - a:scheme -> eq(@"http://www.dmoz.org/")
Which is not syntactically horrible, but it does introduce that magic
intermediate object which does not really exist in the underlying
model. In my Versa 2 proposal you can do:
all() - a:category[a:scheme] -> eq(<http://www.dmoz.org/>)
Note: the updated URI literal syntax
Note: Versa 2 still enforces URI for the predicate "axis", including for
accessing other parts of an n-ary relationship than the object.
The nice thing is that this syntax can be used easily to address other
metadata of classic triples, such as confidence and trust assertions,
time/place, general context, etc.
Thoughts?
--
Uche Ogbuji http://uche.ogbuji.net
Founding Partner, Zepheira http://zepheira.com
Linked-in profile: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
http://wiki.xml3k.org/Versa/Boulder_vice
Comments welcome. I'm excited enough about Boulder Vice Versa that I'll
probably start noodling on experimental implementation, so now would be
a good time to stop me if you think I'm way off the rails ;)
glenn mcdonald wrote:
> Taking this thought a little further, semantically it seems to me that
> what we want to say in the Uche's mass example is that Uche has one
> mass, and that mass has various quantifications.
I think the idea of quantification is not general enough, and for this
specific case I wonder whether it's a natural concept.
> This is even clearer
> if we think about following this information over time. In that case
> Uche has some number of mass-measurements, each of which can have
> various quantifications. But if you do this with n-ary relationships
> you might be tempted to say
>
> Uche mass (value 95, unit KGs, date 2008-07-13)
>
> This is confused, though. Unit modifies value, but not date.
No confusion at all. Unit *and* date both modify the relationship
("mass"). This is analogous to the XML:
<mass unit="KG" date="2008-07-13">95</mass>
It's pretty well understood in XML design, and in how link parameters
work in HyTime, etc. that attributes should *never* modify each other.
they only modify the element (the link, etc.)
> Date
> applies to the combination of value and unit, not either by
> themselves, and really neither unit nor date directly modify Uche.
> Also, if we then add these three:
>
> Uche mass (value 209.44, unit LBs, date 2008-07-13)
> Uche mass (value 90, unit KGs, date 2008-07-18)
> Uche mass (value 198.42, unit LBs, date 2008-07-18)
> glenn mass (value 90, unit KGs, date 2003-06-04)
>
> Do we have five measurements or three? (Or two, or four?) How many
> masses?
>
You ask an ambiguous question, even if we omit the computer language
consideration. If you take a US classroom ruler measure your palm
across, and get "3 inches", then flip the ruler over and measure the
same palm and get "7.5cm", do you have one or two measurements? That
depends on circumstances and conventions and individual tendencies.
The point is that in your above construction, I know clearly how to
construct a query for anything I want to know. So for example, if I
want a time series of measurements, regardless of their units, that's
easy. So I don't see a problem.
> So ultimately it seems preferable, to me, to go ahead and expand the
> schema to have measurements and masses:
>
Err, what schema?
I think XPath is just one example of my very strong belief that a query
or information modeling language should never require a schema.
> Uche measurement measurement1
> measurement1 date 2008-07-13
> measurement1 mass mass1
> mass1 inKGs 95
>
> Thus the four new statements can be incorporated sensibly:
>
> mass1 inLBs 209.44
> Uche measurement measurement2
> measurement2 date 2008-07-18
> measurement2 mass mass2
> mass2 inKGs 90
> mass2 inLBs 205.03
> glenn measurement measurement3
> measurement3 date 2003-06-04
> measurement3 mass mass2
>
Basically artificial, reified blank nodes. If these are not in the
original information model (and I doubt many life-like info models would
use such reifications) I think they muddle things pretty badly.
> Now we know that we have three measurements, and two masses.
I don't see how you know this any more than in the past example. If the
model with the introduced artificial nodes truly matches the original
information space, then it should have the same knowledge content.
Otherwise you've basically distorted the original. This is precisely
what I'm trying to avoid.
> And we
> can more cleanly think about whether two masses are the same, and how
> the inLBs relationship is a property of a mass that can be derived
> from its inKBs property, rather than inLBs or inKGs (or unit) being a
> property of a person.
>
Again I don't see added clarity. Elaboration does not necessarily lead
to clarity.
> Incidentally, though, I totally support the idea that a usable data
> model should have built-in support for relationships with multiple
> targets. I should be able to say
>
> Uche measurement [measurement1, measurement2]
>
> without having to intermingle syntactic index-number stuff with my
> actual data.
>
I think the query data model should just be as close enough to natural
expressivity that there is a lesser chance that it requires artifice for
a faithful representation. I strongly think that this is impossible
without first-class N-ary relationships. I've thought that since my
first days of trying to use RDF for modeling in 2000 or so (and yes I
and others argued then that RDF really needed to fix its expressional
limitations), and I personally see many of RDF's problems coupled to its
insistence on an artificially simplified model. That's why I'd like to
avoid such limitations with Versa. BVV can query an RDF-type model just
fine, but it can also query richer models just fine, without having to
subject them to pre-computational distortion.
This really doesn't make sense to me at all, especially the last
sentence. A relationship is an abstract concept that can be expressed
in many ways, including using SQL idiom (i.e. tables, normalization,
etc.) It can be expressed in an infinite number of ways within that
idiom, and it can be expressed in an infinite number of other idioms.
> So it's not, I think, just a question of whether the complexity is
> pushed down into the data model or up into the query language, it's a
> question of whether we're actually going to model the true nature of
> the data, rather than just particular flattened views of it!
>
Exactly, but I think that by adding artificial nodes to turn an N-ary
relationship into a cascade of triples, you're flattening it. So
apparently we both want the most natural expression, and yet we come up
with starkly different approaches for doing so. That doesn't surprise
me one bit.