Radical Versa 2.0 proposal

12 views
Skip to first unread message

Uche Ogbuji

unread,
Jun 23, 2008, 5:29:59 PM6/23/08
to versa...@googlegroups.com
So far the Versa 2.0 efforts have been focused on largely incremental
improvements to Versa 1.0 to fill out capability gaps and such. I must
admit that one of the things that has slowed me down on Versa 2.0
(though I haven't always really clearly understood this reason) is that
I think Versa needs more than that. I think it needs a fresh re-think,
from the reason for being all the way through language design.

So I've given it a lot of thought and I have a proposal. Please give it
some though and let me know what you think.

I'd like to reposition Versa as more of a general-purpose Web data query
language. Certainly it should always be as compatible as possible fine
with RDF, but I'd like to give it a richer domain, and a simpler,
well-bounded data model. Several reasons for this:

1) So much of what's happening with regard to Web data, whether we like
it or not, is not happening directly in RDF. A good example is social
networking relationships. Sure the LOD community [1] is trying hard to
keep up with all that work, but I'm not convinced they'll always be able
to do so. It would be nice to generalize the idea of Web data query to
address this more directly.
2) Some of the stuff happening in more general Web data does not fit the
RDF model well as is, anyway. n-ary relationships and ordered
information are a great example. RDF is frankly pretty cumbersome in
the way it captures such constructs, and it would be nice not to have to
wrangle with RDF-isms while trying to get to the essence of the data.
I'd like Versa to be able to handle such constructs more directly.
3) SPARQL, like it or not, has official dress on it as RDF query
language, and I think it wold be productive to reduce the head-to-head
positioning

The actual proposal is to define Versa 2.0 as a lightweight data model,
based on RDF, but designed for a bit more expressiveness, as mentioned
above. The sketch of the data model is:

The graph is represented as a set of n-tuples, not triples. This
accommodates RDF as well as all the various RDF subgraph (quad)
proposals, and a more general view as well. For example ordering a set
of triples can be done by having a tuple for the ordering information.

Note: a tuple in Versa is just a special case subgraph of length 1

Versa also supports a few primitive data types:

* Ordered lists
* Sets
* Associated arrays (a la hashes or python dictionaries)
* Number type (a la XPath and Versa 1)
* String (Unicode) type (a la XPath and Versa 1)
* Boolean type (a la XPath and Versa 1)

With this in mind the language actually doesn't need to change much at
all. The biggest change is a new syntax proposal for access of n-ary
relationships. Short version:

In Versa 1, a basic traversal expression was:

all() - atom:category -> "mytopic"

However, if you think of an atom category, it's an n-ary relationship:

<entry> - subject ----> "mytopic"

|

|
+- scheme -> http://www.dmoz.org/

Meaning that the category comes from the Open Directory

You can approximate this in RDF with the usual blank node technique:

<entry> - a:subject -> [bnode] --- rdf:value -> "mytopic"
|
|
+- m:scheme --> http://www.dmoz.org/

In Versa 1 you'd have to do:


(all() - a:category -> *) - a:scheme -> eq(@"http://www.dmoz.org/")


Which is not syntactically horrible, but it does introduce that magic
intermediate object which does not really exist in the underlying
model. In my Versa 2 proposal you can do:

all() - a:category[a:scheme] -> eq(<http://www.dmoz.org/>)

Note: the updated URI literal syntax
Note: Versa 2 still enforces URI for the predicate "axis", including for
accessing other parts of an n-ary relationship than the object.

The nice thing is that this syntax can be used easily to address other
metadata of classic triples, such as confidence and trust assertions,
time/place, general context, etc.

Thoughts?

--
Uche Ogbuji http://uche.ogbuji.net
Founding Partner, Zepheira http://zepheira.com
Linked-in profile: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/

Uche Ogbuji

unread,
Jul 13, 2008, 3:13:08 PM7/13/08
to versa...@googlegroups.com
I gave my Versa 2.0 proposal a name and consolidated information on the
idea:

http://wiki.xml3k.org/Versa/Boulder_vice

Comments welcome. I'm excited enough about Boulder Vice Versa that I'll
probably start noodling on experimental implementation, so now would be
a good time to stop me if you think I'm way off the rails ;)

glenn mcdonald

unread,
Jul 18, 2008, 11:53:30 PM7/18/08
to Versa Query
Supporting n-ary relationships directly in the query language is an
interesting idea, but how much better is it, overall, than turning
them into intermediate nodes in a pure binary (that is, triples)
model? So instead of

(subject=<http://purl.org/person/uche>, rel=m:mass, value=95,
u:unit=<http://purl.org/unit/kg>)

you'd do

(subject=<http://purl.org/person/uche>, rel=m:mass, <_1>),
(subject=<_1>, rel=s:scalar, value=95),
(subject=<_1>, rel=u:unit, value=<http://purl.org/unit/kg>)

or possibly

(subject=<http://purl.org/person/uche>, rel=m:mass, <_1>),
(subject=<_1>, rel=u:in_kgs, value=95)

Either way the data-meta-model is simpler, the query language doesn't
need to "pivot", and all data elements are equally addressable. One
way in which this approach holds up where your n-ary extension breaks
down is if the structure has more than one extra layer. In your
examples, the units are metadata on the scalar mass, and the scheme is
metadata on the category, but what if you needed to represent meta-
meta-data on the meta-data, like a confidence score for whether the
units were actually kilograms, or a date on which the category was in
the given scheme?

glenn mcdonald

unread,
Jul 25, 2008, 5:21:33 PM7/25/08
to Versa Query
Taking this thought a little further, semantically it seems to me that
what we want to say in the Uche's mass example is that Uche has one
mass, and that mass has various quantifications. This is even clearer
if we think about following this information over time. In that case
Uche has some number of mass-measurements, each of which can have
various quantifications. But if you do this with n-ary relationships
you might be tempted to say

Uche mass (value 95, unit KGs, date 2008-07-13)

This is confused, though. Unit modifies value, but not date. Date
applies to the combination of value and unit, not either by
themselves, and really neither unit nor date directly modify Uche.
Also, if we then add these three:

Uche mass (value 209.44, unit LBs, date 2008-07-13)
Uche mass (value 90, unit KGs, date 2008-07-18)
Uche mass (value 198.42, unit LBs, date 2008-07-18)
glenn mass (value 90, unit KGs, date 2003-06-04)

Do we have five measurements or three? (Or two, or four?) How many
masses?

So ultimately it seems preferable, to me, to go ahead and expand the
schema to have measurements and masses:

Uche measurement measurement1
measurement1 date 2008-07-13
measurement1 mass mass1
mass1 inKGs 95

Thus the four new statements can be incorporated sensibly:

mass1 inLBs 209.44
Uche measurement measurement2
measurement2 date 2008-07-18
measurement2 mass mass2
mass2 inKGs 90
mass2 inLBs 205.03
glenn measurement measurement3
measurement3 date 2003-06-04
measurement3 mass mass2

Now we know that we have three measurements, and two masses. And we
can more cleanly think about whether two masses are the same, and how
the inLBs relationship is a property of a mass that can be derived
from its inKBs property, rather than inLBs or inKGs (or unit) being a
property of a person.


Incidentally, though, I totally support the idea that a usable data
model should have built-in support for relationships with multiple
targets. I should be able to say

Uche measurement [measurement1, measurement2]

without having to intermingle syntactic index-number stuff with my
actual data.

glenn mcdonald

unread,
Jul 29, 2008, 9:41:18 AM7/29/08
to Versa Query
Not to have this conversation entirely with myself, but I feel like I
should note that I realize this isn't a new subject. The most obvious
current reference is http://www.w3.org/TR/swbp-n-aryRelations/ ,
although that describes *methods* for turning n-ary relationships into
binary, given that RDF is binary, rather than making any kind of case
for why this might be desirable in general.

My feeling, after doing a lot of binary-relationship modeling, is that
a lot of the time the "obvious" n-ary relationship is actually an
artifact of a particular limited perspective. The example of Uche's
mass makes sense if you're mainly only thinking about the data from
the perspective of Uche (that is, of people). It makes less sense if
you try to turn the data around and think about it from the
perspective of masses, or of measurement dates. In fact, defining that
data as a n-ary relationship amounts to defining it as a denormalized
table!

So it's not, I think, just a question of whether the complexity is
pushed down into the data model or up into the query language, it's a
question of whether we're actually going to model the true nature of
the data, rather than just particular flattened views of it!

Uche Ogbuji

unread,
Jul 30, 2008, 1:06:12 AM7/30/08
to versa...@googlegroups.com
Yeah, I really think this would benefit from broader discussion. For
one thing these are more aesthetic than scientific matters, and that's
why I don't really believe one query language will ever fit all.

glenn mcdonald wrote:
> Taking this thought a little further, semantically it seems to me that
> what we want to say in the Uche's mass example is that Uche has one
> mass, and that mass has various quantifications.

I think the idea of quantification is not general enough, and for this
specific case I wonder whether it's a natural concept.

> This is even clearer
> if we think about following this information over time. In that case
> Uche has some number of mass-measurements, each of which can have
> various quantifications. But if you do this with n-ary relationships
> you might be tempted to say
>
> Uche mass (value 95, unit KGs, date 2008-07-13)
>
> This is confused, though. Unit modifies value, but not date.

No confusion at all. Unit *and* date both modify the relationship
("mass"). This is analogous to the XML:

<mass unit="KG" date="2008-07-13">95</mass>

It's pretty well understood in XML design, and in how link parameters
work in HyTime, etc. that attributes should *never* modify each other.
they only modify the element (the link, etc.)

> Date
> applies to the combination of value and unit, not either by
> themselves, and really neither unit nor date directly modify Uche.
> Also, if we then add these three:
>
> Uche mass (value 209.44, unit LBs, date 2008-07-13)
> Uche mass (value 90, unit KGs, date 2008-07-18)
> Uche mass (value 198.42, unit LBs, date 2008-07-18)
> glenn mass (value 90, unit KGs, date 2003-06-04)
>
> Do we have five measurements or three? (Or two, or four?) How many
> masses?
>

You ask an ambiguous question, even if we omit the computer language
consideration. If you take a US classroom ruler measure your palm
across, and get "3 inches", then flip the ruler over and measure the
same palm and get "7.5cm", do you have one or two measurements? That
depends on circumstances and conventions and individual tendencies.

The point is that in your above construction, I know clearly how to
construct a query for anything I want to know. So for example, if I
want a time series of measurements, regardless of their units, that's
easy. So I don't see a problem.


> So ultimately it seems preferable, to me, to go ahead and expand the
> schema to have measurements and masses:
>

Err, what schema?

I think XPath is just one example of my very strong belief that a query
or information modeling language should never require a schema.

> Uche measurement measurement1
> measurement1 date 2008-07-13
> measurement1 mass mass1
> mass1 inKGs 95
>
> Thus the four new statements can be incorporated sensibly:
>
> mass1 inLBs 209.44
> Uche measurement measurement2
> measurement2 date 2008-07-18
> measurement2 mass mass2
> mass2 inKGs 90
> mass2 inLBs 205.03
> glenn measurement measurement3
> measurement3 date 2003-06-04
> measurement3 mass mass2
>

Basically artificial, reified blank nodes. If these are not in the
original information model (and I doubt many life-like info models would
use such reifications) I think they muddle things pretty badly.

> Now we know that we have three measurements, and two masses.

I don't see how you know this any more than in the past example. If the
model with the introduced artificial nodes truly matches the original
information space, then it should have the same knowledge content.
Otherwise you've basically distorted the original. This is precisely
what I'm trying to avoid.


> And we
> can more cleanly think about whether two masses are the same, and how
> the inLBs relationship is a property of a mass that can be derived
> from its inKBs property, rather than inLBs or inKGs (or unit) being a
> property of a person.
>

Again I don't see added clarity. Elaboration does not necessarily lead
to clarity.


> Incidentally, though, I totally support the idea that a usable data
> model should have built-in support for relationships with multiple
> targets. I should be able to say
>
> Uche measurement [measurement1, measurement2]
>
> without having to intermingle syntactic index-number stuff with my
> actual data.
>

I think the query data model should just be as close enough to natural
expressivity that there is a lesser chance that it requires artifice for
a faithful representation. I strongly think that this is impossible
without first-class N-ary relationships. I've thought that since my
first days of trying to use RDF for modeling in 2000 or so (and yes I
and others argued then that RDF really needed to fix its expressional
limitations), and I personally see many of RDF's problems coupled to its
insistence on an artificially simplified model. That's why I'd like to
avoid such limitations with Versa. BVV can query an RDF-type model just
fine, but it can also query richer models just fine, without having to
subject them to pre-computational distortion.

Uche Ogbuji

unread,
Jul 30, 2008, 1:12:10 AM7/30/08
to versa...@googlegroups.com
glenn mcdonald wrote:
> Not to have this conversation entirely with myself, but I feel like I
> should note that I realize this isn't a new subject. The most obvious
> current reference is http://www.w3.org/TR/swbp-n-aryRelations/ ,
> although that describes *methods* for turning n-ary relationships into
> binary, given that RDF is binary, rather than making any kind of case
> for why this might be desirable in general.
>
> My feeling, after doing a lot of binary-relationship modeling, is that
> a lot of the time the "obvious" n-ary relationship is actually an
> artifact of a particular limited perspective. The example of Uche's
> mass makes sense if you're mainly only thinking about the data from
> the perspective of Uche (that is, of people). It makes less sense if
> you try to turn the data around and think about it from the
> perspective of masses, or of measurement dates. In fact, defining that
> data as a n-ary relationship amounts to defining it as a denormalized
> table!
>

This really doesn't make sense to me at all, especially the last
sentence. A relationship is an abstract concept that can be expressed
in many ways, including using SQL idiom (i.e. tables, normalization,
etc.) It can be expressed in an infinite number of ways within that
idiom, and it can be expressed in an infinite number of other idioms.

> So it's not, I think, just a question of whether the complexity is
> pushed down into the data model or up into the query language, it's a
> question of whether we're actually going to model the true nature of
> the data, rather than just particular flattened views of it!
>

Exactly, but I think that by adding artificial nodes to turn an N-ary
relationship into a cascade of triples, you're flattening it. So
apparently we both want the most natural expression, and yet we come up
with starkly different approaches for doing so. That doesn't surprise
me one bit.

glenn mcdonald

unread,
Jul 30, 2008, 3:49:14 PM7/30/08
to Versa Query
If it's just you and me talking, and we immediately agree to differ,
this will be a pretty short conversation. I guess I don't really
believe that the phrase "natural expressivity" has useful meaning in
the context of abstract data modeling. A query language is
inextricable from the data model it interrogates, and if we believe
that the query-language discussion doesn't end with SPARQL, then we
shouldn't have to assume that the data-model discussion ends with RDF,
nor with any other existing thing someone happens to already have.

But it sounds like you're defining BVV's problem space as just the
query language, and thus aren't thinking about proposing changes to
how data is represented. That's your decision, obviously, so I won't
push this point any further here.


Two details, solely for closure:

1. You say:

> You ask an ambiguous question, even if we omit the computer language
> consideration.  If you take a US classroom ruler measure your palm
> across, and get "3 inches", then flip the ruler over and measure the
> same palm and get "7.5cm", do you have one or two measurements?

Turn this around, though. If you do only the first measurement, I
assume we can agree that you have one measurement. We can still
calculate what 3 inches means in centimeters. I want to know the
difference between a) a single measurement with a calculated
equivalent, and b) two direct measurements in two different units.
Thus this is a case, to me, in which the n-ary version loses
information that the binary version can quite naturally represent.


2. I said:

> So ultimately it seems preferable, to me, to go ahead and expand the
> schema to have measurements and masses:


Then you said:

> Err, what schema?

> I think XPath is just one example of my very strong belief that a query
> or information modeling language should never require a schema.

I meant schema in the sense only of the things and relationships you
have in the data. Replace the word "schema" in my sentence with "data"
if the word "schema" is too loaded for you.

As it happens, I'm also generally in favor of schemas, but that's a
separate discussion we haven't even touched on here.


glenn
Reply all
Reply to author
Forward
0 new messages