To RDF or not to RDF?

Matthew Gertner

unread,

Aug 2, 2006, 10:23:36 AM8/2/06

to

I've been struggling with the right attitude to take towards RDF in
Mozilla. I'd be very interested in other opinions since I know a lot of
people have strong feelings about this. I would summarize my current
view as follows:

The RDF implementation in Mozilla has a couple of major issues. One is
that the API is very verbose, especially in C++. The other is that it
has been used exclusively where other, much simpler approaches would
have been more appropriate (contents.rdf being an excellent example of
where this has already been fixed). Seen in this light, it's easy to
understand why people hear "RDF" and go "oh, yuk!".

That said, the idea of a general-purpose data model is quite compelling.
We have many use cases where we want to be able to "stuff" to some piece
of domain-specific data. Examples: map it into SQL tables, serialize it
as/parse it from XML, display it as HTML/XUL.

Certainly the RDF implementation in Mozilla is not sufficient for these
purposes. Most notably, it is lacking the notion of a schema. What we've
done is to describe our RDF data structures using XML schema (RELAX NG
to be specific), with some extensions to add subtyping. In this way, we
can define a data structure for, say, a person with all the appropriate
fields (name, birthdate, favorite color, etc.) and another for, say, a
car. And we can use exactly the same code to serialize instances to XML,
store them in a SQL database, display them in the browser (though this
also requires a template of some sort to do properly), etc.

Of course, most of this can be accomplished with plain vanilla
XML+schemas. But we would have had to reinvent a lot of the stuff that
is already in RDF: IDs, data sources, a central registry (the RDF
service), an API that includes interfaces for data type primitives, etc.
Besides the aforementioned verbosity of the API (fairly straightforward
to address) and the lack of thread-safety (somewhat less so but still
eminently doable), I can't see any good reason not to use RDF. Certainly
fixing the weakness of the current implementation would be far easier
than building everything from scratch, and you get the benefit of
conforming to a standard (with the faint promise of some sort of
interoperability with other RDF implementations).

So to restate my question more precisely:
1) Do people see a need for the type of thing I describe inside the
Mozilla core?
2) If so, is RDF the way to go and, if it isn't, what other approaches
would be worth looking at?

Benjamin Smedberg

unread,

Aug 2, 2006, 11:33:33 AM8/2/06

to

Matthew Gertner wrote:

> That said, the idea of a general-purpose data model is quite compelling.
> We have many use cases where we want to be able to "stuff" to some piece
> of domain-specific data. Examples: map it into SQL tables, serialize it
> as/parse it from XML, display it as HTML/XUL.

The idea of a general-purpose data model is a snare into which many bright
engineers have fallen. Any data model which claims to be able to represent
everything is either completely wrong or hopelessly complex. RDF happens to
be both. Note that I'm not talking about the Mozilla implementation per-se,
or the RDF/XML serialization, but the data model itself.

You can map SQL tables to RDF, but what value does that give you over a
simple (custom) XML format (or JSON, or some plaintext format)? Supposedly
it gives you the ability to aggregate data from multiple sources. But it is
painfully evident that simplistic aggregation of RDF doesn't produce useful
results. Typically you're left with mostly disconnected graphs of anonymous
nodes.

To solve this problem, RDF invented abstraction layers to state "this node
is the same as this node"... which is all fine and good, except that it
requires extra intelligence from either the aggregation engine or the client
code.

Then, there is the problem that you may not trust data from some sources. To
solve this problem, RDF variously uses reification and other techniques to
identify data origins.

Eventually, you've constructed a system that could, perhaps, theoretically
contain any data, but is completely unusable by real-world applications

And we haven't even touched the problem of presenting that data to the user!

I think that any promises of a "general purpose data model" should be
treated with considerable skepticism, and that we should focus our energies
on domain-specific data models (microsummaries and feeds). We should give
data consumers (extensions and applications, and even websites the user has
chosen) the ability to glean information off the web as the user browses.

This really doesn't have anything to do with providing a unified data store
"under the hood" in the mozilla platform, because I haven't seen a good
use-case for doing so. Storing data in SQLite databases, or XML files, or
JSON text files, provides a good storage solution for any of the use cases
I've seen.

Prove me wrong ;-)

--BDS

Matthew Gertner

unread,

Aug 5, 2006, 1:33:13 PM8/5/06

to

Benjamin,

Thanks for your comments. I actually agree with a lot of what you said.
I think there are several questions here and it's worth dissecting them
and analyzing them in turn.

As far as general-purpose data models are concerned, you may be
conflating two different issues. Leaving aside RDF entirely for a
moment, I certainly believe that models of this type have great value.
Modeling all data structures in an application-specific manner implies
that we must rewrite the same code over and over again to process it,
display it, store it, etc. I would argue that this is the reason that
writing things like Places has proven to be so difficult.

This was a major motivation for the specification of XML. It's nice, for
example, not to have to write lexical analyzers over and over again for
arbitrary serialization formats. My idea is that this type of advantage
can be pushed a lot further with some extensions to XML (as a model),
most notably through the use of schemas. So to answer your question
about SQL, the advantage is that I don't have to write data mapping code
by hand for every type of data structure. Of course, this doesn't have
anything to do with RDF per se, but rather with the use of formal
schemas that can be used to deduce a correct mapping for use by a
generic framework.

The alternative you propose, to "focus our energies on domain-specific
data models", is the one that I would treat with "considerable
scepticism". We've been down this road before, for example, with HTML,
which became a bloated mess precisely because we were trying to
shoe-horn every possible type of data into it. Once again, XML was
developed to counter this (since it's a metalanguage), and if you look
at what's happening with feeds (to take one of your examples), we're
seeing the same trend with formats like Atom.

My view used to be that we could do all this with plain vanilla XML and
schemas, but I've come around to the view that the model also needs IDs.
In fact, I see a lot of value in RDF's approach to treating resources as
simple labels (URIs), with the actual data residing in one or more data
sources. A central registry like Mozilla's RDF service is also essential.

But should the model actually be according-to-Hoyle RDF? Considering
that we use a smallish subset of RDF with lots of extensions, the answer
to this question is not obvious to me. On the one hand, the hope of
interoperability might lead us to say yes. On the other hand, RDF is too
bloated and ambitious, as you ably argue, so using it might have
implications that would better be avoided. In general I am wary of any
type of design-by-committee "boil the ocean" standard, a category into
which RDF fits neatly.

Interoperability will happen when Mozilla or AllPeers or someone else
comes up with a model, API and implementation that have great practical
value "out of the box" and which evolve based on experience gained by
actually using them. I have absolutely no doubt that we will find a
useful way to model and reuse structured data on the web. Whether RDF
and Mozilla's implementation thereof represent a good starting point was
my original question. I still don't know what the answer is, but it's
certainly telling that no one has stepped up to the plate and advocated
the merits of RDF as a universal data model.

Matt

Benjamin Smedberg

unread,

Aug 7, 2006, 10:10:09 AM8/7/06

to

Matthew Gertner wrote:

> As far as general-purpose data models are concerned, you may be
> conflating two different issues. Leaving aside RDF entirely for a
> moment, I certainly believe that models of this type have great value.
> Modeling all data structures in an application-specific manner implies
> that we must rewrite the same code over and over again to process it,
> display it, store it, etc. I would argue that this is the reason that
> writing things like Places has proven to be so difficult.

There are perhaps two different ideas in play here. One is a
general-purposes data model that actually represents data. The other is a
common data format that makes it easy to manipulate data (display, store,
process). I believe the former (a universal data representation) is not only
not desirable, but utterly impossible. A standard data representation,
however, is within reach. But I'd say that we already have several good
standard data representation models:

1) XML
2) JS objects
3) sql (sqlite) tables

With the continuing work on E4X, the first two are merging. Instead of
grafting a problematic theoretical data structure like RDF onto our
platform, we should use the existing data formats that we have. For example,
Neil Deakin's work on the XUL templating engine will allow us to plug in
template backends for XML and JS objects, and perhaps even a direct template
engine for SQL. JS serialization is no longer a mysterious art, and XML
serializes itself.

> This was a major motivation for the specification of XML. It's nice, for
> example, not to have to write lexical analyzers over and over again for
> arbitrary serialization formats. My idea is that this type of advantage
> can be pushed a lot further with some extensions to XML (as a model),
> most notably through the use of schemas. So to answer your question

Or less complex and theoretical language specifiers like RelaxNG.

> about SQL, the advantage is that I don't have to write data mapping code
> by hand for every type of data structure. Of course, this doesn't have
> anything to do with RDF per se, but rather with the use of formal
> schemas that can be used to deduce a correct mapping for use by a
> generic framework.

The question is, of course, "mapping to what"? It would be pretty easy to
write a little adapter that remaps SQLite (or particular SQLite query
results) as a hierarchical/XML data structure or as JSON.

> The alternative you propose, to "focus our energies on domain-specific
> data models", is the one that I would treat with "considerable
> scepticism". We've been down this road before, for example, with HTML,
> which became a bloated mess precisely because we were trying to
> shoe-horn every possible type of data into it. Once again, XML was
> developed to counter this (since it's a metalanguage), and if you look
> at what's happening with feeds (to take one of your examples), we're
> seeing the same trend with formats like Atom.

These are all domain-specific data formats. They are made extensible through
careful design and the use of namespaces. But they don't pretend to be able
to represent or model everything. This is precisely their strength, that
they can focus on the domain their solving.

--BDS

Neil Deakin

unread,

Aug 7, 2006, 12:45:06 PM8/7/06

to

Benjamin Smedberg wrote:

> 1) XML
> 2) JS objects
> 3) sql (sqlite) tables
>
> With the continuing work on E4X, the first two are merging. Instead of
> grafting a problematic theoretical data structure like RDF onto our
> platform, we should use the existing data formats that we have. For
> example, Neil Deakin's work on the XUL templating engine will allow us
> to plug in template backends for XML and JS objects, and perhaps even a
> direct template engine for SQL. JS serialization is no longer a
> mysterious art, and XML serializes itself.
>

I should point out though that neither of the three "data models" you
list above can be used for most usages of templates.

/ Neil

Vladimir Vukicevic

unread,

Aug 7, 2006, 3:28:07 PM8/7/06

to

Neil wrote:
> Benjamin Smedberg wrote:
>
>> 1) XML
>> 2) JS objects
>> 3) sql (sqlite) tables
>

> I should point out though that neither of the three "data models" you
> list above can be used for most usages of templates.

Hmm, why not? I would have thought that with a custom Query Processor you
could use any of the above to generate query results. We don't have query
processors (or query languages) designed for the above, but it should be
possible, no?

- Vlad

Axel Hecht

unread,

Aug 8, 2006, 1:51:52 PM8/8/06

to

Well, in comparing data models, you compare one general data model to
2.5 others. Based on that, the weaknesses of XML/json as serialization
formats of 1.5 trees have the very same limitations as directed graphs
in terms of types, as does SQL as a query language for a flock of
tables. Schemas are just no fun, that's why nobody uses them.

One weakness of RDF that is unique though is the lack of clear
specification of what aggregation really is. I think that the scheme
implemented in Mozilla with explicitly aggregating and preserving the
individual data sources is the right way to fix that. In the spec,
aggregation is probably closer to a bug, but in an implementation, it
can be a strong feature.

On the data model itself, RDF can display arbitrary graphs including
circular graphs and branches etc, which are hard to model with tree
based data models. As such, it does have a valid place in the world.

I do think that there are real life occurences for aggregation, too.
Extensions meta data would be one, if we wouldn't have that broken.

The other upcoming use case I see is an intermediate format between
parsers of microformats and users (in the sense of extensions) of the
meta data. I think that the RDF model comes naturally for adding meta
data to document fragments. That does of course require an API that
people want to use, I guess I should escalate one of my mails from
brendan and shaver to .jsengine.

On a general note, I'm not sure why there seems to be this fuzzy angst
about anonymous nodes. They're just 'things'.

On an API level, I think that IDL doesn't like to describe well-working
query APIs for any datamodel. Think DOM, compare it to E4X. Same problem
comes up with our existing RDF API and the proposals I have seen so far,
or the mozstorage API. My conception is that in particular js2 will want
APIs that are tailored to that language and blend nicely, not something
that strongly-typed-object-bound as IDL.

Axel

Matthew Gertner

unread,

Aug 9, 2006, 10:35:29 AM8/9/06

to

Benjamin Smedberg wrote:
> There are perhaps two different ideas in play here. One is a
> general-purposes data model that actually represents data. The other is
> a common data format that makes it easy to manipulate data (display,
> store, process). I believe the former (a universal data representation)
> is not only not desirable, but utterly impossible. A standard data
> representation, however, is within reach. But I'd say that we already
> have several good standard data representation models:
>
> 1) XML
> 2) JS objects
> 3) sql (sqlite) tables
>
> With the continuing work on E4X, the first two are merging. Instead of
> grafting a problematic theoretical data structure like RDF onto our
> platform, we should use the existing data formats that we have. For
> example, Neil Deakin's work on the XUL templating engine will allow us
> to plug in template backends for XML and JS objects, and perhaps even a
> direct template engine for SQL. JS serialization is no longer a
> mysterious art, and XML serializes itself.

Without schemas, you can't e.g. map XML into SQL tables without writing
specific code for every data format. And once you add schemas, you've
adopted a data model with similar expressive power to RDF (or so it
appears to me). The main attraction of RDF for our application is that
it adds the notion of unique IDs, which is essential if you're going to
track things as they move across the wire, into/out of the database, etc.

The bottom line is that you can't achieve generic processing without
schemas, whether you're representing your data as JS objects, SQL, XML,
RDF or whatever. So am I interpreting your stance correctly as being
against use of schemas to describe data formats and attempting to
perform generic processing on data by exploiting the information in
these schemas?

>> This was a major motivation for the specification of XML. It's nice,
>> for example, not to have to write lexical analyzers over and over
>> again for arbitrary serialization formats. My idea is that this type
>> of advantage can be pushed a lot further with some extensions to XML
>> (as a model), most notably through the use of schemas. So to answer
>> your question
>
>
> Or less complex and theoretical language specifiers like RelaxNG.

Note that I am using "schemas" to refer to any XML vocabulary
description language, not W3C XML Schema specifically. We use RelaxNG.

>
> The question is, of course, "mapping to what"? It would be pretty easy
> to write a little adapter that remaps SQLite (or particular SQLite query
> results) as a hierarchical/XML data structure or as JSON.

Yes but the reverse is impossible without XML schemas. SQL databases
have a schema which is why your "little adapter" is possible.

> These are all domain-specific data formats. They are made extensible
> through careful design and the use of namespaces. But they don't pretend
> to be able to represent or model everything. This is precisely their
> strength, that they can focus on the domain their solving.

"Careful design and use of namespaces" are there to do precisely that:
enable e.g. Atom feeds to contain any sort of data. To the extent that
this data is described by a microformat or some other type of data
definition language, I still fail to see why RDF is any different/worse.
Perhaps the RDF crowd has been more vocal about their lofty ambitions,
but the XML PSVI is devised to do exactly the same thing: enable
universal data representation and generic processing.

Matt

Benjamin Smedberg

unread,

Aug 9, 2006, 11:05:56 AM8/9/06

to Matthew Gertner

Matthew Gertner wrote:

> Without schemas, you can't e.g. map XML into SQL tables without writing
> specific code for every data format. And once you add schemas, you've
> adopted a data model with similar expressive power to RDF (or so it
> appears to me). The main attraction of RDF for our application is that
> it adds the notion of unique IDs, which is essential if you're going to
> track things as they move across the wire, into/out of the database, etc.
>
> The bottom line is that you can't achieve generic processing without
> schemas, whether you're representing your data as JS objects, SQL, XML,
> RDF or whatever. So am I interpreting your stance correctly as being
> against use of schemas to describe data formats and attempting to
> perform generic processing on data by exploiting the information in
> these schemas?

Yes. I believe that any attempt to perform generic processing of data is
misguided. Generic adapters can be used to go from specific formats to
generic formats (SQL to XML), and custom glue code which has domain-specific
knowledge should do the reverse. That doesn't mean that the code has to be
complex: a little E4X and/or XPath can make the code expressive and simple
(assuming a simple a straightforward data model to begin with).

> "Careful design and use of namespaces" are there to do precisely that:
> enable e.g. Atom feeds to contain any sort of data. To the extent that
> this data is described by a microformat or some other type of data
> definition language, I still fail to see why RDF is any different/worse.

The question is that Atom is focused (for the most part) on solving a
particular domain of problems having to do with syndication of data, and
solving it well. They have defined semantics for what processors are
supposed to do when they encounter an Atom extension they don't know how to
process.

--BDS

Lev Serebryakov

unread,

Aug 9, 2006, 1:28:47 PM8/9/06

to Neil Deakin, dev-pl...@lists.mozilla.org

Hello Neil,

Monday, August 7, 2006, 8:45:06 PM, you wrote:

ND> I should point out though that neither of the three "data models" you
ND> list above can be used for most usages of templates.
And this is pity, if we speak about Mozilla as platform (yes, I sing one song here, because FireFox 1.5 is good browser already, and interesting part is new system-indepenede platform, not based on Java or .NET).
And typical desktop application author used to SQL. Yes, SQL (err, read "Relational data model" where I use SQL) is not ideal for large trees, but people used to Object -> Relation mapping for many, many, many cases.
It seems, by my expirience, that creation of custom treeview is simpler than mapping data to RDF (for using with templates) :(

--
Best regards,
Lev mailto:l...@serebryakov.spb.ru

Neil Deakin

unread,

Aug 9, 2006, 1:52:41 PM8/9/06

to

Vladimir Vukicevic wrote:
> Neil wrote:
>> Benjamin Smedberg wrote:
>>
>>> 1) XML
>>> 2) JS objects
>>> 3) sql (sqlite) tables
>>
>> I should point out though that neither of the three "data models" you
>> list above can be used for most usages of templates.

I should clarify that I meant most usages of templates in Mozilla code,
not most usages in general.

Neil

Matthew Gertner

unread,

Aug 12, 2006, 2:24:12 AM8/12/06

to Benjamin Smedberg

Benjamin Smedberg wrote:
> Yes. I believe that any attempt to perform generic processing of data is
> misguided. Generic adapters can be used to go from specific formats to
> generic formats (SQL to XML), and custom glue code which has
> domain-specific knowledge should do the reverse. That doesn't mean that
> the code has to be complex: a little E4X and/or XPath can make the code
> expressive and simple (assuming a simple a straightforward data model to
> begin with).

To say that "any attempt to perform generic processing of data is
misguided" is a pretty sweeping statement. Also, I guess I don't
understand what you mean by "specific formats to generic formats" since
SQL is pretty generic from my point of view.

As I mentioned before, the key point for me is that data representations
should have schemas. C++ objects have schemas (in the form of class
definitions), as you do SQL tables. I can't see any reason why XML
should be the exception. And (again repeating myself) once you add
schemas to XML you're very close to what I, at least, believe RDF was/is
trying to achieve.

> The question is that Atom is focused (for the most part) on solving a
> particular domain of problems having to do with syndication of data, and
> solving it well. They have defined semantics for what processors are
> supposed to do when they encounter an Atom extension they don't know how
> to process.

This is a total red herring. "Syndication of data" is not a limited
niche application, it's about distribution and reuse of data in general.
The fact that feed formats have been used mainly for display up til now
doesn't mean they won't be used for automated processing of various
types as well. In fact, the more I think about it, the more I believe
that Atom could be the "new RDF" in that regard.

I dare say we'll agree to disagree on this until it is there is more
concrete evidence to support our respective viewpoints (personally I
hope AllPeers will be a convincing argument for my stance). Nonetheless,
this discussion is very useful and thought-provoking, although I have to
admit that I still haven't decided whether it serves our interests to
use RDF (at least nominally).

Matt

Benjamin Smedberg

unread,

Aug 14, 2006, 9:51:14 AM8/14/06

to

Matthew Gertner wrote:

> As I mentioned before, the key point for me is that data representations
> should have schemas. C++ objects have schemas (in the form of class
> definitions), as you do SQL tables. I can't see any reason why XML
> should be the exception. And (again repeating myself) once you add
> schemas to XML you're very close to what I, at least, believe RDF was/is
> trying to achieve.

That depends entirely on what the schema is useful for. Schemas can be used
for lots of different things:

1) schemas can define a validation model to ensure that a document matches
some vocabulary
2) schemas can define an editing model to aid editors in generating a
document that matches some vocabulary
3) schemas can declare some of the semantics of a vocabulary (datatypes, for
example, or aggregation rules) so that the document can be processed more
intelligently by machines.

and, we're getting out of "schemas" per-se, but 4) schemas can provide a
user presentation of data.

>> The question is that Atom is focused (for the most part) on solving a
>> particular domain of problems having to do with syndication of data,
>> and solving it well. They have defined semantics for what processors
>> are supposed to do when they encounter an Atom extension they don't
>> know how to process.
>
> This is a total red herring. "Syndication of data" is not a limited
> niche application, it's about distribution and reuse of data in general.
> The fact that feed formats have been used mainly for display up til now
> doesn't mean they won't be used for automated processing of various
> types as well. In fact, the more I think about it, the more I believe
> that Atom could be the "new RDF" in that regard.

The distinction is that you would never represent street addresses in "Atom
format". The Atom vocabulary allows for feeds to contain street addresses,
but that requires that the street address format be specified using some
other vocabulary.

RDF vocabularies are not clearly delineated. An RDF "vocabulary" cannot
precisely define the interactions between itself and other vocabularies. You
cannot say that an RSS feed resource may only have one RSS "title", because
aggregation of data can easily provide more than one title. So you have to
write massively complicated processing and display rules to deal with
multiple titles (or ignore the fact that there are multiple titles... which
is worse?)

A clear delineation and combination of domain-specific vocabularies is what
I'm arguing for, against the flat combination of vocabulary data inherent in
RDF.

--BDS