On Mar 16, 5:26 pm, Robert Sanderson <azarot...@gmail.com> wrote:
Well, if your batches are meaningful, or you need to preserve order,
this makes sense.
In my case, the batches are purely a technical solution to efficiently
transport data between client and server.
There is no reason why these batches should become part of the content
model. I have a Faculty aggregation
that aggregates a large number of publications, some sort of batch
aggregation in the middle would only make the model more confusing.
> The reason that we went with this method is that it's otherwise impossible
> to distinguish pages of a resource map from multiple complete resource
> maps. For example, one might have 3 Turtle pages and 10 RDF/XML pages. In
> our scenario there would be two abstract resource maps (one for each
> serialization), rather than 13 separate ore:isDescribedBy links.
I'm not sure if I understand this correctly. For the batching strategy
you do not need to list all of the isDescribedBy links at once. You
only need to
list the first batch, once a client downloads that, it will find a
link with the next batch.
This is all a bit implicit though, but you could add a
the different batches:
large rdf:type ore:Aggregation
large ore:isDescribedBy ReM.rdf # the current ReM
large ore:isDescribedBy ReM.rdf?offset=100 # another ReM
ReM.rdf dcterms:requires ReM.rdf?offset=100
Once you download ?offset=100 you will find that it requires ?
which requires ?offset=300, etc.
I don't really see a problem with mixing multiple dc:formats in an
Aggregation, the client can choose
the best ReM it can find (one using a format it understands) and then
start downloading all the ReMs
in that format (dcterms:hasFormat can be used to indicate that this
is the same data in another format).
Besides using dc:format and dc:requires, we could take this one step
further and add a dcterms:temporal
to a ReM to indicate that this ReM only deals with a certain period in
time. Consider the following:
large ore:isDescribedBy ReM.rdf?since=yesterday
ReM.rdf?since=yesterday dcterms:temporal "start=2010-03-16T00:00:00Z,
By using dcterms:temporal the harvesting client could understand that
this ReM only returns aggregations/statements modified
on March 16. The harvesting client can choose to only download this
ReM, if it only needs an update from yesterday.
Further more, since this period is a specific point in time, the
server only needs to determine what was modified once, (at midnight
and can then cache this data for the next 24 hours. This makes the
problem that it can be very expensive to determine the correct
date of a ReM much more manageable.
Additionally, a ReM could also have dcterms:conformsTo predicates
specifying which ontology is used to markup the aggregation:
large ore:isDescribedBy ReM.rdf
ReM.rdf dcterms:conformsTo <http://purl.org/dc/terms#>
To summarise, I would like to propose the following:
- A harvesting client can choose a more appropriate ReM after loading
the initial authoritative ReM
through the Aggregation URI. The client knows which ReMs are
available because of the ore:isDescribedBy
predicates. The choice of ReM is determined by inspecting what these
different ReMs do. They might be in
different formats (dcterms:format), use different metadata standards
(dcterms:conformsTo) or only deal with
specific periods (dcterms:temporal). Off course this is a completely
optional step, and this can be extended
with additional predicates that are only understood by specific
- A harvesting client should download all available ReMs that make
sense (no need to download ReM in a dc:format
the harvester does not understand, or something that
dcterms:conformsTo an unknown ontology), to get a
complete picture of an Aggregation. This allows the server to send
the data in batches (through multiple ReMs).
Additionally the server can use dcterms:requires predicates to make
the batching more explicit, allowing
harvesters that do understand these predicates to download more
precisely what they need. Harvesters that do
not have a clue what all these extra ReMs mean will simply download
them all, which will probably result in duplicate
statements (which in itself is not really a problem in RDF).
The nice thing about this is that ORE maps can scale from a single
file on a webserver to a batching webservice that can be
queried to only return incremental data changes. It's also an
extendible model, that would work now, without having to change the
I hope this is all a bit cohesive, let me know what you think.