Harvesting large aggregations

Jasper Op de Coul

unread,

Mar 16, 2010, 6:45:07 AM3/16/10

to OAI-ORE

Hi,

I recently added ORE support to the IR site of the Erasmus University.
One of the faculties involved asked if they could harvest all the
publications created by their faculty using ORE.
I tried to come up with a solution for this, which I will outline
below.

The first thing I did was to make the faculty an aggregation that
aggregates all of their publication aggregations. This was simple
enough
but I had a hard time figuring out what to use for the modification
date
of the faculty aggregation.

The faculty holds several thousand publications that will be harvested
daily. It seemed a bit wasteful to let the harvest agent download
these
publications all the time just to find out if they've been modified,
so
I decided to include the modification date of the publication res.map
in
the faculty graph, like so:

<..faculty/aggr/xyz> ore:aggregates <..publication/aggr/xyz>
<..publication/resm/xyz> ore:describes <..publication/aggr/xyz>
<..publication/resm/xyz> dct:modified "2010-01-01T00:00:00"

This could allow a harvester to determine if a res.map is modified
without having to download it first. Off course this res.map is not
authoritative.

Generating the modification date of a res.map can be quite an
expensive
operation since aggregations can consist of many resources and binary
files that all have there own modification date.

Because of this, I added a batching strategy to the faculty
aggregation.
It turns out that this can easily be implemented using multiple
resource
maps. By default only the first 100 aggregations are shown, together
with the following statements:

<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz>
<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=100>

I think it makes sense for a harvester to first download all the
res.maps describing the aggregation, before it would actually start
processing the aggregation and it's aggregated resources. So it would
also download the resmap with the offset, which would in turn contain
the following statements:

<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=100>
<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=200>

Which would trigger the harvester to download the next batch. This
would
continue until there are no more resource maps to download.

The only problem I'm having now is that I don't know what to use as
modification date on the faculty aggregation. Ideally it would be the
last modification time of any of the aggregated publication
aggregations. However, this is rather expensive to find out. For now I
made the faculty aggregation have the current time as modification
time,
forcing a harvester to always download the whole aggregation.

Anyway, any feedback on all this would be greatly appreciated. You can
view the real ORE graphs here:

http://repub.eur.nl/resource/erim/rdf.xml (faculty)
http://repub.eur.nl/resource/pub_17919/rdf.xml (example article)

Kind regards,
Jasper

--
Jasper Op de Coul -- Infrae
t +31 10 243 7051 -- http://infrae.com
Hoevestraat 10 3033GC Rotterdam -- The Netherlands

Robert Sanderson

unread,

Mar 16, 2010, 12:26:54 PM3/16/10

to oai...@googlegroups.com

On Tue, Mar 16, 2010 at 3:45 AM, Jasper Op de Coul <jas...@infrae.com> wrote:

I recently added ORE support to the IR site of the Erasmus University.

Great work Jasper :) I saw you demonstrate the system at OAI6 last year in Geneva, so I'm glad that it has become production code.

[...] Because of this, I added a batching strategy to the faculty

aggregation.
It turns out that this can easily be implemented using multiple
resource maps. By default only the first 100 aggregations are shown, together
with the following statements:

<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz>
<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=100>

Ed Summers and I discussed this problem a couple of weeks ago at the Dev8D [un]conference and came to a similar but not identical solution. This was admittedly in a pub, so criticism of the idea is very much appreciated :)
Also, please note that this is a problem for RDF in general, not just ORE! If we can solve it neatly, it may become a pattern for other systems as well.

If we have an aggregation (Large), which needs to be serialized into multiple files, we thought that there should be one abstract resource map that somehow contains the parts. As we have a solution for an abstract resource that contains other resources in ORE, our model was:

Large ore:isDescribedBy Abstract-ReM
Large rdf:type ore:Aggregation

Abstract-ReM rdf:type ore:Aggregation // this breaks the range of isDescribedBy
Abstract-ReM ore:aggregates ReM-0
Abstract-ReM ore:aggregates ReM-1
...
Abstract-ReM ore:aggregates ReM-n

And then add proxies with next/prev relationships so that the order of the aggregated resource map parts can be maintained, if necessary. [Remember that it's a graph, so in theory there is no order... but for paging to be useful, we need the order!]

This information is then, of course, encoded in a resource map for Abstract-ReM, call it ReM-A

Abstract-ReM ore:isDescribedBy ReM-A
ReM-A rdf:type ore:ResourceMap

The reason that we went with this method is that it's otherwise impossible to distinguish pages of a resource map from multiple complete resource maps. For example, one might have 3 Turtle pages and 10 RDF/XML pages. In our scenario there would be two abstract resource maps (one for each serialization), rather than 13 separate ore:isDescribedBy links.

Rob

Andreas Aschenbrenner

unread,

Mar 16, 2010, 3:20:05 PM3/16/10

to oai...@googlegroups.com

Dear Jasper,

sounds like a fun challenge.
and I'm very impressed by your approach.

However, I'm not sure whether I completely understand your objectives
and your design decisions. Please, allow this question: As far as I
understood it, you aim to expose a collection (xy) of aggregations
(ORE). You modeled this as an aggregation (ORE) of aggregation (ORE).
Did you consider formats other than ORE for the collection (xy) though?
Perhaps sitemaps (http://sitemaps.org/protocol.php) to expose all the
aggregations (ORE); with that the harvester can always pick the most
recent ones, etc.

Best wishes,
aA

Jasper Op de Coul

unread,

Mar 18, 2010, 10:27:45 AM3/18/10

to OAI-ORE

Hi Rob,

Well, if your batches are meaningful, or you need to preserve order,
this makes sense.
In my case, the batches are purely a technical solution to efficiently
transport data between client and server.
There is no reason why these batches should become part of the content
model. I have a Faculty aggregation
that aggregates a large number of publications, some sort of batch
aggregation in the middle would only make the model more confusing.

> The reason that we went with this method is that it's otherwise impossible
> to distinguish pages of a resource map from multiple complete resource
> maps. For example, one might have 3 Turtle pages and 10 RDF/XML pages. In
> our scenario there would be two abstract resource maps (one for each
> serialization), rather than 13 separate ore:isDescribedBy links.

I'm not sure if I understand this correctly. For the batching strategy
to work,
you do not need to list all of the isDescribedBy links at once. You
only need to
list the first batch, once a client downloads that, it will find a
link with the next batch.
This is all a bit implicit though, but you could add a
dcterms:requires between
the different batches:

large rdf:type ore:Aggregation
large ore:isDescribedBy ReM.rdf # the current ReM
large ore:isDescribedBy ReM.rdf?offset=100 # another ReM
ReM.rdf dcterms:requires ReM.rdf?offset=100

Once you download ?offset=100 you will find that it requires ?
offset=200,
which requires ?offset=300, etc.

I don't really see a problem with mixing multiple dc:formats in an
Aggregation, the client can choose
the best ReM it can find (one using a format it understands) and then
start downloading all the ReMs
in that format (dcterms:hasFormat can be used to indicate that this
is the same data in another format).

Besides using dc:format and dc:requires, we could take this one step
further and add a dcterms:temporal
to a ReM to indicate that this ReM only deals with a certain period in
time. Consider the following:

large ore:isDescribedBy ReM.rdf?since=yesterday
ReM.rdf?since=yesterday dcterms:temporal "start=2010-03-16T00:00:00Z,
end=2010-03-17T00:00:00Z"^^dcterms:Period

By using dcterms:temporal the harvesting client could understand that
this ReM only returns aggregations/statements modified
on March 16. The harvesting client can choose to only download this
ReM, if it only needs an update from yesterday.
Further more, since this period is a specific point in time, the
server only needs to determine what was modified once, (at midnight
march 17)
and can then cache this data for the next 24 hours. This makes the
problem that it can be very expensive to determine the correct
modification
date of a ReM much more manageable.

Additionally, a ReM could also have dcterms:conformsTo predicates
specifying which ontology is used to markup the aggregation:

large ore:isDescribedBy ReM.rdf
ReM.rdf dcterms:conformsTo <http://purl.org/dc/terms#>

To summarise, I would like to propose the following:

- A harvesting client can choose a more appropriate ReM after loading
the initial authoritative ReM
through the Aggregation URI. The client knows which ReMs are
available because of the ore:isDescribedBy
predicates. The choice of ReM is determined by inspecting what these
different ReMs do. They might be in
different formats (dcterms:format), use different metadata standards
(dcterms:conformsTo) or only deal with
specific periods (dcterms:temporal). Off course this is a completely
optional step, and this can be extended
with additional predicates that are only understood by specific
harvesters.
- A harvesting client should download all available ReMs that make
sense (no need to download ReM in a dc:format
the harvester does not understand, or something that
dcterms:conformsTo an unknown ontology), to get a
complete picture of an Aggregation. This allows the server to send
the data in batches (through multiple ReMs).
Additionally the server can use dcterms:requires predicates to make
the batching more explicit, allowing
harvesters that do understand these predicates to download more
precisely what they need. Harvesters that do
not have a clue what all these extra ReMs mean will simply download
them all, which will probably result in duplicate
statements (which in itself is not really a problem in RDF).

The nice thing about this is that ORE maps can scale from a single
file on a webserver to a batching webservice that can be
queried to only return incremental data changes. It's also an
extendible model, that would work now, without having to change the
ORE spec.

I hope this is all a bit cohesive, let me know what you think.

regards,
Jasper

>
> Rob

Jasper Op de Coul

unread,

Apr 1, 2010, 9:30:12 AM4/1/10

to OAI-ORE

Hi,

Just replying to myself here, I understand now that my approach would
never work since the modification date of an aggregation has nothing
to do with the modification date of it's aggregated resources so you
can't look at a modification date of a ReM to determine if any of the
aggregated resources have changed.
I also see now that using proxies to implement batches are not that
bad, and probably the only correct way too implement batches that any
agent would understand.
To harvest ORE graphs you need something like a bot/spider that keeps
refreshing data loaded from many URLs. I was looking for something
more like PMH that does batches and incremental harvesting and is more
centralised, which would make more sense for this specific usecase.

regards,
Jasper

Reply all

Reply to author

Forward