sounds like a fun challenge. and I'm very impressed by your approach.
However, I'm not sure whether I completely understand your objectives and your design decisions. Please, allow this question: As far as I understood it, you aim to expose a collection (xy) of aggregations (ORE). You modeled this as an aggregation (ORE) of aggregation (ORE). Did you consider formats other than ORE for the collection (xy) though? Perhaps sitemaps (http://sitemaps.org/protocol.php) to expose all the aggregations (ORE); with that the harvester can always pick the most recent ones, etc.
Best wishes, aA
Am 16.03.2010 11:45, schrieb Jasper Op de Coul: > Hi, > > I recently added ORE support to the IR site of the Erasmus University. > One of the faculties involved asked if they could harvest all the > publications created by their faculty using ORE. > I tried to come up with a solution for this, which I will outline > below. > > The first thing I did was to make the faculty an aggregation that > aggregates all of their publication aggregations. This was simple > enough > but I had a hard time figuring out what to use for the modification > date > of the faculty aggregation. > > The faculty holds several thousand publications that will be harvested > daily. It seemed a bit wasteful to let the harvest agent download > these > publications all the time just to find out if they've been modified, > so > I decided to include the modification date of the publication res.map > in > the faculty graph, like so: > > <..faculty/aggr/xyz> ore:aggregates<..publication/aggr/xyz> > <..publication/resm/xyz> ore:describes<..publication/aggr/xyz> > <..publication/resm/xyz> dct:modified "2010-01-01T00:00:00" > > This could allow a harvester to determine if a res.map is modified > without having to download it first. Off course this res.map is not > authoritative. > > Generating the modification date of a res.map can be quite an > expensive > operation since aggregations can consist of many resources and binary > files that all have there own modification date. > > Because of this, I added a batching strategy to the faculty > aggregation. > It turns out that this can easily be implemented using multiple > resource > maps. By default only the first 100 aggregations are shown, together > with the following statements: > > <..faculty/aggr/xyz> ore:isDescribedBy<..faculty/resm/xyz> > <..faculty/aggr/xyz> ore:isDescribedBy<..faculty/resm/xyz?offset=100> > > I think it makes sense for a harvester to first download all the > res.maps describing the aggregation, before it would actually start > processing the aggregation and it's aggregated resources. So it would > also download the resmap with the offset, which would in turn contain > the following statements: > > <..faculty/aggr/xyz> ore:isDescribedBy<..faculty/resm/xyz?offset=100> > <..faculty/aggr/xyz> ore:isDescribedBy<..faculty/resm/xyz?offset=200> > > Which would trigger the harvester to download the next batch. This > would > continue until there are no more resource maps to download. > > The only problem I'm having now is that I don't know what to use as > modification date on the faculty aggregation. Ideally it would be the > last modification time of any of the aggregated publication > aggregations. However, this is rather expensive to find out. For now I > made the faculty aggregation have the current time as modification > time, > forcing a harvester to always download the whole aggregation. > > Anyway, any feedback on all this would be greatly appreciated. You can > view the real ORE graphs here: > > http://repub.eur.nl/resource/erim/rdf.xml (faculty) > http://repub.eur.nl/resource/pub_17919/rdf.xml (example article) > > Kind regards, > Jasper > > -- > Jasper Op de Coul -- Infrae > t +31 10 243 7051 -- http://infrae.com > Hoevestraat 10 3033GC Rotterdam -- The Netherlands >