Google Groups

Re: Harvesting large aggregations


AAschen Mar 16, 2010 12:20 PM
Posted in group: OAI-ORE
Dear Jasper,

sounds like a fun challenge.
and I'm very impressed by your approach.

However, I'm not sure whether I completely understand your objectives
and your design decisions. Please, allow this question: As far as I
understood it, you aim to expose a collection (xy) of aggregations
(ORE). You modeled this as an aggregation (ORE) of aggregation (ORE).
Did you consider formats other than ORE for the collection (xy) though?
Perhaps sitemaps (http://sitemaps.org/protocol.php) to expose all the
aggregations (ORE); with that the harvester can always pick the most
recent ones, etc.

Best wishes,
aA

Am 16.03.2010 11:45, schrieb Jasper Op de Coul:
> Hi,
>
> I recently added ORE support to the IR site of the Erasmus University.
> One of the faculties involved asked if they could harvest all the
> publications created by their faculty using ORE.
> I tried to come up with a solution for this, which I will outline
> below.
>
> The first thing I did was to make the faculty an aggregation that
> aggregates all of their publication aggregations. This was simple
> enough
> but I had a hard time figuring out what to use for the modification
> date
> of the faculty aggregation.
>
> The faculty holds several thousand publications that will be harvested
> daily. It seemed a bit wasteful to let the harvest agent download
> these
> publications all the time just to find out if they've been modified,
> so
> I decided to include the modification date of the publication res.map
> in
> the faculty graph, like so:
>
> <..faculty/aggr/xyz>  ore:aggregates<..publication/aggr/xyz>
> <..publication/resm/xyz>  ore:describes<..publication/aggr/xyz>
> <..publication/resm/xyz>  dct:modified "2010-01-01T00:00:00"
>
> This could allow a harvester to determine if a res.map is modified
> without having to download it first. Off course this res.map is not
> authoritative.
>
> Generating the modification date of a res.map can be quite an
> expensive
> operation since aggregations can consist of many resources and binary
> files that all have there own modification date.
>
> Because of this, I added a batching strategy to the faculty
> aggregation.
> It turns out that this can easily be implemented using multiple
> resource
> maps. By default only the first 100 aggregations are shown, together
> with the following statements:
>
> <..faculty/aggr/xyz>  ore:isDescribedBy<..faculty/resm/xyz>
> <..faculty/aggr/xyz>  ore:isDescribedBy<..faculty/resm/xyz?offset=100>
>
> I think it makes sense for a harvester to first download all the
> res.maps describing the aggregation, before it would actually start
> processing the aggregation and it's aggregated resources. So it would
> also download the resmap with the offset, which would in turn contain
> the following statements:
>
> <..faculty/aggr/xyz>  ore:isDescribedBy<..faculty/resm/xyz?offset=100>
> <..faculty/aggr/xyz>  ore:isDescribedBy<..faculty/resm/xyz?offset=200>
>
> Which would trigger the harvester to download the next batch. This
> would
> continue until there are no more resource maps to download.
>
> The only problem I'm having now is that I don't know what to use as
> modification date on the faculty aggregation. Ideally it would be the
> last modification time of any of the aggregated publication
> aggregations. However, this is rather expensive to find out. For now I
> made the faculty aggregation have the current time as modification
> time,
> forcing a harvester to always download the whole aggregation.
>
> Anyway, any feedback on all this would be greatly appreciated. You can
> view the real ORE graphs here:
>
> http://repub.eur.nl/resource/erim/rdf.xml (faculty)
> http://repub.eur.nl/resource/pub_17919/rdf.xml (example article)
>
> Kind regards,
> Jasper
>
> --
> Jasper Op de Coul -- Infrae
> t +31 10 243 7051 -- http://infrae.com
> Hoevestraat 10 3033GC Rotterdam -- The Netherlands
>