I recently added ORE support to the IR site of the Erasmus University. One of the faculties involved asked if they could harvest all the publications created by their faculty using ORE. I tried to come up with a solution for this, which I will outline below.
The first thing I did was to make the faculty an aggregation that aggregates all of their publication aggregations. This was simple enough but I had a hard time figuring out what to use for the modification date of the faculty aggregation.
The faculty holds several thousand publications that will be harvested daily. It seemed a bit wasteful to let the harvest agent download these publications all the time just to find out if they've been modified, so I decided to include the modification date of the publication res.map in the faculty graph, like so:
This could allow a harvester to determine if a res.map is modified without having to download it first. Off course this res.map is not authoritative.
Generating the modification date of a res.map can be quite an expensive operation since aggregations can consist of many resources and binary files that all have there own modification date.
Because of this, I added a batching strategy to the faculty aggregation. It turns out that this can easily be implemented using multiple resource maps. By default only the first 100 aggregations are shown, together with the following statements:
I think it makes sense for a harvester to first download all the res.maps describing the aggregation, before it would actually start processing the aggregation and it's aggregated resources. So it would also download the resmap with the offset, which would in turn contain the following statements:
Which would trigger the harvester to download the next batch. This would continue until there are no more resource maps to download.
The only problem I'm having now is that I don't know what to use as modification date on the faculty aggregation. Ideally it would be the last modification time of any of the aggregated publication aggregations. However, this is rather expensive to find out. For now I made the faculty aggregation have the current time as modification time, forcing a harvester to always download the whole aggregation.
Anyway, any feedback on all this would be greatly appreciated. You can view the real ORE graphs here:
On Tue, Mar 16, 2010 at 3:45 AM, Jasper Op de Coul <jas...@infrae.com>wrote:
> I recently added ORE support to the IR site of the Erasmus University.
Great work Jasper :) I saw you demonstrate the system at OAI6 last year in Geneva, so I'm glad that it has become production code.
> [...] Because of this, I added a batching strategy to the faculty > aggregation. > It turns out that this can easily be implemented using multiple > resource maps. By default only the first 100 aggregations are shown, > together > with the following statements:
Ed Summers and I discussed this problem a couple of weeks ago at the Dev8D [un]conference and came to a similar but not identical solution. This was admittedly in a pub, so criticism of the idea is very much appreciated :) Also, please note that this is a problem for RDF in general, not just ORE! If we can solve it neatly, it may become a pattern for other systems as well.
If we have an aggregation (Large), which needs to be serialized into multiple files, we thought that there should be one abstract resource map that somehow contains the parts. As we have a solution for an abstract resource that contains other resources in ORE, our model was:
Large ore:isDescribedBy Abstract-ReM Large rdf:type ore:Aggregation
Abstract-ReM rdf:type ore:Aggregation // this breaks the range of isDescribedBy Abstract-ReM ore:aggregates ReM-0 Abstract-ReM ore:aggregates ReM-1 ... Abstract-ReM ore:aggregates ReM-n
And then add proxies with next/prev relationships so that the order of the aggregated resource map parts can be maintained, if necessary. [Remember that it's a graph, so in theory there is no order... but for paging to be useful, we need the order!]
This information is then, of course, encoded in a resource map for Abstract-ReM, call it ReM-A
The reason that we went with this method is that it's otherwise impossible to distinguish pages of a resource map from multiple complete resource maps. For example, one might have 3 Turtle pages and 10 RDF/XML pages. In our scenario there would be two abstract resource maps (one for each serialization), rather than 13 separate ore:isDescribedBy links.
sounds like a fun challenge. and I'm very impressed by your approach.
However, I'm not sure whether I completely understand your objectives and your design decisions. Please, allow this question: As far as I understood it, you aim to expose a collection (xy) of aggregations (ORE). You modeled this as an aggregation (ORE) of aggregation (ORE). Did you consider formats other than ORE for the collection (xy) though? Perhaps sitemaps (http://sitemaps.org/protocol.php) to expose all the aggregations (ORE); with that the harvester can always pick the most recent ones, etc.
> I recently added ORE support to the IR site of the Erasmus University. > One of the faculties involved asked if they could harvest all the > publications created by their faculty using ORE. > I tried to come up with a solution for this, which I will outline > below.
> The first thing I did was to make the faculty an aggregation that > aggregates all of their publication aggregations. This was simple > enough > but I had a hard time figuring out what to use for the modification > date > of the faculty aggregation.
> The faculty holds several thousand publications that will be harvested > daily. It seemed a bit wasteful to let the harvest agent download > these > publications all the time just to find out if they've been modified, > so > I decided to include the modification date of the publication res.map > in > the faculty graph, like so:
> This could allow a harvester to determine if a res.map is modified > without having to download it first. Off course this res.map is not > authoritative.
> Generating the modification date of a res.map can be quite an > expensive > operation since aggregations can consist of many resources and binary > files that all have there own modification date.
> Because of this, I added a batching strategy to the faculty > aggregation. > It turns out that this can easily be implemented using multiple > resource > maps. By default only the first 100 aggregations are shown, together > with the following statements:
> I think it makes sense for a harvester to first download all the > res.maps describing the aggregation, before it would actually start > processing the aggregation and it's aggregated resources. So it would > also download the resmap with the offset, which would in turn contain > the following statements:
> Which would trigger the harvester to download the next batch. This > would > continue until there are no more resource maps to download.
> The only problem I'm having now is that I don't know what to use as > modification date on the faculty aggregation. Ideally it would be the > last modification time of any of the aggregated publication > aggregations. However, this is rather expensive to find out. For now I > made the faculty aggregation have the current time as modification > time, > forcing a harvester to always download the whole aggregation.
> Anyway, any feedback on all this would be greatly appreciated. You can > view the real ORE graphs here:
> On Tue, Mar 16, 2010 at 3:45 AM, Jasper Op de Coul <jas...@infrae.com>wrote:
> > I recently added ORE support to the IR site of the Erasmus University.
> Great work Jasper :) I saw you demonstrate the system at OAI6 last year in > Geneva, so I'm glad that it has become production code.
> > [...] Because of this, I added a batching strategy to the faculty > > aggregation. > > It turns out that this can easily be implemented using multiple > > resource maps. By default only the first 100 aggregations are shown, > > together > > with the following statements:
> Ed Summers and I discussed this problem a couple of weeks ago at the Dev8D > [un]conference and came to a similar but not identical solution. This was > admittedly in a pub, so criticism of the idea is very much appreciated :) > Also, please note that this is a problem for RDF in general, not just ORE! > If we can solve it neatly, it may become a pattern for other systems as > well.
> If we have an aggregation (Large), which needs to be serialized into > multiple files, we thought that there should be one abstract resource map > that somehow contains the parts. As we have a solution for an abstract > resource that contains other resources in ORE, our model was:
> Large ore:isDescribedBy Abstract-ReM > Large rdf:type ore:Aggregation
> Abstract-ReM rdf:type ore:Aggregation // this breaks the range of > isDescribedBy > Abstract-ReM ore:aggregates ReM-0 > Abstract-ReM ore:aggregates ReM-1 > ... > Abstract-ReM ore:aggregates ReM-n
> And then add proxies with next/prev relationships so that the order of the > aggregated resource map parts can be maintained, if necessary. [Remember > that it's a graph, so in theory there is no order... but for paging to be > useful, we need the order!]
> This information is then, of course, encoded in a resource map for > Abstract-ReM, call it ReM-A
Well, if your batches are meaningful, or you need to preserve order, this makes sense. In my case, the batches are purely a technical solution to efficiently transport data between client and server. There is no reason why these batches should become part of the content model. I have a Faculty aggregation that aggregates a large number of publications, some sort of batch aggregation in the middle would only make the model more confusing.
> The reason that we went with this method is that it's otherwise impossible > to distinguish pages of a resource map from multiple complete resource > maps. For example, one might have 3 Turtle pages and 10 RDF/XML pages. In > our scenario there would be two abstract resource maps (one for each > serialization), rather than 13 separate ore:isDescribedBy links.
I'm not sure if I understand this correctly. For the batching strategy to work, you do not need to list all of the isDescribedBy links at once. You only need to list the first batch, once a client downloads that, it will find a link with the next batch. This is all a bit implicit though, but you could add a dcterms:requires between the different batches:
large rdf:type ore:Aggregation large ore:isDescribedBy ReM.rdf # the current ReM large ore:isDescribedBy ReM.rdf?offset=100 # another ReM ReM.rdf dcterms:requires ReM.rdf?offset=100
Once you download ?offset=100 you will find that it requires ? offset=200, which requires ?offset=300, etc.
I don't really see a problem with mixing multiple dc:formats in an Aggregation, the client can choose the best ReM it can find (one using a format it understands) and then start downloading all the ReMs in that format (dcterms:hasFormat can be used to indicate that this is the same data in another format).
Besides using dc:format and dc:requires, we could take this one step further and add a dcterms:temporal to a ReM to indicate that this ReM only deals with a certain period in time. Consider the following:
large ore:isDescribedBy ReM.rdf?since=yesterday ReM.rdf?since=yesterday dcterms:temporal "start=2010-03-16T00:00:00Z, end=2010-03-17T00:00:00Z"^^dcterms:Period
By using dcterms:temporal the harvesting client could understand that this ReM only returns aggregations/statements modified on March 16. The harvesting client can choose to only download this ReM, if it only needs an update from yesterday. Further more, since this period is a specific point in time, the server only needs to determine what was modified once, (at midnight march 17) and can then cache this data for the next 24 hours. This makes the problem that it can be very expensive to determine the correct modification date of a ReM much more manageable.
Additionally, a ReM could also have dcterms:conformsTo predicates specifying which ontology is used to markup the aggregation:
To summarise, I would like to propose the following:
- A harvesting client can choose a more appropriate ReM after loading the initial authoritative ReM through the Aggregation URI. The client knows which ReMs are available because of the ore:isDescribedBy predicates. The choice of ReM is determined by inspecting what these different ReMs do. They might be in different formats (dcterms:format), use different metadata standards (dcterms:conformsTo) or only deal with specific periods (dcterms:temporal). Off course this is a completely optional step, and this can be extended with additional predicates that are only understood by specific harvesters. - A harvesting client should download all available ReMs that make sense (no need to download ReM in a dc:format the harvester does not understand, or something that dcterms:conformsTo an unknown ontology), to get a complete picture of an Aggregation. This allows the server to send the data in batches (through multiple ReMs). Additionally the server can use dcterms:requires predicates to make the batching more explicit, allowing harvesters that do understand these predicates to download more precisely what they need. Harvesters that do not have a clue what all these extra ReMs mean will simply download them all, which will probably result in duplicate statements (which in itself is not really a problem in RDF).
The nice thing about this is that ORE maps can scale from a single file on a webserver to a batching webservice that can be queried to only return incremental data changes. It's also an extendible model, that would work now, without having to change the ORE spec.
I hope this is all a bit cohesive, let me know what you think.
Just replying to myself here, I understand now that my approach would never work since the modification date of an aggregation has nothing to do with the modification date of it's aggregated resources so you can't look at a modification date of a ReM to determine if any of the aggregated resources have changed. I also see now that using proxies to implement batches are not that bad, and probably the only correct way too implement batches that any agent would understand. To harvest ORE graphs you need something like a bot/spider that keeps refreshing data loaded from many URLs. I was looking for something more like PMH that does batches and incremental harvesting and is more centralised, which would make more sense for this specific usecase.
regards, Jasper
On Mar 18, 4:27 pm, Jasper Op de Coul <jas...@infrae.com> wrote:
> On Mar 16, 5:26 pm, Robert Sanderson <azarot...@gmail.com> wrote:
> > On Tue, Mar 16, 2010 at 3:45 AM, Jasper Op de Coul <jas...@infrae.com>wrote:
> > > I recently added ORE support to the IR site of the Erasmus University.
> > Great work Jasper :) I saw you demonstrate the system at OAI6 last year in > > Geneva, so I'm glad that it has become production code.
> > > [...] Because of this, I added a batching strategy to the faculty > > > aggregation. > > > It turns out that this can easily be implemented using multiple > > > resource maps. By default only the first 100 aggregations are shown, > > > together > > > with the following statements:
> > Ed Summers and I discussed this problem a couple of weeks ago at the Dev8D > > [un]conference and came to a similar but not identical solution. This was > > admittedly in a pub, so criticism of the idea is very much appreciated :) > > Also, please note that this is a problem for RDF in general, not just ORE! > > If we can solve it neatly, it may become a pattern for other systems as > > well.
> > If we have an aggregation (Large), which needs to be serialized into > > multiple files, we thought that there should be one abstract resource map > > that somehow contains the parts. As we have a solution for an abstract > > resource that contains other resources in ORE, our model was:
> > Large ore:isDescribedBy Abstract-ReM > > Large rdf:type ore:Aggregation
> > Abstract-ReM rdf:type ore:Aggregation // this breaks the range of > > isDescribedBy > > Abstract-ReM ore:aggregates ReM-0 > > Abstract-ReM ore:aggregates ReM-1 > > ... > > Abstract-ReM ore:aggregates ReM-n
> > And then add proxies with next/prev relationships so that the order of the > > aggregated resource map parts can be maintained, if necessary. [Remember > > that it's a graph, so in theory there is no order... but for paging to be > > useful, we need the order!]
> > This information is then, of course, encoded in a resource map for > > Abstract-ReM, call it ReM-A
> Well, if your batches are meaningful, or you need to preserve order, > this makes sense. > In my case, the batches are purely a technical solution to efficiently > transport data between client and server. > There is no reason why these batches should become part of the content > model. I have a Faculty aggregation > that aggregates a large number of publications, some sort of batch > aggregation in the middle would only make the model more confusing.
> > The reason that we went with this method is that it's otherwise impossible > > to distinguish pages of a resource map from multiple complete resource > > maps. For example, one might have 3 Turtle pages and 10 RDF/XML pages. In > > our scenario there would be two abstract resource maps (one for each > > serialization), rather than 13 separate ore:isDescribedBy links.
> I'm not sure if I understand this correctly. For the batching strategy > to work, > you do not need to list all of the isDescribedBy links at once. You > only need to > list the first batch, once a client downloads that, it will find a > link with the next batch. > This is all a bit implicit though, but you could add a > dcterms:requires between > the different batches:
> large rdf:type ore:Aggregation > large ore:isDescribedBy ReM.rdf # the current ReM > large ore:isDescribedBy ReM.rdf?offset=100 # another ReM > ReM.rdf dcterms:requires ReM.rdf?offset=100
> Once you download ?offset=100 you will find that it requires ? > offset=200, > which requires ?offset=300, etc.
> I don't really see a problem with mixing multiple dc:formats in an > Aggregation, the client can choose > the best ReM it can find (one using a format it understands) and then > start downloading all the ReMs > in that format (dcterms:hasFormat can be used to indicate that this > is the same data in another format).
> Besides using dc:format and dc:requires, we could take this one step > further and add a dcterms:temporal > to a ReM to indicate that this ReM only deals with a certain period in > time. Consider the following:
> large ore:isDescribedBy ReM.rdf?since=yesterday > ReM.rdf?since=yesterday dcterms:temporal "start=2010-03-16T00:00:00Z, > end=2010-03-17T00:00:00Z"^^dcterms:Period
> By using dcterms:temporal the harvesting client could understand that > this ReM only returns aggregations/statements modified > on March 16. The harvesting client can choose to only download this > ReM, if it only needs an update from yesterday. > Further more, since this period is a specific point in time, the > server only needs to determine what was modified once, (at midnight > march 17) > and can then cache this data for the next 24 hours. This makes the > problem that it can be very expensive to determine the correct > modification > date of a ReM much more manageable.
> Additionally, a ReM could also have dcterms:conformsTo predicates > specifying which ontology is used to markup the aggregation:
> To summarise, I would like to propose the following:
> - A harvesting client can choose a more appropriate ReM after loading > the initial authoritative ReM > through the Aggregation URI. The client knows which ReMs are > available because of the ore:isDescribedBy > predicates. The choice of ReM is determined by inspecting what these > different ReMs do. They might be in > different formats (dcterms:format), use different metadata standards > (dcterms:conformsTo) or only deal with > specific periods (dcterms:temporal). Off course this is a completely > optional step, and this can be extended > with additional predicates that are only understood by specific > harvesters. > - A harvesting client should download all available ReMs that make > sense (no need to download ReM in a dc:format > the harvester does not understand, or something that > dcterms:conformsTo an unknown ontology), to get a > complete picture of an Aggregation. This allows the server to send > the data in batches (through multiple ReMs). > Additionally the server can use dcterms:requires predicates to make > the batching more explicit, allowing > harvesters that do understand these predicates to download more > precisely what they need. Harvesters that do > not have a clue what all these extra ReMs mean will simply download > them all, which will probably result in duplicate > statements (which in itself is not really a problem in RDF).
> The nice thing about this is that ORE maps can scale from a single > file on a webserver to a batching webservice that can be > queried to only return incremental data changes. It's also an > extendible model, that would work now, without having to change the > ORE spec.
> I hope this is all a bit cohesive, let me know what you think.