Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Harvesting large aggregations
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jasper Op de Coul  
View profile  
 More options Mar 16 2010, 6:45 am
From: Jasper Op de Coul <jas...@infrae.com>
Date: Tue, 16 Mar 2010 03:45:07 -0700 (PDT)
Local: Tues, Mar 16 2010 6:45 am
Subject: Harvesting large aggregations
Hi,

I recently added ORE support to the IR site of the Erasmus University.
One of the faculties involved asked if they could harvest all the
publications created by their faculty using ORE.
I tried to come up with a solution for this, which I will outline
below.

The first thing I did was to make the faculty an aggregation that
aggregates all of their publication aggregations. This was simple
enough
but I had a hard time figuring out what to use for the modification
date
of the faculty aggregation.

The faculty holds several thousand publications that will be harvested
daily. It seemed a bit wasteful to let the harvest agent download
these
publications all the time just to find out if they've been modified,
so
I decided to include the modification date of the publication res.map
in
the faculty graph, like so:

<..faculty/aggr/xyz> ore:aggregates <..publication/aggr/xyz>
<..publication/resm/xyz> ore:describes <..publication/aggr/xyz>
<..publication/resm/xyz> dct:modified "2010-01-01T00:00:00"

This could allow a harvester to determine if a res.map is modified
without having to download it first. Off course this res.map is not
authoritative.

Generating the modification date of a res.map can be quite an
expensive
operation since aggregations can consist of many resources and binary
files that all have there own modification date.

Because of this, I added a batching strategy to the faculty
aggregation.
It turns out that this can easily be implemented using multiple
resource
maps. By default only the first 100 aggregations are shown, together
with the following statements:

<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz>
<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=100>

I think it makes sense for a harvester to first download all the
res.maps describing the aggregation, before it would actually start
processing the aggregation and it's aggregated resources. So it would
also download the resmap with the offset, which would in turn contain
the following statements:

<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=100>
<..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=200>

Which would trigger the harvester to download the next batch. This
would
continue until there are no more resource maps to download.

The only problem I'm having now is that I don't know what to use as
modification date on the faculty aggregation. Ideally it would be the
last modification time of any of the aggregated publication
aggregations. However, this is rather expensive to find out. For now I
made the faculty aggregation have the current time as modification
time,
forcing a harvester to always download the whole aggregation.

Anyway, any feedback on all this would be greatly appreciated. You can
view the real ORE graphs here:

http://repub.eur.nl/resource/erim/rdf.xml (faculty)
http://repub.eur.nl/resource/pub_17919/rdf.xml (example article)

Kind regards,
Jasper

--
Jasper Op de Coul -- Infrae
t +31 10 243 7051 -- http://infrae.com
Hoevestraat 10 3033GC Rotterdam -- The Netherlands


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Sanderson  
View profile  
 More options Mar 16 2010, 12:26 pm
From: Robert Sanderson <azarot...@gmail.com>
Date: Tue, 16 Mar 2010 09:26:54 -0700
Local: Tues, Mar 16 2010 12:26 pm
Subject: Re: Harvesting large aggregations

On Tue, Mar 16, 2010 at 3:45 AM, Jasper Op de Coul <jas...@infrae.com>wrote:

> I recently added ORE support to the IR site of the Erasmus University.

Great work Jasper :)  I saw you demonstrate the system at OAI6 last year in
Geneva, so I'm glad that it has become production code.

> [...] Because of this, I added a batching strategy to the faculty
> aggregation.
> It turns out that this can easily be implemented using multiple
> resource maps. By default only the first 100 aggregations are shown,
> together
> with the following statements:

> <..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz>
> <..faculty/aggr/xyz> ore:isDescribedBy <..faculty/resm/xyz?offset=100>

Ed Summers and I discussed this problem a couple of weeks ago at the Dev8D
[un]conference and came to a similar but not identical solution.  This was
admittedly in a pub, so criticism of the idea is very much appreciated :)
Also, please note that this is a problem for RDF in general, not just ORE!
If we can solve it neatly, it may become a pattern for other systems as
well.

If we have an aggregation (Large), which needs to be serialized into
multiple files, we thought that there should be one abstract resource map
that somehow contains the parts.  As we have a solution for an abstract
resource that contains other resources in ORE, our model was:

Large ore:isDescribedBy Abstract-ReM
Large rdf:type ore:Aggregation

Abstract-ReM rdf:type ore:Aggregation       // this breaks the range of
isDescribedBy
Abstract-ReM ore:aggregates ReM-0
Abstract-ReM ore:aggregates ReM-1
...
Abstract-ReM ore:aggregates ReM-n

And then add proxies with next/prev relationships so that the order of the
aggregated resource map parts can be maintained, if necessary.  [Remember
that it's a graph, so in theory there is no order... but for paging to be
useful, we need the order!]

This information is then, of course, encoded in a resource map for
Abstract-ReM, call it ReM-A

Abstract-ReM ore:isDescribedBy ReM-A
ReM-A rdf:type ore:ResourceMap

The reason that we went with this method is that it's otherwise impossible
to distinguish pages of a resource map from multiple complete resource
maps.  For example, one might have 3 Turtle pages and 10 RDF/XML pages. In
our scenario there would be two abstract resource maps (one for each
serialization), rather than 13 separate ore:isDescribedBy links.

Rob


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andreas Aschenbrenner  
View profile  
 More options Mar 16 2010, 3:20 pm
From: Andreas Aschenbrenner <aschenbren...@sub.uni-goettingen.de>
Date: Tue, 16 Mar 2010 20:20:05 +0100
Local: Tues, Mar 16 2010 3:20 pm
Subject: Re: Harvesting large aggregations
Dear Jasper,

sounds like a fun challenge.
and I'm very impressed by your approach.

However, I'm not sure whether I completely understand your objectives
and your design decisions. Please, allow this question: As far as I
understood it, you aim to expose a collection (xy) of aggregations
(ORE). You modeled this as an aggregation (ORE) of aggregation (ORE).
Did you consider formats other than ORE for the collection (xy) though?
Perhaps sitemaps (http://sitemaps.org/protocol.php) to expose all the
aggregations (ORE); with that the harvester can always pick the most
recent ones, etc.

Best wishes,
aA

Am 16.03.2010 11:45, schrieb Jasper Op de Coul:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jasper Op de Coul  
View profile  
 More options Mar 18 2010, 10:27 am
From: Jasper Op de Coul <jas...@infrae.com>
Date: Thu, 18 Mar 2010 07:27:45 -0700 (PDT)
Local: Thurs, Mar 18 2010 10:27 am
Subject: Re: Harvesting large aggregations
Hi Rob,

On Mar 16, 5:26 pm, Robert Sanderson <azarot...@gmail.com> wrote:

Well, if your batches are meaningful, or you need to preserve order,
this makes sense.
In my case, the batches are purely a technical solution to efficiently
transport data between client and server.
There is no reason why these batches should become part of the content
model. I have a Faculty aggregation
that aggregates a large number of publications, some sort of batch
aggregation in the middle would only make the model more confusing.

> The reason that we went with this method is that it's otherwise impossible
> to distinguish pages of a resource map from multiple complete resource
> maps.  For example, one might have 3 Turtle pages and 10 RDF/XML pages. In
> our scenario there would be two abstract resource maps (one for each
> serialization), rather than 13 separate ore:isDescribedBy links.

I'm not sure if I understand this correctly. For the batching strategy
to work,
you do not need to list all of the isDescribedBy links at once. You
only need to
list the first batch, once a client downloads that, it will find a
link with the next batch.
This is all a bit implicit though, but you could add a
dcterms:requires between
the different batches:

large rdf:type ore:Aggregation
large ore:isDescribedBy ReM.rdf                     # the current ReM
large ore:isDescribedBy ReM.rdf?offset=100    # another ReM
ReM.rdf dcterms:requires ReM.rdf?offset=100

Once you download ?offset=100 you will find that it requires ?
offset=200,
which requires ?offset=300, etc.

I don't really see a problem with mixing multiple dc:formats in an
Aggregation, the client can choose
the best ReM it can find (one using a format it understands) and then
start downloading all the ReMs
 in that format (dcterms:hasFormat can be used to indicate that this
is the same data in another format).

Besides using dc:format and dc:requires, we could take this one step
further and add a dcterms:temporal
to a ReM to indicate that this ReM only deals with a certain period in
time. Consider the following:

large ore:isDescribedBy ReM.rdf?since=yesterday
ReM.rdf?since=yesterday dcterms:temporal "start=2010-03-16T00:00:00Z,
end=2010-03-17T00:00:00Z"^^dcterms:Period

By using dcterms:temporal the harvesting client could understand that
this ReM only returns aggregations/statements modified
on March 16. The harvesting client can choose to only download this
ReM, if it only needs an update from yesterday.
Further more, since this period is a specific point in time, the
server only needs to determine what was modified once, (at midnight
march 17)
and can then cache this data for the next 24 hours. This makes the
problem that it can be very expensive to determine the correct
modification
date of a ReM much more manageable.

Additionally, a ReM could also have dcterms:conformsTo predicates
specifying which ontology is used to markup the aggregation:

large ore:isDescribedBy ReM.rdf
ReM.rdf dcterms:conformsTo <http://purl.org/dc/terms#>

To summarise, I would like to propose the following:

- A harvesting client can choose a more appropriate ReM after loading
the initial authoritative ReM
  through the Aggregation URI.  The client knows which ReMs are
available because of the ore:isDescribedBy
  predicates. The choice of ReM is determined by inspecting what these
different ReMs do. They might be in
  different formats (dcterms:format), use different metadata standards
(dcterms:conformsTo) or only deal with
  specific periods (dcterms:temporal). Off course this is a completely
optional step, and this can be extended
  with additional predicates that are only understood by specific
harvesters.
- A harvesting client should download all available ReMs that make
sense (no need to download ReM in a dc:format
  the harvester does not understand, or something that
dcterms:conformsTo an unknown ontology), to get a
  complete picture of an Aggregation. This allows the server to send
the data in batches (through multiple ReMs).
  Additionally the server can use dcterms:requires predicates to make
the batching more explicit, allowing
  harvesters that do understand these predicates to download more
precisely what they need. Harvesters that do
  not have a clue what all these extra ReMs mean will simply download
them all, which will probably result in duplicate
  statements (which in itself is not really a problem in RDF).

The nice thing about this is that ORE maps can scale from a single
file on a webserver to a batching webservice that can be
queried to only return incremental data changes. It's also an
extendible model, that would work now, without having to change the
ORE spec.

I hope this is all a bit cohesive, let me know what you think.

regards,
Jasper


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jasper Op de Coul  
View profile  
 More options Apr 1 2010, 9:30 am
From: Jasper Op de Coul <jas...@infrae.com>
Date: Thu, 1 Apr 2010 06:30:12 -0700 (PDT)
Local: Thurs, Apr 1 2010 9:30 am
Subject: Re: Harvesting large aggregations
Hi,

Just replying to myself here, I understand now that my approach would
never work since the modification date of an aggregation has nothing
to do with the modification date of it's aggregated resources so you
can't look at a modification date of a ReM to determine if any of the
aggregated resources have changed.
I also see now that using proxies to implement batches are not that
bad, and probably the only correct way too implement batches that any
agent would understand.
To harvest ORE graphs you need something like a bot/spider that keeps
refreshing data loaded from many URLs. I was looking for something
more like PMH that does batches and incremental harvesting and is more
centralised, which would make more sense for this specific usecase.

regards,
Jasper

On Mar 18, 4:27 pm, Jasper Op de Coul <jas...@infrae.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »