Hi Andreas!
Many thanks for your suggestions!
Generally,
I'm not sure if what I'm asking for is within the scope of ldspider.
With your suggestion to introduce the notion of a 'run', ldspider,
currently a tool to perform a one-shot custom crawl, may become more of a
continuously running crawler, and I fear the logic required for that
deserves more careful planning than I am able to do as an outsider.
However, for my current needs, a small hack could do, and I think it makes sense to try it.
On Tuesday, May 7, 2013 12:24:34 AM UTC+2, aharth wrote:
Hi,
On 05/05/13 13:40, Florian Kleedorfer wrote:
> First of all, thanks for making ldspider, it's great!
>
> I would like to use ldspider for keeping an up to date index of all the
> data in a distributed linked data application. Each server that's part
> of the network maintains a page listing all the resources. This
> application uses immutable data so I only need to read each document
> once. Currently I use a seed file containing the listing pages of all
> known servers, and I run ldspider from a looping shell script for
> simplicity, which of course results in downloading all the data each
> time.. fine for the time being, but needs a fix.
>
> My obvious question is: is there a way to make ldspider follow only new
> URIs, i.e. keep track of crawls over consecutive runs from the command-line?
there is currently no such functionality. It would be great to have a
notion of "run", and a way to pick up where it left.
I suspect there may be a lot of state to save
between runs if this is done thoroughly. For my current purposes, it
would probably be enough to avoid re-downloading (and re-processing)
resources that
a) have been downloaded already
b) are not expected to change
Your
hint at expires-headers and squid below is excellent for this purpose.
We could collect such information in an hsqldb that is persisted to disk
at the end of a crawl, and the state could be read back in at the
beginning of the next run.
If that's an approach you can live with, I
could find the time to implement this feature soon, and you may want to
add the data (frontier, etc) needed to 'pick up where it left' later.
> Alternatively, is there a comparably simple but better approach?
You could have correct "expires" headers on the files on the servers
and have a Squid between the crawler and the server, so that the
previously fetched URIs are served from the cache.
Good idea... but it's a little complicated to
set up as compared to just running ldspider. And ldspider treats all
downloaded data as new and passes them to the sink, which is not what I
want
You could add a parameter "since" on the feed, which then only returns
the new URIs since the date of the last crawl.
I'll probably do that but the problem are not
only the 'feeds': The data structure is quite interlinked with links
between already known resources that are added over time. During a later
crawl, I expect to find links to known resources, which I would like
not to have to download
You could parse the seen URIs from the access.log and remove them from
the list of files to crawl.
Sure - but I prefer the db based solution above.
Thanks again,
Best regards.
Florian