Frequent crawling with same seed - download only delta?

瀏覽次數:64 次
跳到第一則未讀訊息

Florian Kleedorfer

未讀,
2013年5月5日 下午4:40:322013/5/5
收件者:ldsp...@googlegroups.com
Hi,

First of all, thanks for making ldspider, it's great!

I would like to use ldspider for keeping an up to date index of all the data in a distributed linked data application. Each server that's part of the network maintains a page listing all the resources. This application uses immutable data so I only need to read each document once. Currently I use a seed file containing the listing pages of all known servers, and I run ldspider from a looping shell script for simplicity, which of course results in downloading all the data each time.. fine for the time being, but needs a fix.

My obvious question is: is there a way to make ldspider follow only new URIs, i.e. keep track of crawls over consecutive runs from the command-line?

Alternatively, is there a comparably simple but better approach?

best
Florian

Andreas Harth

未讀,
2013年5月6日 下午6:24:342013/5/6
收件者:ldsp...@googlegroups.com
Hi,
there is currently no such functionality. It would be great to have a
notion of "run", and a way to pick up where it left.

> Alternatively, is there a comparably simple but better approach?

You could have correct "expires" headers on the files on the servers
and have a Squid between the crawler and the server, so that the
previously fetched URIs are served from the cache.

You could add a parameter "since" on the feed, which then only returns
the new URIs since the date of the last crawl.

You could parse the seen URIs from the access.log and remove them from
the list of files to crawl.

Best regards,
Andreas.

Florian Kleedorfer

未讀,
2013年5月9日 下午3:42:062013/5/9
收件者:ldsp...@googlegroups.com
Hi Andreas!

Many thanks for your suggestions!
Generally, I'm not sure if what I'm asking for is within the scope of ldspider. With your suggestion to introduce the notion of a 'run', ldspider, currently a tool to perform a one-shot custom crawl, may become more of a continuously running crawler, and I fear the logic required for that deserves more careful planning than I am able to do as an outsider.
However, for my current needs, a small hack could do, and I think it makes sense to try it.


On Tuesday, May 7, 2013 12:24:34 AM UTC+2, aharth wrote:
Hi,

On 05/05/13 13:40, Florian Kleedorfer wrote:
> First of all, thanks for making ldspider, it's great!
>
> I would like to use ldspider for keeping an up to date index of all the
> data in a distributed linked data application. Each server that's part
> of the network maintains a page listing all the resources. This
> application uses immutable data so I only need to read each document
> once. Currently I use a seed file containing the listing pages of all
> known servers, and I run ldspider from a looping shell script for
> simplicity, which of course results in downloading all the data each
> time.. fine for the time being, but needs a fix.
>
> My obvious question is: is there a way to make ldspider follow only new
> URIs, i.e. keep track of crawls over consecutive runs from the command-line?

there is currently no such functionality.  It would be great to have a
notion of "run", and a way to pick up where it left.

I suspect there may be a lot of state to save between runs if this is done thoroughly. For my current purposes, it would probably be enough to avoid re-downloading (and re-processing) resources that
a) have been downloaded already
b) are not expected to change
Your hint at expires-headers and squid below is excellent for this purpose. We could collect such information in an hsqldb that is persisted to disk at the end of a crawl, and the state could be read back in at the beginning of the next run.
If that's an approach you can live with, I could find the time to implement this feature soon, and you may want to add the data (frontier, etc) needed to 'pick up where it left' later.


> Alternatively, is there a comparably simple but better approach?

You could have correct "expires" headers on the files on the servers
and have a Squid between the crawler and the server, so that the
previously fetched URIs are served from the cache.

Good idea... but it's a little complicated to set up as compared to just running ldspider. And ldspider treats all downloaded data as new and passes them to the sink, which is not what I want

You could add a parameter "since" on the feed, which then only returns
the new URIs since the date of the last crawl.

I'll probably do that but the problem are not only the 'feeds': The data structure is quite interlinked with links between already known resources that are added over time. During a later crawl, I expect to find links to known resources, which I would like not to have to download


You could parse the seen URIs from the access.log and remove them from
the list of files to crawl.

Sure - but I prefer the db based solution above.

Thanks again,
Best regards.
Florian
 

Andreas Harth

未讀,
2013年5月9日 下午4:54:332013/5/9
收件者:ldsp...@googlegroups.com
Hi Florian,
ok sounds good. Not sure though whether you'd need a database - a
simple hashtable should do the trick equally well, without adding
another dependency. And reading RDF (or Nx) is very easy with
the NxParser.

Cheers,
Andreas.

Florian Kleedorfer

未讀,
2013年5月10日 凌晨3:09:022013/5/10
收件者:ldsp...@googlegroups.com

It sure is easier to implement that way. I only fear ugly performance degradation for biggish crawls, both in terms of memory consumption and disk io upon startup and shutdown.
But as you prefer it and it'll be faster for me to code, I'll take that route. We can adapt it later if needed.
 
 And reading RDF (or Nx) is very easy with
the NxParser.

I don't think I'll need to parse any RDF for this task.. or do I? Anyway, thanks for the hint!
 

Cheers,
Andreas.

Jürgen Umbrich

未讀,
2013年5月10日 凌晨3:25:582013/5/10
收件者:ldsp...@googlegroups.com
The IRLBot paper of 2008 might be of relevance to your task.



Cheers,
Andreas.

--
You received this message because you are subscribed to the Google Groups "LDSpider" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldspider+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Florian Kleedorfer

未讀,
2013年5月10日 下午5:56:142013/5/10
收件者:ldsp...@googlegroups.com
Very much indeed! Thanks for the hint!

Jürgen Umbrich

未讀,
2013年5月10日 下午6:00:512013/5/10
收件者:ldsp...@googlegroups.com
great that it is of interest. It would be very interesting to have something like the DRUM data structure or the BEAST for the LDSpider ;)


On 10 May 2013 22:56, Florian Kleedorfer <florian.k...@gmail.com> wrote:
Very much indeed! Thanks for the hint!

Florian Kleedorfer

未讀,
2013年5月15日 晚上10:28:462013/5/15
收件者:ldsp...@googlegroups.com


On Friday, May 10, 2013 7:00:51 PM UTC-3, juum wrote:
great that it is of interest. It would be very interesting to have something like the DRUM data structure or the BEAST for the LDSpider ;)

I'm afraid that's a little beyond my time budget for now; the vanilla solution does well enough for now, too. When we start doing bigger crawls, we'll consider it though. Thanks again for the hint!
 
回覆所有人
回覆作者
轉寄
0 則新訊息