SIOC - Incremental crawling

2 views

Skip to first unread message

Uldis Bojars

unread,

Jul 25, 2006, 2:02:31 PM7/25/06

to SIOC-Dev

In a recent blog post [1] I described a SIMILE Timeline based on SIOC
data.

[1]
http://captsolo.net/info/blog_a.php/2006/07/14/sioc_sparql_and_timeline

The post contains more information about the timeline (e.g., scripts
used) and on problems encountered. One of the problems - once crawled
SIOC data get old quickly. An obvious solution is incremental crawling
- download only the new data.

Now incremental crawling is available in our SIOC / RDF crawler [2].
Other features:
- can limit to the same domain (default:on)
- can exclude comments / replies (default:off)

How it works:
- run the crawler ( ./run ) and it's crawling results are saved to
'result.rdf'
- for incremental crawling copy result file 'result.rdf' into
'input.rdf'
- do crawling again and only new posts should be crawled.
( incremental crawling is on by default, but only has effect if
'input.rdf' is present )

Please try it out. :)

If you want to know more about how it works and what are its
limitations, please write or look at the code. Bugs can be recorded at:
http://esw.w3.org/topic/SIOC/ToDoList#crawler

[2]
http://sw.deri.org/svn/sw/2005/08/sioc/crawler/releases/crawler_v0.7.tar.gz

(requires Python and Redland)

Uldis

[ http://captsolo.net/info/ ]

Reply all

Reply to author

Forward

0 new messages