[1]
http://captsolo.net/info/blog_a.php/2006/07/14/sioc_sparql_and_timeline
The post contains more information about the timeline (e.g., scripts
used) and on problems encountered. One of the problems - once crawled
SIOC data get old quickly. An obvious solution is incremental crawling
- download only the new data.
Now incremental crawling is available in our SIOC / RDF crawler [2].
Other features:
- can limit to the same domain (default:on)
- can exclude comments / replies (default:off)
How it works:
- run the crawler ( ./run ) and it's crawling results are saved to
'result.rdf'
- for incremental crawling copy result file 'result.rdf' into
'input.rdf'
- do crawling again and only new posts should be crawled.
( incremental crawling is on by default, but only has effect if
'input.rdf' is present )
Please try it out. :)
If you want to know more about how it works and what are its
limitations, please write or look at the code. Bugs can be recorded at:
http://esw.w3.org/topic/SIOC/ToDoList#crawler
[2]
http://sw.deri.org/svn/sw/2005/08/sioc/crawler/releases/crawler_v0.7.tar.gz
(requires Python and Redland)
Uldis