I think there should be an open implementation of a (standalone)
viewer for WARC files which would also allow to use another archiving
system to store these files. In addition it would be possible to
view / browse single WARC files (pages stored in WARC files). I also
would see the need to "export" a single page with all components e.g.,
to proof how a web page look at a certain point in time (e.g., for
legal reasons, historic research, etc.).
Speaking of Heritrix: I was reading the manual and I have a little
problem understanding how I can set up a crawl job. My task would be
to archive only certain pages in a crawl job, i.e., I want to give
Heritrix a list of URLs referring to one page each and I want them to
be collected (including all components of that page (e.g., PDF files,
images, ...). Anyboy here which could give me a hint / sample job
definition?
Thank you and Kind regards
.
--
You received this message because you are subscribed to the Google Groups "warc-tools" group.
To post to this group, send email to warc-...@googlegroups.com.
To unsubscribe from this group, send email to warc-tools+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/warc-tools?hl=en.