WARC Viewer & Heritrix

678 views
Skip to first unread message

S.W.Schilke

unread,
Feb 18, 2010, 7:02:39 AM2/18/10
to warc-tools
Dear all,

I think there should be an open implementation of a (standalone)
viewer for WARC files which would also allow to use another archiving
system to store these files. In addition it would be possible to
view / browse single WARC files (pages stored in WARC files). I also
would see the need to "export" a single page with all components e.g.,
to proof how a web page look at a certain point in time (e.g., for
legal reasons, historic research, etc.).

Speaking of Heritrix: I was reading the manual and I have a little
problem understanding how I can set up a crawl job. My task would be
to archive only certain pages in a crawl job, i.e., I want to give
Heritrix a list of URLs referring to one page each and I want them to
be collected (including all components of that page (e.g., PDF files,
images, ...). Anyboy here which could give me a hint / sample job
definition?

Thank you and Kind regards

.

mark williamson

unread,
Feb 18, 2010, 7:15:45 AM2/18/10
to warc-...@googlegroups.com
The warc browser in the warc tools is designed to allow browsing of warcs created by Hertirix (or other tools) and it is entirely standalone. 

I am not sure what you mean by "allow to use another archivie system to store these files."  but the warc browser in warc tools is open source and you will find it straight forward to adapt it to storage other than a flat file system. 

There is a new release due which will support the version 1.0 warcs made by Heritrix 3.0 - we are still testing this to make sure it doesn't break anything. 

I'm afraid things like export and page authenticity from warc are currently beyond the remit of the warc tools project.  We (Hanzo Archives)  do offer these facilities as part of our commercial offerings. 

There are Heritrix mailing lists where you will find excellent support for Heritirix. Check out the Hertirix project pages for details. 

-mark 



--
You received this message because you are subscribed to the Google Groups "warc-tools" group.
To post to this group, send email to warc-...@googlegroups.com.
To unsubscribe from this group, send email to warc-tools+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/warc-tools?hl=en.




-- Mark Williamson | ma...@hanzoarchives.com | tel +44 7894 947 343 http://www.hanzoarchives.com Hanzo provides web archiving products and services to companies whose e-discovery and/or compliance requirements encompass their web content. Hanzo Archives Limited. Registered in England. Number: 5410483. Office: 64 Clifton Street, London, EC2A 4HB, U.K. VAT: 912 8708 19.


Reply all
Reply to author
Forward
0 new messages