wget warc one-file output

183 views
Skip to first unread message

Alex Garnett

unread,
Jul 11, 2016, 4:38:47 PM7/11/16
to Digital Curation
Hi folks,

Something that's puzzling me and I'm not sure whether I've just missed it or what: when using wget to create WARCs, using e.g. the following syntax (http://www.archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget), multiple people have made reference to getting "a single WARC file," usually in gzip format, containing all of the page data that they wgetted, e.g. here: http://inkdroid.org/2016/04/14/warc-work/.

I can't seem to figure out a way to do this -- no matter what parameters I pass to wget, my WARC gzip only contains a single page with none of the site assets; the site assets wind up in a bunch of other directories created wherever I ran wget from. Am I missing something? There don't seem to be any warc-specific parameters in wget that'd affect this.

here's my syntax: wget -e robots=off -r -l 1 -p --waitretry 5 --timeout 60 --tries 5 --wait 1 --warc-file="example" http://example.com

This is with wget 1.18. Thanks!

Bertram Lyons

unread,
Jul 11, 2016, 4:56:43 PM7/11/16
to digital-...@googlegroups.com
Hi Alex --

Using wget 1.17.1, it works just fine. I do get two types of output:

1) warc.gz (serialized recording of all http transactions and content)
2) standard wget folder/files with non-serialized files/code extracted from target

I used the following test:

wget -e robots=off -r -l 1 -p --waitretry 5 --timeout 60 --tries 5 --wait 1 "https://enjangada.wordpress.com/" --warc-file="enjangada"

Best --

Bert

______________________________________
 
Bertram Lyons, CA
AVPreserve
634 W. Main St., Ste 202
Madison, Wisconsin 53703
 
office: 202-430-4457

http://www.avpreserve.com
Facebook.com/AVPreserve
twitter.com/AVPreserve

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at https://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

Alex Garnett

unread,
Jul 11, 2016, 6:20:27 PM7/11/16
to Digital Curation
Thanks, Bertram, you're right! I was expecting to see binhexed images in the output but using the wrong test cases.
Reply all
Reply to author
Forward
0 new messages