arc file path in JSON format output of wayback machine timemap

34 views
Skip to first unread message

cheyrn

unread,
Nov 16, 2018, 7:57:45 AM11/16/18
to Memento Development
Output from the timemap looks like this:

["com,moonation)/","20010410213828","http://moonation.com:80/","text/html","302","DMWUOF3K3DNGT3ZRZVO3O7EW2SZBFPZD","-","-","324","8836085","DE_crawl4.20010408061427-c/DE_crawl4.20010410213755.arc.gz"]

Does the path DE_crawl4.20010408061427-c/DE_crawl4.20010410213755.arc.gz indicate a resource that can be retrieved? How does one retrieve it?

Kendall

Michael Nelson

unread,
Nov 16, 2018, 10:11:05 AM11/16/18
to 'cheyrn' via Memento Development

Hi Kendall,

That's a CDX file entry:

https://archive.org/web/researcher/cdx_file_format.php

and the arc file (and now more commonly, warc file) is the tar-like file
format that combines multiple web resources into a single file; see:

https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml

in general, the arc/warc files are not available for download by the end
user. arc/warc files are typically large (~100MB) and contain a lot of
additional files beyond what is needed to replay a single page.

see:

https://iipc.github.io/warc-specifications/guidelines/warc-implementation-guidelines/

regards,

Michael
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "Memento Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

----
Michael L. Nelson m...@cs.odu.edu http://www.cs.odu.edu/~mln/
Dept of Computer Science, Old Dominion University, Norfolk VA 23529
+1 757 683 6393 +1 757 683 4900 (f)
Reply all
Reply to author
Forward
0 new messages