PCDM Model for Web Archives

29 views
Skip to first unread message

Eoin Kilfeather

unread,
Mar 8, 2018, 11:43:18 AM3/8/18
to pc...@googlegroups.com
Hello,

We are beginning to look again at what to do with our backlog of WARC archives and how to manage these beyond bit-level preservation. While we are very much interested in an approach which would integrate these into our Hydra-PCDM models this might not be possible or desirable because of the way the legacy archive has been captured and delivered.

Our archive has built up over several years as WARCs were delivered by a succession of external contractors who undertook domain level and target crawls on our behalf. In the past we haven't done much with these files as the management and discovery interfaces were part of the contract for web-archiving and our WARCs are browsable / discoverable through a web front-end (Open Wayback) provided by the contractors. Consequently the delivered files are not well structured and are in a variety of formats i.e. ARCs and WARCs of different versions and with no clear correspondence between the files and their internal resources.

We have subsequently specified that we want our contractor to deliver at least the targeted crawls in a single WARC file per target hostname - which might then allow us to manage the data within our Hydra repository more easily. That said the utility of doing so isn't immediately obvious or compelling. What would we gain from doing so? We could imagine a useful workflow where we could for example extract PDF resources and have these linked in the models. 

I wanted to know if any other institutions are looking at a similar situation and how are they dealing with their "legacy" web-archives.

Thanks,

Eoin.
Digital Collections
National Library of Ireland.
Reply all
Reply to author
Forward
0 new messages