PCDM Model for Web Archives

159 views
Skip to first unread message

Ilya Kreymer

unread,
Aug 25, 2017, 4:26:20 PM8/25/17
to pc...@googlegroups.com
Hello PCDM Community,

I’m writing to let you all know that the Webrecorder team is interested in creating a PCDM model and profile for web archive data. (Webrecorder, hosted at https://webrecorder.io/, is a free and open-source tool for creating high-fidelity web archives of any web page.) Our goal is to facilitate the preservation of web archive objects in Fedora, and longer term, potentially expose web archive data to other Fedora-based repositories.

I’ve chatted with Nick Ruest a bit about this project, and we’ve presented this idea on the Fedora tech call, but wanted to share this goal with the official PCDM list and to gauge interest in this work from the community. We’ve already experimented with writing WARC files to Fedora and providing replay access directly from Fedora-stored WARCs with promising results, but there is not yet a linked data model for the entirety of web archive data and metadata.

The intent will be to describe all objects associated with web archives, such as files (WARCs, indexes), crawls/recordings, and collections of crawls/recordings, as well as entry point URLs, for example seeds or bookmarks. While our initial use case will be with Webrecorder, we would like the data model to apply to any web archiving use case. It may also be possible to describe individual urls stored in web archives, and to enable them to be linked from other objects.

The first phase of this effort will be to fully include all web archive objects via PCDM which Webrecorder will manage on its own. As a later phase, we also hope that this effort can lead to an eventual way to share web archives stored in Fedora with other repositories, for example, by providing UI/discovery level links from Islandora or Samvera-based repositories to specific web archive bookmarks.

I wanted to ask if anyone here would be interested in being more involved with this process. 

We’d like to officially start on this project early next year, but wanted to reach out to the group now in case anyone has any questions, concerns or suggestions. We’d be happy to help organize a call as well to discuss this more if there is interest. Thanks for your consideration, and we look forward to collaborating with the PCDM community!

Thank you,
Ilya
Webecorder Lead Developer, Rhizome

Joshua Westgard

unread,
Aug 28, 2017, 10:09:38 AM8/28/17
to PCDM
Dear Ilya,

I personally think this is a great idea, just the sort of thing that PCDM is meant to facilitate. While we at University of Maryland do not currently store our WARCs in Fedora, eventually that is something we will most probably want to do, and so the prospect of developing a common approach to modeling such content using PCDM sounds like a positive development. Thanks for bringing it up. If there is critical mass for a call I would be happy to participate.

Best,
Josh Westgard
University of Maryland Libraries

Christina Harlow

unread,
Sep 17, 2017, 3:27:55 PM9/17/17
to PCDM
Hi all-

Just an idea to throw out there. Seems like a few folks are working on PCDM models for web archiving artifacts (I know I have at my current job), so perhaps it would be a good thing to form a small working group for some determined amount of time (6 months?) around this. Would be a way to get some more PCDM community docs out there where we can work independently of the various implementations, and I wouldn't mind having a small group of folks to work through these models + ideas with, share some specs, etc.

Let me know what yall think; we could write up a proposal, questions, scope, meetings schedule + send out invites fairly quickly (I have - like others here I'm sure - basically a template for doing that now).

Thanks!
C

Nick Ruest

unread,
Sep 18, 2017, 3:45:11 PM9/18/17
to pc...@googlegroups.com
I think a working group is a great idea!

Should we coordinate around Ilya's timeline? Look to start in the new
year? Or, start before then?

-nruest
> --
> You received this message because you are subscribed to the Google
> Groups "PCDM" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pcdm+uns...@googlegroups.com
> <mailto:pcdm+uns...@googlegroups.com>.
> To post to this group, send email to pc...@googlegroups.com
> <mailto:pc...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/pcdm.
> For more options, visit https://groups.google.com/d/optout.

Christina Harlow

unread,
Sep 18, 2017, 4:07:41 PM9/18/17
to pc...@googlegroups.com
Why don’t we have a prelim call sometime in next month to just gauge interest, timing, etc.?

C
> You received this message because you are subscribed to a topic in the Google Groups "PCDM" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/pcdm/vOLGWiT50kY/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to pcdm+uns...@googlegroups.com.
> To post to this group, send email to pc...@googlegroups.com.

Joshua Allan Westgard

unread,
Sep 18, 2017, 4:47:20 PM9/18/17
to pc...@googlegroups.com
+1 to a prelim call

Ilya Kreymer

unread,
Sep 18, 2017, 8:47:05 PM9/18/17
to pc...@googlegroups.com
Hi everyone,

Great, yes a prelim call around mid-October sounds like a great idea to start discussion and plan for work next year. If anyone else is interested, feel free to respond to this thread or just to me at: ilya.kreymer[at]rhizome.org so we can get an initial count. We'll announce the call on this list if anyone wants to join later.

Thanks, and looking forward to chatting about web archives and PCDM in the near future!
Ilya


>>> To post to this group, send email to pc...@googlegroups.com
>>> <mailto:pc...@googlegroups.com>.
>>> Visit this group at https://groups.google.com/group/pcdm.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the Google Groups "PCDM" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/pcdm/vOLGWiT50kY/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to pcdm+unsubscribe@googlegroups.com.

>> To post to this group, send email to pc...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/pcdm.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "PCDM" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/pcdm/vOLGWiT50kY/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to pcdm+unsubscribe@googlegroups.com.

> To post to this group, send email to pc...@googlegroups.com.
> Visit this group at https://groups.google.com/group/pcdm.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "PCDM" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pcdm/vOLGWiT50kY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pcdm+unsubscribe@googlegroups.com.

Christina Harlow

unread,
Sep 18, 2017, 11:48:11 PM9/18/17
to pc...@googlegroups.com
Why don’t we just set up an open doodle to be filled in by anyone interested? Like this guy: http://doodle.com/poll/87y9ices2xdtt764 with date options from Oct 9 through Oct 19 and times that are on the cusp of goodish for US Pacific to Western Europe. People can fill it in by Oct 1st, say, and everyone should feel empowered to share this out?

I wonder if we can also just set up some space in the PCDM community wiki for the meeting / agenda / notes. I’m happy to make a blurb page for sharing that out.

That way, we can just coordinate the group using community resources + lists.

Sounds good, Ilya?
Christina

To unsubscribe from this group and all its topics, send an email to pcdm+uns...@googlegroups.com.

Ilya Kreymer

unread,
Sep 19, 2017, 2:17:05 AM9/19/17
to pc...@googlegroups.com
Sounds great! Thanks for setting up the Doodle already, makes things even simpler.  Feel free to share this out to whomever else might be inclined to join. Space for agenda/notes on wiki sounds like a great plan as well.

Thanks,
Ilya

Dragan Espenschied

unread,
Sep 19, 2017, 2:17:06 AM9/19/17
to PCDM
Same here! Shall we try to find a date via doodle?

anna.p...@rhizome.org

unread,
Sep 19, 2017, 10:51:59 AM9/19/17
to PCDM
Greetings,

Many thanks, Christina, for giving this work some serious momentum! Also thanks to Josh and Nick for your interest. 

I agree a blurb page/space for coordinating work and Doodle poll will be very useful. I have set up this poll that reflects some scheduling limitations of the Webrecorder team: http://doodle.com/poll/9a95agwwexx3uecx. Hopefully within these 19 options we can find a time in the range of 10/9-20 to talk. All, please share this link as you see fit. 

The Webrecorder team, myself included, will also be happy to contribute to the shared workspace as things move forward (thanks, again, Christina for offering to set that up). Regarding logistics moving forward, please feel welcome to contact me now that I have returned to the office. Ilya and Dragan will lead on some content discussions but I'll mostly likely be managing this work if there is community interest in pursuing this project.

Thanks for your consideration and insights.

All best,
Anna

Christina Harlow

unread,
Sep 19, 2017, 10:14:00 PM9/19/17
to pc...@googlegroups.com
Here’s a blurb to share out (or add to) wrt this proposed working group and the kick off meeting: https://github.com/duraspace/pcdm/wiki/PCDM-Web-Archiving-WG

Please do share out to other possible interested parties. Note: it has the link to the canonical doodle poll, since two got generated: http://doodle.com/poll/87y9ices2xdtt764 and please do fill it out by October 1.

Anna, we can coordinate on setting up the call information and kick off meeting just after Oct 1.

Thanks all,

C

Joshua Allan Westgard

unread,
Sep 19, 2017, 11:06:44 PM9/19/17
to pc...@googlegroups.com
I'm really hesitant to further complicate the doodle polling, but can I suggest that we have a shorter timeframe for the poll? My calendar after Oct. 1 -- when we're proposing to lock in this meeting -- is bound to be totally different than it is right now.

Josh
>>> >>> an email to pcdm+uns...@googlegroups.com
>>> >>> <mailto:pcdm+uns...@googlegroups.com>.
>>> >>> To post to this group, send email to pc...@googlegroups.com
>>> >>> <mailto:pc...@googlegroups.com>.
>>> >>> Visit this group at https://groups.google.com/group/pcdm.
>>> >>> For more options, visit https://groups.google.com/d/optout.
>>> >>
>>> >> --
>>> >> You received this message because you are subscribed to a topic in the Google Groups "PCDM" group.
>>> >> To unsubscribe from this topic, visit https://groups.google.com/d/topic/pcdm/vOLGWiT50kY/unsubscribe.

Eoin Kilfeather

unread,
Sep 25, 2017, 10:46:44 AM9/25/17
to PCDM

Hi all,

I just participated in the poll. We are at a very early stage of looking at this issue and would love to hear thoughts on how to manage it. One issue we are facing are large top level domain crawls (possibly thousands of WARCs and sites) these have historically been delivered in an unstructured way with the expectation that the CDX indexer will sort it all out and present the playback. This isn't of course how we generally deal with our bread and butter preservation concerns i.e. TIFFs and associated MARC metadata. So we have been scratching our heads a bit on what the best approach is.

All the best,

Eoin Kilfeather
National Library of Ireland.

Christina Harlow

unread,
Sep 25, 2017, 11:55:50 AM9/25/17
to pc...@googlegroups.com
+1 Thanks Eoin!


ikre...@gmail.com

unread,
Oct 2, 2017, 9:22:00 PM10/2/17
to PCDM
Hi,

Just one last reminder to fill out the Doodle for the call before a time is chosen: http://doodle.com/poll/87y9ices2xdtt764

Look forward to discussing PCDM and Web archives with everyone soon!

Ilya

>>> To post to this group, send email to pc...@googlegroups.com
>>> <mailto:pc...@googlegroups.com>.
>>> Visit this group at https://groups.google.com/group/pcdm.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the Google Groups "PCDM" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/pcdm/vOLGWiT50kY/unsubscribe.

Christina Harlow

unread,
Oct 7, 2017, 7:59:38 PM10/7/17
to PCDM
Hi all-

The Doodle Goddesses have spoken.

We'll be having a kick off / working group initiation call on Thursday, October 12, at Noon Eastern / 9 AM Pacific (https://www.timeanddate.com/worldclock/fixedtime.html?msg=PCDM+Web+Archiving+Modeling+Kick+Off&iso=20171012T09&p1=224&ah=1).

The open agenda is here: https://github.com/duraspace/pcdm/wiki/PCDM-Web-Archiving-WG PLEASE PLEASE PLEASE put your name, ideas, etc. there if you are interested in taken part (especially if you cannot make this kick off meeting).

Best,
Christina

Christina Harlow

unread,
Oct 12, 2017, 12:00:00 PM10/12/17
to PCDM
Hi all-

Friendly reminder this is happening now. 

Zoom info: (link https://stanford.zoom.us/j/612948313)

Join from PC, Mac, Linux, iOS or Android: https://stanford.zoom.us/j/612948313
Or iPhone one-tap (US Toll): +18333021536,,612948313# or +16507249799,,612948313#
Or Telephone:
    Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536 (US, Canada, Caribbean Toll Free)
          
    Meeting ID: 612 948 313
    International numbers available: https://stanford.zoom.us/zoomconference?m=dFq7LOtQld4H72QnLrY4oIjzZNmjTfbO

    Meeting ID: 612 948 313
    SIP: 6129...@zoomcrc.com

Best,
Christina
Reply all
Reply to author
Forward
0 new messages