PCDM Model for Newspapers and Journals

Michael Bolton

unread,

Sep 8, 2017, 10:54:53 AM9/8/17

to PCDM

Hello all,

Ilya's recent post on a model for Web Archives prompted me to check the status of other PCDM models, in particular, newspapers and journals.

I read the wiki entry on PCDM Mapping for Welsh Newspapers and noted it was last modified on April 2, 2015. It also stated it was a first attempt at creating the mapping for PCDM and IIIF. We are looking at doing some work with journals and conference proceedings and would be interested in hearing any updates on the project. From what I have read, I think this is an interesting project and it looks like the model is fairly complete. I also checked out the Welsh Journals web site, as well as the newspaper site, and I like what I see.

I would be interested in hearing more about the project so is this still a current project and is there interest in moving it beyond a beta?

Thanks

Michael W. Bolton

Joshua Westgard

unread,

Sep 9, 2017, 1:12:04 PM9/9/17

to PCDM

Hi Michael,

We have a production newspaper repository and a PCDM model for newspapers that is very much in line with the NLW model. We presented a diagram on our poster at the 2016 Samvera (then called Hydra) Connect meeting. the poster is available on the meeting wiki:

https://wiki.duraspace.org/display/hydra/Hydra+Connect+2016+Posters

A few things have changed since that diagram was drawn up -- we removed the hasMember relationship between collection and issue (now issues point to collections but not the other way around). Also, this is not reflected on the diagram, but for a while we were relating Articles to Issues by a hasRelatedObject predicate, but we've since moved away from that idea and are following NLW in using hasMember for that. There are a few other differences in how we've managed OCR text blocks, but by and large the two models are compatable. I have a draft PCDM profile for this that will eventually be published on the PCDM wiki.

This is a long way of saying that yes, we would be interested in participating in a community effort to describe and maintain a common model for newspaper content. There is also a Newspapers in Samvera working group that is looking at this problem (https://wiki.duraspace.org/display/samvera/Samvera+Newspapers+Interest+Group).

Joshua Westgard

unread,

Sep 9, 2017, 1:16:08 PM9/9/17

to PCDM

PS: Sorry I neglected to add that by 'we' I mean the University of Maryland Libraries.

Josh Westgard

Systems Librarian, Digital Programs and Initiatives

University of Maryland Libraries

west...@umd.edu

Michael Bolton

unread,

Sep 11, 2017, 11:44:50 AM9/11/17

to PCDM

Josh,

Thanks for the reply. I appreciate the information and the links.

Our team here would indeed be interested in helping develop a common model for Newspapers. We are also working on a project to upload some back issues of journals to an Open Journal System instance and possibly an ingest to HathiTrust. I would be interested in how the models would look for a Journal as well as for a conference proceedings. Maybe we could add that to the project when the time is right. I do see a lot of similarities between journals and newspapers and maybe a single model will cover them all. I don't have enough experience with PCDM to say. But I do look forward to learning more about it.

I did a Google search to find The Historic Maryland Newspapers Project. Is this the repository you mentioned in your note? Is your newspaper digitization project part of the Chronicling America project? We are using the Chronam software for our newspaper project but we are also tracking the work being done at Oregon on the Open Newspaper Initiative.

Please let me know how you think we should proceed and what we can do to help.

Thanks again.

Eben English

unread,

Sep 12, 2017, 11:14:15 AM9/12/17

to PCDM

Hi Michael,

I would recommend getting in touch with Glen Robson, who was one of the main people behind the NLW journals and newspapers projects. He's just been named the technical coordinator for IIIF (http://iiif.io/news/2017/08/30/technical-coordinator/), so I'm actually not sure how to best get in touch with him at this point. He's usually fairly active on the IIIF-Discuss list [1], and has been leading the meetings of the IIIF Newspapers Group.

Boston Public Library and University of Utah are currently working on an IMLS-funded project to standardize newspaper content management within the Samvera / Hyrax framework [2], part of this effort is to publish a community-vetted model for all types of newspaper content objects (titles, microfilm reels, bound volumes, issues, pages, articles, files, ALTO, etc.) expressed in RDF using the PCDM ontology. There has been a fair amount of discussion on this model via the Samvera Newspapers Interest Group, and we're hoping to publish a draft model very soon. (Josh has been involved with this as well.)

Once the draft is made available, we'd love to get feedback, specially from people that might be using the model outside of the Samvera/Islandora/Fedora spheres.

Eben

Boston Public Library

1. https://groups.google.com/forum/#!forum/iiif-discuss

2. https://www.imls.gov/grants/awarded/lg-70-17-0043-17

Glen Robson

unread,

Sep 13, 2017, 7:55:15 AM9/13/17

to pc...@googlegroups.com

Hi Michael,

I'm glad you found the Newspaper PCDM mapping useful and its great to hear others are extending it. The mapping was done while the National Library of Wales (NLW) was looking at piloting the migration from Fedora 3 to Fedora 4. Currently NLW is still on Fedora 3 so we haven't worked further on the PCDM mappings. Instead we concentrated on mapping our existing METS based metadata into IIIF. The NLW Newspaper website which was the basis for the PCDM mapping:

http://newspapers.library.wales

currently only uses IIIF images but there are Manifests and Collections for the full hierarchy in existence and I hope they will be made use of in the next unscheduled redevelopment of the newspapers website. While at NLW I documented the IIIF Newspaper structure and it was available on our Dev wiki but unfortunately this is currently down but I will send on the links when its back available.

For the Journals we started with a IIIF first approach and based the website on harvesting the IIIF resources (Collections, Manifests and Annotation Lists). There are also details of this work on the dev wiki and Ill also send this on when its back up.

For the IIIF Newspaper group we have the following document which maps various Newspaper concepts to IIIF concepts:

https://docs.google.com/document/d/1di1uYGau_ABUPmhnOcd8KH5yZyHS1sDbiZrp5aTpx5M/edit?usp=sharing

For the NLW Newspaper PCDM mapping we leant heavily on the predicates used by IIIF (hidden behind the json-ld) to make it as compatible as possible.

I hope the above helps but let me know if I can help further. As Eben mentioned I started as the IIIF Technical Coordinator on Monday but I can still contact NLW if required.

Thanks

Glen Robson

IIIF Technical Coordinator

--
You received this message because you are subscribed to the Google Groups "PCDM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pcdm+unsubscribe@googlegroups.com.
To post to this group, send email to pc...@googlegroups.com.
Visit this group at https://groups.google.com/group/pcdm.
For more options, visit https://groups.google.com/d/optout.

Michael Bolton

unread,

Sep 13, 2017, 8:58:45 AM9/13/17

to PCDM

Eben,

Thanks for the info. I will have to check out the special interest groups you mentioned. And fortunately for me, Glen just posted an update to this group.

And good luck with your project. I will be interested in seeing the draft when you are ready.

Michael Bolton

unread,

Sep 13, 2017, 9:17:45 AM9/13/17

to PCDM

Hello Glen,

Thanks for the follow-up message.

I like the Welsh Newspapers Online site and that's really what got us started down this path. That must have been a lot of work breaking down the pages to identify maps, illustrations, articles, etc. But it makes for an impressive interface. What software/application are you using to build and manage the site? Do you have an application that helps view the pages and mark the regions of interest? I read the About which was very helpful and it led me to the Welsh Journals site, again, a very nice site. We have a fairly long run of journals we would like to get online and I believe the Journals site would be a great start. Do you have a separate PCDM model for Journals or is it a variation of the Newspaper model? Also, I appreciate the statement on Open Data. We hope to see more of that as new collections come online.

I do have a lot to learn and the information I am getting from this group is very helpful. I am also looking forward to seeing the documents on your wiki. I am sure I will have more questions as I study the documents. I am excited to see this much interest in developing Newspaper and Journal models.

Again, thanks for the info and good luck with your IIIF work.

To unsubscribe from this group and stop receiving emails from it, send an email to pcdm+uns...@googlegroups.com.

Glen Robson

unread,

Sep 14, 2017, 8:28:21 AM9/14/17

to pc...@googlegroups.com

Hi Michael,

Thanks for your comments on the Newspaper and Journals website. For the Newspaper site we outsourced the OCR and article segmentation to a company called Jouve (see https://www.llgc.org.uk/blog/?p=278 for further details) and followed the Australian Trove METS/Alto model. For the website itself it is a custom written PHP site backed by a SOLR database.

The detailed writeup on how the Journals website works is now back available at:

http://dev.llgc.org.uk/wiki/index.php?title=IIIF_Journals

But I’m afraid we didn’t use PCDM for the site but instead mapped our METS/ALTO to IIIF and EDM. The Newspaper IIIF details are also now available at:

http://dev.llgc.org.uk/wiki/index.php?title=IIIF_Newspapers

I hope that helps and let me know if I can help further.

Cheers

Glen

Michael Bolton

unread,

Oct 27, 2017, 9:56:02 AM10/27/17

to PCDM

Hello All,

I saw Eben's post on a PCDM Profile for Newspapers and I am studying it now. It may take a while for me to fully understand the relationships but it is indeed an interesting read. I see the model addresses related works such as OCR'ed text and thumbnails so my question may be rather basic. As I generate manifests for various collections, potentially newspapers, where do I store the manifests? Is there a best practice for manifests or is that such a basic part of the model that I am overlooking it? Is there a "best way" to store manifests so they can be discovered and easily accessed? I have been thinking all along they need to be in the repository and associated with my collections in some way.

Peter Binkley

unread,

Oct 27, 2017, 12:04:41 PM10/27/17

to pc...@googlegroups.com

The model covers how to structure the manifest and what to include rather than where to put it in your stack, but it's good to figure out a good practice. (And I'm speaking as someone who hasn't implemented IIIF in our repository yet - that comes next year). Some people don't store the manifest at all, but generate it on the fly. So the question of where it lives is more a matter of its uri than its physical location. And once you're talking about uris in IIIF, you're talking about linked data. If you view source on the Mirador demo http://projectmirador.org/demo/ you'll see the uris of a bunch of manifests from several institutions. I like the Stanford model, e.g. https://purl.stanford.edu/ch264fq0568/iiif/manifest.json, where the manifest is treated as a rendering of the object: you just take the object uri and append /iiif/manifest.json. The ids within the manifest can then follow suit: https://purl.stanford.edu/ch264fq0568/iiif/canvas/ch264fq0568_1 . And so, linked data assertions can be made about individual canvases or even zones on canvases (e.g. newspaper articles) in a way that doesn't embed a lot of implementation detail (like paths to images in the storage system), and are therefore easier to maintain as the stack changes and develops in the future.

Regarding discovery, there's another group working on that: http://iiif.io/community/groups/discovery/ . Have you looked IIIF collections yet? http://iiif.io/api/presentation/2.1/#collection

I'm not sure I've answered your question, but I some of that is at least useful context.

Peter

Peter Binkley, Ph.D., MLIS / Digital Initiatives Technology Librarian / peter....@ualberta.ca

2-10K Cameron Library / University of Alberta / Edmonton, Alberta / Canada T6G 2J8
phone 780-492-3743 / fax 780-492-9243

--

You received this message because you are subscribed to the Google Groups "PCDM" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pcdm+unsubscribe@googlegroups.com.

Eben English

unread,

Oct 27, 2017, 12:44:46 PM10/27/17

to PCDM

At BPL we don't store the manifests anywhere; our repository application generates them on the fly when the client makes the request.

Example item: http://ark.digitalcommonwealth.org/ark:/50959/s4655h10k

Example item manifest: http://ark.digitalcommonwealth.org/ark:/50959/s4655h10k/manifest

You can see the code we use for this here, which may or may not be useful for you, depending on your system/language/implementation/etc.:

https://github.com/boston-library/commonwealth-vlr-engine/blob/master/lib/commonwealth-vlr-engine/iiif_manifest.rb

This code uses methods from a Ruby gem called osullivan: https://github.com/iiif-prezi/osullivan

That being said, the code above can be pretty slow when generating a manifest for an item with a lot of images in the main sequence. We would definitely get a big performance benefit from some better caching, or from potentially storing the manifest JSON.

In the model we proposed [1], I think if you were to try store the manifest for an entire issue, the manifest would be a pcdm:File object that would be related to the NewspaperIssue object via pcdm:fileOf relationship.

Thanks,
Eben

1. https://groups.google.com/d/msg/pcdm/4PouAdXLheM/T7AOJPwqBgAJ

Michael Bolton

unread,

Oct 27, 2017, 12:47:13 PM10/27/17

to pc...@googlegroups.com

Peter,

This information is great and I can indeed see how the model covers structuring the manifest. I did follow the links to the Presentation API specification, which was a big help. I also noticed that the Collection section of the specification indicated that Collections and Manifests are deprecated and may change in the 3.0 specification. I am not sure what that means and how it will impact this model.

I also like dynamically generated manifests. If I wanted to create a manifest of several images from a larger image collection, how would I go about that task? Would I create a special collection that just included those images? I guess that would allow me to generate the manifest on the fly.

And just so I understand, making an Article a IIIF Range would provide for the manifest that would include the pages for the article, correct?

Again, thanks for the response. It does provide a lot of good information.

Thanks

_____________________________________________________________________

Michael W. Bolton | Assistant Dean, Digital Initiatives

Sterling C. Evans Library | Texas A&M University

5000 TAMU | College Station, TX 77843-5000

Ph: 979-845-5751 | Michael...@tamu.edu

http://library.tamu.edu

Michael Bolton

unread,

Oct 27, 2017, 2:15:28 PM10/27/17

to pc...@googlegroups.com

Eben,

Good point on the performance issues. I can see where that would be a potential issue. It does seem to make sense to generate some items ahead of time and save them in some fashion. I can see a benefit in doing that with thumbnails for instance. Thanks for the links and the information.

Also, it appears you are using ARKs for items in your repository. Do all items in the repository have an ARK?

Thanks again.

_____________________________________________________________________

Michael W. Bolton | Assistant Dean, Digital Initiatives

Sterling C. Evans Library | Texas A&M University

5000 TAMU | College Station, TX 77843-5000

Ph: 979-845-5751 | Michael...@tamu.edu

http://library.tamu.edu

--

Peter Binkley

unread,

Oct 27, 2017, 3:22:15 PM10/27/17

to pc...@googlegroups.com

Regarding including selected images in a manifest, that's easy - they don't even have to be your images. If you look at the structure of the manifest, with a sequence of canvases which each have images, which each have resources, the resource points to iiif-enabled images anywhere, from which the IIIF client pulls info.json files to discover their capabilities and do the appropriate presentation. There's a great demo of cross-institution images here: http://demos.biblissima-condorcet.fr/chateauroux/demo/ (I love this demo!) - click the "Miniatures" tab, then make both layers visible - you're seeing the main image from one institution and the restored miniature from another.

The manifest that makes this happen is here: http://iiif.biblissima.fr/chateauroux/B360446201_MS0005/manifest.json . If you search for canvas-981394, you'll find that this canvas has two images, one from cnrs.fr like the manifest, and one from gallica.bnf.fr representing the inserted miniature. The former fills the whole canvas, while the latter has an xywh parameter on its "on" property, which specifies where it's supposed to appear. The two services are at different institutions and even use different versions of the IIIF API, but their resources can be brought together in a coherent rendering by the manifest - which could just as easily have been published by a third party, without involving either source institution at all. It shows the amazing interoperability that IIIF can enable.

Regarding article ranges, you would still include the pages as canvases (which would enable the user to browse the pages of the newspaper issue in the same way as they would a book), and provide the ranges in a separate sequence that provide access to the article level.

Peter

Peter Binkley, Ph.D., MLIS / Digital Initiatives Technology Librarian / peter....@ualberta.ca

2-10K Cameron Library / University of Alberta / Edmonton, Alberta / Canada T6G 2J8
phone 780-492-3743 / fax 780-492-9243

west...@umd.edu

unread,

Oct 28, 2017, 6:11:19 PM10/28/17

to PCDM

Michael,

We have a small Rails app that generates the manifests. It is here:

https://github.com/umd-lib/pcdm-manifests

We also generate manifests on the fly, and they are cached for better performance.

You can see how it all fits together for the end users here:

https://www.lib.umd.edu/univarchives/student-newspapers

Josh

To unsubscribe from this group and stop receiving emails from it, send an email to pcdm+uns...@googlegroups.com.

To post to this group, send email to pc...@googlegroups.com.
Visit this group at https://groups.google.com/group/pcdm.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PCDM" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pcdm+uns...@googlegroups.com.

Michael Bolton

unread,

Oct 30, 2017, 9:00:07 AM10/30/17

to pc...@googlegroups.com

Josh,

This is really nice. Thanks for the links. We are bringing up the Chronam software for our newspapers and will go live with the site in a few months. I know there is a lot of work going on to upgrade or replace that app with more current technology. I like how you use Mirador as the viewer. I notice that when I hover over a column or article, it highlights. How does that work?

Again, thanks for the links and for sharing your manifest generator app. We are using API-X and the Amherst PCDM Extension to do our manifests now and it is always interesting to see other implementations.

Thanks again.

_____________________________________________________________________

Michael W. Bolton | Assistant Dean, Digital Initiatives

Sterling C. Evans Library | Texas A&M University

5000 TAMU | College Station, TX 77843-5000

Ph: 979-845-5751 | Michael...@tamu.edu

http://library.tamu.edu

To unsubscribe from this group and stop receiving emails from it, send an email to pcdm+unsubscribe@googlegroups.com.

Michael Bolton

unread,

Oct 30, 2017, 9:01:31 AM10/30/17

to pc...@googlegroups.com

Peter,

This is really a cool demo. Thanks for the links and the information.

_____________________________________________________________________

Michael W. Bolton | Assistant Dean, Digital Initiatives

Sterling C. Evans Library | Texas A&M University

5000 TAMU | College Station, TX 77843-5000

Ph: 979-845-5751 | Michael...@tamu.edu

http://library.tamu.edu

Joshua Allan Westgard

unread,

Oct 30, 2017, 10:11:53 AM10/30/17

to pc...@googlegroups.com

Hi Michael,

Regarding the highlighting, right now what is being highlighted is the text block. We have OCR that is broken up into blocks in this way, and those blocks are modeled as web annotation objects on the page. When the page is loaded in the viewer, the annotation layer containing the boundaries of those blocks is loaded as well.

Ultimately, we would like to group text blocks into articles and do the highlighting at that level, but that's something we've put off for the time being.

Josh

Reply all

Reply to author

Forward