What should we do about identifying PDF/UA?

David Clipsham

unread,

Mar 26, 2024, 7:17:04 AM3/26/24

to PRONOM

In line with the shift towards more accessible services I'm hearing of pushes towards adoption of PDF/UA (Universal Accessibility), but PRONOM currently doesn't identify them.

Identification is easy enough - e.g. for PDF/UA-1 we want to find a <pdfuaid:part> tag with a value of '1'

However, a file can be both, neither, or one of PDF/UA file and PDF/A, which complicates priority somewhat.

In the past a general aim for PRONOM has been to avoid known clashes so that each identification event gets a single outcome, which is managed through setting priorities where appropriate. For example in a typical case the more specific PDF/UA would get priority over the less specific PDF 1.4.

However it will be of value to know if a file is both, for example, PDF/A-2u and PDF/UA-1

So do we:

a) create a new PDF/UA-1 entry (with priority over PDF 1.x) and accept that sometimes we might find files that identify as both as this and some PDF/A variant

b) create a PDF/UA-1 entry, and further entries named something like 'PDF/A-2u with PDF/UA-1 compliance'. In this case I believe we'd need entries for each of the 8 PDF/A- 1-3 variants, and these would need priority over those plus the various PDF 1.x entries.

It all gets very tangled.

Obviously here I've focused on PDF/UA-1, but I'm aware that PDF/UA-2 has just recently been published (https://www.iso.org/standard/82278.html)

I'm keen to hear others' thoughts.

David

Johan van der Knijff

unread,

Mar 26, 2024, 12:14:40 PM3/26/24

to PRONOM

Some thoughts on this:

I would definitely not go with option b), as the number of possible combinations would quickly become unwieldy. Especially when you keep in mind that besides the various versions of the PDF/A and PDF/UA profiles that you mention, there's also PDF/E, PDF/VT and PDF/X (most of which also have sub-versions, like PDF/A a PDF/UA). So creating entries for all combinations of those profiles would quickly become very tangled indeed, to the point where it would be a pain for both the PRONOM maintainers and users alike. Say, for instance, a user wants to create a report for all files that are identified as PDF/UA-1. In case of combined entries, the user will then have to know all entries with some combination that includes PDF/UA-1, which would make this simple query unnecessarily complex.

Option a) (accept multiple id matches) definitely looks a lot better, both from a maintainer's and a user's point of view. But it does make me wonder wonder if this couldn't (shouldn't?) be taken even one step further, by also including the (less specific) match for PDF 1.4? This is mostly because I would expect that in many cases the required granularity depends quite a bit on organisation-specific contexts. E.g. some orgs might not be concerned at all whether a file is PDF/A or PDF/UA, while at the same needing info about the specific version of the parent PDF standard (1.x/2.x). Other orgs might only be interested in the most specific matches (here: PDF/A and PDF/UA), or even both.

But I'm not entirely sure how well this fits into the current PRONOM/DROID architecture, and whether this might result in adverse effects on existing workflows. I'd be very interested to hear others' opinions on this!

Cheers,

Johan

Tyler Thorsted

unread,

Mar 26, 2024, 12:53:36 PM3/26/24

to PRONOM

PRONOM has precedence for defining a format like TIFF, but leaving other tools to characterize it further, but then we have formats like PDF which has been expanded to include many types of PDF.

I had assumed a PDF/UA had to be some sort of PDF/A, but as David points out it can be stand alone or as an additional conformance. I agree with Johan that the possible combinations of all the various types of PDF will continue to get unmanageable. I created a PDF awhile back which is a PDF/A-2b and a PDF/X-4, which is the same issue. Siegfried stops when it finds the PDF/X-4 identification. They are all subsets of each other.

My vote is to choose the priority, since they are all subsets of each other, maybe this can be used to define which format takes priority as one could assume if it is a PDF/UA, it should conform to PDF/A as well?

Of David's options, having an entry just for PDF/UA-1 makes sense and let the chips fall when there is multiple matches. Siegfried will probably pick the first one it finds, which appears to be PDF/A identification first.

Tyler Thorsted

Johan van der Knijff

unread,

Mar 26, 2024, 1:09:30 PM3/26/24

to PRONOM

@Tyler:

> maybe this can be used to define which format takes priority as one could assume if it is a PDF/UA, it should conform to PDF/A as well?

Except you can't, they're both different profiles within PDF, and the fact that a file is PDF/UA doesn't mean it conforms to PDF/A or vice versa. So any kind of priority you define here will be largely arbitrary. And again, some users might be primarily interested in the PDF/A match, and others in the PDF/UA match.

Tyler Thorsted

unread,

Mar 26, 2024, 1:57:25 PM3/26/24

to PRONOM

Johan,

I guess my point was that PDF's do have a hierarchy, A PDF/UA-2 must be a PDF 2.0 for example. Or, PDF/A-3 must be PDF 1.7, etc. Maybe some of that could be used in reducing the possible combinations.

In looking at PDF/UA, I also noticed the WTPDF conformance standard.

https://pdfa.org/wp-content/uploads/2024/02/Well-Tagged-PDF-WTPDF-1.0.pdf

Maybe we need to have PDF identification be container based so we can identify the subsets appropriate for each version?

Tyler Thorsted

Johan van der Knijff

unread,

Mar 26, 2024, 2:11:58 PM3/26/24

to PRONOM

@Tyler Ah yes, of course you're right about the hierarchy between e.g. PDF 1.7, 2.0 etc. and their underlying profiles. For the case of David's example you'd then still end up with 2 matches at the same hierarchical level (which, by the way, would be completely fine with me).

David Clipsham

unread,

Mar 26, 2024, 2:28:00 PM3/26/24

to PRONOM

Cool thank you both. Honestly I'm fine with multiple ID also. I think it's the proper outcome here.

David

Tyler Thorsted

unread,

Sep 5, 2024, 10:51:01 AM9/5/24

to PRONOM

Remembering this conversation now that the ISO standards are been released for free.