Tech Company GLAM collection training requests - framing the issues and how to engage

tgpad...@gmail.com

unread,

Mar 18, 2026, 10:51:53 AMMar 18

to Collections as Data

Hi all,

I'm sure many of you are tracking on and/or engaging in conversations about how to engage with tech companies who want to train on library, archive, and museum collections.

I wanted to highlight a recent post by Dave Hansen at Authors Alliance - Library and Archives 101: AI and The False Promise of Control as I think it introduces a number of important arguments for us as a community to consider. It is especially useful to read this in concert with the UVA AI Protocol as they diverge in fundamental ways. For my two cents, I find the Authors Alliance post more compelling, though I would want to retain training exceptions for collection training based on community responsibilities - e.g., indigenous data, oral histories (as Mia Ridge mentioned elsewhere).

I'm curious how either piece is resonating with your work? I'm also curious how these mostly U.S. focused discussions relate to similar work happening in other national contexts?

Thomas

Thomas Padilla

Associate Dean for Research and Learning

University of Nebraska-Lincoln

Steven Claeyssens

unread,

Mar 19, 2026, 3:31:10 PMMar 19

to Collections as Data

hi Thomas, all,

Yes, this is an important discussion. I would like to take the opportunity to set something straight. Hansen writes that the KB, where I work, “has restricted access to its major digital collections for commercial AI training—while making the same materials available to a government-backed Dutch language model.” That is simply not correct. Our position is that all these parties can gain access to our public domain materials, but not to the in-copyright materials that we make publicly available online, based on agreements with publishers and CMOs. In practice, we have indications that the large commercial players have nonetheless crawled (parts of) the copyrighted portion, whereas GPT-NL properly asked what they are allowed to use and what not. As a result, they had access to less material, not more, as Hansen suggests.

This also explains why we believe we should have some form of “control.” We are allowed to publish in-copyright materials based on our agreements, for private use and research purposes, not for AI training (cf. www.delpher.nl, newspapers = "kranten"). If those terms are violated, this truly unique arrangement comes under pressure, and that would be a real shame. This has nothing to do with agency or the hope of a good financial deal.

With best regards,

Steven

Steven Claeyssens

Curator of Digital Collections

KB, National library of the Netherlands

Dave Hansen

unread,

Mar 19, 2026, 4:29:02 PMMar 19

to Collections as Data

Hi Steven,

Thanks for this clarification -- it's very helpful (and sorry I misunderstood). I'll mark a correction on the essay. It wasn't clear to me from the news releases (e.g., this one) that the commercial restriction was being driven by underlying agreements with the publishers and CMOs rather than just an internal desire to restrict these types of uses. This quote in particular is what led me in that direction:

"Wij vinden dat AI-toepassingen op een ethisch verantwoorde manier tot stand moeten komen. Zo vinden wij het belangrijk dat het auteursrecht wordt gerespecteerd, dat er aan bronvermelding wordt gedaan en dat persoonsgegevens worden beschermd’, licht KB-directieteamlid Martijn Kleppe toe. ‘Dat is bij veel commerciële AI-bedrijven niet het geval. Zij vragen geen toestemming voor het binnenhalen van deze data en zijn niet transparant over de manier waarop deze data worden gebruikt.’

I understand that agreements with publishers or other rightsholders can sometimes impose limits that libraries and archives just can't do much about.

And I just entirely glossed over that the Dutch model is being excluded from access to in-copyright material (though I do wonder if GPT-NL or similar might have a case to argue that such use is permitted under Article 3 of the CDSM? I suppose not given that its goals are much broader than just scientific research.

Thanks,

Dave

Thomas Padilla

unread,

Mar 20, 2026, 2:47:53 PMMar 20

to Collections as Data

Chiming in to share Rosalyn Metz' piece which comments on both the UVA effort and the Authors Alliance post, while introducing some new arguments.

https://rosalynmetz.substack.com/p/the-balance-of-knowledge

Thomas

Thomas Padilla

Associate Dean for Research and Learning

University of Nebraska-Lincoln

--
This group aims to foster a welcoming and inclusive experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age, religion, nationality, or political beliefs. Harassment of participants will not be tolerated in any form. Harassment includes any behavior that participants find intimidating, hostile or offensive. Participants asked to stop any harassing behavior are expected to comply immediately. Please contact Thomas Padilla if you have concerns.
---
You received this message because you are subscribed to the Google Groups "Collections as Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to collectionsasd...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/collectionsasdata/70d0588f-5630-4ecd-88d9-5a8156052799n%40googlegroups.com.

Ingrid Mason

unread,

Mar 20, 2026, 10:17:43 PMMar 20

to Thomas Padilla, Collections as Data

Hi Thomas, all,

Not sure I'm going to assist in the discussion, rather add more questions. Thanks for posting your query and to those that have responded. My reflections below.

Good wishes, Ingrid

--

Interesting post by Rosalyn Metz and she gets to the nub of the problem quickly. Forces for private vs public interest in access are in tension. I've been trying to work out what's behind your question Thomas: "how to engage with tech companies who want to train on library, archive, and museum collections".

So I backed up and asked: why engage, WIIFM (or the organisation I work for and those who own or are reflected in the material)? and what is it about "tech companies" that causes the concern (profit is a known and reasonable outcome from industry, but it can be a diabolical motivator)? Then asked another question: what would breach the social licence a GLAM organisation has as a holder of knowledge and/or heritage? What lessons from history etc etc.

No simple answers to offer; rather more questions: what have we learned from the tragedy of the commons?, and what does it cost us to jeopardise rights over research and heritage collections (knowledge and information)?. I arrive at a loss of trust and understand that a rights owner might be much more cautious about licensing their works for publishing online. The ultimate result will be an inhibiting effect (knowledge needs to flow but how quickly and to whom and why).

Where do I land? I have accepted that as a community we will need to look at alternatives and establish gates and new access models to be able to protect interests and trust third parties to operate ethically in an open environment (private or public). Gates and alternate models and pathways still make resources available, the extra step is the tempering of availability and negotiation that gets put in.

So, why should public interests dominate and prevail by being judicious in entering into agreements? My 2c: trust is a fundamental social institution (hard to establish, easy to lose).

Frankly, the conversation about expanding our commitments to mediated access, when material can still be made available, but in a gated or controlled arrangement isn't getting much airtime (or I'm missing out on it if so!). Yet, the mediated model is used heavily in research where matters of sensitivity whether personal or commercial come into play around data and software and has well established norms in heritage in a physical sense. The tech stack is all there to do this and it is very mature, so is a risk model (Five Safes Framework).

What's missing is the rationale and models as types for joint ventures and an acceptance of the need to negotiate and establish new norms for mediation. When working with researchers in eScience/Research to support the release of open data, some useful case studies emerged e.g., an annual survey of religious practices was released, but at such a high level to obscure identifying people (useful, protective); a small sample of images from a photographic collection to indicate the range of material (useful, protective).

We (at the National Museum of Australia) are commencing digitising a card catalogue with very sensitive First Nations information on it that comes with a very complex ethical backdrop and history. It is fairly safe to say the datasets will not ever be in the hands of a tech company. Same goes for the work with audio in film, tv, oral histories etc at the NFSA, a rich mix of public interest, commercial, and private opinion. I can only imagine none or only a very small portion of that being made available to tech companies, ever without negotiation. Maybe given this, there are substantive delineating factors that separate research and heritage collections for this reason. Not sure, maybe not in the end, both are important to have some level of openness for many good reasons and some level of mediation for many other good reasons. The question is then, how mediated and how open and how to negotiate and communicate that?

It is not as if we've had a perfect state of making publicly known all that is in a collection, let alone the collection itself. Many times I have learned about indexes and catalogues inside institutions that have never seen the light of day. We have history in this regard and it has not always been about lack of funds and will, it has also been about protecting interests and rights (or shame and inertia).

Welcome hearing others' thoughts on this front, I have literally just relayed this message to a tech company rep recently. That the context for copyright in Australia and Aotearoa New Zealand and legacies of colonialism are going to be defining features in this terrain, and I am very focused on: useful and protective, as matters of balance. So it comes back to: where's the harm and who is going to be harmed etc? Care ethics are written all over this space to interrogate and deliberate on.

To view this discussion visit https://groups.google.com/d/msgid/collectionsasdata/CAMiq34MipshueLEedUCK3giOtpDHKwXAef6LKOTomT0fZ%2BSBww%40mail.gmail.com.

Reply all

Reply to author

Forward