--
Noemi Betancort Cabrera
Data and systems librarian - QualidataNet metadata
manager
Staats- und Universitätsbibliothek Bremen
Digitale Dienste
BibliothekstraĂźe 9
28359 Bremen
Tel. 0421/218-59592
Fax. 0421/218-98 59592
Â
For those of you interested in qualitative data,  the New York times article “How The Times Is Digging Into Millions of Pages of Epstein Files” is an example of analyzing a huge corpus using AI and human tools. The corpus includes “three million pages, 180,000 images and 2,000 videos”. It’s online, posted in a way that makes retrieval complicated (not surprising).
Â
Here is a link, although it might be paywalled.
https://www.nytimes.com/2026/02/12/insider/jeffrey-epstein-files-documents.html
Â
The kind of metadata might facilitate working with a huge trove like this is worth considering.
Â
Larry
Thanks for sharing, Larry!
Access to such large data sets is now more common, which has made qualitative data analysis substantially more arduous, given all the previous steps you have to take to prepare the materials for analysis. In the article we see how many of these steps can be done by leveraging A.I.
Trump. Clinton. Gates. Duke of York. My colleagues and I came up with a list of those terms and others about prominent people, places and events that involved Epstein; we’ve added more every day. Some searches were more topical, seeking details on Epstein’s time in jail and death. The plan was to divide those terms and phrases among the reporters and then begin searching the files to see what we found that was new and potentially newsworthy.
They hightlight the application of A.I. for searching, organising, synthesising, etc. but not for expert judgement. Some interesting excerps:
The first thing we always try to do is make things searchable. But here we also needed ways for reporters to get at the things that weren’t easy targets for search. One way we did that was by leveraging something called “semantic search,” which lets reporters search for concepts and find matching text even if the exact language isn’t in the document. We also built an A.I.-powered tagging and categorization tool to bucket the documents by type and add labels for things that we thought may be useful indicators of newsworthiness [...]
With A.I., information — text, images, video, audio — is like a liquid; it can be molded into different formats and searched in rich, expressive ways. A.I. will never replace the expert judgment of reporters, but it can make their lives easier and amplify their reporting ambitions.
A.I. is really bad at news judgment — what information to include, whether it’s important. A.I. can be sloppy and make mistakes that are inexcusable in journalism. It’s super industrious but not super intelligent. A.I. outputs can amplify biases in society. And in my experience, A.I. is not great at producing original ideas (but decent at synthesizing or distilling them).
The way we use A.I. is quite different than how most people interface with Gemini and other tools. We are writing software that gives discrete tasks to A.I. that we feel comfortable the technology can handle reliably. For example, we may ask it to let us know if a page has an image or if a document is an email. The stuff we get back may help reporters get to the right material faster, but ultimately a reporter’s eyes on actual documents are what is driving every story.
Therefore, sharing interoperable metadata, the results of artificial intelligence tools, and expert reviews and annotations could save a considerable amount of time and resources, as well as improve future research on these materials. This would benefit not only journalists, but also qualitative social science researchers, linguists, historians, lawyers, and many others.
Noemi
--
Noemi Betancort Cabrera
Data and systems librarian - Qualiservice metadata manager
Staats- und Universitätsbibliothek Bremen
Digitale Dienste
BibliothekstraĂźe 9
28359 Bremen
Tel. 0421/218-59592
Fax. 0421/218-98 59592