Word Document Legacy Finding Aids and Python

1,109 views
Skip to first unread message

Alexis Jimenez

unread,
Sep 25, 2025, 9:02:55 AMSep 25
to Archivesspac...@lyrasislists.org
I work for a religious archive, and we have all of our finding aids in Word documents. Upon starting here in March, I implemented ArchivesSpace for our archives to gain intellectual and physical control over our collections. I need to find the most efficient and practical way to extract the data into Excel for upload into ASpace. Suggestions I have gotten say to write Python code to streamline the task. I unfortunately am a lone wolf archivist, and I have no Python experience. Does anyone have any idea what my best course of action would be?

Thank you!
Alexis Jimenez
Alexis Jimenez
Archivist
Sisters of St. Francis of Philadelphia
609 S. Convent Road
Aston, PA 19014

Paul Sutherland

unread,
Sep 25, 2025, 10:10:56 AMSep 25
to Alexis Jimenez, Archivesspac...@lyrasislists.org
Hi Alexis,

Greetings from a few miles up the road! My first questions would be about the Word documents themselves:
  • Do they have tables inside or is everything free text? Tables can be helpful as the cells get separated.
  • Which Word format are they? If they were .rtf format this could be helpful as this can be broken out following the code syntax.
  • How many are there?
My own approach to this has been copy-pasting the document text into Notepad++ and doing some find and replace (sometimes using simpler RegEx) until I get something that has tab separation between the fields. That can then be copied into Excel for further working. It's scrappy and depending on the number and internal consistency of the finding aids, may not be feasible.

Paul

--
You received this message because you are subscribed to the Google Groups "Archivesspace_Users_Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to Archivesspace_User...@lyrasislists.org.
To view this discussion visit https://groups.google.com/a/lyrasislists.org/d/msgid/Archivesspace_Users_Group/CH3PR19MB83312B9325658E151100BC3DB31FA%40CH3PR19MB8331.namprd19.prod.outlook.com.


--
Paul Sutherland
Archivist of Indigenous Materials
Center for Native American and Indigenous Research
Library & Museum
American Philosophical Society
105 S. 5th Street, 2nd Floor
Philadelphia, PA 19106
Lenapehoking

I am currently on a reduced schedule working Tuesdays-Thursdays. Me and my colleagues at CNAIR can be reached at cn...@amphilsoc.org.

I respectfully acknowledge that I work and reside in Lenapehoking, the homeland of the Lenape people in past, present, and future generations. I am grateful for the past and ongoing generosity of numerous Indigenous communities and individuals who have offered guidance, expertise, and opportunities for collaboration that make my work possible.

Learn more about ...
- The Indigenous Subject Guide to our Indigenous collections, updated frequently
- Blog posts by CNAIR staff & fellows

- Fellowships (residential and non) for working with our collections and elsewhere.
- Scheduling a visit to our Reading Room to view our collections
- Our Museum's current exhibit Philadelphia: The Revolutionary City, open through the end of 2025

Kevin Schlottmann

unread,
Sep 25, 2025, 10:22:47 AMSep 25
to Paul Sutherland, Alexis Jimenez, Archivesspac...@lyrasislists.org
Hi Alexis,

You might try ChatGPT here -- I have had occasional success asking it to turn unstructured text into delimited data.  The prompts need to be very precise, the human labor is shifted to the back end to proofread the output, and it only works if the original Word doc is somewhat consistent, but if it works it will be way faster than learning Python or doing it entirely by hand.

Kevin 

On Thu, Sep 25, 2025 at 10:10 AM Paul Sutherland <psuth...@amphilsoc.org> wrote:
Hi Alexis, Greetings from a few miles up the road! My first questions would be about the Word documents themselves: Do they have tables inside or is everything free text? Tables can be helpful as the cells get separated. Which Word format are
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd


--
Kevin Schlottmann
Head of Archives Processing
Rare Book & Manuscript Library
Butler Library, Room 801
Columbia University Libraries
Pronouns: he/him/his
535 W. 114th St., New York, NY  10027
(212) 854-8483

Jennifer Brcka

unread,
Sep 25, 2025, 11:02:53 AMSep 25
to Alexis Jimenez, Archivesspac...@lyrasislists.org
Hi Alexis, 
I've also had some preliminary success using AI for this type of task, and shared a bit about it at the member forum last month. If you're interested, talk/slides can be found here. Best wishes, 
Jennifer

Gray, Krista Lauren

unread,
Sep 25, 2025, 11:12:41 AMSep 25
to Alexis Jimenez, Archivesspac...@lyrasislists.org

Hi Alexis,

 

I’m leading the sub-project for converting our word processing finding aids into structured data for the University of Illinois Urbana-Champaign (we have about 4,500 of these) as we migrate from Archon to ArchivesSpace.  Our word processing finding aids are essentially container lists (currently, collection-level metadata is entered into the Archon collection record, and then the container list is linked as a PDF).

 

While the python code I’ve written for this project isn’t ready to be shared broadly at this time, I’d be happy to meet with you over zoom/teams to see if it might also be helpful for your use cases and see if you’d be able/would want to try it out.

 

As a note, the code I have assumes the finding aid does not use tables. If your word processing finding aids are table-based, one of the other respondents’ suggestions will likely work better for you.

 

If you are interested in talking more, please email me off-list (gra...@illinois.edu).

 

Good luck!

 

Krista

Megan Brett

unread,
Sep 25, 2025, 11:13:12 AMSep 25
to Alexis Jimenez, 'Tom Hanstra' via Archivesspace_Users_Group
Hello all,

Although OpenRefine does not support Word, you can copy/paste into the Clipboard or save the document as rich text and import that way. I've had some luck working that way, particularly with documents with lots of lists. There are some recordings about ASpace and OpenRefine in the Help Center (for which I was co-presenter)

Megan

From: archivesspac...@lyrasislists.org <archivesspac...@lyrasislists.org> on behalf of Alexis Jimenez <AJim...@osfphila.org>
Sent: Thursday, September 25, 2025 9:02 AM
To: Archivesspac...@lyrasislists.org <Archivesspac...@lyrasislists.org>
Subject: [ArchivesSpace Users Group] Word Document Legacy Finding Aids and Python
 
This Message originated outside your organization.

James Truitt

unread,
Sep 26, 2025, 1:37:50 PM (14 days ago) Sep 26
to Archivesspace_Users_Group, Megan Brett, 'Tom Hanstra' via Archivesspace_Users_Group, Alexis Jimenez
Hello from another Philadelphian!

I'll endorse Paul and Megan's suggestions—depending on how structured your word docs are (and how consistent that structure is), you can probably get a lot of the way with regular expressions in a good text editor (and maybe OpenRefine if needed). If you use a Mac, BBEdit is a great environment for using regular expressions.

Best,
James Truitt
Swarthmore College
Reply all
Reply to author
Forward
0 new messages