Normalizing modern Microsoft office files

57 views
Skip to first unread message

Patrick Burden

unread,
Apr 21, 2025, 9:08:01 AM4/21/25
to archivematica
Good morning all I wanted to see if anyone has solved a problem that I am currently dealing with with Amatica.  We have a bunch of files across several folders that are in the DOCX, XLSX, and PPTX format.  What we want to do is normalize the access copy to have it be in PDFs and CSVs.  I know that if I were to run this process in Archivematica it is giving me an error because of the XML nature of the files and it is unable to process.

At the moment the work around that I am doing is creating batch files to do and process them to do the conversion.  I was able to get it working with DOCX to PDF via a Powershell script, but I am struggling with doing this with the XLSX to CSV and PPTX to PDF.  Here is the script so you can get an idea of what I am looking for.

***
$CurrentPath = $PWD.Path
$documents_path = $CurrentPath
$word_app = New-Object -ComObject Word.Application
Get-ChildItem -Path $documents_path -Recurse -Filter *.docx | ForEach-Object {
$document = $word_app.Documents.Open($_.FullName)
$pdf_filename = "$($_.DirectoryName)\$($_.BaseName).pdf"
$document.SaveAs([ref] $pdf_filename, [ref] 17)
$document.Close() }
$word_app.Quit()
***

Has anyone able to solve the other formats or have a solution that is tackling these issues that I am not familiar with?  Thank you in advance for your help.

Nico Poppelier

unread,
Apr 22, 2025, 2:56:19 AM4/22/25
to archivematica
Hello Patrick,

The best method I know is this one:

tmpdir=/var/archivematica/sharedDirectory/tmplibreoffice --headless --invisible --convert-to pdf --outdir "$tmpdir" "%fileFullName%"mv "$tmpdir/%fileName%.pdf" "%outputDirectory%%prefix%%fileName%%postfix%.pdf"We have this in place in our Archivematica installation.

Lately I've begun to wonder if it is really necessary because the latest Microsoft Office formats are XML based, unlike the preceding formats without an 'x.

Regards, 
Nico Poppelier
University Medical Centre Utrecht, the Netherlands
Op maandag 21 april 2025 om 15:08:01 UTC+2 schreef patrick...@gmail.com:
Reply all
Reply to author
Forward
0 new messages