FITS resource usage on PDFs

59 views
Skip to first unread message

Nicole Currens

unread,
Sep 11, 2024, 3:51:59 PMSep 11
to archivematica
Hi everyone,I'm looking for suggestions on an issue we're having with Archivematica FITS when it's processing PDFs. For context, we manually installed Archivematica on RHEL9 on a VM with 8CPU/32GB RAM.

We are trying to process a 50GB SIP with about 1,500 files, most of which are PDFs. The transfer fails on the FITS steps at two points: on the Transfer page on Characterize and Extract Metadata, and on the Ingest page during the Process Submission Documentation microservice during the Characterize and extract metadata on submission documentation job.

In the Archivematica dashboard failure logs I see some JVM out of memory errors, and digging deeper into the logs on our VM we see that the process that appears to be failing is a perl process running exiftool as part of FITS, and it fails due to running out of memory. I have included a screenshot of our resource usage monitoring so you can see the memory spike and the resulting crash.

Other SIPs that are larger/contain more files don't fail on this step or consume nearly as much memory, so we believe this is specifically related to the fact that this SIP is mostly PDFs. And other than this specific situation, the rest of the microservices seem to be running well within the VM limits. So we'd like to avoid adding more memory if we can. We also don't want to turn off FITS.

Given that, I have a few questions:
- Is it normal for exiftool to consume so much memory when processing PDFs?
- Is there anything we can do other than turn off FITS/add more memory to the VM that might improve the performance of exiftool on PDFs?

Would love to hear from anyone who has faced similar problems.
Thanks,
Nicole Currens
Senior Software Developer
University of Texas Libraries
Screenshot 2024-09-06 at 9.28.45 AM.png

Joseph Anderson

unread,
Sep 12, 2024, 9:41:44 AMSep 12
to archivematica
We had a memory issue connected to JHOVE once caused by Tiffs, but only random Tiffs that weren't necessarily bigger or smaller than others that processed fine. In that case I just boosted our RAM a bit and it stopped being an issue, something about the particular TIFF was causing a memory spiral at the amount of RAM we had. 

Anyhow, it sounds like you have a lot of RAM at 32GB. I would try to determine if it's a specific PDF within the SIP that's causing the problem or it's just the quantity within a single SIP. It may be a corrupt PDF of some sort. You could try running all 1500 as separate SIP's to see if one of them gets hung up, or look at the logs to see which file was being processed during the memory spike.

We ran another problem last year where we discovered that all mp4's outputted from Webex recordings were missing information in it's header and Archivematica was unable to identify them as video/mp4. It wasn't a memory issue but the point being sometimes application output corrupted files.

Best of luck,

Joe Anderson

Susan Borda

unread,
Sep 12, 2024, 10:57:27 AMSep 12
to archiv...@googlegroups.com
Hi Nicole-
There are a few known issues with FITS and Archivematica:

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archivematica/a850bb38-e704-4d4e-a306-645a60200b44n%40googlegroups.com.


--
Susan Borda
Digital Preservation Projects Manager
Digital Preservation Unit
University of Michigan Libraries
Buhr Building
My office phone number is temporarily disconnected while I work remotely due to COVID-19. Please contact me via email.
 
Reply all
Reply to author
Forward
0 new messages