Hi,
I'm using Archivematica to store a relatively large set of PDF files (~8.000 files, amounting to ~500GB) and upload them to Atom.
A VERY large portion of the time is spend by ghostscript during PDF normalization, but it's very CPU bound and single-threaded (I have an 24 CPU server, with an 8 CPU VM).
Is it possible to use the 'parallel' command to parallelize that process? Has someone already attempted that?
In the case, I intended to, in "Preservation planning\Normalization\Commands\Command Transcoding to pdfa with Ghostscript", change:
gs -dNumRenderingThreads=8 -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile="%outputDirectory%%prefix%%fileName%%postfix%.pdf" "%fileFullName%" to parallel -j 8 gs -dNumRenderingThreads=8 -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile="%outputDirectory%%prefix%%fileName%%postfix%.pdf" "%fileFullName%"
Would that be the right place to do that? Should that work? Has someone attempted that?
Tks,
Roberto
--
-----------------------------------------------------
Marcos Roberto Greiner
Os otimistas acham que estamos no melhor dos mundos
Os pessimistas tem medo de que isto seja verdade
James Branch Cabell
-----------------------------------------------------
--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/archivematica/809515a9-eedc-4af6-a91e-55f0a6ea3e9f%40gmail.com.
I've already tried that a couple of weeks ago and it didn't work. Checking again, I still don't think it will work. In "https://github.com/artefactual/archivematica/blob/3e52494735ebfeb0cabc477d95d692034f4b3142/src/MCPClient/README.md#concurrency", it says "(at the time of writing, we just default to running as many processes as you have CPUs, but it might make sense to run fewer or more in some cases). And indeed, when I run "systemctl status archivematica-mcp-client.service", I get the following:
root@archive:~# /usr/bin/systemctl status
archivematica-mcp-client.service
● archivematica-mcp-client.service - Archivematica MCPClient
Loaded: loaded
(/lib/systemd/system/archivematica-mcp-client.service; enabled;
vendor preset: enabled)
Active: active (running) since Tue 2025-12-09 11:31:07 -03;
22h ago
Main PID: 211462 (python)
Tasks: 20 (limit: 11763)
Memory: 5.7G
CPU: 10h 47min 36.686s
CGroup: /system.slice/archivematica-mcp-client.service
├─211462
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─281110
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─281285
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─281287
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─281289
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─287866
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─288740
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─289068
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
├─290324
/usr/share/archivematica/virtualenvs/archivematica/bin/python
/usr/lib/archivematica/MCPClient/archivematicaClient.py
└─292261 gs -dPDFA -dBATCH -dNOPAUSE
-sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1
-sOutputFile=/var/archivematica/sharedDirectory/currentlyProcessing/convenio-protocolo-de-intencoes-acordo-de-cooperacao-me>
Dec 10 10:00:30 archive python[290324]: Can't find (or can't open)
font file ArialMT.
Dec 10 10:00:30 archive python[290324]: Didn't find this font on
the system!
Dec 10 10:00:30 archive python[290324]: Substituting font
Helvetica for ArialMT.
Dec 10 10:00:30 archive python[290324]: Loading NimbusSans-Regular
font from
/usr/share/ghostscript/9.55.0/Resource/Font/NimbusSans-Regular...
5407956 3927280 15305976 12702858 4 done.
Dec 10 10:00:30 archive python[290324]: program="Ghostscript";
version="9.55.0"
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0:
Text string detected in DOCINFO cannot be represented in XMP for
PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0:
Text string detected in DOCINFO cannot be represented in XMP for
PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0:
Text string detected in DOCINFO cannot be represented in XMP for
PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0:
Text string detected in DOCINFO cannot be represented in XMP for
PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0:
Text string detected in DOCINFO cannot be represented in XMP for
PDF/A1, discarding DOCINFO
So, the MCPClient already seems to be running 9 instances (for 8 CPUs, so 1 parent and 1 child per CPU), but gs is still not being parallelized. Any idea of why this could be happening?
Tks,
Roberto
PS: running archivematica version 1.17.1 in Ubuntu 22.04.
To view this discussion visit https://groups.google.com/d/msgid/archivematica/CAMpnCy-V2coYFjKmMQjPPtfuudGhymh3EGTxPtWX%3DbvOSSabpA%40mail.gmail.com.