Using parallel during a Transfer

1 view
Skip to first unread message

Roberto Greiner

unread,
6:52 AM (17 hours ago) 6:52 AM
to archivematica

Hi,

I'm using Archivematica to store a relatively large set of PDF files (~8.000 files, amounting to ~500GB) and upload them to Atom.

A VERY large portion of the time is spend by ghostscript during PDF normalization, but it's very CPU bound and single-threaded (I have an 24 CPU server, with an 8 CPU VM).

Is it possible to use the 'parallel' command to parallelize that process? Has someone already attempted that?

In the case, I intended to, in "Preservation planning\Normalization\Commands\Command Transcoding to pdfa with Ghostscript", change:

gs -dNumRenderingThreads=8 -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile="%outputDirectory%%prefix%%fileName%%postfix%.pdf" "%fileFullName%"

to

parallel -j 8 gs -dNumRenderingThreads=8 -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile="%outputDirectory%%prefix%%fileName%%postfix%.pdf" "%fileFullName%"

Would that be the right place to do that? Should that work? Has someone attempted that?

Tks,

Roberto

-- 
  -----------------------------------------------------
                Marcos Roberto Greiner

   Os otimistas acham que estamos no melhor dos mundos
    Os pessimistas tem medo de que isto seja verdade
                             James Branch Cabell
  -----------------------------------------------------

Santiago Rodríguez Collazo

unread,
7:18 AM (16 hours ago) 7:18 AM
to archiv...@googlegroups.com
Hi Roberto

I think using parallel won't work on this case, because the parameter is the full file name, and not a list of files that need to be normalized in parallel.

Another approach to take vantage of all the cpus is the use of multiple MCP clients, as explained in https://www.archivematica.org/en/docs/archivematica-1.18/admin-manual/installation-setup/customization/scaling-archivematica/#deploy-multiple-mcpclients

This way, the system will be able to process more PDF's in parallel using all the cpu's available. You might need to experiment with a different number of mcp-clients, to find the sweet spot between performance and throughput.

/santi



--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/archivematica/809515a9-eedc-4af6-a91e-55f0a6ea3e9f%40gmail.com.


--
Santiago Rodríguez
DevOps, Artefactual Systems Inc.

Roberto Greiner

unread,
8:07 AM (16 hours ago) 8:07 AM
to archiv...@googlegroups.com

I've already tried that a couple of weeks ago and it didn't work. Checking again, I still don't think it will work. In "https://github.com/artefactual/archivematica/blob/3e52494735ebfeb0cabc477d95d692034f4b3142/src/MCPClient/README.md#concurrency", it says "(at the time of writing, we just default to running as many processes as you have CPUs, but it might make sense to run fewer or more in some cases). And indeed, when I run "systemctl status archivematica-mcp-client.service", I get the following:

root@archive:~# /usr/bin/systemctl status archivematica-mcp-client.service
● archivematica-mcp-client.service - Archivematica MCPClient
     Loaded: loaded (/lib/systemd/system/archivematica-mcp-client.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2025-12-09 11:31:07 -03; 22h ago
   Main PID: 211462 (python)
      Tasks: 20 (limit: 11763)
     Memory: 5.7G
        CPU: 10h 47min 36.686s
     CGroup: /system.slice/archivematica-mcp-client.service
             ├─211462 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─281110 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─281285 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─281287 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─281289 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─287866 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─288740 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─289068 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             ├─290324 /usr/share/archivematica/virtualenvs/archivematica/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
             └─292261 gs -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -sOutputFile=/var/archivematica/sharedDirectory/currentlyProcessing/convenio-protocolo-de-intencoes-acordo-de-cooperacao-me>

Dec 10 10:00:30 archive python[290324]: Can't find (or can't open) font file ArialMT.
Dec 10 10:00:30 archive python[290324]: Didn't find this font on the system!
Dec 10 10:00:30 archive python[290324]: Substituting font Helvetica for ArialMT.
Dec 10 10:00:30 archive python[290324]: Loading NimbusSans-Regular font from /usr/share/ghostscript/9.55.0/Resource/Font/NimbusSans-Regular... 5407956 3927280 15305976 12702858 4 done.
Dec 10 10:00:30 archive python[290324]: program="Ghostscript"; version="9.55.0"
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0: Text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0: Text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0: Text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0: Text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO
Dec 10 10:00:30 archive python[290324]: GPL Ghostscript 9.55.0: Text string detected in DOCINFO cannot be represented in XMP for PDF/A1, discarding DOCINFO


So, the MCPClient already seems to be running 9 instances (for 8 CPUs, so 1 parent and 1 child per CPU), but gs is still not being parallelized. Any idea of why this could be happening?

Tks,

Roberto

PS: running archivematica version 1.17.1 in Ubuntu 22.04.

Reply all
Reply to author
Forward
0 new messages