Failure generating mets.xml file

29 views
Skip to first unread message

Roberto Greiner

unread,
May 22, 2025, 1:40:47 PM5/22/25
to archiv...@googlegroups.com

Hi, I'm having a problem when running a task to process a relatively large (~560 PDF files) group of files.

I have a folder with about 560 PDF files. I've created a metadata.csv file with the descriptions, and have everything set for exporting the files to my Atom server at the end of the processing.

I have already tested the processes with a small task (1 PDF) and the whole process worked fine.

When I run this large task with 560 files, the process fails when generating the mets.xml file with a "ValueError: Empty tag name" error message (full error below).

Any idea of what could be the cause of this problem?

Thank you,

Roberto Greiner


Information about my system:

Archivematica version 1.17.0  is installed in an Ubuntu 22.04 VM with 6GB RAM and  1TB disk space

The full error output I got was the following:

Empty tag nameTraceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/client/job.py", line 172, in JobContext
    yield
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1942, in call
    main(
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1772, in main
    structMapDivObjects = createFileSec(
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1038, in createFileSec
    createFileSec(
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1145, in createFileSec
    DMDIDS = createDMDIDsFromCSVMetadata(
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 274, in createDMDIDsFromCSVMetadata
    dmdsecs = createDmdSecsFromCSVParsedMetadata(job, metadata, state)
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 358, in createDmdSecsFromCSVParsedMetadata
    etree.SubElement(other, normalizeNonDcElementName(key)).text = v
  File "src/lxml/etree.pyx", line 3205, in lxml.etree.SubElement
  File "src/lxml/apihelpers.pxi", line 180, in lxml.etree._makeSubElement
  File "src/lxml/apihelpers.pxi", line 1654, in lxml.etree._getNsTag
  File "src/lxml/apihelpers.pxi", line 1689, in lxml.etree.__getNsTag
ValueError: Empty tag name


The first lines of my metadata.csv are the following (the full file is 566 lines deep...):

filename,dc.identifier,dc.title,dc.description,dc.publisher,dc.provenance,dc.subject,dc.language,dc.coverage,dc.coverage,dc.rights,dc.type,dc.format,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
objects/JURTRB 019-CR-CA/000302000001000006001.pdf,000302000001000006001.pdf, Processo: 00405.2005.025.15.00-6 Vara: Vara do Trabalho de Botucatu - SP RECLAMAÇÃO TRABALHISTA Autor CÉLIO AMARILDO DE ALMEIDA Réu: FUNDAÇÃO PARA O DESENVOLVIMENTO DA UNESP - FUNDUNESP 000302000001000006001 FUNDUNESP Fxndaç5 para o Desenvolvhaento da Vnesp,RECLAMANTE: Célio Amarildo de Almeida,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,ponto de acesso de nome,pt_BR,São Paulo - SP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Este documento é de propriedade da Fundunesp. Ao utilizá-lo colocar os devidos créditos à Unesp  segundo Lei de Direito Autoral.,Texto,PDF,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
objects/JURTRB 019-CR-CA/000302000002000006002.pdf,000302000002000006002.pdf, Processo: 00152.2002.127.15.00-8 Vara: Vara do Trabalho de Teodoro Sampaio - SP RECLAMAÇÃO TRABALHISTA Autor: CARLOS EDUARDO REIS Réu: FUNDAÇÃO PARA O DESENVOLVIMENTO DA UNESP - ,RECLAMANTE: Carlos Eduardo Reis,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,ponto de acesso de nome,pt_BR,São Paulo - SP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Este documento é de propriedade da Fundunesp. Ao utilizá-lo colocar os devidos créditos à Unesp  segundo Lei de Direito Autoral.,Texto,PDF,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
objects/JURTRB 19 - CR - CA/000302000003000006003.pdf,000302000003000006003.pdf, Nº 033 Processo: 00146.2002.069.15.00-4 Vara: Vara do Trabalho de Registro - SP RECLAMAÇÃO TRABALHISTA Autor: CARLOS FERREIRA + 4 Réu: FUNDAÇÃO PARA O DESENVOLVIMENTO DA UNESP -  000302000003000006003,RECLAMANTE: Carlos Ferreira + 4,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,ponto de acesso de nome,pt_BR,São Paulo - SP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Este documento é de propriedade da Fundunesp. Ao utilizá-lo colocar os devidos créditos à Unesp  segundo Lei de Direito Autoral.,Texto,PDF,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
objects/JURTRB 016 - MI - MA/000302000004000006004.pdf,000302000004000006004.pdf, Processo: 00788-2003-056-15-00 -8 Vara: Vara do Trabalho de Andradina - SP RECLAMAÇÃO TRABALHISTA Autor: MILTON OLIVEIRA GONÇALVES Réu: FUNDAÇÃO PARA O DESENVOI-VIMENTO DA UNESP -  000302000004000006004,RECLAMANTE: Milton Oliveira Gonçalves,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,ponto de acesso de nome,pt_BR,São Paulo - SP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Este documento é de propriedade da Fundunesp. Ao utilizá-lo colocar os devidos créditos à Unesp  segundo Lei de Direito Autoral.,Texto,PDF,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


The Atom server that is the destination for the export is an Ubuntu 24.04 VM running Atom 2.9.


-- 
  -----------------------------------------------------
                Marcos Roberto Greiner

   Os otimistas acham que estamos no melhor dos mundos
    Os pessimistas tem medo de que isto seja verdade
                             James Branch Cabell
  -----------------------------------------------------

Douglas Cerna

unread,
May 22, 2025, 2:17:39 PM5/22/25
to archiv...@googlegroups.com
Hello,

I really appreciate you sharing the Archivematica version and a sample of your CSV file.

I can reproduce your problem by modifying the metadata.csv file in the SampleTransfers/DemoTransferCSV of the archivematica-sampledata repository by changing the header row from:

filename,dc.title,dc.creator,dc.subject,dc.subject,dc.subject,dc.subject,dc.subject,dc.description,dc.publisher,dc.contributor,dc.date,dc.type,dc.format,dc.identifier,dc.source,dc.language,dc.language,dc.relation,dc.coverage,dc.rights

to:

filename,dc.title,dc.creator,dc.subject,dc.subject,dc.subject,dc.subject,dc.subject,dc.description,dc.publisher,dc.contributor,dc.date,dc.type,dc.format,dc.identifier,dc.source,dc.language,dc.language,,,dc.rights

Notice I removed the dc.relation and dc.coverage headers which causes the same error:

Empty tag nameTraceback (most recent call last):
  File "/src/src/archivematica/MCPClient/client/job.py", line 171, in JobContext
    yield
  File "/src/src/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1961, in call
    main(
  File "/src/src/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1791, in main
    structMapDivObjects = createFileSec(
  File "/src/src/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1164, in createFileSec
    DMDIDS = createDMDIDsFromCSVMetadata(
  File "/src/src/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 293, in createDMDIDsFromCSVMetadata
    dmdsecs = createDmdSecsFromCSVParsedMetadata(job, metadata, state)
  File "/src/src/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 377, in createDmdSecsFromCSVParsedMetadata
    etree.SubElement(other, normalizeNonDcElementName(key)).text = v
  File "src/lxml/etree.pyx", line 3206, in lxml.etree.SubElement
  File "src/lxml/apihelpers.pxi", line 180, in lxml.etree._makeSubElement
  File "src/lxml/apihelpers.pxi", line 1654, in lxml.etree._getNsTag
  File "src/lxml/apihelpers.pxi", line 1689, in lxml.etree.__getNsTag
ValueError: Empty tag name

I think the problem is that when you try to add non-dublin core metadata the Generate METS.xml document job uses the headers of the metadata.csv file to set the tag of the XML element that will contain your data.

Notice in the docs how these tags (alternative_title, dates_of_publicatio, etc) match the headers in the sample CSV:

  <mets:mdWrap MDTYPE="OTHER" OTHERMDTYPE="CUSTOM">
    <mets:xmlData>
      <alternative_title>Sunshine Coast News </alternative_title>
      <dates_of_publication>1945-1995 </dates_of_publication>
      <frequency>Weekly </frequency>
      <digital_file_format>image/jpg</digital_file_format>
    </mets:xmlData>
  </mets:mdWrap>

In your sample I notice several empty values at the end of the headers row so I'd recommend you to fill those, and I'm curious if you used the same headers structure in your successful transfer and how did that look in the METS.

Hope this helps.


--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/archivematica/fedec69d-9632-4134-8695-682676e6b98f%40gmail.com.


--
Douglas Cerna (he/him),
Software Developer, Artefactual Systems Inc.
http://www.artefactual.com

Roberto Greiner

unread,
May 22, 2025, 2:48:36 PM5/22/25
to archiv...@googlegroups.com

Actually,

after sending this message I kept running some tests and found that the problem was with one of the PDF files in the batch. I kept trying in smaller packages (50 instead of the full 560 files, then smaller when that failed), and it worked until I hit files 68-70. I've uploaded those manually to Atom, and resumed processing the rest of the files and it's working fine so far.

So, there was something in one of those PDF files or in their metadata that broke de METS generator. Ive looked and didn't notice anything of in there. For documentation, the metadata I used for those PDF files is the following:

filename,dc.identifier,dc.title,dc.description,dc.publisher,dc.provenance,dc.subject,dc.language,dc.coverage,dc.coverage,dc.rights,dc.type,dc.format,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
objects/000302000079000006079.pdf,000302000079000006079 FUNDUNESP Fundação para Desenvolvimento da Unesp Nº DE CONTROLE 30.009.885-6 IPVA - PLACAS DAR 4347 - RENAVAM 746532571 (ITESP) - 18-12-2007,000302000079000006079 FUNDUNESP Fundação para Desenvolvimento da Unesp Nº DE CONTROLE 30.009.885-6 IPVA - PLACAS DAR 4347 - RENAVAM 746532571 (ITESP) - 18-12-2007,LANÇAMENTO IPVA (ITESP),Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,ponto de acesso de nome,pt_BR,São Paulo - SP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Este documento é de propriedade da Fundunesp. Ao utilizá-lo colocar os devidos créditos à Unesp  segundo Lei de Direito Autoral.,Texto,PDF,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
objects/000302000080000006080.pdf,Processo DPE 038/95 000302000080000006080 Interessado: Instituto de Terras - ITESP e Secretaria da Justiça e da Defes a da Cidadania Assunto: Execução de Programa de Desenvolvimento  Estado de SÃO PAULO,Processo DPE 038/95 000302000080000006080 Interessado: Instituto de Terras - ITESP e Secretaria da Justiça e da Defes a da Cidadania Assunto: Execução de Programa de Desenvolvimento  Estado de SÃO PAULO,PROCESSO ITESP DPE 038/95 - Pasta 1 - Institucional,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,ponto de acesso de nome,pt_BR,São Paulo - SP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Este documento é de propriedade da Fundunesp. Ao utilizá-lo colocar os devidos créditos à Unesp  segundo Lei de Direito Autoral.,Texto,PDF,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
objects/000302000081000006081.pdf,Ofício n° 07/03 PRES/AJ/MM. 000302000081000006081 São Paulo 02 . de julho de 2003. Arthur Francisco Marques. Doutor Jonas Villas Boas Fundação Instituto de Terras do Estado de São Paulo,Ofício n° 07/03 PRES/AJ/MM. 000302000081000006081 São Paulo 02 . de julho de 2003. Arthur Francisco Marques. Doutor Jonas Villas Boas Fundação Instituto de Terras do Estado de São Paulo,OFICIO N 07/03 PRES/AJ/MM,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,ponto de acesso de nome,pt_BR,São Paulo - SP,Fundação para o Desenvolvimento da Unesp  - FUNDUNESP,Este documento é de propriedade da Fundunesp. Ao utilizá-lo colocar os devidos créditos à Unesp  segundo Lei de Direito Autoral.,Texto,PDF,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Thank you,

Roberto

Roberto Greiner

unread,
May 23, 2025, 7:52:22 AM5/23/25
to archiv...@googlegroups.com


Em 22/05/2025 15:17, 'Douglas Cerna' via archivematica escreveu:

I think the problem is that when you try to add non-dublin core metadata the Generate METS.xml document job uses the headers of the metadata.csv file to set the tag of the XML element that will contain your data.

Notice in the docs how these tags (alternative_title, dates_of_publicatio, etc) match the headers in the sample CSV:

  <mets:mdWrap MDTYPE="OTHER" OTHERMDTYPE="CUSTOM">
    <mets:xmlData>
      <alternative_title>Sunshine Coast News </alternative_title>
      <dates_of_publication>1945-1995 </dates_of_publication>
      <frequency>Weekly </frequency>
      <digital_file_format>image/jpg</digital_file_format>
    </mets:xmlData>
  </mets:mdWrap>

In your sample I notice several empty values at the end of the headers row so I'd recommend you to fill those, and I'm curious if you used the same headers structure in your successful transfer and how did that look in the METS.


Yes, I used the same structure in all the tasks. I'm adding the compressed METS.xml file (I hope this list accepts compressed attachments :-)  ) for one of the attempts. I'm compressing it because uncompressed it has 750K.

In the meantime, I ran another job adding another batch of 100 PDF files and it ran without any errors, exporting the files to Atom at the end.

About the empty spaces, I really hadn't paid attention to that. It really makes no sense for those empty spaces. It's probably because I used Excel to create the original file and I may have added some formatting making Excel think it was a valid field. I will remove them (also, I'm using Libreoffice now for editing the CSV's).

PS: In the attached job the files that failed are the ones pointing to the files 000302000080000006080.pdf, 000302000081000006081.pdf and 000302000082000006082.pdf.

METS.zip

Roberto Greiner

unread,
May 23, 2025, 2:21:52 PM5/23/25
to archiv...@googlegroups.com

After replying this email I kept running more batches but without fixing the metadata.csv file and ended up getting the same error. I've then corrected the metadata file (removing all the empty fields), ran the batch again and got no error this time. So the reason was actually the empty fields. The questions remains why it only fails sometimes....

Roberto Greiner

Reply all
Reply to author
Forward
0 new messages