bug AM does not support diachritic file name from manually normalized unzipped bag

26 views
Skip to first unread message

romain guedj

unread,
Mar 30, 2022, 7:05:57 AM3/30/22
to archivematica
Hi Sarah,

AM can not create METS file through create_transfer_mets.py from manually normalized files with diachritics characters in filename. So SIP fails.

context: 
AM  1.12.1

Unzipped bag with following structure : 

K:.
│   bag-info.txt
│   manifest-md5.txt
│   processingMCP.xml
│   bagit.txt
│   tagmanifest-md5.txt

├───data
│   └───skip-transfer-directory
│       └───1er Communion CHATEL 2008
│           │   1ère Communion annonce.doc
│           │
│           ├───Reportage
│           │       DSC_0001.jpg
│           │       DSC_0002.jpg
│           │
│           └───Groupe
│                   30x45.jpg
│                   G6SS8038.DCR
│                   20x30.jpg
│                   G6SS8038.psd
│                   G6SS8038.JPG

├───metadata
│       metadata.csv

└───manualNormalization
    ├───access
    │   └───skip-transfer-directory
    │       └───1er_Communion_CHATEL_2008
    │               1ère Communion annonce.pdf
    │
    └───preservation
        └───skip-transfer-directory
            └───1er_Communion_CHATEL_2008
                    1ère Communion annonce.odt

metadata.csv

parts,dc.title,dc.identifier,bcu.process
objects/,FD-KEHREN-OBERSON-ARCHNUMFR_6932-0128,ARCHNUMFR 6932-0128,hierarchical

stdout

Module createTransferMETS_v1.0

--sipUUID "ff538b91-e7b4-4b9b-9965-7ee4ec28e35e" --basePath "/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/FD-KEHREN-OBERSON-ARCHNUMFR_6932-0128-ff538b91-e7b4-4b9b-9965-7ee4ec28e35e/" --xmlFile "/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/FD-KEHREN-OBERSON-ARCHNUMFR_6932-0128-ff538b91-e7b4-4b9b-9965-7ee4ec28e35e/"metadata/submissionDocumentation/METS.xml --basePathString "transferDirectory"

Standard streams
Errors and diagnostics (stderr)

'ascii' codec can't decode byte 0xc3 in position 78: ordinal not in range(128)Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/job.py", line 111, in JobContext
    yield
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 800, in call
    args.xml_file, args.base_path, args.base_path_string, args.sip_uuid
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 102, in write_mets
    fsentry_tree.scan()
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 188, in scan
    self.build_tree(self.root_path, parent=self.root_node)
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 207, in build_tree
    self.build_tree(dir_entry.path, parent=fsentry)
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 207, in build_tree
    self.build_tree(dir_entry.path, parent=fsentry)
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 207, in build_tree
    self.build_tree(dir_entry.path, parent=fsentry)
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 207, in build_tree
    self.build_tree(dir_entry.path, parent=fsentry)
  File "/usr/lib/archivematica/MCPClient/clientScripts/create_transfer_mets.py", line 212, in build_tree
    db_path = "".join([self.db_base_path, entry_relative_path])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 78: ordinal not in range(128)

Workaround

If I remove diachritic characters from file name, the SIP passes. 

bag structure without diachritic characters below the manualNormalization directory 
J:.
│   bag-info.txt
│   manifest-md5.txt
│   processingMCP.xml
│   bagit.txt
│   tagmanifest-md5.txt

├───data
│   └───skip-transfer-directory
│       └───1er Communion CHATEL 2008
│           │   1ère Communion annonce.doc
│           │
│           ├───Reportage
│           │       DSC_0001.jpg
│           │       DSC_0002.jpg
│           │
│           └───Groupe
│                   30x45.jpg
│                   G6SS8038.DCR
│                   20x30.jpg
│                   G6SS8038.psd
│                   G6SS8038.JPG

├───metadata
│       metadata.csv

└───manualNormalization
    ├───access
    │   └───skip-transfer-directory
    │       └───1er_Communion_CHATEL_2008
    │               1ere Communion annonce.pdf
    │
    └───preservation
        └───skip-transfer-directory
            └───1er_Communion_CHATEL_2008
                    1ere Communion annonce.odt

stdout  is fine

 Module createTransferMETS_v1.0
--sipUUID "b488255e-7366-41d4-9b94-5a4c5f9046f1" --basePath "/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/FD-KEHREN-OBERSON-ARCHNUMFR_6932-0128-b488255e-7366-41d4-9b94-5a4c5f9046f1/" --xmlFile "/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/FD-KEHREN-OBERSON-ARCHNUMFR_6932-0128-b488255e-7366-41d4-9b94-5a4c5f9046f1/"metadata/submissionDocumentation/METS.xml --basePathString "transferDirectory"

Let me know if you can reproduce it.
Thank you in advance,

Cheers,

Romain

Sarah Romkey

unread,
Mar 30, 2022, 8:55:42 AM3/30/22
to archiv...@googlegroups.com
Hello Romain,

Thanks for this report- if you're able, the best place to file an issue is in the Archivematica Github Issues repository.

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Archivematica Program Manager
@archivematica / @accesstomemory




--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archivematica/a20d520e-84af-4311-9473-ab9de2b88a95n%40googlegroups.com.

gclibrary...@gmail.com

unread,
Mar 30, 2022, 9:36:30 AM3/30/22
to archivematica
Yes, I learned the hard-way to rename file names and directories before an ingest.

romain guedj

unread,
Mar 30, 2022, 11:04:32 AM3/30/22
to archivematica
Hi Sarah,

issue submitted.

Thank you.

Cheers,

Romain



Guedj Romain

unread,
Mar 30, 2022, 11:30:47 AM3/30/22
to archiv...@googlegroups.com

Hi GC,

When filename of manually normalized unzipped bag are cleaned before the transfer, the SIP is successfully created but manual normalization fails. Even if filenames are sanitized, AM can not find them as a normalized files.

I have no issue when it is not a bag.

If you successfully ingest a manually normalized unzipped bag, it would be great if you could share the structure, metadata.csv and normalization.csv file.

 

Thank you in advance

 

Cheers,

 

Meilleures salutations

 

Romain Guedj, e-Archiviste

Romain...@fr.ch, T +41 26 305 13 74

Bibliothèque cantonale et universitaire BCU

Kantons- und Universitätsbibliothek KUB

Secteur technologies du web et infrastructure informatique

Abteilung Webtechnologien und Informatik-Infrastruktur

Rue de la Carrière 22, Case postale, 1701 Fribourg

T +41 26 305 13 33, www.fr.ch/bcuf

Direction de la formation et des affaires culturelles DFAC
Direktion für Bildung und kulturelle Angelegenheiten BKAD

ETAT DE FRIBOURG

STAAT FREIBURG

 

P Be green ! Keep it on the screen…

 

De : archiv...@googlegroups.com <archiv...@googlegroups.com> De la part de gclibrary...@gmail.com
Envoyé : mercredi 30 mars 2022 15:37
À : archivematica <archiv...@googlegroups.com>
Objet : [archivematica] Re: bug AM does not support diachritic file name from manually normalized unzipped bag

--

You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages