I have been testing the new manual normalization workflow via the normalization.csv file, which provides the correct mapping for potentially conflicting filenames (e.g. image1.jpg
and image1.tif) - currently just documented in issue
4922.
Where I've hit a roadblock is for files in subdirectories. The original requirements for this issue indicated that the workflow should allow for subdirectories, so I'm wondering if this was not done or if I'm missing something in terms of how to structure the
objects and csv.
To boil it down to a fairly simple example, take a CSV file like:
# original, access, preservation
Subfolder/File1.doc,manualNormalization/access/Subfolder/File1.pdf,manualNormalization/preservation/Subfolder/File1.docx
Subfolder/File1.PCD,manualNormalization/access/Subfolder/File1.jpg,manualNormalization/preservation/Subfolder/File1.tif
with corresponding file structure
manualNormalization
normalization.csv
access
Subfolder
File1.pdf
File1.jpg
preservation
Subfolder
File1.docx
File1.tif
Subfolder
File1.doc
File1.PCD
I get an error for each of File1.doc and File1.PCD:
File found: 4c9307a4-4d92-4ae6-904b-5e9f2f7a0dec %SIPDirectory%objects/SubfolderTwo/File1.doc
STDERR
Traceback (most recent call last):
File "/usr/lib/archivematica/MCPClient/clientScripts/normalize.py", line 364, in <module>
sys.exit(main(opts))
File "/usr/lib/archivematica/MCPClient/clientScripts/normalize.py", line 259, in main
manually_normalized_file = check_manual_normalization(opts)
File "/usr/lib/archivematica/MCPClient/clientScripts/normalize.py", line 138, in check_manual_normalization
return File.objects.get(sip=opts.sip_uuid, currentlocation__startswith=path) #removedtime = 0
File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 143, in get
return self.get_query_set().get(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 407, in get
(self.model._meta.object_name, num))
main.models.MultipleObjectsReturned: get() returned more than one File -- it returned 2!
This seems to be the same error you get if there are filename conflicts and there is no csv file.
I also tried changing the subdirectory structure to:
Subfolder
Doc1.doc
Doc1.PCD
manualNormalization
access
Doc1.pdf
Doc1.jpg
preservation
Doc1.docx
Doc1.tif
normalization.csv
but that didn't work at all - the manually normalized objects as well as the normalization.csv file were all attempted to be processed as regular objects (no recognition of any manually normalization).
If there are no conflicts in filenames (e.g. Doc1.doc and Doc2.PCD), the first folder structure does seem to work, and I managed to get various combinations to work with the csv file when there is no subfolder.
Thanks
Tim