manual normalization using normalization.csv file

189 views
Skip to first unread message

Hutchinson, Tim

unread,
Jun 15, 2014, 5:40:44 PM6/15/14
to <archivematica@googlegroups.com>
I have been testing the new manual normalization workflow via the normalization.csv file, which provides the correct mapping for potentially conflicting filenames (e.g. image1.jpg and image1.tif) - currently just documented in issue 4922.

Where I've hit a roadblock is for files in subdirectories. The original requirements for this issue indicated that the workflow should allow for subdirectories, so I'm wondering if this was not done or if I'm missing something in terms of how to structure the objects and csv.

To boil it down to a fairly simple example, take a CSV file like:

# original, access, preservation
Subfolder/File1.doc,manualNormalization/access/Subfolder/File1.pdf,manualNormalization/preservation/Subfolder/File1.docx
Subfolder/File1.PCD,manualNormalization/access/Subfolder/File1.jpg,manualNormalization/preservation/Subfolder/File1.tif

with corresponding file structure
manualNormalization
   normalization.csv
   access
     Subfolder
        File1.pdf
        File1.jpg
   preservation
     Subfolder
        File1.docx
        File1.tif
Subfolder
  File1.doc
  File1.PCD

I get an error for each of File1.doc and File1.PCD:

File found: 4c9307a4-4d92-4ae6-904b-5e9f2f7a0dec %SIPDirectory%objects/SubfolderTwo/File1.doc

STDERR

Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/clientScripts/normalize.py", line 364, in <module>
    sys.exit(main(opts))
  File "/usr/lib/archivematica/MCPClient/clientScripts/normalize.py", line 259, in main
    manually_normalized_file = check_manual_normalization(opts)
  File "/usr/lib/archivematica/MCPClient/clientScripts/normalize.py", line 138, in check_manual_normalization
    return File.objects.get(sip=opts.sip_uuid, currentlocation__startswith=path) #removedtime = 0
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 407, in get
    (self.model._meta.object_name, num))
main.models.MultipleObjectsReturned: get() returned more than one File -- it returned 2!

This seems to be the same error you get if there are filename conflicts and there is no csv file.

I also tried changing the subdirectory structure to:

Subfolder
   Doc1.doc
   Doc1.PCD
   manualNormalization
      access
          Doc1.pdf
          Doc1.jpg
      preservation
         Doc1.docx
         Doc1.tif
      normalization.csv

but that didn't work at all - the manually normalized objects as well as the normalization.csv file were all attempted to be processed as regular objects (no recognition of any manually normalization).

If there are no conflicts in filenames (e.g. Doc1.doc and Doc2.PCD), the first folder structure does seem to work, and I managed to get various combinations to work with the csv file when there is no subfolder.

Thanks
Tim

Sarah Romkey

unread,
Jun 18, 2014, 4:22:26 PM6/18/14
to archiv...@googlegroups.com
Hi Tim (and all),

One of our developers did some investigation and she discovered that you did indeed uncover a bug- when the code was written for normalization.csv, we didn't anticipate the use case of files in subdirectories. She was able to write a patch which we will endeavour to include in the 1.2 release- word to the wise though, this patch will "break" the old behaviour in the sense that now the file paths will have to be specified in normalization.csv if there are sub-directories present. We'll document this so our users will know. It seemed like the best compromise- it's better to be able to include  subdirectories than not at all, I think.

I haven't had a chance to do an actual test here to try to replicate the exact behaviour you described, but if I find anything different, I'll let you know.

Cheers,

Sarah

Hutchinson, Tim

unread,
Jun 19, 2014, 12:29:37 PM6/19/14
to archiv...@googlegroups.com
Hi Sarah,

Thanks, that's great news that you were able to provide a patch so quickly. Can you clarify what you mean about needing the path? Would that be the path relative to the top level (as seems to be needed already, e.g. manualNormalization/access/file1.pdf), or the full server path at a given point in the process?

I also did some further testing relating to another variation - in this case the issue ticket indicates that this wasn't part of the core requirements, but it appears that both the file names under manualNormalization and all the entries in normalization.csv need to be manually sanitized (I tested with spaces, but presumably this applies generally). On the other hand the original workflow (without the csv file, in a case where there are no conflicts in the filenames), manual normalization works without doing the sanitization first - so I had hoped that you would just need to sanitize the entries in the csv. More specifically:
1) normalized files unsanitized, no csv (no conflict in names) => works
2) normalized files unsantized, csv supplied with no sanitization => fails
3) normalized files unsanitized, csv supplied with sanitization of columns 2 and 3 => fails
4) normalized files unsanitized, csv supplied with sanitization of all columns => fails
5) normalized files sanitized, csv supplied with sanitization of columns 2 and 3 => fails
6) normalized files sanitized, csv supplied with sanitization of all columns => works

So in this case at least there is a workaround, and the good news is that the original files don't have to be renamed.

Tim

From: archiv...@googlegroups.com [archiv...@googlegroups.com] on behalf of Sarah Romkey [sro...@artefactual.com]
Sent: June 18, 2014 2:22 PM
To: archiv...@googlegroups.com
Subject: [archivematica] Re: manual normalization using normalization.csv file

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at http://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

Sarah Romkey

unread,
Jun 19, 2014, 7:47:53 PM6/19/14
to archiv...@googlegroups.com
Hi Tim,

Sorry to be unclear- the normalization.csv file will need to show any subdirectories after objects. So if the filepath is /objects/Subfolder1/file.txt, what will need to be in normalization.csv is Subfolder1/file.txt.

Thank you for testing the sanitization workflow as well- this will be helpful for documentation purposes. It would be nice if you didn't have to manually sanitize the files though.. I think we may want to consider this a bug, albeit not an urgent one.

Thanks Tim!

-Sarah

Sarah Romkey, MAS,MLIS
Systems Archivist
Artefactual Systems
604-527-2056
@accesstomemory / @ArchivesSarah


Reply all
Reply to author
Forward
0 new messages