Matching criterias CSV Import

56 views
Skip to first unread message

r.ruti...@docuteam.ch

unread,
Mar 5, 2021, 6:06:40 PM3/5/21
to AtoM Users
Dear AtoM Users, 

I experienced some problems when trying to update archival descriptions based on an CSV export. Depending on the import settings, the ADs were either created new or skipped.

So I did some deepr investigation. 
Could somebody from the users here or from Artefactual confirm my observations, please? 

The following Test Scenario was used:

1 - CSV creation
*****************
Name the file as you want. In this example, we will call it legacy-id-test.csv.

Content:

legacyId,parentId,qubitParentSlug,accessionNumber,identifier,accessionNumber,title,levelOfDescription
111222333,,,,,,legacyidtest1,
,,,,,,legacyidtest2,

The first row has a legacyId of 111222333, and the second row has a legacyId of <none>


2 - Import
**********
-> Import the CSV via GUI (/informationobject/add). After that, they should appear in the "root" level in AtoM, as no parent or repository was provided in the CSV.


3 - Export to CSV
*****************
-> Add each of the two newly created items to the clipboard
-> Load the clipboard (/user/clipboard)
-> click on "Export", check "include draft records", and click on "Export"
-> Download the CSV on the "Manage Jobs" page (/jobs/browse)

4 - Outcome
***********
-> Extract the downloaded ZIP file
-> Open the containing CSV file isad_0000000001.csv
-> It will look something like this:
referenceCode,legacyId,parentId,qubitParentSlug,identifier,accessionNumber,title,levelOfDescription,extentAndMedium,repository,archivalHistory,acquisition,scopeAndContent,appraisal,accruals,arrangement,accessConditions,reproductionConditions,language,script,languageNote,physicalCharacteristics,findingAids,locationOfOriginals,locationOfCopies,relatedUnitsOfDescription,publicationNote,digitalObjectURI,digitalObjectChecksum,generalNote,subjectAccessPoints,placeAccessPoints,nameAccessPoints,genreAccessPoints,descriptionIdentifier,institutionIdentifier,rules,descriptionStatus,levelOfDetail,revisionHistory,languageOfDescription,scriptOfDescription,sources,archivistNote,publicationStatus,physicalObjectName,physicalObjectLocation,physicalObjectType,alternativeIdentifiers,alternativeIdentifierLabels,eventDates,eventTypes,eventStartDates,eventEndDates,eventDescriptions,eventActors,eventActorHistories,eventPlaces,culture,nameAccessPoints,genreAccessPoints,descriptionIdentifier,institutionIdentifier,rules,descriptionStatus,levelOfDetail,revisionHistory,languageOfDescription,scriptOfDescription,sources,publicationStatus,digitalObjectURI,digitalObjectChecksum,generalNote,appraisal,physicalObjectName,physicalObjectLocation,physicalObjectType,alternativeIdentifiers,alternativeIdentifierLabels,eventDates,eventTypes,eventStartDates,eventEndDates,eventDescriptions,eventActors,eventActorHistories,eventPlaces,publicationNote,archivistNote,culture
,74297,,,,,legacyidtest1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Draft,,,,,,,,,,,,,,en,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,en
,74298,,,,,legacyidtest2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Draft,,,,,,,,,,,,,,en,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,en

-> The legacyID for legacyidtest1 is 74297 (at Import: 111222333)
   The legacyID for legacyidtest2 is 74298 (at Import: null)

Explanation from the Docs:
When CSV data is imported into AtoM, values in the legacyID column are stored in AtoM’s keymap table, in a column named source_id. A system administrator or developer can access this information, either via the command-line, or by using a graphical application such as phpMyAdmin to look up exising legacy ID values in the source_id column of the MySQL keymap table.

You can not see the "original" legacyIDs in the GUI, you need querying the database:
mysql> SELECT source_name, source_id FROM keymap JOIN slug ON keymap.target_id=slug.object_id WHERE slug.slug='legacyidtest1';
+--------------------+-----------+
| source_name        | source_id |
+--------------------+-----------+
| legacy-id-test.csv | 111222333 |
+--------------------+-----------+
1 row in set (0.01 sec)

mysql> SELECT source_name, source_id FROM keymap JOIN slug ON keymap.target_id=slug.object_id WHERE slug.slug='legacyidtest2';
+--------------------+-----------+
| source_name        | source_id |
+--------------------+-----------+
| legacy-id-test.csv |           |
+--------------------+-----------+
1 row in set (0.00 sec)


Update archival description
***************************
Let's assume we want to update one or multiple archival descriptions by exporting CSV, doing the changes in Libre Office Calc, and then importing the CSV by using the update behaviour "Update matches ignoring blank fields in CSV".


--- Scenario 1 - Change "title"

Let's just only change the title:
legacyidtest1 --> legacyidtest_1
legacyidtest2 --> legacyidtest_2
We leave the rest unchanged.

-> Import in GUI as an update
-> Outcome: The two archival descriptions are not updated, but newly created. We now have 4 items:
legacyidtest1, legacyidtest2, legacyid_1, legacyid_2

Why? Because no match was found.
FIRST, AtoM looks for an exact match in the legacyID and the source_name. The source_name is the name of the file that was originally used to import the archival description.
As we have seen, the legacyIDs are not the same as when imported the first time.
SECOND, AtoM looks for an exact match on title, repository and identifier. 
As we have changed the title, this matching criteria is not given anymore.


--- Scenario 2 - Change "extentAndMedium", but not "title"

In order to have a match on the SECOND criteria, we reverse the title to the original name of "legacyid1" and "legacyid2", but add something to the field "extentAndMedium".
-> Import in GUI as update
-> Outcome: Not matching either. Maybe it's because we left the "repository" and "identifier" blank in the original import? We'll check this later...


--- Scenario 3 - Change "extentAndMedium", and change legacyId to the original ones, rename to original file name to "legacy-id-test.csv".

CSV will look like this:

referenceCode,legacyId,parentId,qubitParentSlug,identifier,accessionNumber,title,levelOfDescription,extentAndMedium,repository,archivalHistory,acquisition,scopeAndContent,appraisal,accruals,arrangement,accessConditions,reproductionConditions,language,script,languageNote,physicalCharacteristics,findingAids,locationOfOriginals,locationOfCopies,relatedUnitsOfDescription,publicationNote,digitalObjectURI,digitalObjectChecksum,generalNote,subjectAccessPoints,placeAccessPoints,nameAccessPoints,genreAccessPoints,descriptionIdentifier,institutionIdentifier,rules,descriptionStatus,levelOfDetail,revisionHistory,languageOfDescription,scriptOfDescription,sources,archivistNote,publicationStatus,physicalObjectName,physicalObjectLocation,physicalObjectType,alternativeIdentifiers,alternativeIdentifierLabels,eventDates,eventTypes,eventStartDates,eventEndDates,eventDescriptions,eventActors,eventActorHistories,eventPlaces,culture,nameAccessPoints,genreAccessPoints,descriptionIdentifier,institutionIdentifier,rules,descriptionStatus,levelOfDetail,revisionHistory,languageOfDescription,scriptOfDescription,sources,publicationStatus,digitalObjectURI,digitalObjectChecksum,generalNote,appraisal,physicalObjectName,physicalObjectLocation,physicalObjectType,alternativeIdentifiers,alternativeIdentifierLabels,eventDates,eventTypes,eventStartDates,eventEndDates,eventDescriptions,eventActors,eventActorHistories,eventPlaces,publicationNote,archivistNote,culture
,111222333,,,,,legacyidtest1,,zzz,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Draft,,,,,,,,,,,,,,en,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,en
,,,,,,legacyidtest2,,yyy,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Draft,,,,,,,,,,,,,,en,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,en

-> Import in GUI
-> Outcome: Partial success!!
legacyid1 was updated with the "extentAndMedium" value!
legacyid2 was not updated, but newly created, because in the original import, we did not provide a legacyId at all. Makes sense.


--- Scenario 4 - start all over, this time also providing the identifier and repository.

-> delete all legacyidtest* archival descriptions
-> create CSV, let's name it legacy-id-test-identifier.csv. Content:
legacyId,parentId,qubitParentSlug,accessionNumber,identifier,accessionNumber,title,levelOfDescription,extentAndMedium,repository
,,,,lidt10,,legacyidtest10,,zzz,repo
,,,,lidt20,,legacyidtest20,,yyy,repo


The identifier is "lidt1" and "lidt2" and the title is "legacyidtest1" and "legacyidtest2". We also add content for the column "repository" and "extentAndMedium"

-> Import as new in GUI
-> change only the column "extentAndMedium" in legacy-id-test-identifier.csv (this is the file that we used before. It's not a download from the clipboard)
legacyId,parentId,qubitParentSlug,accessionNumber,identifier,accessionNumber,title,levelOfDescription,extentAndMedium,repository
,,,,lidt10,,legacyidtest10,,zzz222,repo
,,,,lidt20,,legacyidtest20,,yyy222,repo
-> Import as update in GUI
-> Outcome: 
Still, we get new archival descriptions, they are NOT updated as expected. It seems like the SECOND criteria in https://www.accesstomemory.org/en/docs/2.6/user-manual/import-export/csv-import/#matching-criteria-for-archival-descriptions is wrong!


--- Scenario 5 - create archival description in the GUI

-> create AD (/informationobject/add) by only specifying an identifier ("identifier_1") and a title ("title_1")
-> export this AD as CSV with clipboard. You'll get isad_0000000001.csv
-> change CSV by adding "extentAndMedium" ("abc")
-> import isad_0000000001.csv as update
-> Outcome: Slug "title-1" not updated, but newly created as "title-1-2"

Why?
- Don't know!
- FIRST criteria does not match because there is no legacyID.
Check:
mysql> SELECT source_name, source_id FROM keymap JOIN slug ON keymap.target_id=slug.object_id WHERE slug.slug='title-1';
Empty set (0.00 sec)
- SECOND criteria: Don't know why it doesn't match. O.K.... "repository" is not set, but there was no "repository" set in the first place either, when created in the GUI.


Conclusion
**********
If we want to bulk update archival descriptions based on a CSV export, we have to be very careful.
- If the original archival description was created by CSV import, and there was a legacyID given, the FIRST matching criteria (legacyID, source_name) is used, but we have to use that same legacyID and not the one that is given in a CSV export of the archival description, and also the same source_name.
What if in the mean time some other ADs were added to the parent, so that there are ADs not present in the original CSV used for the import? What if we don't have the original CSV with original legacyIDs anymore, does one need to create a complicated SQL-query first?
- If the original archival description was created by CSV import, and there was no legacyID given, then the SECOND matching criteria (title, repository, and identifier) SHOULD be used, but it's not (bug?).
- If the original archival description was created in the GUI, an update via CSV import does not seem to be possible at all. 

I am wondering, if you come to the same conclusion as I do. 
Your help and clarification is as always much appreciated!

Best, 
Roger

r.ruti...@docuteam.ch

unread,
Mar 5, 2021, 6:07:16 PM3/5/21
to AtoM Users
Tested with AtoM 2.6.2 by the way...

Dan Gillean

unread,
Mar 8, 2021, 10:17:56 AM3/8/21
to ICA-AtoM Users
Hi Roger, 

A couple responses to your questions and observations below: 

First: yes, the design of the matching logic is rather poor for general purpose use - it was designed on a limited budget and intended to help with regular updates from one AtoM site to another, such as a regional or national portal, and not for roundtripping updates in a single system. However, this latter case is how most people actually want to use the feature! If you'd like more background on how the matching current works (and doesn't) and why, there are some older threads in the forum where I've added some context. See for example: 
The second thread above also provides some basic information on a much better solution, if you have access to the command-line. The CSV import task there has a --roundtrip option that ignores all other matching criteria, and instead looks only for an exact match on object ID - i.e. what is included in the legacyID column when you perform an export. See: 
Also, bonus tip: though it doesn't help with getting the legacyID value, the user interface can tell you the sourcename that was used, if you enter edit mode on a description that was imported and take a look in the Administration area:



Second: you've correctly determined that there is a bug in current 2.6.x releases for the fallback match criteria (title, repository, identifier). This regression was caused by some of the import optimizations we added in 2.6.0, and will be addressed in the upcoming 2.6.3 release. Related issue ticket: 
There are links to related PRs and commits on the issue ticket, as well as more context, if you're curious. 

Long term, I would very much like to see user interface support for the --roundtrip option, handled via a job. I think this would solve many of the frustrations users encounter with this task, and since the base CLI task has already been created, I'm hopeful that the development effort / cost of adding UI support is not too high. However, we at Artefactual haven't yet had the time to implement it ourselves - we're focused on bug fixes, release planning, and other optimizations that will help our users at present, and as such, this feature enhancement may need community support (either in the form of community code contributions or development sponsorship) for us to prioritize. For anyone wanting to learn more about how we develop and maintain AtoM currently, please see: 
One of the quality of life improvements we're currently working on involves adding major enhancements to the csv:check-import task, which was created over a decade ago and has been largely ignored since. We hope to make this a much more useful task for ensuring your CSV import of descriptions does not contain the kind of errors that can lead to data corruption, dozens of blank descriptions, and more. While this won't directly help with import updates, it should be generally useful overall for sense-checking description CSVs before importing. See: 
While this is currently being developed as an enhancement of an existing command-line task, we are also in discussion with a potential sponsor about adding user interface support for the description CSV import check, which would greatly increase the usability of this feature, especially for users relying on third-party hosting who may not have access to the command-line. No promises yet, but we're hopeful this will be included in 2.7! 

2.6.3 should be out this month! 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/2cee876b-c129-419f-b7b9-b88b2ae8e2f1n%40googlegroups.com.

r.ruti...@docuteam.ch

unread,
Mar 10, 2021, 2:15:17 PM3/10/21
to AtoM Users
Hi Dan
Thank you very much for your extensive reply. I managed to update with the --roundtrip parameter. I tried it earlier, too, but it did not work. Nothing happened in the console. I found out today that I have to remove the --quiet parameter when using --roundtrip. Now it works.
Looking forward to 2.6.3 :-)
Cheers, Roger

Reply all
Reply to author
Forward
0 new messages