Matching Issue- Edit Archival Description via CSV Import

338 views

CSV-importFAQarchival-descriptionimport-export

Skip to first unread message

holly.a....@gmail.com

unread,

Sep 10, 2021, 11:03:42 AM9/10/21

to AtoM Users

Hello Everyone,

I am using AtoM 2.6.4 and I seem to be running into an issue with using a CSV Import via the AtoM GUI to edit existing archival descriptions. I think it’s more to do with my understanding (or misunderstanding) of how this function can be used, rather than issues with our system.

These item-level descriptions have been created via Archivematica’s Atom/Binder DIP upload process without any metadata. We would like to edit the descriptions to add identifiers and proper titles. Additionally, I have also created test descriptions manually in AtoM and tested this process with the same outcome of not matching and thus not editing the descriptions.

I think that this has something to do with matching on the combination of the legacy id AND source name fields. I think that my disconnect is not understanding the source name part of this requirement. Does this mean that I cannot match on the legacy id and source name if I did not originally create these archival descriptions via New Archival Descriptions CSV Import?

Would someone be able to explain the function of Edit Archival Description via CSV Import including when and how to use it?

I’m happy to provide more detail on my process if that helps.

Thank you in advance for your time!

Very Best,

Holly McIntyre

Dan Gillean

unread,

Sep 14, 2021, 11:58:49 AM9/14/21

to ICA-AtoM Users

Hi Holly,

It is confusing - the initial implementation of this feature was not designed to support the way that most people actually want to use it (by roundtripping descriptions from their own system to import bulk updates), which is part of why things are confusing right now. I have tried to make the 2.6 documentation more explicit on how things work - please see:

https://www.accesstomemory.org/docs/latest/user-manual/import-export/csv-import/#update-existing-descriptions-via-csv-import

Essentially, this feature was first sponsored as a way for the maintainers of portal sites / union catalogs to be able to regularly receive updates from participating institutions using their own AtoM instances. The idea being that the initial metadata in the portal site comes from an AtoM CSV import, as do subsequent updates.

For the TL;DR and what you can try now, you can jump down to the workarounds section below.

In the meantime, since the use of the legacyID field, and the update import matching logic in general, can be somewhat confusing, I'll try to provide a bit of context below that might help you better understand where we're at currently, why matching is hard, and how we got here.

The current matching logic for archival description imports in AtoM

This is described in our documentation here:

https://www.accesstomemory.org/docs/latest/user-manual/import-export/csv-import/#csv-descriptions-match-criteria

Essentially:

First AtoM will look for an exact match in the legacyID and the source_name value
If no match is found on legacyID and source_name, then AtoM will look for an exact match on title, repository, and identifier.
If there is no exact match on all of these 3 values, then the record is considered not to have a match. Depending on user settings during import, it will either import as new, or be skipped and reported in the Job details page

The legacyID and source name values are described more in this section of the documentation:

https://www.accesstomemory.org/docs/latest/user-manual/import-export/csv-import/#csv-legacy-id-mapping

However, the purpose and rationale of how this works may not be immediately apparent to users, so I will try to elaborate.

The history and purpose of the legacyID field and the keymap table in the database

Originally, the legacyID field was added to CSV imports to help Artefactual perform migrations from other systems. Since many users were migrating from custom systems that did not follow archival standards (and therefore each database schema was different, and some might have custom fields, etc), we needed a way to uniquely identify all records from the source system to be able to audit the results of a data migration. Relational databases almost always assign a unique, system-generated ID value to records stored in the database - in AtoM we call these object IDs. Our plan was to use these to uniquely identify records during a migration - they would be stored in AtoM in a database table called the keymap table as the legacyID - i.e. the unique ID from the legacy system. The source name was also stored - by default this will use the name of the CSV file, but the command-line task (which we typically use for bulk CSV data migration projects) also has a --source-name parameter that allows you to specify any source name you'd like.

AtoM behaves the same way on export. When you export a CSV of descriptions from AtoM, the legacyID column in the resulting CSV is actually populated with the object ID from AtoM's MySQL database for that object. This way, if you are ever migrating FROM AtoM to a different system, you have a unique ID to work with as well, in case your titles, identifiers, etc are not unique across all metadata.

The original design of the matching logic during update imports

As I mentioned, the original first intention of the update import development project was to allow portal site maintainers to receive updates from regional users via CSV import.

During an initial import into AtoM, the legacyID value and the source name value will be stored in the keymap table. The idea was that these two values combined could be used to provide future import updates. You could do an original import of new metadata from a contributing institution and give it a source name of "institutionA-descriptions" for example. Contributing institutions would give each row in a CSV a unique legacyID value - but since there might be multiple contributors, the legacyID values might not be unique across contributors (i.e. two different institutions might give the portal site maintainers CSVs that both use 00001 for a legacyID value in a given row) - so the added unique source name helps differentiate them.

Then, during a future import, the contributing institutions would use the same legacyID values in an update CSV, and the portal maintainers would use the same sourcename value when running the update import. As a fallback, we have the exact match on title, identifier, and repository name, as well as the "skip unmatched" option to prevent accidentally creating new descriptions when no match is found.

The problems with this approach

We've learned a lot since this was first implemented, and I think we'd make different choices given a chance to completely redesign the feature. As you've noted in your post, there are a couple of issues with how the matching logic was implemented.

The first of these is that even for the approach to work in the original use case (portal site updates), there's still a high degree of planning and coordination required. It requires:

That the original metadata in the portal site has been created via CSV import, that came from a CSV export from another AtoM system

If it was not an AtoM system, then it would require that the contributing institutions have maintained a copy of the original import CSVs, so they can make sure they use the same legacyID values in the future

That the portal site administrators have maintained a list of the source names used during import, so they can be reused during future updates
That contributors do not mix new records and updates in a single CSV
That portal site administrators have access to the command-line for imports, or else rename the CSV every time to match the sourcename.

In many cases, we have found that this degree of advanced planning, coordination, and collaboration does not occur.

Additionally, as your email points out, the actual use case that most AtoM users want the CSV update imports for is performing bulk updates in their own system. However, the current matching logic makes this difficult and unlikely:

Many users create their descriptions via the user interface, meaning they do not have a stored legacyID and sourcename value
Even if they were created via CSV import, on export AtoM does NOT use the legacyID, but rather the internal database object ID to populate the legacyID column
Consequently, users would need to have a copy of the original import CSV, and then would manually need to update the legacyID column values from the object ID used on export to the original legacyID values used during the first import.
The fallback logic also means that titles, identifiers, and repository names cannot be updated via an import.

So.... if you want to import updates in your own system, what can be done?

Workarounds for local roundtrip updates

The easiest method of performing local roundtrip updates is via the command-line CSV import. There, we've recently added a new --roundtrip import option. When used, the current matching logic is completely bypassed. Instead, the matching logic looks only for EXACT matches on the internal object ID value in the import CSV's legacyID column. See:

https://www.accesstomemory.org/docs/latest/admin-manual/maintenance/cli-import-export/#csv-import-descriptions-cli
(scroll to the bottom of the section for the mention of the --roundtrip option)

It's important that you still review the update import overview documentation closely - not all fields support updates. Some will only allow for you to append new information, rather than overwrite existing metadata. See:

https://www.accesstomemory.org/docs/latest/user-manual/import-export/csv-import/#fields-that-will-support-update-imports

In general, I'd suggest that your workflow be something like this:

Use the clipboard to export records you want to update. Remember you can just add top levels to the clipboard and use the "include descendants" option - you don't need to individually add every description in a hierarchy
Open the CSV in a spreadsheet application. We strongly recommend using something like LibreOffice Calc over Microsoft Excel, since Microsoft typically uses its own custom line endings and character encoding that can cause issues on import (more information here)
Delete any columns that you don't need, to reduce the possibility of accidental updates, overwrites, or appends. If you are using the roundtrip option, you just need to make sure that the legacyID column is still included, plus the columns with metadata you wish to update
Make your updates in the relevant rows and save
Use the command-line CSV import task, with the --roundtrip option

It's still far from ideal, but I hope this might help at least!

Future possible work

While it will be a much bigger project to completely rewrite how update imports and the matching logic works, one thing we could do in the future is add support for the --roundtrip option to the user interface. This work is not currently sponsored or slated for inclusion in an upcoming release, but my hope is that it would not require too much work to implement, since the CLI option already exists.

We could potentially also find a way to include the original legacyID values used during an initial import (if they exist) in AtoM CSV exports, to make it easier to discover what these are. Right now, you'd need to either just know, or else have access to the database to make queries. With something like an originalLegacyId column on export, this could be ignored by default in imports, but would also make it easier for users to not only see the values, but change the column name to legacyId if desired. We could also add support for defining a source name in the user interface during CSV imports. Both of these enhancements would help when performing updates from one system to another.

Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056

@accesstomemory

he / him

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/f0341896-8229-48e1-a3dd-d608075bdbadn%40googlegroups.com.

holly.a....@gmail.com

unread,

Sep 17, 2021, 1:39:33 PM9/17/21

to AtoM Users

Hi Dan,

Thank you for your very thorough and thoughtful reply. This helps tremendously and is also validating. We plan on trying out the CL roundtrip import option to see how that works. Also, we will probably stick with using metadata.csv files through Archivematica to fill in our required item-level description fields (identifier and title, among others).

Thank you again and stay well.

Very Best,

Holly

Reply all

Reply to author

Forward

0 new messages