Hi Holly,
It is confusing - the initial implementation of this feature was not designed to support the way that most people actually want to use it (by roundtripping descriptions from their own system to import bulk updates), which is part of why things are confusing right now. I have tried to make the 2.6 documentation more explicit on how things work - please see:
Essentially, this feature was first sponsored as a way for the maintainers of portal sites / union catalogs to be able to regularly receive updates from participating institutions using their own AtoM instances. The idea being that the initial metadata in the portal site comes from an AtoM CSV import, as do subsequent updates.
For the TL;DR and what you can try now, you can jump down to the workarounds section below.
In the meantime, since the use of the legacyID field, and the update import matching logic in general, can be somewhat confusing, I'll try to provide a bit of context below that might help you better understand where we're at currently, why matching is hard, and how we got here.
The current matching logic for archival description imports in AtoM
This is described in our documentation here:
Essentially:
- First AtoM will look for an exact match in the legacyID and the source_name value
- If no match is found on legacyID and source_name, then AtoM will look for an exact match on title, repository, and identifier.
- If there is no exact match on all of these 3 values, then the record is considered not to have a match. Depending on user settings during import, it will either import as new, or be skipped and reported in the Job details page
The legacyID and source name values are described more in this section of the documentation:
However, the purpose and rationale of how this works may not be immediately apparent to users, so I will try to elaborate.
The history and purpose of the legacyID field and the keymap table in the database
Originally, the legacyID field was added to CSV imports to help Artefactual perform migrations from other systems. Since many users were migrating from custom systems that did not follow archival standards (and therefore each database schema was different, and some might have custom fields, etc), we needed a way to uniquely identify all records from the source system to be able to audit the results of a data migration. Relational databases almost always assign a unique, system-generated ID value to records stored in the database - in AtoM we call these object IDs. Our plan was to use these to uniquely identify records during a migration - they would be stored in AtoM in a database table called the keymap table as the legacyID - i.e. the unique ID from the legacy system. The source name was also stored - by default this will use the name of the CSV file, but the command-line task (which we typically use for bulk CSV data migration projects) also has a --source-name parameter that allows you to specify any source name you'd like.
AtoM behaves the same way on export. When you export a CSV of descriptions from AtoM, the legacyID column in the resulting CSV is actually populated with the object ID from AtoM's MySQL database for that object. This way, if you are ever migrating FROM AtoM to a different system, you have a unique ID to work with as well, in case your titles, identifiers, etc are not unique across all metadata.
The original design of the matching logic during update imports
As I mentioned, the original first intention of the update import development project was to allow portal site maintainers to receive updates from regional users via CSV import.
During an initial import into AtoM, the legacyID value and the source name value will be stored in the keymap table. The idea was that these two values combined could be used to provide future import updates. You could do an original import of new metadata from a contributing institution and give it a source name of "institutionA-descriptions" for example. Contributing institutions would give each row in a CSV a unique legacyID value - but since there might be multiple contributors, the legacyID values might not be unique across contributors (i.e. two different institutions might give the portal site maintainers CSVs that both use 00001 for a legacyID value in a given row) - so the added unique source name helps differentiate them.
Then, during a future import, the contributing institutions would use the same legacyID values in an update CSV, and the portal maintainers would use the same sourcename value when running the update import. As a fallback, we have the exact match on title, identifier, and repository name, as well as the "skip unmatched" option to prevent accidentally creating new descriptions when no match is found.
The problems with this approach
We've learned a lot since this was first implemented, and I think we'd make different choices given a chance to completely redesign the feature. As you've noted in your post, there are a couple of issues with how the matching logic was implemented.
The first of these is that even for the approach to work in the original use case (portal site updates), there's still a high degree of planning and coordination required. It requires:
- That the original metadata in the portal site has been created via CSV import, that came from a CSV export from another AtoM system
- If it was not an AtoM system, then it would require that the contributing institutions have maintained a copy of the original import CSVs, so they can make sure they use the same legacyID values in the future
- That the portal site administrators have maintained a list of the source names used during import, so they can be reused during future updates
- That contributors do not mix new records and updates in a single CSV
- That portal site administrators have access to the command-line for imports, or else rename the CSV every time to match the sourcename.
In many cases, we have found that this degree of advanced planning, coordination, and collaboration does not occur.
Additionally, as your email points out, the actual use case that most AtoM users want the CSV update imports for is performing bulk updates in their own system. However, the current matching logic makes this difficult and unlikely:
- Many users create their descriptions via the user interface, meaning they do not have a stored legacyID and sourcename value
- Even if they were created via CSV import, on export AtoM does NOT use the legacyID, but rather the internal database object ID to populate the legacyID column
- Consequently, users would need to have a copy of the original import CSV, and then would manually need to update the legacyID column values from the object ID used on export to the original legacyID values used during the first import.
- The fallback logic also means that titles, identifiers, and repository names cannot be updated via an import.
So.... if you want to import updates in your own system, what can be done?
Workarounds for local roundtrip updates
The easiest method of performing local roundtrip updates is via the command-line CSV import. There, we've recently added a new --roundtrip import option. When used, the current matching logic is completely bypassed. Instead, the matching logic looks only for EXACT matches on the internal object ID value in the import CSV's legacyID column. See:
It's important that you still review the update import overview documentation closely - not all fields support updates. Some will only allow for you to append new information, rather than overwrite existing metadata. See:
In general, I'd suggest that your workflow be something like this:
- Use the clipboard to export records you want to update. Remember you can just add top levels to the clipboard and use the "include descendants" option - you don't need to individually add every description in a hierarchy
- Open the CSV in a spreadsheet application. We strongly recommend using something like LibreOffice Calc over Microsoft Excel, since Microsoft typically uses its own custom line endings and character encoding that can cause issues on import (more information here)
- Delete any columns that you don't need, to reduce the possibility of accidental updates, overwrites, or appends. If you are using the roundtrip option, you just need to make sure that the legacyID column is still included, plus the columns with metadata you wish to update
- Make your updates in the relevant rows and save
- Use the command-line CSV import task, with the --roundtrip option
It's still far from ideal, but I hope this might help at least!
Future possible work
While it will be a much bigger project to completely rewrite how update imports and the matching logic works, one thing we could do in the future is add support for the --roundtrip option to the user interface. This work is not currently sponsored or slated for inclusion in an upcoming release, but my hope is that it would not require too much work to implement, since the CLI option already exists.
We could potentially also find a way to include the original legacyID values used during an initial import (if they exist) in AtoM CSV exports, to make it easier to discover what these are. Right now, you'd need to either just know, or else have access to the database to
make queries. With something like an
originalLegacyId column on export, this could be ignored by default in imports, but would also make it easier for users to not only see the values, but change the column name to
legacyId if desired. We could also add support for defining a source name in the user interface during CSV imports. Both of these enhancements would help when performing updates from one system to another.
Cheers,