LegacyID and interface-created descriptions

104 views

CSV-importFAQarchival-descriptionimport-export

Skip to first unread message

jody.robi...@gmail.com

unread,

Aug 4, 2020, 11:47:13 AM8/4/20

to AtoM Users

Hello,

I am wondering if an archival description has a LegacyID if it was created through the user interface rather than by a CSV import? If it does have a legacy ID, where would I be able to see what it is?

We were investigating if we can update descriptions via CSV import but they are descriptions that were initially created via the user interface and not an import. Or maybe it would be possible through an EAD import?

Thanks, in advance, for any info on this!

Jody Robinson

Dan Gillean

unread,

Aug 5, 2020, 5:09:57 PM8/5/20

to ICA-AtoM Users

Hi Jody,

In addition to trying to answer your direct questions, I'll try to provide a bit more context about the feature that might help you identify additional workarounds and strategies. This topic comes up frequently, so I'm going to give a much longer response than you're probably expecting, to be able to refer to it in the future.

First, to answer the direct questions:

If a description was created via the user interface, it does not have a legacyID. There's nowhere in the user interface where you can see a legacyID associated with a description, at present. However, there may still be ways that you can update your descriptions created via the user interface with a CSV update import. In fact, you'll probably have better luck with this than you would with an EAD XML import.

I'll give some background first, and then proceed to outline options for how you might proceed.

Background on legacyIDs, and the import update feature

The concept of a legacyID was introduced early on, probably around ICA-AtoM 1.1 or 1.2. Artefactual was seeing increased interest in users migrating to AtoM from legacy systems, and we were undertaking a lot more migration projects as a result. There's a whole mapping stage of carrying out a migration, as you work out what fields in your legacy data correspond to those in the target system, and make decisions about where to split or join data elements where there's no 1:1 mapping. One of the more complex pieces of this was dealing with maintaining hierarchical relationships (such as files to a series; series to a collection, etc) from the source to the target system. To help be able to keep track of what was going where and what record was a parent or child of another record, while still being able to associate each record imported into AtoM with the original source record from the legacy system, Artefactual developers introduced this concept of the legacyID. Essentially, it captures a unique identifier (typically the database's unique record ID) of the source system, and these were ordered to help capture hierarchical relationships during import. Post import, we could compare the results against the original source system hierarchies, because each record imported into AtoM still retained a unique identifier from the source system - though there was no need to show this to end users once the migration was complete.

On import, AtoM still does something very similar on its own - the database assigns a unique Object ID to each record, but since this is not meaningful to the end user (and different from an archivist assigned identifier or reference code), you don't typically see these in the user interface either.

This is why:

The legacyID can't be seen in the AtoM user interface
Records created via the user interface do not have a legacyID
On export, the data populating the legacyID column is not the same legacyID you might have imported the record with

This third point deserves a bit more attention. On export, AtoM puts the objectID (i.e. the unique identifier assigned by AtoM's database to every record) in the legacyID column, and uses this to represent hierarchical relationships (via the parentID column as well).

Why does it do this? Why not use legacyID if one is available? First, because your legacyID values might not be unique. If you import a CSV and use 1, 2, 3, 4, etc as your legacy ID values, and then next week someone else does the same with a different import, then an export that contained both hierarchies (say for example a command-line export of all records, or a clipboard export of multiple fonds) might not be something you could re-import, as those common ID values would break the hierarchy in unexpected ways when trying to reimport. AtoM's object ID values are guaranteed to be unique system-wide, since the database is the system of record and ensures uniqueness.

Second, as previously noted, not all records will have a legacyID value. We don't want to use a mix of legacyID values where they exist and objectIDs where they don't, because a) users would not be able to tell the difference and determine which is which, and b) there's always the chance that a previous legacyID from an import will be the same value as an AtoM object ID - in which case you'd once again get conflicts when trying to reimport.

Finally, by using the object ID, AtoM is following the same pattern that Artefactual developers exploited when dealing with migrations from other legacy systems into AtoM. If you are exporting records because you want to migrate to a different system, you need some kind of unique value to be able to audit your work against what AtoM originally held as you import into the target system. This gives you a unique code for each record to work with, to check the outcome of your migration.

LegacyID vs ObjectID and AtoM's database

In AtoM's data model, there's an object table, and almost every major entity type in AtoM will be connected to this table. This is where the objectID of a record is stored - the rest of the record metadata will be in other tables connected via foreign keys (so for descriptions, the information_object and information_object_i18n tables will have most of the rest of the data). As noted above, every record in AtoM will have an objectID. For descriptions, there are two ways to access this information.

The first, for the more technical-minded, requires access to the MySQL database, but we have a query in our docs that will return the object ID of a record based on the slug input in the query:

https://www.accesstomemory.org/docs/latest/admin-manual/maintenance/common-atom-queries/#finding-the-object-id-of-a-record

(I've also provided an example query for when you know the title or authorized name of a record, but not the slug)

For descriptions, the other easier way that doesn't involve CLI access: add it to the clipboard and export. Remember, on export, AtoM will add the objectID value to the legacyID column - so you can open your CSV, find the right row, and look at the legacyID value in that row to see what AtoM's internal objectID value is for the record.

For actual legacyID values originating from an import, AtoM stores these in a separate table called keymap. As our documentation states here and here, you can use the legacyID and parentID columns to manage the creation of hierarchical relationships in a CSV you intend to import. On import, the legacyID value is written to the keymap table in the source_id column, along with the source_name - if no source name is provided by the user on import (there's an option to add one in the CLI import commands, but not for imports via the user interface or from Archivematica), then the CSV's filename is used as a default.

The sourcename will be visible in the user interface if one exists, though you'll have to enter edit mode on a record to see it. It can be found in the Administration area:

If you're curious to learn more about AtoM's data model, we make copies of the AtoM database entity relationship model diagrams available on the wiki, here:

https://wiki.accesstomemory.org/Development/ERDs

AtoM's import update functionality and record matching logic

The ability to import a CSV as an update to existing records was added much later - first in the 2.4 release. At the time, the purpose of the module was to help users update records being maintained in other systems - for example, a separate regional or national AtoM portal that had a copy of your data. Importantly, this meant the focus was on exports from one system, and imports into another - NOT roundtripping descriptions in a single system (i.e. exporting records, making updates in the CSV, and then re-importing them into the same system). The budget was tight on this, time was limited, and as a consequence, the use case for the development was very narrow in scope. Unfortunately, it seems that there is way more interest in users roundtripping descriptions than using this to update records in other AtoM instances - but we've not yet had anyone sponsor work to help us improve and address this.

So with that in mind, how does it work currently?

On initial import, AtoM will write the legacyID value (as source_id) and the source name (either user-input via a CLI option on the import task, or else defaulting to the filename - as source_name) to the keymap table for future reference.

During an update import, AtoM has a cascading set of criteria it uses to find candidate matches for updating. First, it will look for an EXACT match in both source_id (compared against the legacyID value in the incoming CSV) and source_name (compared against any user-input source-name entered in the command-line, or else the incoming CSV filename). If an exact match is found, then AtoM will proceed to update the target record with the corresponding CSV row.

If no match is found, then AtoM falls back to its secondary matching criteria: an exact match on title, identifier, and repository values in the CSV, against those of the candidate match. There are also some separate rules for finding matches, and potentially updating, related authority records - those are outlined in our documentation here:

https://www.accesstomemory.org/docs/latest/user-manual/import-export/csv-import/#attempting-to-match-to-existing-authority-records-on-import

This set of secondary criteria implies a couple things:

It is possible to find matches against records that were not originally imported - it's just harder to do so
There are 3 criteria (title, identifier, repository name) because none of these on their own are guaranteed to be unique
Because all 3 of these might not be unique, there are some risks - AtoM will make the update against the first record it encounters that meet all of these criteria, so if you have 5 records in your system that all have the same identifier (e.g. F1), title (e.g. "Correspondence"), and repository (e.g. "Example Archives"), then you can't guarantee that the right one will be matched.
If you don't have or know the original legacyID and sourcemame values used (or the record was created via the user inteface), then you can't use a CSV import to update the title, identifier, or repository of a record - because changing these will cause the match to fail

By default, if no match is found, then AtoM will proceed by importing the record as a new record. Fortunately, there is a "skip unmatched" option in both the user interface and the command-line import task. When this is used, if no match is found, AtoM will simply skip the record, rather than importing it as a new record (and potential duplicate).

Finally, if you'd like to see most of the above explained a bit differently, I've written about this in the user forum previously - here's the most comprehensive example:

https://groups.google.com/d/msg/ica-atom-users/OWFFysVAdnk/SLUJ663PAAAJ

Options going forward

First, be aware that not all AtoM fields can be updated via import! I've tried to update the 2.6 documentation to make this as clear as possible - please review it before proceeding:

https://www.accesstomemory.org/docs/2.6/user-manual/import-export/csv-import/#update-existing-descriptions-via-csv-import

If you have access to the command-line, then there's one good option you can explore, not previously discussed.

In 2.6, we've added support for a new CLI import option, called --roundtrip. When this option is used:

AtoM ignores any legacyID and sourcename values in the keymap table
AtoM will also ignore any other matching criteria, and
AtoM will ONLY look for matches by comparing the legacyID value in the CSV against AtoM's objectID values.

This means you could:

Add the collection you want to update to the clipboard, and export it as a CSV
Make updates to the CSV as needed, and save
Import using the command-line CSV import task, with the --roundtrip option and the --skip-unmatched option (that way, any records that don't match are skipped and reported in the console, rather than coming in as new duplicate records)

See: https://www.accesstomemory.org/docs/2.6/admin-manual/maintenance/cli-import-export/#importing-archival-descriptions

This is a good option, because it avoids the complexity of the existing matching criteria, instead relying on objectID matches, which should be a 100% match for a CSV that was exported first. It also doesn't matter if the records were originally imported or not, since legacyID values in AtoM's database are bypassed entirely. In the future, we hope to add support for this to the user interface.

If using the command-line is not an option for you:

I would suggest you test with a small batch - ideally on a separate test system, but if not, make sure you (or someone) creates a backup of the database first, just in case you accidentally make a bunch of duplicates.

However, as noted, there is a secondary set of matching criteria that relies on exact matches on title, identifier and repository name. Use the "skip unmatched" option, and try out a small import to see if you're able to get a match.

Let us know how it goes!

Dan Gillean, MAS, MLIS

AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056

@accesstomemory

he / him

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/6edbadf5-f378-4b37-ab29-31cd706b0afeo%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages