CSV import of archival descriptions failing to match

132 views
Skip to first unread message

Brandon Uhlman

unread,
May 25, 2018, 11:26:05 AM5/25/18
to ica-ato...@googlegroups.com
Hi, AtoM community.

First time poster, and a pretty new administrator of an AtoM system, so apologies in advance if I miss something obvious.

We are trying to change identifiers for a number of archival descriptions by exporting a CSV of the parent description and all its children from the AtoM interface, modifying the CSV externally, then reimporting it matching by source_name and legacy_id, which are unchanged. [1] By adding some debugging statements, I've confirmed that QubitFlatFileImport::setInformationObjectByKeymap() and QubitFlatFileImport::fetchKeymapEntryBySourceandTargetName() are being called with the expected parameters, but the latter isn't returning any results. :-(

I reviewed the data in the keymap table, and I think I've identified the problem: the SQL query in fetchKeymapEntryBySourceandTargetName() is expecting the legacy_id to be in keymap.source_id, but in our database, they are in keymap.target_id. Based on the AtoM documentation, it seems like that shouldn't be the case.

What I don't know if how the data got that way, or what I should do to fix it.

Does anybody have any suggestions?

Thanks!

Brandon

[1] Based on this thread from earlier this week (https://groups.google.com/d/msg/ica-atom-users/axJIt6PbDtY/1LLoxjgnDAAJ), that does look like a supported approach.

Dan Gillean

unread,
May 25, 2018, 12:18:46 PM5/25/18
to ICA-AtoM Users
Hi Brandon, 

Hmmm, that's interesting. Matching on existing records and updating them in place is inherently a tricky operation - but it does sound possible that you've identified a bug. I will have to ask one of our developers to take a look at this thread and see if they can offer further suggestions based on what you've reported. 

Were the original descriptions created via CSV import, or manually? I get the sense that you've already tried this, but just to be sure: If you are importing via the User interface, then the filename is automatically used as the --source-name value. For the original descriptions, you can check what the source name saved with the descriptions is by looking in the Administration area of the edit template: 


Make sure that you are either changing the import CSV to match any source name listed here, or if using the command-line interface, using the --source-name option to specify the match. 

Have you tried using any of the other limiting options to improve the match - e.g. limit the match to a specific repository, or a specific top-level description? If you are also using the skip unmatched option, then at least if it fails to find a match, it should have no ill consequences on your data (aka nothing will be imported). 

In the meantime, I will see if I can get further suggestions from our team. 

Regards, 



Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/b035cc5c-9b0a-4651-8fd5-b7a24ca00893%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brandon Uhlman

unread,
May 25, 2018, 12:51:16 PM5/25/18
to ICA-AtoM Users

Thanks for the good suggestions. We already caught the source-name-from-import-filename piece.

 

I didn’t try any of the limiting options, though it looks like they only get used later on in QubitFlatFileImport:handleInformationObjectRow() when trying to actually perform the import. $this->checkInformationObjectMatchLimit(), which performs the limit check, only gets called if we match by legacy ID or fields (title, repository, identifier) first, which we never will – in the first case for the reason I mention in my original message; and in the second case because identifier is the field we’re trying to update, so the old and new values don’t match.

 

~B

--

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.

 

--

You received this message because you are subscribed to the Google Groups "AtoM Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.

Dan Gillean

unread,
May 29, 2018, 12:24:20 PM5/29/18
to ICA-AtoM Users
Hi Brandon, 

Just a brief update to say hang in there - we're seeing what you're seeing, and one of our developers is considering how best we can assist you. I don't think it's a bug per se, but rather confusion around the intended purpose of the keymap table - as well as some early design decisions around how this was originally implemented that didn't take some possible workflows (like roundtripping from the same system) into account. 

We'll be following up soon with more information and suggestions! 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

sbr...@artefactual.com

unread,
Jun 21, 2018, 5:02:33 PM6/21/18
to AtoM Users

Hello Brandon,


AtoM’s CSV import functionality was originally implemented to import from an external system to AtoM. The original purpose of the keymap table was to map an external system’s unique identifier to an AtoM object row id. This allows records to be exported from the origin system a second time to update records already loaded in AtoM.


Keymap table: Maps foreign object ids from non-AtoM system -> AtoM


The case you are describing is different in that you are exporting records from your AtoM system, altering them, then re-importing them into the same AtoM system.


Detailed explanation


When a CSV row from an external system is first imported into AtoM, a keymap record is created. The keymap record will be created with:

  • A source_id matching the CSV row's legacyId (this is the ’source' system’s unique id)

  • A target_id matching the newly created AtoM object id

  • A source name matching the original source file name (this can be overridden using the —source-name) command line option as Dan pointed out.


Once created, this keymap record will be used in the future when matching and updating AtoM records: it maps the source system id -> AtoM id.


When a record is re-exported from the original source system because a user wants to update this record in an AtoM system:

  • This CSV will contain the same legacyId value as when originally exported from the origin system.

  • This value will be found in the keymap.source_id

  • The filename will be the same (or —source-name is set similarly as when first imported)


From that keymap row, we can determine the AtoM unique id (in the keymap.target_id field) - this allows us to correctly match and update an existing AtoM record.



What is happening in your case:


The scenario you are describing is slightly different from the above use case since your source and target systems are the same.


  • Your existing keymap records map the original source system IDs from your first data import into AtoM,  to AtoM unique id’s. You do not want to alter or delete these records.

  • When you export a CSV from your AtoM system, then the legacyId will be populated with the source system’s unique ID which in this case will be the AtoM unique id from your AtoM system.

  • When the keymap matching logic runs, it will fail to match because it will not find a row with a keymap.source_id equal to AtoM’s unique id as contained in your AtoM-created CSV.  


As you are noticing, keymap matching will then fail to match in your case.


There is secondary matching logic that will try to match to an existing AtoM record based on the values for  identifier, title and repository name contained in the import CSV when keymap matching fails. Unfortunately this is also failing in your use case because you are making modifications to the record Identifier.  If it were any field other than identifier, title or repo name you were modifying in the CSV, this matching logic would have succeeded and updated your records in AtoM.


So to send the data on a ‘round trip’ where you want to export from AtoM, modify either the identifier, title or repo name in the CSV, then re-import back into the same AtoM instance:

  • Keymap matching is not applicable, and will fail.

  • Secondary matching on identifier, title and repo name will fail due to these fields being modified.


A third type of matching is required where we can tell AtoM that the CSV legacyId column contains a valid AtoM unique identifier for your exact AtoM system.



Short term resolution


In the short term it is possible to make this change in AtoM to accomplish what you are seeking.


WARNING:

  • This process is untested.

  • You should make a backup of your AtoM database and know how to restore your system should you need.

  • Ideally this should be conducted on a test system first if you have one available.

  • Proceed at your own risk!


I have created a branch of stable/2.4.x that contains a temporary change to lib/QubitFlatfileImport.class.php.


https://github.com/artefactual/atom/commit/ca2d56427b16b1d48185fb0899b3b8ba07e9f562


If you make this change to your AtoM instance and import your CSV file on the command line, the records will match and update your system.


The process would be:


  1. Back up your database.

  2. Backup your copy of lib/QubitFlatfileImport.class.php.

  3. Make the same change I’ve highlighted in "function handleInformationObjectRow()” in lib/QubitFlatfileImport.class.php.

  4. On the server from your atom home directory run: php symphony cc

  5. Restart the php-fpm service.

  6. Restart the memcached service.

  7. Import your CSV on the command line using csv:import.

  8. Undo the change to lib/QubitFlatfileImport.class.php by replacing it with your backup copy of this file. (ensure file permissions are exactly as they were prior to your change to this file)

  9. On the server from your atom home directory run: php symphony cc

  10. Restart the php-fpm service.

  11. Restart the memcached service.


Towards a long term improvement on roundtripping updates in the same system


Given that what you are trying to do appears to be a common use case for many AtoM users, I’ve filed a feature request ticket in the AtoM Wish List project that outlines a possible enhancement to the command-line import to better support roundtripping updates within the same system. This would involve adding logic that, when a specific import option is used, will bypass the legacyID check against the keymap table, and instead check the values directly against AtoM’s objectID values.



You would not want this import option used when importing from one system to another, since the ID used in your local system for a given description will not be globally unique, and may be used by a different description in the target system. Because this has the potential to cause a lot of problems if used incorrectly, I’ve recommended that for now this only be added to the command-line import, as an option that must be set (using something like a --roundtrip flag) by the system administrator when executing the task.


Please note that simply filing this ticket does NOT guarantee its inclusion in a future release. AtoM's development is almost entirely driven by community sponsorship - you can read more about the history of the project, our development philosophy, and how Artefactual maintains and develops AtoM at the following link:



If your institution might be interested in sponsoring such a feature, please feel free to get in touch with us off-list, and Artefactual can prepare a development estimate for you. As with all the feature development work we do, you'll get access to the feature immediately, and the entire community will also gain access in the next major-version public release.



Steve


Steven Breker, BSc
AtoM Programmer
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-at...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-at...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages