Database corrupted when importing many descriptions at once

ism...@gmail.com

unread,

Nov 26, 2020, 3:26:04 AM11/26/20

to AtoM Users

Good morning, we already have AtoM ready to put it into production, but we have serious problems. We are uploading large CSV files with many descriptions as well as images and the database got corrupted.

We do not know if the problem was that the machine needs more resources or that the configuration is wrong. Can you guide me? Has it happened to you that the tasks have been blocked? It happens to me often. Thank you.

Dan Gillean

unread,

Nov 26, 2020, 5:35:41 PM11/26/20

to ICA-AtoM Users

Hi Isabel,

Are you importing from the command-line, or via the user interface?

Creating descriptions and all the digital object derivatives can certainly be resource intensive. It can also potentially run into timeout errors, depending on some of your PHP execution limit settings. See:

https://www.accesstomemory.org/docs/latest/admin-manual/installation/execution-limits/

In general, I would encourage you to perform large imports from the command-line, so you can take advantage of some of the CLI options that can reduce upfront processing time. Most importantly:

DO NOT use the --index option. Instead, wait until your imports are completed, and then re-index afterwards. This can significantly improve import performance
Similarly, DO use the --skip-nested-set-build option, and then run the build nested set task after your import (and before re-indexing!)
You could also try using the --skip-derivatives option as well, and then generate the digital object derivatives afterwards, using the --no-overwrite option on the derivatives generation task to speed it up (so it will only generate derivatives where they are missing, and not all derivatives)

Taken together, these 3 things should help improve the processing of CSV imports.

Keep in mind as well that improperly formatted CSVs can also be the cause of data corruption! Remember that AtoM expects CSVs to use UTF-8 character encoding and unix-style line endings. I've seen cases where spreadsheets prepared in Excel can, upon conversion, have the wrong line endings and/or field delimiters set, which can cause data to be pushed to the wrong columns - sometimes forcing unexpected data into some fields. If you're using a spreadsheet application to prepare your data, we strongly recommend using LibreOffice Calc, as it allows you to easily configure field delimiters and character encoding every time you open a CSV, and will use unix-style line endings by default!

If you do try the above suggestions, then I would also suggest the following order for the follow up tasks after the import completes:

Rebuild the nested set
Generate the missing derivatives
Populate the search index

Next: it is possible that your system resources are being exhausted by the import process. You could try using something like htop to monitor them during an import and see if you need to make changes. See:

https://www.accesstomemory.org/docs/latest/admin-manual/maintenance/troubleshooting/#troubleshooting-resources-limits

I'll also add that you can also break up the CSV into smaller ones at logical breaking points - for example, a series. Once the first CSV with the parent fonds is imported, you can just change the parent series row in the second CSV to add the slug in the qubitParentSlug column - everything else can remain unchanged if the 2nd CSV only contains descendants of the series. Similarly, if you are including multiple large fonds/collections in a single CSV, then you might try breaking them out at the fonds level, as no further changes will be needed for the metadata in this case.

Finally: we do have some suggestions in the documentation to help you find and resolve database corruption issues. See:

https://www.accesstomemory.org/docs/latest/admin-manual/maintenance/troubleshooting/#dealing-with-data-corruption

We strongly recommend you make a habit of taking a fresh backup before performing large imports into a production environment!

Regards,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056

@accesstomemory

he / him

On Thu, Nov 26, 2020 at 3:26 AM ism...@gmail.com <ism...@gmail.com> wrote:

Good morning, we already have AtoM ready to put it into production, but we have serious problems. We are uploading large CSV files with many descriptions as well as images and the database got corrupted.

We do not know if the problem was that the machine needs more resources or that the configuration is wrong. Can you guide me? Has it happened to you that the tasks have been blocked? It happens to me often. Thank you.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/73880f89-6ae2-44f9-b5ba-969d9199be5fn%40googlegroups.com.

ism...@gmail.com

unread,

Dec 2, 2020, 8:34:07 AM12/2/20

to AtoM Users

Hi Dan, thank you very much for the prompt reply as always.

Imports are done via the interface since we do not do them but file personnel who do not have access to the console.

Is there a way that the tasks of indexing, nesting and creating derivatives of digital objects could be run from the interface separately?

I mean, import the CSVs only and at the end launch these tasks manually from the interface.

A greeting.

Dan Gillean

unread,

Dec 2, 2020, 9:31:55 AM12/2/20

to ICA-AtoM Users

Hi there,

Unfortunately, at this time we do not have support for running the CSV import via the user interface with some of the options I described (disabling the nested set and skipping the derivatives generation), nor do we have a way of launching command-line tasks from the user interface. Both would require further development to support.

In the meantime, if you are not yet running 2.6, you might consider upgrading. We did make a number of performance optimizations in the 2.6 release that could help with import times. You can see the 2.6 release notes here:

https://wiki.accesstomemory.org/Releases/Release_announcements/Release_2.6

Additionally, I would suggest trying to break up your large imports into smaller pieces. Just make sure that no child record in one CSV refers to a parent in another CSV, unless you are using the slug and not the legacyID to match child to parent - otherwise you might end up with errors or orphaned descriptions!

Regards,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056

@accesstomemory

he / him

To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/0fecc6ba-bb9c-4034-94fb-492b87fe9594n%40googlegroups.com.

ism...@gmail.com

unread,

Dec 15, 2020, 8:30:10 AM12/15/20

to AtoM Users

Good morning, okay, I'll update to 2.6 then and shorten the CSVs. Thanks for everything.

Reply all

Reply to author

Forward