Hi Isabel,
Are you importing from the command-line, or via the user interface?
Creating descriptions and all the digital object derivatives can certainly be resource intensive. It can also potentially run into timeout errors, depending on some of your PHP execution limit settings. See:
In general, I would encourage you to perform large imports from the command-line, so you can take advantage of some of the CLI options that can reduce upfront processing time. Most importantly:
- DO NOT use the --index option. Instead, wait until your imports are completed, and then re-index afterwards. This can significantly improve import performance
- Similarly, DO use the --skip-nested-set-build option, and then run the build nested set task after your import (and before re-indexing!)
- You could also try using the --skip-derivatives option as well, and then generate the digital object derivatives afterwards, using the --no-overwrite option on the derivatives generation task to speed it up (so it will only generate derivatives where they are missing, and not all derivatives)
Taken together, these 3 things should help improve the processing of CSV imports.
Keep in mind as well that improperly formatted CSVs can also be the cause of data corruption! Remember that
AtoM expects CSVs to use UTF-8 character encoding and unix-style line endings. I've seen cases where spreadsheets prepared in Excel can, upon conversion, have the wrong line endings and/or field delimiters set, which can cause data to be pushed to the wrong columns - sometimes forcing unexpected data into some fields. If you're using a spreadsheet application to prepare your data, we strongly recommend using
LibreOffice Calc, as it allows you to easily configure field delimiters and character encoding every time you open a CSV, and will use unix-style line endings by default!
If you do try the above suggestions, then I would also suggest the following order for the follow up tasks after the import completes:
- Rebuild the nested set
- Generate the missing derivatives
- Populate the search index
Next: it is possible that your system resources are being exhausted by the import process. You could try using something like htop to monitor them during an import and see if you need to make changes. See:
I'll also add that you can also break up the CSV into smaller ones at logical breaking points - for example, a series. Once the first CSV with the parent fonds is imported, you can just change the parent series row in the second CSV to add the slug in the qubitParentSlug column - everything else can remain unchanged if the 2nd CSV only contains descendants of the series. Similarly, if you are including multiple large fonds/collections in a single CSV, then you might try breaking them out at the fonds level, as no further changes will be needed for the metadata in this case.
Finally: we do have some suggestions in the documentation to help you find and resolve database corruption issues. See:
We strongly recommend you make a habit of taking a fresh backup before performing large imports into a production environment!
Regards,