cvs:import performance problem after upgrading from 2.4 to 2.5

152 views
Skip to first unread message

vincent....@fr.ch

unread,
Nov 11, 2019, 8:29:05 AM11/11/19
to AtoM Users

Hi,

We experienced performance problem after migrating from AtoM 2.4.0 to AtoM 2.5.3. The command cvs:import –update=match-and-update –keep-digital-objects is something like 30x slower than before. The server is running Ubuntu 18.04 LTS with PHP 7.2.24 with the same hardware configuration.

Any idea why it is so much slower? CPU is not high. We do not see any bottleneck…

Thanks for your help.

Best regards

Vincent

Steve Breker

unread,
Nov 13, 2019, 8:09:55 PM11/13/19
to AtoM Users
Hi Vincent

I’ve tested a large csv description import on both AtoM 2.4.x and 2.5.3 VM and am not experiencing the slowness you are seeing. The time that it takes to import and create descriptions is greatly dependent on the number and type of relationships between records. Do you have the ability to try loading the exact same spreadsheet on both a 2.4 and 2.5 based system?

I've highlighted some things you can check on your system below. I am curious if all the components (Nginx/AtoM, MySQL, ES) installed on a single server or multiple?

1) When you set up your new 18.04 system, do you recall if you made the MySQL configuration changes specified here:
Specifically the change to 'mysqld.cnf' to include:
  • optimizer_switch='block_nested_loop=off'
The optimizer_switch setting affects the performance of certain database actions and needs to be in place for MySQL 5.7 and up.


2) If you run ‘top’ on your AtoM and Nginx servers, do you see a high iowait (wa) value while the import is running? This could indicate that the bottleneck is possibly disk or network (in a distributed setup with a separate mysql server).


Steve

nicola...@gmail.com

unread,
Nov 19, 2019, 10:29:10 AM11/19/19
to AtoM Users
Hi Steve, 

We strictly follow your installation/upgrade guides and we have all the components (Nginx/AtoM, MySQL, ES) are all working on a single server.

We have actually two different single test servers:
• Server A as SA: AtoM 2.4.1, Ubuntu 14.04.5 LTS, Php5.6, MySQL 5.5.59, ES 1.7.6
• Server B as SB: AtoM 2.5.3, Ubuntu 18.04.3 LTS, Php7.2, MySQL 5.7.27, ES 6.8.4
Our two servers are virtualized and have the same CPU and memory configurations. (2 CPU, 8GB RAM…).

We tried to (to run the CSV import):
To downgrade to PHP5.6 on SB  and run the import from SB too and it was still slow
Run the import CSV on SB with the SA’s database and it was still slow
Run the import CSV from SA to the SB database and it work pretty fast like our normal SA

Finally, we think that the problem is in the new AtoM code.

Thank you for your help.

Best regards,

Nicolas

Dan Gillean

unread,
Nov 19, 2019, 10:33:36 AM11/19/19
to ICA-AtoM Users
Hi Nicolas, 

I will let Steve continue to follow up on the performance of the CSV import code - I just wanted to mention one thing based on your last message: 

• Server B as SB: AtoM 2.5.3, Ubuntu 18.04.3 LTS, Php7.2, MySQL 5.7.27, ES 6.8.4

Currently, the highest version of Elasticsearch we have confirmed to work with AtoM 2.5 is version 5.6. If ES 6.8 is working for you, great! But you may run into issues as you continue to use the application. 

We hope to be able to upgrade to ES v7.4 for the AtoM 2.6 release, but as this is currently not work sponsored by our community, I cannot guarantee we'll be able to include this upgrade in the next release. 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/e8e40b7a-45f4-4c44-b11e-410b5d19ead3%40googlegroups.com.

Steve Breker

unread,
Nov 19, 2019, 3:17:24 PM11/19/19
to AtoM Users
Hi Nicolas

As Dan points out, ES 5.6 is the recommended version for AtoM 2.5.3. This could be the source of the performance issues you are seeing since indexing does occur during an import - there could be version incompatibilities/errors slowing the process down.

I would recommend correcting the ES version and re-testing the import performance. If you are still seeing performance issues, it may be useful to try turning on MySQL's slow query log and look for clues.

Let us know what effect correcting the ES version has.

Thanks,
Steve
To unsubscribe from this group and stop receiving emails from it, send an email to ica-ato...@googlegroups.com.

vdec...@gmail.com

unread,
Nov 20, 2019, 2:13:52 AM11/20/19
to AtoM Users
Hi,

Nicolas makes a mistake. We are running ES 5.6.16. ES 6.8.4 is running on our archivematica server.

I will analyze the SQL queries.

Thanks for your help.

Vincent

li...@orangeleaf.com

unread,
Nov 21, 2019, 10:28:50 AM11/21/19
to AtoM Users
We are experiencing severe performance issues too on doing and update csv import.

I have narrowed it down to records where the eventActors field has a value of "Unknown"

Records that have eventActors="Unknown" are taking 1hr 30mins to process.

Have tested it with records where either eventActors is NULL or has a different value and it takes seconds to update.

I have repeated my tests several times and it always happens with records where I am trying to update event fields and the value in the eventActors field is "Unknown"

vdec...@gmail.com

unread,
Nov 21, 2019, 11:09:36 AM11/21/19
to AtoM Users
Thanks for your feedback.

Right know we are back on Ubuntu 16.04 has the performance are better. With Ubuntu 18.04 the throughput is very bad. MySQL goes a lot in uninterruptible sleep. Maybe there is more SQL transactions with version 2.5 that 2.4...

In my CVS, the eventActors are filled most of the time.

I will make more analyzed and give a feedback when I found something...

li...@orangeleaf.com

unread,
Nov 21, 2019, 11:41:10 AM11/21/19
to AtoM Users
I should have said we are running AtoM v2.5.2

On Thursday, 21 November 2019 15:28:50 UTC, li...@orangeleaf.com wrote:
We are experiencing severe performance issues too on doing an update csv import.

I have narrowed it down to records where the eventActors field has a value of "Unknown"

Records that have eventActors="Unknown" are taking 1hr 30mins each to process.

Have tested it with records where either eventActors is NULL or has a different value and it takes seconds to update a record.

li...@orangeleaf.com

unread,
Nov 21, 2019, 11:49:57 AM11/21/19
to AtoM Users
I've found that if there is just one record in the set of records I am trying to update where eventActors="Unknown" it slows down the import for all the records.

li...@orangeleaf.com

unread,
Nov 21, 2019, 11:55:28 AM11/21/19
to AtoM Users
Would it be possible for you to do a test update csv import for a record on the system where eventActor="Unknown" and see if you can replicate the issue I am seeing on our system (ie trying to update a record where we have eventActors = Unknown takes 1 hr 30 mins for a single record)

Dan Gillean

unread,
Nov 22, 2019, 12:41:46 PM11/22/19
to ICA-AtoM Users
Hi Linda, 

I have a suspicion about what's going on here. There's nothing particular about creating or linking to an authority record titled "Unknown" that would be any different from any other record linking or creation process, so I'm not sure what a local test on our part would prove, unless we had your entire dataset to work with. My guess is, you are adding this to every lower-level description where you don't have a creator, rather than just leaving the field blank - and there are now hundreds or thousands of links to this single "Unknown" authority record. Is this the case?

There can be performance issues when you hard-link creators at all levels of description - AtoM includes creator inheritance to avoid this issue. Think of it this way: if you link a collection-level record to Jane Doe's authority record and let inheritance work, then there is just one related resource to update if you then go and edit the title of Jane's authority. If you hard-link Jane at every level of a multi-level description, then AtoM suddenly has to manage hundreds of relations with every update to the authority (or its permissions, etc). I suspect this might also be what is slowing down the import for you. 

Of course, you can still add different creators at lower levels where needed. But is having an actual authority record titled "Unknown" that is linked to thousands of descriptions at all different levels across different collections actually useful to your researchers? Or to you, given that it is causing such performance issues?

If you don't want to just leave the creator field blank (perhaps because of inheritance from a higher level that would suggest a creator you don't want shown), then the first thing I would suggest trying is to run the creator-unlinker task, to see if it helps: 
This task is designed to remove ONLY unnecessary hard links to creator authority records, where inheritance would produce the same result. So it shouldn't negatively affect your records - but if there are cases where a parent series is already linked to "Unknown" and it's not needed at the file or item level in that series (because inheritance would show "Unknown" just as well), then this will at least clean up those relations. 

If my theory is right and the creator-unlinker task isn't helping, you may want to try to consider what alternative arrangements you can use to still indicate what you need to your users while sidestepping this issue as much as possible. For example, if inheritance from actual creator records at higher levels is not causing an issue, then you might for example just leave the creator field blank, and possibly add a note somewhere else in the record's metadata fields. If inheritance is an issue but you have some freedom over the arrangement, you might consider grouping the "Unkown" creator records into a subseries, so you can use inheritance in that subseries, etc. 

We continue to look for ways to optimize AtoM's code and increase performance and scalability with each release, and we're trying to do a bunch of work in the upcoming 2.6 release around this. Record relations are definitely one of the places we have been seeing bottlenecks, and while I can't think of a fix that we've already added to 2.6 that relates directly to this, we'll keep this in mind as we continue to prepare the 2.6 release.

If I'm not right, then any further information you can provide about your workflow and data would be helpful. How many relations does your "Unknown" authority record currently have, for example?

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/e72c026b-396a-4654-bc8a-b08e0dd5cc4a%40googlegroups.com.

li...@orangeleaf.com

unread,
Nov 25, 2019, 1:16:05 PM11/25/19
to AtoM Users
Thanks Dan, this explains what my techie colleagues suspected.  We've ended up with lots of "Unknown" eventActors as a result of importing legacy data so now need to have a rethink about how we do this.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-ato...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages