Database performance issue with batch deletion

153 views
Skip to first unread message

Sunny Lee

unread,
Aug 15, 2019, 3:43:59 PM8/15/19
to ica-ato...@googlegroups.com

Hello,

We’ve been having a recurring database performance issue with batch deletion of descriptions and we’d like to get some advice on this.


As of now, we have 91,530 rows in the information_object table and it takes about 6 minutes for executing delete-description command which deletes a description with its 362 child descriptions (details are in the Issue #2). The AtoM version we tested is 2.5.1 and the following issues were existing from the previous versions. 


Issue #1:
Deleting description using UI (archivists use this) gets a timeout error when the parent description has many child descriptions. I can’t specify what number of child descriptions would/wouldn’t cause this issue because that can vary depending on server. (Current our test database server has 4 CPUs with 16G memory. Other database servers we tested had 8 CPUs with 16G memory or cluster database with more capacity.)

Increasing max_execution_limit to a certain big number (https://groups.google.com/d/msg/ica-atom-users/cwiPXeOjYYo/gfQ8AgDzAQAJ) may prevent this problem but I’m hesitant to make this change because it could cause a performance risk on our production infrastructure. Instead, can `Delete` description create a job (if it has over certain number of child descriptions) and show the job ID similarly to Import workflow instead of waiting HTTPS response?

Issue #2:

Deleting a `Series` record which has 362 child descriptions took about 6 minutes on a single dedicated server for AtoM when I ran delete-description command on our test server. It was updating LFT and RGT columns of information_object table every time an information_object was deleted. Both UPDATE queries updated over 40,000 rows per time (multiply 362). Same test on our cluster database ran even slower and connections were piling up. As a result, all sites which used the same cluster database server got 503 error.


We think that the core problem might be that the deletion process takes so long to complete, taking almost six minutes. Could you explain to us why there are so many intensive database queries and commands executed by the deletion request … particularly these two update queries:

-----

UPDATE information_object
SET information_object.LFT = information_object.LFT - '2'
WHERE information_object.LFT >= '100631'

 

UPDATE information_object
SET information_object.RGT = information_object.RGT - '2'
WHERE information_object.RGT >= '100631'


-----

Execute time for delete-description command: 
-----

$ time sudo php symfony tools:delete-description writing-6 -B

>> delete-description [08:43:55 AM] Deleting description "Writing" (slug: writing-6, +361 descendants)

>> delete-description [08:49:39 AM] Finished: 362 descriptions deleted.

 

real      5m44.575s

user     0m1.660s

sys       0m0.632s


-----
 

ANALYZE TABLE looked fine and most of the LFT and RGT rows were indexed. 

----- 

MariaDB [atomDiscArch]> ANALYZE TABLE information_object;

+---------------------------------+---------+----------+----------+

| Table                           | Op      | Msg_type | Msg_text |

+---------------------------------+---------+----------+----------+

| atomDiscArch.information_object | analyze | status   | OK       |

+---------------------------------+---------+----------+----------+

1 row in set (0.03 sec)


-----

Record counts for information_object.LFT >= '100631' and information_object.RGT >= '100631' before running the above delete-description command: 

-----

MariaDB [atomDiscArch]> SELECT count(*) FROM information_object WHERE information_object.LFT >= '100631';

+----------+

| count(*) |

+----------+

|    41121 |

+----------+

1 row in set (0.00 sec)

 

MariaDB [atomDiscArch]>

MariaDB [atomDiscArch]> SELECT count(*) FROM information_object WHERE information_object.RGT >= '100631';

+----------+

| count(*) |

+----------+

|    41125 |

+----------+

1 row in set (0.02 sec)  

-----


Any help would be greatly appreciated!

Sunny

Dan Gillean

unread,
Aug 19, 2019, 5:38:15 PM8/19/19
to ICA-AtoM Users
Hi Sunny, 

The main expense, in terms of computational resources, when deleting a description with children is updating the nested set progressively as each record is deleted. This is why it is taking so long, and why you're seeing so many database queries and updates during the process. 

AtoM uses a relational database to store its data (MySQL). However, relational databases, which are comprised of flat tables, are not particularly suited to handling hierarchical data such as archival desrciptions, where an item might be nested in a file, which is in a series, as part of a collection. As developer Mike Hillyer notes, “Hierarchical data has a parent-child relationship that is not naturally represented in a relational database table.” 

One method of addressing this is to employ a “Nested set model” (Wikipedia). It's a way of giving a set of nested elements left and right value numbers that can be used to speed up ordering and retrieval operations. Hillyer uses the following example of electronics to explain: 



To turn this type of hierarchy into a nested set model that we can use in our relational database, imagine flattening this out into containers nested inside of each other instead of a tree. We then start numbering all the edges, from the left to the right: 



This information is added as lft and rgt (rather than left and right, which are reserved terms in MySQL) values. These values can then be used to retrieve a single node, or a particular "branch" and all its "leaves" (think a subset and its descendants, as well as knowing the adjacent sibling records) without having to perform complicated joins across tables. 

However, during delete operations in AtoM, every time a record is deleted, every other record's left and right values in the nested set are updated - and this is repeated for each description deleted. It ends up being very computationally expensive and is the slowest part of the delete operation. 

In terms of Case #1 in your message: yes, in the future, it would certainly be possible to add a task managed by the job scheduler that would perform the delete operation in the background, to avoid timeouts. It may also be possible to make this job only kick in if the target description has a certain number of descendants. Either way, this will require development to implement - more on that below. 

RE: Case #2 - because it is the repetitive updates to the nested set that are both slow and inefficient, one option would be to add a --disable-nested-set-updates option to the delete-description task. Essentially, when used, AtoM would skip updating the nested set entirely, until the very end of the operation. At this point, it throws out the entire nested set as out-of-date, and regenerates the whole thing - however, even in a fairly large installation, this still takes very little time to complete (usually a couple seconds). Having to do this only once at the end, instead of dozens or thousands of times, would vastly improve the speed and performance of the task! 

We already have such an option on the CSV import task for description (docs here), and we are in the process of adding such an option to the digital object import task. You can see the commit where the work on the load task was done here: 
Adding such an option to the delete descriptions task is comparatively a much smaller and easier development project than adding job scheduler support via the user interface. However, it's also possible that the job could skip the nested set build during deletion as well, and then simply update it at the end of the job. 

Either way, both options would require community support for us to implement in a future AtoM release. This could be in the form of development sponsorship, or a community pull request. If you'd like to learn more about how we develop and maintain AtoM, please see: 
In terms of sponsorship, if your institution is potentially interested in sponsoring one of these enhancements, please feel free to contact me off-list, and Artefactual would be happy to prepare estimates for your review. 

In terms of development - if you are (or have access to) a developer, much of the work in the commit linked above could be reused to update the delete-descriptions task. You can find that task in AtoM's code base here: 
Earlier today I posted a response to another user full of development resource links - you can find them here: 

Otherwise, I understand your hesitance in modifying the PHP execution limit settings, but I can't think of many other workarounds that might solve this without development work. 

Best, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/1fc0d283-9e29-4a24-851d-00f63a35dce3%40googlegroups.com.

Sunny Lee

unread,
Aug 20, 2019, 3:21:00 PM8/20/19
to AtoM Users
Hi Dan, 

Thank you for the clear explanation and sharing the resources. These are very useful and informative.
As you suggested, delete-description with --disable-nested-set-updates option and update the nested set entirely at the end would greatly improve the performance. 
I was in the process of testing AtoM 2.5 installation and listing any issues of it. As our archivists test their import data on the development site first, we don't need to run batch deletion on the production site often. For that reason, I will add it to our wishlist for now and have a further discussion on this if the demand for this improvement increases.

Much Appreciated, 
Sunny

To unsubscribe from this group and stop receiving emails from it, send an email to ica-ato...@googlegroups.com.

r.ruti...@docuteam.ch

unread,
Jul 29, 2021, 6:34:50 AM7/29/21
to AtoM Users
Hi Dan
The  --disable-nested-set-updates option does not seem to work anymore, right? Is there another way to speed up the deletion process?
Best, Roger

Dan Gillean

unread,
Jul 29, 2021, 9:27:24 AM7/29/21
to ICA-AtoM Users
Hi Roger, 

It was a suggestion based on the feedback about the task performance, but the work to implement it was not sponsored, and it has not yet been included in AtoM for the delete descriptions task. It should still work on the digital object load task, but if this is not what you're finding, let me know - I will reproduce the issue locally and file a bug ticket if needed. 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him


Reply all
Reply to author
Forward
0 new messages