poor mysql performance while indexing large transfers

Skip to first unread message

Aaron Elkiss

Sep 29, 2016, 10:15:53 AM9/29/16
to archivema...@googlegroups.com
Hi all,

I have a suggestion to make with regards to indexing large transfers.

It appears that when indexing transfers, for every file, there's a
query run like:

SELECT `Files`.`fileUUID`, `Files`.`sipUUID`, `Files`.`transferUUID`,
`Files`.`originalLocation`, `Files`.`currentLocation`,
`Files`.`fileGrpUse`, `Files`.`fileGrpUUID`, `Files`.`checksum`,
`Files`.`checksumType`, `Files`.`fileSize`, `Files`.`label`,
`Files`.`enteredSystem`, `Files`.`removedTime` FROM `Files` WHERE
(`Files`.`transferUUID` = '3349f91e-450a-4b40-84a7-43ebf17653a0' AND
`Files`.`currentLocation` = REDACTED)`

That is, it's querying the Files table for a file with a given
transferUUID and currentLocation.

The problem is that by default there's no index on currentLocation,
only transferUUID.

So, with a large transfer (~50,000 files), it's scanning the entire
set of files in the transfer to find one with the given current
location, and doing that for each file in the transfer - so a
quadratic slowdown for the number of files in the transfer.

By adding an index on currentLocation, I saw performance change so
that instead of mysql using 100% CPU, it's using only a few percent
and the ElasticSearch and elasticSearchIndexProcessTransfer processses
are instead using the bulk of the CPU time.

Is there a reason this isn't created by default? If not, it seems like
it would make sense to add.

Aaron Elkiss
Library IT Architecture & Engineering
University of Michigan

Holly Becker

Oct 4, 2016, 5:30:41 PM10/4/16
to Archivematica Tech
That's an excellent idea. I made a pull request with the changes https://github.com/artefactual/archivematica/pull/509  Let me know if you'd like to change how you're credited in the commit message.

Reply all
Reply to author
0 new messages