Sep 29, 2016, 10:15:53 AM9/29/16
I have a suggestion to make with regards to indexing large transfers.
It appears that when indexing transfers, for every file, there's a
query run like:
SELECT `Files`.`fileUUID`, `Files`.`sipUUID`, `Files`.`transferUUID`,
`Files`.`fileGrpUse`, `Files`.`fileGrpUUID`, `Files`.`checksum`,
`Files`.`checksumType`, `Files`.`fileSize`, `Files`.`label`,
`Files`.`enteredSystem`, `Files`.`removedTime` FROM `Files` WHERE
(`Files`.`transferUUID` = '3349f91e-450a-4b40-84a7-43ebf17653a0' AND
`Files`.`currentLocation` = REDACTED)`
That is, it's querying the Files table for a file with a given
transferUUID and currentLocation.
The problem is that by default there's no index on currentLocation,
So, with a large transfer (~50,000 files), it's scanning the entire
set of files in the transfer to find one with the given current
location, and doing that for each file in the transfer - so a
quadratic slowdown for the number of files in the transfer.
By adding an index on currentLocation, I saw performance change so
that instead of mysql using 100% CPU, it's using only a few percent
and the ElasticSearch and elasticSearchIndexProcessTransfer processses
are instead using the bulk of the CPU time.
Is there a reason this isn't created by default? If not, it seems like
it would make sense to add.
Library IT Architecture & Engineering
University of Michigan