Aaron Elkiss
unread,Sep 29, 2016, 10:15:53 AM9/29/16Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to archivema...@googlegroups.com
Hi all,
I have a suggestion to make with regards to indexing large transfers.
It appears that when indexing transfers, for every file, there's a
query run like:
SELECT `Files`.`fileUUID`, `Files`.`sipUUID`, `Files`.`transferUUID`,
`Files`.`originalLocation`, `Files`.`currentLocation`,
`Files`.`fileGrpUse`, `Files`.`fileGrpUUID`, `Files`.`checksum`,
`Files`.`checksumType`, `Files`.`fileSize`, `Files`.`label`,
`Files`.`enteredSystem`, `Files`.`removedTime` FROM `Files` WHERE
(`Files`.`transferUUID` = '3349f91e-450a-4b40-84a7-43ebf17653a0' AND
`Files`.`currentLocation` = REDACTED)`
That is, it's querying the Files table for a file with a given
transferUUID and currentLocation.
The problem is that by default there's no index on currentLocation,
only transferUUID.
So, with a large transfer (~50,000 files), it's scanning the entire
set of files in the transfer to find one with the given current
location, and doing that for each file in the transfer - so a
quadratic slowdown for the number of files in the transfer.
By adding an index on currentLocation, I saw performance change so
that instead of mysql using 100% CPU, it's using only a few percent
and the ElasticSearch and elasticSearchIndexProcessTransfer processses
are instead using the bulk of the CPU time.
Is there a reason this isn't created by default? If not, it seems like
it would make sense to add.
Aaron Elkiss
Library IT Architecture & Engineering
University of Michigan