Hi,
We are also having problems running the OAI-PMH on AtoM 2.2. We have 1,114,758 records in the object table (I'm hoping thats the correct way to count the number of records). By adding the following indexes I've currently got it down to 5 minutes per OAI-PMH page of results:
create index lft_rft_id on information_object (ID,RGT,LFT);
CREATE INDEX RGT ON information_object (rgt);
CREATE INDEX LFT ON information_object (LFT);
but this would still take around 38 days to harvest the complete repository. The main culprit seems to be the following two SQL statements:
SELECT COUNT(*) FROM (SELECT object.CLASS_NAME AS object_CLASS_NAME, object.CREATED_AT AS object_CREATED_AT, object.UPDATED_AT AS object_UPDATED_AT, object.ID AS object_ID, object.SERIAL_NUMBER AS object_SERIAL_NUMBER, information_object.ID AS information_object_ID, information_object.IDENTIFIER AS information_object_IDENTIFIER, information_object.OAI_LOCAL_IDENTIFIER AS information_object_OAI_LOCAL_IDENTIFIER, information_object.LEVEL_OF_DESCRIPTION_ID AS information_object_LEVEL_OF_DESCRIPTION_ID, information_object.COLLECTION_TYPE_ID AS information_object_COLLECTION_TYPE_ID, information_object.REPOSITORY_ID AS information_object_REPOSITORY_ID, information_object.PARENT_ID AS information_object_PARENT_ID, information_object.DESCRIPTION_STATUS_ID AS information_object_DESCRIPTION_STATUS_ID, information_object.DESCRIPTION_DETAIL_ID AS information_object_DESCRIPTION_DETAIL_ID, information_object.DESCRIPTION_IDENTIFIER AS information_object_DESCRIPTION_IDENTIFIER, information_object.SOURCE_STANDARD AS information_object_SOURCE_STANDARD, information_object.DISPLAY_STANDARD_ID AS information_object_DISPLAY_STANDARD_ID, information_object.LFT AS information_object_LFT, information_object.RGT AS information_object_RGT, information_object.SOURCE_CULTURE AS information_object_SOURCE_CULTURE FROM `status`, `information_object`, `object` WHERE status.STATUS_ID=160 AND information_object.ID=status.OBJECT_ID AND information_object.ID=object.ID ORDER BY object.UPDATED_AT ASC LIMIT 100) AS propelmatch4cnt;
SELECT object.CLASS_NAME, object.CREATED_AT, object.UPDATED_AT, object.ID, object.SERIAL_NUMBER, information_object.ID, information_object.IDENTIFIER, information_object.OAI_LOCAL_IDENTIFIER, information_object.LEVEL_OF_DESCRIPTION_ID, information_object.COLLECTION_TYPE_ID, information_object.REPOSITORY_ID, information_object.PARENT_ID, information_object.DESCRIPTION_STATUS_ID, information_object.DESCRIPTION_DETAIL_ID, information_object.DESCRIPTION_IDENTIFIER, information_object.SOURCE_STANDARD, information_object.DISPLAY_STANDARD_ID, information_object.LFT, information_object.RGT, information_object.SOURCE_CULTURE FROM `object`, `information_object`, `status` WHERE status.STATUS_ID=160 AND information_object.ID=status.OBJECT_ID AND information_object.ID=object.ID ORDER BY object.UPDATED_AT ASC LIMIT 100;
which each take over a minute run. All of the WHERE fields have indexes but looking at the execution plan for the second one:
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+--------+----------------------------------------------+
| 1 | SIMPLE | status | ref | status_FI_1,status_FI_3 | status_FI_3 | 5 | const | 241592 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | information_object | eq_ref | PRIMARY,lft_rft_id | PRIMARY | 4 | atom.status.object_id | 1 | |
| 1 | SIMPLE | object | eq_ref | PRIMARY | PRIMARY | 4 | atom.status.object_id | 1 | |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+--------+----------------------------------------------+
its examining every row and using a file sort to sort the results (then doing a limit). As well as being slow it requires 5G of space in /tmp/ to write out the file for sorting. I've added the following index:
create index id_updated on object (updated_at,id );
but unfortunately mysql chooses not to use it. If i add an indexing hint to the SQL:
SELECT object.CLASS_NAME, object.CREATED_AT, object.UPDATED_AT, object.ID, object.SERIAL_NUMBER, information_object.ID, information_object.IDENTIFIER, information_object.OAI_LOCAL_IDENTIFIER, information_object.LEVEL_OF_DESCRIPTION_ID, information_object.COLLECTION_TYPE_ID, information_object.REPOSITORY_ID, information_object.PARENT_ID, information_object.DESCRIPTION_STATUS_ID, information_object.DESCRIPTION_DETAIL_ID, information_object.DESCRIPTION_IDENTIFIER, information_object.SOURCE_STANDARD, information_object.DISPLAY_STANDARD_ID, information_object.LFT, information_object.RGT, information_object.SOURCE_CULTURE FROM `object`
use index (id_updated) , `information_object`, `status` WHERE status.STATUS_ID=160 AND information_object.ID=status.OBJECT_ID AND information_object.ID=object.ID ORDER BY object.UPDATED_AT ASC LIMIT 100;
the query takes less than a second. The execution plan is:
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+------+-------------+
| 1 | SIMPLE | object | index | NULL | id_updated | 12 | NULL | 100 | |
| 1 | SIMPLE | status | ref | status_FI_1,status_FI_3 | status_FI_1 | 4 |
atom.object.id | 1 | Using where |
| 1 | SIMPLE | information_object | eq_ref | PRIMARY,lft_rft_id | PRIMARY | 4 | atom.status.object_id | 1 | Using where |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+------+-------------+
which seems to suggest it does the object sort first, then limit before joining the other tables which is a lot quicker and only examines 100 rows.
Does this index addition make sense and if so is there anyway to get AtoM to use it? Are there any other improvements we could make?
Thanks for your help
Glen Robson
National Library of Wales