OAI requests killing database

134 views
Skip to first unread message

suh...@gmail.com

unread,
Sep 4, 2015, 1:00:44 PM9/4/15
to ICA-AtoM Users
Hi all,

We have a new AtoM instance up here at Ryerson, and want to harvest it via the OAI interface.  However, whenever I make *any* request, it launches a flurry of SQL queries like so:

SELECT object.CLASS_NAME, object.CREATED_AT, object.UPDATED_AT, object.ID, object.SERIAL_NUMBER, 
information_object.ID, information_object.IDENTIFIER, information_object.OAI_LOCAL_IDENTIFIER, 
information_object.LEVEL_OF_DESCRIPTION_ID, information_object.COLLECTION_TYPE_ID, 
information_object.REPOSITORY_ID, information_object.PARENT_ID, information_object.DESCRIPTION_STATUS_ID, 
information_object.DESCRIPTION_DETAIL_ID, information_object.DESCRIPTION_IDENTIFIER, 
information_object.SOURCE_STANDARD, information_object.DISPLAY_STANDARD_ID, information_object.LFT, 
information_object.RGT, information_object.SOURCE_CULTURE 
FROM `object`, `information_object` 
WHERE information_object.LFT<'41412' 
AND information_object.RGT>'41415' 
AND information_object.ID=object.ID 
ORDER BY information_object.LFT ASC,
information_object.LFT ASC


... where the LFT and RGT values increment through the entire table (we have about 40,000 descriptions).  On our current production machine this takes about 50s with mysqld taking about 90% CPU during that time (and php the remaining 10%).

This even occurs for requests such as Identify and ListMetadataFormats, which unless I'm overlooking something, shouldn't even have to query the descriptions at all.

I'm about to dive in and see if I can trace what is causing this, but wanted to ask here if anyone with more familiarity with the OAI code might be able to offer some insight.  We are seeing some ORM/database-related performance issues when working on objects that have a large number of relations (eg. http://atom.rula.info/index.php/kodak-canada-inc-2), but this seems like something entirely separate.

On a somewhat unrelated note, I noticed during my profiling that there weren't any indexes on the LFT and RGT columns, so adding some seemed to make a small improvement, so it seems that maybe some table optimization could help?  I've tried various mysqld optimizations, but does anyone have any experience or recommendations for improving database performance for an AtoM instance of this size?

Many thanks in advance,
MJ

suh...@gmail.com

unread,
Sep 4, 2015, 5:38:32 PM9/4/15
to ICA-AtoM Users, suh...@gmail.com
After some poking around this afternoon, I discovered that this has to do with how the QubitOai class and the arOaiPlugin are calling QubitInformationObject::getCollections(), which then iterates through all objects individually to check their ACL -- something that isn't necessary for OAI, since it doesn't have permission levels (and can be restricted via API key).

I've also come across a couple of other areas where the code can be cleaned up and improved, so I'm going to be working on a branch here:

Once there's a pull request, I'll post a reply to this thread with a URL.  I'd still love to hear from anyone regarding database performance improvements.

Cheers,
MJ

Dan Gillean

unread,
Sep 4, 2015, 5:51:02 PM9/4/15
to ICA-AtoM Users
Hi MJ!

Wow, thanks for taking such initiative and for being willing to share it back with the AtoM community! I look forward to seeing what you come up with. You're right that the OAI code does not need to consider all permissions - but publication status should be respected, so an AtoM instance is not exposing draft records to a harvester.

Many of our team have taken the day off, to extend the long weekend (for any international readers, Monday is a holiday in Canada) - I'll make sure our developers take a look at this thread on Tuesday when we've returned and offer some suggestions. Increasingly we have users with large holdings, and improving the performance and scalability of the application is an important topic to us these days, so I'm sure that they'll have some thoughts to share!

The Ryerson site looks great! Please let us know when it is formally launched, so that we can add it to our Users list! We'd be happy to tweet out a link and help generate some interest any other way we can as well.

Cheers,

Dan Gillean, MAS, MLIS
AtoM Product Manager / Systems Analyst,
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at http://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/2b02ff4f-9de3-414d-9b66-98c2636b2214%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

suh...@gmail.com

unread,
Sep 8, 2015, 11:18:20 AM9/8/15
to ICA-AtoM Users
Hey Dan,

No worries; thanks for the response on your day off!

My focus at the moment is just addressing the issues which are causing problems for our OAI harvesting and to improve compliance for other harvesters.  It doesn’t look like publication status is currently being honoured, but I agree it should be — I’ll take a closer look.

As Jesús will likely remember, AtoM’s scalability has long been a concern of mine (going back to when Jack was customizing Propel in v0.6 or so), so it’s nice to have a chance to look at the code again and offer what I can with a few more years’ experience behind me.

Thanks for your offer — we’re still waiting on some branding / UI work before we officially launch the site, but I’ll be sure to let you know so we can inform the community appropriately together.  The staff here at RULA are already pretty excited to be working with AtoM, so I’m sure they’ll want to rave. :)

MJ

mi...@artefactual.com

unread,
Sep 8, 2015, 6:00:15 PM9/8/15
to ICA-AtoM Users, suh...@gmail.com
Hi MJ!!!

Regarding database performance improvements, I'm not the best person to answer, but we've swapped out some DB queries for ElasticSearch, simplified some logic to generate less queries (like the treeview code), and have likely caching here and there. I think we're also using Percona for some deployments which likely helps a bit.

Cheers,
Mike

David Juhasz

unread,
Sep 8, 2015, 6:54:04 PM9/8/15
to ica-ato...@googlegroups.com
Hi MJ!

Nice to hear from you! When querying the DB, using raw SQL statements via PDO is much faster than using the ORM - especially when iterating through a lot of rows.  We tend to use PDO for queries that return multiple rows, and the ORM for single row results where the performance hit isn't critical, and insert and update operations where the ORM magic is very helpful.

Best,
David

--

David Juhasz
Director, AtoM Technical Services Artefactual Systems Inc. www.artefactual.com

--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at http://groups.google.com/group/ica-atom-users.

suh...@gmail.com

unread,
Sep 9, 2015, 10:43:40 AM9/9/15
to ICA-AtoM Users
Thanks David & Mike.  The PDO / ORM approach makes perfect sense, and it explains a few places in the code I've come across.  The current OAI tweaks have our responses down to about 3-4s (from 50s previously), so that's fast enough for production use.  I may still chip away at a few features (eg. resumption token support for ListSets -- we have ~1000 top-level objects), but otherwise I'll just fire off occasional PRs opportunistically whenever I think I'm "optimizing" something.

Here's the PR for our fork -- I can make a copy of this against qa/2.3.x if that's helpful for your workflow:  https://github.com/ryersonlibrary/atom/pull/5

Great to hear from y'all.  Nice to be back in touch.  :)

MJ

Glen Robson

unread,
Oct 14, 2015, 8:05:36 PM10/14/15
to ICA-AtoM Users, suh...@gmail.com
Hi,

We are also having problems running the OAI-PMH on AtoM 2.2. We have 1,114,758 records in the object table (I'm hoping thats the correct way to count the number of records). By adding the following indexes I've currently got it down to 5 minutes per OAI-PMH page of results:

create index lft_rft_id on information_object (ID,RGT,LFT);
CREATE INDEX RGT ON information_object (rgt);
CREATE INDEX LFT ON information_object (LFT);

but this would still take around 38 days to harvest the complete repository. The main culprit seems to be the following two SQL statements:

SELECT COUNT(*) FROM (SELECT object.CLASS_NAME AS object_CLASS_NAME, object.CREATED_AT AS object_CREATED_AT, object.UPDATED_AT AS object_UPDATED_AT, object.ID AS object_ID, object.SERIAL_NUMBER AS object_SERIAL_NUMBER, information_object.ID AS information_object_ID, information_object.IDENTIFIER AS information_object_IDENTIFIER, information_object.OAI_LOCAL_IDENTIFIER AS information_object_OAI_LOCAL_IDENTIFIER, information_object.LEVEL_OF_DESCRIPTION_ID AS information_object_LEVEL_OF_DESCRIPTION_ID, information_object.COLLECTION_TYPE_ID AS information_object_COLLECTION_TYPE_ID, information_object.REPOSITORY_ID AS information_object_REPOSITORY_ID, information_object.PARENT_ID AS information_object_PARENT_ID, information_object.DESCRIPTION_STATUS_ID AS information_object_DESCRIPTION_STATUS_ID, information_object.DESCRIPTION_DETAIL_ID AS information_object_DESCRIPTION_DETAIL_ID, information_object.DESCRIPTION_IDENTIFIER AS information_object_DESCRIPTION_IDENTIFIER, information_object.SOURCE_STANDARD AS information_object_SOURCE_STANDARD, information_object.DISPLAY_STANDARD_ID AS information_object_DISPLAY_STANDARD_ID, information_object.LFT AS information_object_LFT, information_object.RGT AS information_object_RGT, information_object.SOURCE_CULTURE AS information_object_SOURCE_CULTURE FROM `status`, `information_object`, `object` WHERE status.STATUS_ID=160 AND information_object.ID=status.OBJECT_ID AND information_object.ID=object.ID ORDER BY object.UPDATED_AT ASC LIMIT 100) AS propelmatch4cnt;


SELECT object.CLASS_NAME, object.CREATED_AT, object.UPDATED_AT, object.ID, object.SERIAL_NUMBER, information_object.ID, information_object.IDENTIFIER, information_object.OAI_LOCAL_IDENTIFIER, information_object.LEVEL_OF_DESCRIPTION_ID, information_object.COLLECTION_TYPE_ID, information_object.REPOSITORY_ID, information_object.PARENT_ID, information_object.DESCRIPTION_STATUS_ID, information_object.DESCRIPTION_DETAIL_ID, information_object.DESCRIPTION_IDENTIFIER, information_object.SOURCE_STANDARD, information_object.DISPLAY_STANDARD_ID, information_object.LFT, information_object.RGT, information_object.SOURCE_CULTURE FROM `object`, `information_object`, `status` WHERE status.STATUS_ID=160 AND information_object.ID=status.OBJECT_ID AND information_object.ID=object.ID ORDER BY object.UPDATED_AT ASC LIMIT 100;
 
which each take over a minute run. All of the WHERE fields have indexes but looking at the execution plan for the second one:

+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+--------+----------------------------------------------+
| id | select_type | table              | type   | possible_keys           | key         | key_len | ref                   | rows   | Extra                                        |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | status             | ref    | status_FI_1,status_FI_3 | status_FI_3 | 5       | const                 | 241592 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | information_object | eq_ref | PRIMARY,lft_rft_id      | PRIMARY     | 4       | atom.status.object_id |      1 |                                              |
|  1 | SIMPLE      | object             | eq_ref | PRIMARY                 | PRIMARY     | 4       | atom.status.object_id |      1 |                                              |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+--------+----------------------------------------------+

its examining every row and using a file sort to sort the results (then doing a limit). As well as being slow it requires 5G of space in /tmp/ to write out the file for sorting. I've added the following index:

create index id_updated on object (updated_at,id );

but unfortunately mysql chooses not to use it. If i add an indexing hint to the SQL:

SELECT object.CLASS_NAME, object.CREATED_AT, object.UPDATED_AT, object.ID, object.SERIAL_NUMBER, information_object.ID, information_object.IDENTIFIER, information_object.OAI_LOCAL_IDENTIFIER, information_object.LEVEL_OF_DESCRIPTION_ID, information_object.COLLECTION_TYPE_ID, information_object.REPOSITORY_ID, information_object.PARENT_ID, information_object.DESCRIPTION_STATUS_ID, information_object.DESCRIPTION_DETAIL_ID, information_object.DESCRIPTION_IDENTIFIER, information_object.SOURCE_STANDARD, information_object.DISPLAY_STANDARD_ID, information_object.LFT, information_object.RGT, information_object.SOURCE_CULTURE FROM `object` use index (id_updated) , `information_object`, `status` WHERE status.STATUS_ID=160 AND information_object.ID=status.OBJECT_ID AND information_object.ID=object.ID ORDER BY object.UPDATED_AT ASC LIMIT 100;

the query takes less than a second. The execution plan is:

+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+------+-------------+
| id | select_type | table              | type   | possible_keys           | key         | key_len | ref                   | rows | Extra       |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+------+-------------+
|  1 | SIMPLE      | object             | index  | NULL                    | id_updated  | 12      | NULL                  |  100 |             |
|  1 | SIMPLE      | status             | ref    | status_FI_1,status_FI_3 | status_FI_1 | 4       | atom.object.id        |    1 | Using where |
|  1 | SIMPLE      | information_object | eq_ref | PRIMARY,lft_rft_id      | PRIMARY     | 4       | atom.status.object_id |    1 | Using where |
+----+-------------+--------------------+--------+-------------------------+-------------+---------+-----------------------+------+-------------+

which seems to suggest it does the object sort first, then limit before joining the other tables which is a lot quicker and only examines 100 rows.

Does this index addition make sense and if so is there anyway to get AtoM to use it? Are there any other improvements we could make?

Thanks for your help

Glen Robson
National Library of Wales

David Juhasz

unread,
Oct 16, 2015, 1:45:23 PM10/16/15
to ica-ato...@googlegroups.com, suh...@gmail.com
Hi Glen,

Thanks for sharing your detective work.  This is really interesting data on what queries AtoM is performing and the impact that tuning the indexes can have.

AtoM is using an old Propel 1.3 ORM implementation which is producing the poorly optimized queries you are seeing.  Unfortunately I don't think Propel 1.3 supports any type of index hinting.  And while we would love to replace the current AtoM ORM with a modern ORM, it would be a huge undertaking that would require substantial funding.

One possibility for improving that query is to replace the ORM query with a raw PDO SQL query. In the past we've seen query time improvements of an order of magnitude or more when switching from an ORM generated query to a raw SQL query.   The disadvantage of using raw SQL is it is not as portable as using an ORM (in theory), so it would make it harder to use AtoM with a PostgresSQL RDBM, for example.  In reality though AtoM is thoroughly hitched to MySQL, so while it's nice to dream of database portability it's not really possible to use AtoM with another RDBMs at this point.

I don't see any reason that using an (updated at, id) index would be a bad idea.  In the query you're working with, I do wonder if a straight "updated_at" index would produce similar speed improvements and perhaps be more likely to be used by MySQL without hinting?  I don't know much about how MySQL chooses whether to us an index though, so this is mostly guesswork on my part.

Regards,
David

--

David Juhasz
Director, AtoM Technical Services Artefactual Systems Inc. www.artefactual.com

Glen Robson

unread,
Oct 16, 2015, 8:18:41 PM10/16/15
to ica-ato...@googlegroups.com, suh...@gmail.com
Hi David,

Thanks for getting back on the questions below. I’m afraid I tried to add just an index on updated_at and unfortunately MYSQL still processes all of the rows in the table rather than doing the sort then limit then join which is where the speed increase comes. Ill talk to our developer and see if a raw PDO query is something we can implement.

Thanks for your help

Glen
Reply all
Reply to author
Forward
0 new messages