Hello all,
As part of determining a data model for our non-hydra repository in fcrepo4, I have been running some basic performance tests against various implementations revolving around PCDM. It was suggested that I share the results, which are here:
These tests are performed with a fairly small number of intellectual objects (versus Fedora objects), between 200 and 5000, resulting in at most 55000 Fedora objects, depending on impl. I did not use transactions.
The first tab compares all the implementations at 1000 objects created/moved/deleted. It lists how much time per object/operation, how many triples or fedora objects are created in Fuseki, etc. After that are graphs comparing all implementations, with creates and moves performed incrementally. The rest of the sheets contain data and graphs per individual implementation.
You can see a description of the tests and the scripts that performed them here:
All tests were run using the fcrepo-vagrant VM, with authentication turned off. Tomcat had 4gb of memory and the vm had 8gb. Otherwise no changes were made.
Discussion
==========
We were most interested in the Flat Hierarchy, which is modeled off of PCDM in Action/Sufia implementation using IndirectContainers, DirectContainers and proxies. At 1000 objects, creation takes about 30% longer than the all DirectContainer approach, and takes about 2.7 times longer than the vanilla Fedora structure. Performance also drops off in what appears to be a non-linear pattern further out, taking 82% longer to create 5000 than the direct container implementation. The others (minus the basic container impl) appear to be fairly steady at this small number of objects. It would be helpful to test to see how it performs at larger numbers of objects, and I may test this next.
Delete performance is understandably slower as well, since flat hierarchy objects needed to be deleted individually rather than deleting the parent collection, but this is a worst case comparison for this implementation.
The most prominent finding is that if PCDM is going to happen, it appears you are much better off using fcrepo's membership populating ldp containers versus doing so via API calls. This is demonstrated in the "Basic Container PCDM" results, particularly for moves. While its no surprise that more API calls would take longer, I was pretty surprised by the difference considering there were no extra container or proxy objects. Populating membership relations is a significant cost to using PCDM
I would be interested to know if there are other related factors people would like to evaluate. I haven't tried to do anything with PCDM's ordering components since my institution isn't really interested in using it at this time, but if anyone else would like to investigate that I suspect others would like to know the costs of it.
- Ben Pennell