PCDM Performance comparison in Fedora 4

83 views
Skip to first unread message

Benjamin Pennell

unread,
Feb 16, 2016, 4:18:50 PM2/16/16
to fedor...@googlegroups.com, hydra...@googlegroups.com, pc...@googlegroups.com
Hello all,

As part of determining a data model for our non-hydra repository in fcrepo4, I have been running some basic performance tests against various implementations revolving around PCDM.  It was suggested that I share the results, which are here:


These tests are performed with a fairly small number of intellectual objects (versus Fedora objects), between 200 and 5000, resulting in at most 55000 Fedora objects, depending on impl.  I did not use transactions.

The first tab compares all the implementations at 1000 objects created/moved/deleted.  It lists how much time per object/operation, how many triples or fedora objects are created in Fuseki, etc.  After that are graphs comparing all implementations, with creates and moves performed incrementally.  The rest of the sheets contain data and graphs per individual implementation.

You can see a description of the tests and the scripts that performed them here:

All tests were run using the fcrepo-vagrant VM, with authentication turned off.  Tomcat had 4gb of memory and the vm had 8gb.  Otherwise no changes were made.

Discussion
==========

We were most interested in the Flat Hierarchy, which is modeled off of PCDM in Action/Sufia implementation using IndirectContainers, DirectContainers and proxies.  At 1000 objects, creation takes about 30% longer than the all DirectContainer approach, and takes about 2.7 times longer than the vanilla Fedora structure.  Performance also drops off in what appears to be a non-linear pattern further out, taking 82% longer to create 5000 than the direct container implementation.  The others (minus the basic container impl) appear to be fairly steady at this small number of objects.  It would be helpful to test to see how it performs at larger numbers of objects, and I may test this next.  

Delete performance is understandably slower as well, since flat hierarchy objects needed to be deleted individually rather than deleting the parent collection, but this is a worst case comparison for this implementation.

The most prominent finding is that if PCDM is going to happen, it appears you are much better off using fcrepo's membership populating ldp containers versus doing so via API calls.  This is demonstrated in the "Basic Container PCDM" results, particularly for moves.  While its no surprise that more API calls would take longer, I was pretty surprised by the difference considering there were no extra container or proxy objects.  Populating membership relations is a significant cost to using PCDM

I would be interested to know if there are other related factors people would like to evaluate.  I haven't tried to do anything with PCDM's ordering components since my institution isn't really interested in using it at this time, but if anyone else would like to investigate that I suspect others would like to know the costs of it.

 - Ben Pennell

Trey Pendragon

unread,
Feb 17, 2016, 12:54:56 PM2/17/16
to Fedora Tech, hydra...@googlegroups.com, pc...@googlegroups.com
Hey Ben,

This is amazing work. Really outstanding, thank you for doing this and sharing the results. Just a couple comments:

The basic container approach seems significantly faster, with the pretty important side effect being that changing membership means changing the URI of the resource, which seems like a bad thing to me. Also, how do you put one object in two parents?

The graph of the flat PCDM approach is really interesting. I wonder if the non-linear time increase is a result of calculating the indirect container triples on the fly, rather than having them actually in the repository. Is it possible for Fedora to instead actually shove the triples in the repo when you POST to an indirect container?

- Trey Pendragon
Princeton University Library

Benjamin Pennell

unread,
Feb 18, 2016, 5:19:30 PM2/18/16
to Fedora Tech, hydra...@googlegroups.com, pc...@googlegroups.com
Thanks Trey!

I should clarify that the "Basic Container" approach is actually the slowest in these tests for create and move operations (especially for moves), I've added the y axis label to the graphs where I missed it previously to help clarify.  In terms of the flat PCDM results, I haven't looked into the internals of how fedora populates the generated triples, but that feature does appear to be beneficial for both ldp:DirectContainer and ldp:IndirectContainers in this kind of situation.

By request, I have also added results for read operations.  The implementations seem to be pretty similar over all, which makes sense given that intermediate membership containers would be skipped over when walking the tree.  Things did start to get erratic at higher numbers of objects with some of the approaches, particularly the direct container hierarchical and the basic container approaches.  I'm not sure why these ones in particular would be problematic, and it may be helpful to rerun the test to see if its consistent.

Benjamin Pennell

unread,
Feb 19, 2016, 12:19:32 PM2/19/16
to Fedora Tech, hydra...@googlegroups.com
And apologies for not answering two of your questions.  For the no pcdm, basic and direct approaches, URIs would change with move operations.  If you wanted an object to be in a second parent I believe you would have to make the second hierarchy use ldp:IndirectContainers with proxies or other approaches to setting up membership relations, but I believe that doesn't preclude using the built in Fedora hierarchy for a primary hierarchy.

In my opinion, institutions will have different needs in terms of hierarchy and how they identify their resources, which is why the PCDM specification tries to remain implementation agnostic, so it is helpful to know what the performance implications are for these features.


Also wanted to mention that I've updated the way in which deletes were being performed in the benchmarks, so that it is now deleting individual objects in all cases rather than the root collection, since that heavily biased against the flat implementation.  

On Wednesday, February 17, 2016 at 12:54:56 PM UTC-5, Trey Pendragon wrote:

Andrew Woods

unread,
Feb 22, 2016, 2:54:40 PM2/22/16
to Benjamin Pennell, fedor...@googlegroups.com, Esmé Cowles, Hydra-Tech, pc...@googlegroups.com
Hello Ben,
Thanks for the rigor you put into testing and documenting these different PCDM/LDP scenarios. 

From a performance perspective, it is very useful to have these tests to help inform modeling decisions. However, strictly from a modeling perspective (performance aside) it is probably worth a discussion of the pros/cons related to the five models you have demonstrated. As Trey touched on, the characteristics of the models (e.g. Can resources move without changing URI? Can membership in multiple collections be accomplished with consistent client logic, etc) will likely narrow the field of approaches from which an institution will choose before even considering performance.

My suspicion is that installations will be most interested in performance related to creation and read. Presumably "read" will typically be addressed with effective caching, but your scripts could prove to be very helpful in setting benchmarks for optimizing "creation".

All of your results paint a consistent performance picture with the exception of the "basic-pcdm" creation case, where you are modeling with PCDM but creating all of the relationships with API calls vs. using the LDP containment interaction models. I would be very interested to explore the root of that non-linear behavior.

Potentially related to this example of non-linear behavior, all of your tests use PUT to create named resources. It would be interesting to see the results of your exact same tests using F4's auto-generated URIs with POST requests to see if the performance differs significantly.

Strictly from a testing infrastructure perspective, a big +1 for using the fcrepo4-vagrant box which affords a consistent platform for reproducibility. One thing that should likely be tuned if we are using fcrepo4-vagrant for performance testing, are the JAVA_OPTS used by Tomcat7. Currently, the JVM defaults are used, which are sub-optimal.

As a side note, I know Esmé is also performing a comparative analysis of alternate backends (MySQL and Postgres) to the default LevelDB. I would encourage you both to attend the next "Fedora Performance and Scalability" meeting to discuss how we can most effectively move these efforts forward:

Regards,
Andrew
Reply all
Reply to author
Forward
0 new messages