Number of Objects in a container

120 views
Skip to first unread message

Nals Star

unread,
May 12, 2017, 10:23:11 AM5/12/17
to Fedora Tech
Hi,

We are planning to ingest data from different projects in the repository . Each project may vary from several GB to TB.

My questions are as follows

1. Are there any  limit on the # of binary objects that you put in a container and are there any subcontainer limit?
 2. I like to have a separate container for each project and ingest data into it which may have many subcontainers. Will that be a good idea?
3. When I log into fedora homepage, how do I restrict to show top level containers instead of all subcontainers and files in it


Any suggestions please

Thanks
Nals

Peter Matthew Eichman

unread,
May 12, 2017, 5:33:43 PM5/12/17
to fedor...@googlegroups.com
Hi Nals,

1. In theory, Fedora (as an LDP implementation) doesn't have any limits on how many objects you can place in a container. In practice, the Modeshape implementation runs into the so-called "many members" problem when you get up to (roughly) a thousand or more objects. At this point, performance becomes seriously degraded.

To get around this, the current implementation will try to balance where your objects go by generating a random UUID identifier for them and placing them in a pairtree structure. This helps limit the number of direct children any one node has, and helps avoid the many members problem.

This pairtree strategy is only done automatically if you use POST to create resources and let Fedora decide the URI for them. If you create a resource using PUT, Fedora will use whatever URI you give it.

2. This depends somewhat on the details of your content modelling. In general, I believe the community best practice is to have a few (or even just one) top level containers and create resources in them using POST and Fedora's pairtree strategy. This has the benefit of keeping most meaning out of your URIs (in accordance with the principles from "Cool URIs Don't Change" [1]). However, if you really do need separate management of each project's resources then it might make sense to create a container for each project.

3. I'm not entirely sure what you mean by this. When you create resources using POST, the intervening pairtree nodes are not part of the LDP structure. So, if you have created a resource by POST to the root of your repository, and it created a resource whose URI path had "01/23/45/67/01234567-abcd-def0-abcdabcdabcdabcd", that resource (not the "01" node) will be the child of the root (in LDP terms) and show up in its list of children.

Hope this helps,
-Peter


--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.
To post to this group, send email to fedor...@googlegroups.com.
Visit this group at https://groups.google.com/group/fedora-tech.
For more options, visit https://groups.google.com/d/optout.



--
Peter Eichman
Senior Software Developer
University of Maryland Libraries

Christopher Johnson

unread,
May 13, 2017, 4:36:26 PM5/13/17
to Fedora Tech
Hi list,

As performance is a topic that seems to arise frequently, I did a bit if searching and found this [1] 

It seems that Modeshape can theoretically support at least 100k children in one container.

Perhaps it might be helpful to generate some performance threshold reports that compare different repository configs so that this speculative "upper node limit" can be analyzed and defined more clearly for Fedora.  

There are certainly many use cases where deep and opaque pairtree container modelling is unnecessary and annoying. The cleanest solution for many graph models may simply be "large" wide flat trees. For me it is just a question of knowing the operational costs in relation to a well-known "upper node limit" so that I can provide object creation constraints in a model.

Cheers,
Christopher


On Friday, May 12, 2017 at 11:33:43 PM UTC+2, Peter Eichman wrote:
Hi Nals,

1. In theory, Fedora (as an LDP implementation) doesn't have any limits on how many objects you can place in a container. In practice, the Modeshape implementation runs into the so-called "many members" problem when you get up to (roughly) a thousand or more objects. At this point, performance becomes seriously degraded.

To get around this, the current implementation will try to balance where your objects go by generating a random UUID identifier for them and placing them in a pairtree structure. This helps limit the number of direct children any one node has, and helps avoid the many members problem.

This pairtree strategy is only done automatically if you use POST to create resources and let Fedora decide the URI for them. If you create a resource using PUT, Fedora will use whatever URI you give it.

2. This depends somewhat on the details of your content modelling. In general, I believe the community best practice is to have a few (or even just one) top level containers and create resources in them using POST and Fedora's pairtree strategy. This has the benefit of keeping most meaning out of your URIs (in accordance with the principles from "Cool URIs Don't Change" [1]). However, if you really do need separate management of each project's resources then it might make sense to create a container for each project.

3. I'm not entirely sure what you mean by this. When you create resources using POST, the intervening pairtree nodes are not part of the LDP structure. So, if you have created a resource by POST to the root of your repository, and it created a resource whose URI path had "01/23/45/67/01234567-abcd-def0-abcdabcdabcdabcd", that resource (not the "01" node) will be the child of the root (in LDP terms) and show up in its list of children.

Hope this helps,
-Peter

On Fri, May 12, 2017 at 10:23 AM, Nals Star <deal...@gmail.com> wrote:
Hi,

We are planning to ingest data from different projects in the repository . Each project may vary from several GB to TB.

My questions are as follows

1. Are there any  limit on the # of binary objects that you put in a container and are there any subcontainer limit?
 2. I like to have a separate container for each project and ingest data into it which may have many subcontainers. Will that be a good idea?
3. When I log into fedora homepage, how do I restrict to show top level containers instead of all subcontainers and files in it


Any suggestions please

Thanks
Nals

--
You received this message because you are subscribed to the Google Groups "Fedora Tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.

To post to this group, send email to fedor...@googlegroups.com.
Visit this group at https://groups.google.com/group/fedora-tech.
For more options, visit https://groups.google.com/d/optout.

Andrew Woods

unread,
May 14, 2017, 9:54:58 AM5/14/17
to fedor...@googlegroups.com
Hello Christopher,
The following test results provide some of what your are suggesting:

Regards,
Andrew

To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.

Christopher Johnson

unread,
May 15, 2017, 4:25:42 AM5/15/17
to Fedora Tech
Hi Andrew,

Thanks for the info.  From what I can see, these tests were run fcrepo-4.0.0-beta-01 that used ModeShape 3.8.0.Final.  I would assume that these results are bound to that (quite old) implementation.  

In addition to the block segmentation feature, it seems that there are several techniques to design for large numbers of child nodes [1] like disabling JCR versioning.

I am interested in analyzing how these techniques actually work.  It might also be useful to see how ModeShape 5.3 compares to the original benchmark in Fedora.

Cheers,
Christopher

Andrew Woods

unread,
May 15, 2017, 8:01:14 AM5/15/17
to fedor...@googlegroups.com
Hello Christopher,
Performing a more current round of testing would be very useful. Ideally, such testing would include results that can be compared to the earlier Fedora results [1]. 

If this is something that you would have an interest in leading, I suspect we could get others in the community to help out as well.

What are your thoughts?

Regards,
Andrew

To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.

Christopher Johnson

unread,
May 16, 2017, 3:23:28 AM5/16/17
to Fedora Tech
Hi Andrew,

I would certainly like to help with this task.  Implementing the ModeShape Performance Test Framework [1] might be a good starting point.

A Fedora Performance Test Framework could also potentially be part of a CI build process...

Cheers,
Christopher

Andrew Woods

unread,
May 17, 2017, 11:43:13 AM5/17/17
to fedor...@googlegroups.com
Hello Christopher,
Having a Fedora performance test framework run as a part of the CI build process would be a big win. That information would be valuable to include in release notes as well.

In light of the Fedora API Specification [1] and being able to quantitatively compare different versions of Fedora as well as different implementations of the specification, it seems like it would be most useful to put effort towards a test framework that focuses on interactions at the Fedora API level, versus at the ModeShape level. That is not to say that we should not be exploring the use of various ModeShape configurations [2].

We already have a strong starting point with the Fedora JMeter tests [3]. Would you be interested in either using these existing tests or creating new ones towards clarifying performance characteristics as well as improving our CI build process?

Thanks,
Andrew

To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.

Christopher Johnson

unread,
May 18, 2017, 3:44:53 AM5/18/17
to Fedora Tech
Hi Andrew,

I definitely have a lot of ideas about this. I suppose the first step is to create an RFC and/or tracking task? The product selection should be community driven. I would advocate that distributed load testing with Kubernetes has merit. Testing the API with the existing JMeter code is certainly within scope of that. Also, simply creating docker images for the different Fedora/Modeshape versions and configurations would make bootstrapping and initialization not so complex.


Andrew Woods

unread,
May 22, 2017, 12:27:21 PM5/22/17
to fedor...@googlegroups.com
Hello Christopher,
This topic could potentially represent an entry point for addressing a few project priorities:
1. Having repeatable tests for demonstrating performance characteristics of Fedora
2. Including those tests as a part of the CI builds and/or release process
3. Establishing more general performance tests against the Fedora API Specification [1]

As already noted, there has been a significant amount of work from the "Performance and Scalability" working group [2] related to #1, albeit somewhat outdated. It would be relatively low-hanging to rerun those tests against the current release of Fedora. Additionally, as you are suggesting, there are probably patterns and infrastructure that could be incorporated towards improving that effort.

Ideally, these performance tests would eventually target the Fedora API Specification to not only demonstrate changes in performance across Fedora releases, but also across different implementations of that specification.

My question at this point is, "Are there folks in the community who would like to join Christopher in moving any or all of these efforts forward"? If so, let's coordinate!

Regards,
Andrew

On Thu, May 18, 2017 at 3:44 AM, Christopher Johnson <chjoh...@gmail.com> wrote:
Hi Andrew,

I definitely have a lot of ideas about this.  I suppose the first step is to create an RFC and/or tracking task? The product selection should be community driven. I would advocate that distributed load testing with Kubernetes has merit. Testing the API with the existing JMeter code is certainly within scope of that. Also, simply creating docker images for the different Fedora/Modeshape versions and configurations would make bootstrapping and initialization not so complex.

Christopher Johnson

unread,
May 22, 2017, 2:03:32 PM5/22/17
to Fedora Tech
Hi Andrew,

Thanks for the well-written overview.  

I can add that I have made a bit of research on distributed load testing over the weekend.  I ran a short version of the "number of containers" test plan on a Kubernetes cluster in Jmeter's remote mode on 2 workers as a "proof of concept".  The results and configuration is here.  

To link an ephemeral performance testing infrastructure in with the fcrepo CI is definitely possible. First, I guess a Jenkins pipeline could include docker builds and publishing to some cloud container hosting platform.  (there are probably different opinions on this and that cost optimization is a factor, so I will refrain from offering an opinion of my own).  The "right container platform" should also include status monitoring and reporting tools that are necessary for performance test correlation with variable hardware configurations.

I am glad to help on this in any number of ways (writing performance tests or test plans, defining repository (and repository extension) container configurations and build scripts, doing comparative perf test infrastructure cost evaluation, etc.), as it interests me a lot!  

Cheers,
Christopher
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech...@googlegroups.com.

Christopher Johnson

unread,
May 24, 2017, 5:05:22 AM5/24/17
to Fedora Tech
Hi list,

To answer the original issue with this post, I have found a possible value for a latency max threshold of objects in a container with the default modeshape configuration @ this fcrepo commit.  I made a new test, "Single Wide Container", that PUTs objects in one container.  The performance was logged using the existing test measuring instrumentation. 

Here is the chart .  Using 5 workers, it took roughly 30 minutes to create ~14000 objects.  I stopped the test shortly after the latency spiked from 1500ms to 4000ms at 13160 objects.

I also monitored the resource usage by the fcrepo pod.  Based on this, I do not think that the latency spike was a consequence of a resource limitation.

The test build process can easily be reproduced.   I try to vary the modeshape configuration to see how (or if) this can be increased.

- Christopher

Yinlin

unread,
May 25, 2017, 9:43:52 AM5/25/17
to Fedora Tech
Hi

Since it is a topic related to Fedora performance/scaling and looking for other to cooperate in community. 

If there is a need, we can have a Performance - Scale meeting sometime in June (Monday 11am). Please let me know and I will resume this meeting.
 
- Yinlin

Andrew Woods

unread,
May 25, 2017, 7:36:45 PM5/25/17
to fedor...@googlegroups.com
Hello Christopher,
Thank you for developing this test and sharing the results. 

As mentioned earlier, it would be valuable to the community to have an updated run of the Performance and Scalability tests [1] with the most recent Fedora release. If the execution of those tests could be further automated, it would facilitate their more frequent execution (at release time and potentially as a part of CI builds).

Naturally, any infrastructure that you create would serve as the foundation for future tests that work against the Fedora API Specification. 

If this is an effort in which you would be interested, please indicate how we on the list can help. Additionally, as Yinlin mentioned, if calling a meeting for broader collaboration would be of interest, let's make it happen.

Regards,
Andrew

To unsubscribe from this group and stop receiving emails from it, send an email to fedora-tech+unsubscribe@googlegroups.com.

Christopher Johnson

unread,
May 27, 2017, 12:09:48 AM5/27/17
to Fedora Tech
Hi Andrew and Yinlin,

Having a meeting would be useful. There are a few items that could precede any concrete CI work:

1. Formalize any API performance test plan definitions in an ontology. This also would make test code and output references consistent.
Example. http://fedora.info/definitions/v4/performance#NumberofContainers
2. Similarly, a fcrepo subject test configuration ontology could be created. In theory, this establishes the range properties for the CI image builder. To define config classes explicitly is also perhaps useful for comparing old Fedoras with new ones.

I can start on this and then we review it at a meeting on 11 June?

Christopher Johnson
Scientific Associate
Bibliotheca Albertina, Leipzig

Yinlin

unread,
May 27, 2017, 8:01:32 AM5/27/17
to Fedora Tech
Hi Christoper,

This is great! I will announce this meeting to the group early next week.
The meeting date would be June 12  (Monday), 11am EST

Our timezone is six hours different, so please let me know if this time works for you. 

Thanks
Yinlin

west...@umd.edu

unread,
May 27, 2017, 8:08:14 AM5/27/17
to Fedora Tech
University of Maryland would be interested in helping out with running tests where we can.  At this point I cannot guarantee that someone will available for the June 12 meeting, but we are interested in this effort.

Josh Westgard

Christopher Johnson

unread,
May 27, 2017, 5:29:07 PM5/27/17
to Fedora Tech
Hi Yinlin, Josh,

12 June at 11:00 EST will work fine for me.  

Here is a html preview of what I am thinking how the performance ontology could look like.  There are a lot of concepts in performance testing that can be specified, so this is still quite basic.

I will also try to make a preliminary SUT ontology, but this is more complex, and after reading this thread, a bit of guidance from the system configuration experts would be helpful.

Christopher Johnson
Scientific Associate
Bibliotheca Albertina, Leipzig

Yinlin

unread,
Jun 1, 2017, 8:39:21 AM6/1/17
to Fedora Tech
Hi Christopher,

That's great. I will announce this meeting in today's tech meeting and also Fedora Tech.
Please feel free to update the draft agenda: https://wiki.duraspace.org/display/FF/2017-06-12+Performance+-+Scale+meeting

Thanks,
Yinlin
Reply all
Reply to author
Forward
0 new messages