scalability testing

Nick Ruest

unread,

Jan 17, 2013, 10:51:38 AM1/17/13

to dataverse...@googlegroups.com

Hi folks,

I am working on a platform analysis report for my university library,
and in it we are addressing scalability testing. I've searched around
and don't really see anything on scalability testing. Does anybody know
if such a test has ever been done?

cheers!

-nruest

Philip Durbin

unread,

Jan 17, 2013, 11:07:03 AM1/17/13

to dataverse...@googlegroups.com

Hi Nick,

By scalability are you wondering how many gigabytes of studies a
Dataverse Network can hold?

"Dataverse now offers access to more social science data than any
other system in the world" according to the paper at
http://gking.harvard.edu/publications/restructuring-social-science

Are you wondering how many front end Glassfish servers you might need
to handle the traffic? (We have two for https://dvn.iq.harvard.edu for
example.)

Can you please elaborate on what you mean by testing? Do you mean
something like ApacheBench?

Thanks,

Phil

--
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Nick Ruest

unread,

Jan 17, 2013, 3:23:11 PM1/17/13

to dataverse...@googlegroups.com

Good follow-ups!

On 13-01-17 11:07 AM, Philip Durbin wrote:
> Hi Nick,
>
> On Thu, Jan 17, 2013 at 10:51 AM, Nick Ruest <rue...@gmail.com> wrote:
>> I am working on a platform analysis report for my university library, and in
>> it we are addressing scalability testing. I've searched around and don't
>> really see anything on scalability testing. Does anybody know if such a test
>> has ever been done?
>
> By scalability are you wondering how many gigabytes of studies a
> Dataverse Network can hold?

Nah. I assume this limited by hardware. So whatever you could throw at
it storagewise, would be what you get.

> "Dataverse now offers access to more social science data than any
> other system in the world" according to the paper at
> http://gking.harvard.edu/publications/restructuring-social-science

Checking this out right now. Thanks!

> Are you wondering how many front end Glassfish servers you might need
> to handle the traffic? (We have two for https://dvn.iq.harvard.edu for
> example.)
> Can you please elaborate on what you mean by testing? Do you mean
> something like ApacheBench?

Yeah. I should have been more specific ;)

Really, what I was looking for was an object limit, or the most anybody
has got in a system with a scalability test. If that is in the above
document, then I'll have my answer.

Thanks for the quick reply!

cheers!

-nruest

> Thanks,
>
> Phil
>

Philip Durbin

unread,

Jan 17, 2013, 3:26:39 PM1/17/13

to dataverse...@googlegroups.com

On Thu, Jan 17, 2013 at 3:23 PM, Nick Ruest <rue...@gmail.com> wrote:
>> By scalability are you wondering how many gigabytes of studies a
>> Dataverse Network can hold?
>
> Nah. I assume this limited by hardware. So whatever you could throw at it
> storagewise, would be what you get.

Right. Just buy more storage, basically. I mean, there must be some
upper limit somewhere but I think we understand each other.

>> "Dataverse now offers access to more social science data than any
>> other system in the world" according to the paper at
>> http://gking.harvard.edu/publications/restructuring-social-science
>
>
> Checking this out right now. Thanks!
>
>
>> Are you wondering how many front end Glassfish servers you might need
>> to handle the traffic? (We have two for https://dvn.iq.harvard.edu for
>> example.)
>> Can you please elaborate on what you mean by testing? Do you mean
>> something like ApacheBench?
>
>
> Yeah. I should have been more specific ;)
>
> Really, what I was looking for was an object limit, or the most anybody has
> got in a system with a scalability test. If that is in the above document,
> then I'll have my answer.

Hmm, it won't be... I don't actually have the specifics handy at the moment.

Condon, Kevin

unread,

Jan 17, 2013, 4:27:31 PM1/17/13

to dataverse...@googlegroups.com

>
>> Are you wondering how many front end Glassfish servers you might need
>> to handle the traffic? (We have two for https://dvn.iq.harvard.edu for
>> example.)
>> Can you please elaborate on what you mean by testing? Do you mean
>> something like ApacheBench?
>
>Yeah. I should have been more specific ;)
>
>Really, what I was looking for was an object limit, or the most anybody
>has got in a system with a scalability test. If that is in the above
>document, then I'll have my answer.

Nick, that kind of information is not in the document. We have not done
formal scalability testing as such, though we have stress tested specific
functional areas that needed improvement. The type of load generated by
different functions varies quite a bit: from harvesting a large remote
repository, to ingesting large, complex data files, to running analytical
models on files of varying complexity, searching on study metadata,
downloading an entire file or subset, converting one file format to
another, or just general web browsing, etc.

At this stage we could describe what we use for production equipment and
attempt to address any concerns you may have about your intended use case.

We are currently deploying our service within our library and part of that
effort is to measure more closely actual usage and size accordingly.

Kevin

Nick Ruest

unread,

Jan 17, 2013, 4:58:58 PM1/17/13

to dataverse...@googlegroups.com

On 13-01-17 04:27 PM, Condon, Kevin wrote:
>
>>
>>> Are you wondering how many front end Glassfish servers you might need
>>> to handle the traffic? (We have two for https://dvn.iq.harvard.edu for
>>> example.)
>>> Can you please elaborate on what you mean by testing? Do you mean
>>> something like ApacheBench?
>>
>> Yeah. I should have been more specific ;)
>>
>> Really, what I was looking for was an object limit, or the most anybody
>> has got in a system with a scalability test. If that is in the above
>> document, then I'll have my answer.
>
> Nick, that kind of information is not in the document. We have not done
> formal scalability testing as such, though we have stress tested specific
> functional areas that needed improvement. The type of load generated by
> different functions varies quite a bit: from harvesting a large remote
> repository, to ingesting large, complex data files, to running analytical
> models on files of varying complexity, searching on study metadata,
> downloading an entire file or subset, converting one file format to
> another, or just general web browsing, etc.
>
> At this stage we could describe what we use for production equipment and
> attempt to address any concerns you may have about your intended use case.

I would be really interested to hear about this.

The specific item we are addressing in the document is: "Scalability --
The purpose of this measure is to comment on the ability of a software
to be able to handle a sufficiently large number of objects."

I believe all that has been provided thus far, and the production
examples should be sufficient for me to address this section. It sounds
like given enough hardware, you can throw whatever you want at it within
reason.

> We are currently deploying our service within our library and part of that
> effort is to measure more closely actual usage and size accordingly.
>
> Kevin
>

-nruest

Stephen Marks

unread,

Jan 17, 2013, 5:07:19 PM1/17/13

to dataverse...@googlegroups.com

Hi Nick-- =)

Your interpretation more or less squares with what we've found here at SP. FWIW, we've found the most taxing operations to be forms of ingest, be they through the web interface or through harvesting.

If the IQSS DVN is running off of two Glassfish servers, that makes me happy because it seems to me then we've got a long time left until we outgrow our current setup. Thanks for the insights, Kevin & Phillip.

Steve

Condon, Kevin

unread,

Jan 17, 2013, 5:33:18 PM1/17/13

to dataverse...@googlegroups.com

>>
>> At this stage we could describe what we use for production equipment and
>> attempt to address any concerns you may have about your intended use
>>case.
>
>I would be really interested to hear about this.
>
>The specific item we are addressing in the document is: "Scalability --
>The purpose of this measure is to comment on the ability of a software
>to be able to handle a sufficiently large number of objects."
>
>I believe all that has been provided thus far, and the production
>examples should be sufficient for me to address this section. It sounds
>like given enough hardware, you can throw whatever you want at it within
>reason.
>

>-nruest

For an idea of objects in the system, our IQSS DVN currently has the
following: Dataverses: 494 | Studies: 51,637 | Files: 719,823

Dataverses are virtual repositories for individual researchers as well as
organizations. Studies can be either references to actual studies or
cataloging information metadata describing the work. Files of course are
individual data files and supporting documents.

Prior to migrating our infrastructure to the library this past October, we
were running our IQSS DVN on the following configuration:
2 web server instances with 2 quad core cpus and 48GB ram, running RHEL 5
(we're now running RHEL 6).
1 database server with 2 six core cpus running RHEL 5 and postgres 8.4 and
64GB memory
1 rserve server to do analytical analysis and format transformations with
2 six core cpus and 64GB memory
Load balanced by a Coyote Point Equalizer, model 450

This was a deluxe configuration intended to support whatever we threw at
it. Our new configuration is similar but running RHEL 6 and with bigger
machines due to the availability of newer equipment and the requirement to
support more, separate DVNs for the library.

To economize and plan for future scalability, we're looking at more
closely measuring actual usage and breaking out certain functions to
specific machines.

Philip Durbin

unread,

Jan 24, 2013, 1:57:39 PM1/24/13

to dataverse...@googlegroups.com

On Thu, Jan 17, 2013 at 5:33 PM, Condon, Kevin <kco...@hmdc.harvard.edu> wrote:
> For an idea of objects in the system, our IQSS DVN currently has the
> following: Dataverses: 494 | Studies: 51,637 | Files: 719,823

As of today this represents close to one terabyte on disk.

Reply all

Reply to author

Forward