# RE: R package to analyze USL models

159 views

### Uriel Carrasquilla

Jan 21, 2013, 11:30:18 AM1/21/13
Hi Stefan.
This is really great work.
It seems to me that given the load and throughput in a queuing center you can calculate the scalability.
I have been trying to get into R but spend most of my time with Perl so please bear with me.
What is not clear to me is the following:
1) did you feed the sigma and kappa coefficients or does the model calculate them?
2) the metrics available to me are not in load and throughput format, can I still use Little's law to get them?

Now to show my ignorance when it comes to USL models:
Can you please elaborate what a scalability of 111.5987 means?
is it something to be compared with another scalability number?
Regards,
Uri

Sent: Sunday, January 20, 2013 3:28 PM
Subject: R package to analyze USL models

Hi!

I would like to announce a first release of my new R package to analyze scalability with USL models. It seems that currently there is no package available for this task and so I started rolling my own according to the method described in GCaP.

The workhorse of this package is the function usl() that solves the USL model just like lm() solves a linear model. Regressor and response are given by an R formula together with a data frame holding the measured values. Currently the function expects to find a value for load=1 in order to perform the normalization.

Here is an example of how the implementation currently works:

library(usl)

# Load the SPEC SDM91 dataset
data(specsdm91)

# Show the data
specsdm91

1    1       64.9
2   18      995.9
3   36     1652.4
4   72     1853.2
5  108     1828.9
6  144     1775.0
7  216     1702.2

# Create usl model from data frame according to formula
usl.model <- usl(throughput ~ load, specsdm91)

# Show model summary
summary(usl.model)

Call:
usl(formula = throughput ~ load, data = specsdm91)

Coefficients:
sigma         kappa
1.704689e-02  7.892498e-05

# Show point of maximum scalability
peak.scalability(usl.model)
[1] 111.5987

# Plot data together and predicted scalability function
plot(specsdm91)

There are two data sets from the GCaP book bundled with the package: raytracer from chapter 5 and specsdm91 from chapter 6. More work needs to be done on behalf of the robustness and documentation of the package.

If you would like to have a look at the package yourself, you can find the R sources on GitHub: https://github.com/smoeding/usl

There are also binary packages for Windows and Mac available from a private package repository. You can install these binary versions for R 2.15 with the following command (tested only on Mac):

Regards,
Stefan

--
You received this message because you are subscribed to the Google Groups "Guerrilla Capacity Planning" group.
To view this discussion on the web visit https://groups.google.com/d/msg/guerrilla-capacity-planning/-/BmgHEFBwzGcJ.
To post to this group, send email to guerrilla-cap...@googlegroups.com.
To unsubscribe from this group, send email to guerrilla-capacity-...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/guerrilla-capacity-planning?hl=en.

### DrQ

Jan 21, 2013, 12:06:15 PM1/21/13
Uriel,

Most of your questions are answered in this overview of the USL, which summarizes Chaps. 4—6 in my 2007 GCaP book. I haven't had time yet to look at what Stefan has done, but I'm sure I will add to the list of R packages for the USL on that page. He has essentially gathered up all the USL equation/model details into a library so that you can just focus on the model fitting aspects. This will be very helpful for a lot of people.

Since you have been more focused on using PDQ, I'll give you the nickel comparison with the USL approach. Although it's not obvious, the USL is also grounded in queueing theory (Theorem 2008), but you don't need to know or use that fact. The inputs are the load (N) which corresponds to vusers, processes, threads, whatever that actually do work on the system. N is the independent variable. The dependent variable is the corresponding throughput X(N). With some reasonable number of ordered pairs (N,X), we can apply nonlinear regression analysis with the USL as the (nonlinear) model to compute the overall scalability based on the normalized throughputs. This is very different from ordinary statistical regression using some arbitrary model of the type you find in Excel, for example.

You are quite correct in that a similar (though not exactly the same) thing could be accomplished with PDQ. After all, it can also be used to calculate throughput curves (as shown in my Perl::PDQ book). The key difference is this. To use PDQ (for anything) necessitates having service times as input parameters. No service, no queues. If you do not have that information available (which is often the case), you cannot use PDQ to reflect the real world system. You either have to find or develop the instrumentation that will provide those data or give up. The USL skips over that constraint.

It's always a matter of what trade-offs you are able or willing to make.
To unsubscribe from this group, send email to guerrilla-capacity-planning+unsub...@googlegroups.com.

Jan 23, 2013, 6:46:35 PM1/23/13
I wish I still had my R USL code - it was quite involved, actually. I
can probably re-create it, though. It was robust, in the sense of
ignoring outliers, fitted the throughput for N=1 if that wasn't given,
and gave confidence intervals for the fitted parameters.
>> For more options, visit this group at
>
> --
> You received this message because you are subscribed to the Google Groups
> "Guerrilla Capacity Planning" group.
> To view this discussion on the web visit
>
> To post to this group, send email to
> To unsubscribe from this group, send email to
> For more options, visit this group at

--
Workbench: http://znmeb.github.com/Computational-Journalism-Publishers-Workbench/

How the Hell can the lion sleep with all those people singing "A weem
oh way!" at the top of their lungs?

### Uriel Carrasquilla

Jan 23, 2013, 11:43:43 PM1/23/13
I am so intrigued that I bought the 2007 GCaP book.
I thought I had it but could not find it.
I should have it by Friday if Amazon delivers as promised.
Not to get ahead of myself, if I had a cluster of servers, each server with exactly the same identical H/W (inexpensive) and each doing proxy work (i.e. a hit ratio that determines how quickly a query is handled), I should be able to use R or Excel to get the two coefficients.  Based on those two coefficients I can make predictions based on the USL fitted chart.
What should N be: number of machines in the cluster, number of HTTP/HTTPS request, number of Gbits of traffic
X(N) - number of requests/sec processed, Gbits/sec
Does it matter?
I have those details.
How do I find residence time?
Uri

Sent: Monday, January 21, 2013 12:06 PM
Subject: Re: R package to analyze USL models

To post to this group, send email to guerrilla-cap...@googlegroups.com.
To unsubscribe from this group, send email to guerrilla-capacity-...@googlegroups.com.

### DrQ

Jan 24, 2013, 1:02:47 PM1/24/13
> I bought the 2007 GCaP book.

OK. Done. Sold. Next? :)

> Not to get ahead of myself,

It sounds almost silly to say it, but you first have to decide what modeling question you want to address. Is it software scalability or hardware scalability? This is part of how you decompose any problem into its relevant components. Later, you may decide to  merge the solutions back together. In that case, you will end up with some kind of 3-dimensional surface, like Fig. 4.10.

To get a better idea of how to apply the USL to clusters, see Sect. 4.6. In general, you will need to have 2 sets of USL coeffs: intranode and internode.

Jan 26, 2013, 4:22:54 PM1/26/13
The strategy is simple:

1. Write the *throughput* equation, not the capacity equation. The
capacity equation assumes you know X(1) and factors it out.
2. Build a data frame of N and X values. If you know X(1) specify it
in the data frame.
3. Do a nonlinear least squares fit to the throughput equation to
solve for the two (or three if you don't know X(1)) unknown
parameters.
4. Bonus points:
a. Robust fit: there are tactics to ignore outliers. Iterative
re-weighting is one way to do it if you don't have a canned routine.
See _Data Analysis and Regression: A Second Course in Statistics_
http://www.amazon.com/Data-Analysis-Regression-Second-Statistics/dp/020104854X
for the gory details.
b. It's actually a *constrained* least squares fit. Both parameters
are greater than or equal to 0 and one (I forget which one and don't
have the book handy) is less than or equal to 1. That's the part I
don't remember - whether there's a "native" R routine for a
constrained robust non-linear least squares or if I rolled my own from
one of the general minimization routines. I did this a long time ago -
2007, IIRC - and R may have gained such a routine if it didn't have
one then.

On Sat, Jan 26, 2013 at 4:14 AM, Stefan Moeding <s.mo...@gmail.com> wrote:
> Hi!
>
>
> On Thursday, January 24, 2013 12:46:35 AM UTC+1, M Edward Borasky wrote:
>>
>> I wish I still had my R USL code - it was quite involved, actually. I
>> can probably re-create it, though. It was robust, in the sense of
>> ignoring outliers, fitted the throughput for N=1 if that wasn't given,
>> and gave confidence intervals for the fitted parameters.
>
>
> That sounds really neat.
> Maybe you could point out the methods you used for the N=1 estimation?
>
> I am thinking about a linear estimation without intercept for a subset of
> the
> data. Getting the right/best subset is the tricky part.
>
> Regards,
> Stefan
>
> --
> You received this message because you are subscribed to the Google Groups
> "Guerrilla Capacity Planning" group.
> To post to this group, send email to
> To unsubscribe from this group, send email to
> Visit this group at
> For more options, visit https://groups.google.com/groups/opt_out.

### DrQ

Jan 26, 2013, 4:30:48 PM1/26/13
You can also use the optimize function with callback.

Jan 26, 2013, 4:28:31 PM1/26/13
Bonus point c. Confidence intervals for the parameters. If your
nonlinear least squares solver doesn't give them, use a bootstrap or
jackknife resampling wrapper to compute them.

Jan 26, 2013, 4:48:51 PM1/26/13
IIRC you did it as a one-liner in Mathematica. I wonder if you could
just paste that code into Wolfram|Alpha and have your browser do the
hard work. ;-)
>> > Visit this group at
>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >
>> >
>>
>>
>>
>> --
>> Workbench:
>> http://znmeb.github.com/Computational-Journalism-Publishers-Workbench/
>>
>> How the Hell can the lion sleep with all those people singing "A weem
>> oh way!" at the top of their lungs?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Guerrilla Capacity Planning" group.
> To unsubscribe from this group and stop receiving emails from it, send an
>
> To post to this group, send email to

### DrQ

Jan 27, 2013, 1:32:42 PM1/27/13
re: Teamquest White Paper

Storm in a numerical teacup. Rest assured that Mathematica and R do the right thing.

On Sunday, January 27, 2013 9:14:27 AM UTC-8, Stefan Moeding wrote:
Hi!

On Saturday, January 26, 2013 10:22:54 PM UTC+1, M Edward Borasky wrote:
The strategy is simple:
[...]

Ah, I see. I was thinking about a way to use a linear fit to predict X(1) while you
used a nonliner fit for the whole equation including sigma and kappa.

It looks like nls() can solve these and there is also an option to add constraints.

Talking about different formulas, here is one for DrQ:

On www.teamquest.com under Resources->White Papers there is a paper
'Reevaluating "Evaluating Scalability Parameters: A Fitting End"'. It includes
an adoption to the way sigma and kappa are calculated and states that the
calculation leads to a better match between measurements and calculated fit.
Any thoughts on that?

Regards,
Stefan

### Baron Schwartz

Jan 27, 2013, 2:22:47 PM1/27/13

Jan 27, 2013, 3:46:22 PM1/27/13
On Sun, Jan 27, 2013 at 9:14 AM, Stefan Moeding <s.mo...@gmail.com> wrote:
> Hi!
>
> On Saturday, January 26, 2013 10:22:54 PM UTC+1, M Edward Borasky wrote:
>>
>> The strategy is simple:
>>
>> [...]
>
>
> Ah, I see. I was thinking about a way to use a linear fit to predict X(1)
> while you
> used a nonliner fit for the whole equation including sigma and kappa.
>
> It looks like nls() can solve these and there is also an option to add
> constraints.

It's starting to come back to me. My view at the time was colored by
_Data Analysis and Regression: A Second Course in Statistics_ *and*
_Compact Numerical Methods for Computers: Linear Algebra and Function
Minimisation_ by John C. Nash. I took the Excel spreadsheets from the
PerfDynamics web site and extracted the raw data from the USL test
cases. I tried all the nonlinear fitting tools I could find in R, but
there were a couple of those datasets which crashed them, usually
because of a data pattern that violated the model.

The thing that worked "best" for the most cases was to write it as an
optimization problem and use the Nelder-Mead minimizer in "optim" to
do the minimization. "optim" has since been replaced / extended by
"optimx". Writing it as an optimization lets you do either least
squares (minimize the sum of the squared residuals) *or* the more
robust "least absolute deviation (LAD)". I think I allowed either
option. Nelder-Mead is hopelessly slow for large problems, but it's
fine for tiny ones like this and it almost always converges to a
global minimum.

What you lose by doing it this way is confidence intervals. You can
get those back via a bootstrap or jackknife, but I didn't go that far.

### DrQ

Jan 27, 2013, 4:22:24 PM1/27/13
Before we get too carried away with all the mathematical minutia, let's not lose sight of the fact that we generally don't even know the level of accuracy of the XN data. Usually, we only have one sample, and are supposed to be grateful for having that. Like the superluminal neutrino guys, measuring the wrong thing to 6σ confidence level, doesn't make it  correct.

Very often, I find that just calculating the efficiencies (C/N), which I need for the USL regression, already reveals something funky in the measurements. Sometimes efficiencies are bigger than 100%. What? Sometimes a lot less than 100%. What? Somebody has a lot of explaining to do (and it isn't me).

The process of applying the USL (or trying to) is often more valuable than the accuracy of any predictions it might make. That's b/c the USL forces the measurements to viewed within a formal scalability framework that otherwise does not exist.

### DrQ

Feb 8, 2013, 12:34:53 PM2/8/13
Just cite the source and you should be fine.

@BOOK{njgBOOK07,
AUTHOR = Neil J. Gunther,
TITLE = {Guerrilla Capacity Planning
A Tactical Approach to Planning for Highly Scalable Applications and Services},
EDITION = {1st},
PUBLISHER = {Springer},
YEAR = 2007,
}

On Friday, February 8, 2013 12:18:25 AM UTC-8, Stefan Moeding wrote:

I did some more work on the package.

The summary now includes some statistics about the distrubution of the efficiency. There is also the method efficiency() to extract the values from the model.

Nonlinear regression is implemented with two different algorithms. One uses the standard nls() function and one is build upon the nlmrt package. The author of that package is the already mentioned John C. Nash.  I expect it to be more robust than the standard nls() function.

There is certainly room for improvement but I believe the package can already be used for some analysis work. Therefore I would like to submit the package to CRAN and make it available to a broader audience.

Since the work is derived from GCaP I need the confirmation that the distribution of the source code does not violate any copyright/license statements. So to comply with the CRAN policy I would like to ask if there are any objections to this plan.

Regards,

Stefan

Feb 10, 2013, 1:23:39 AM2/10/13
Should the R version of PDQ also be in CRAN?
> --
> You received this message because you are subscribed to the Google Groups
> "Guerrilla Capacity Planning" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> To post to this group, send email to
> Visit this group at
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
http://j.mp/CompJournoStick/

The National Coal Institute reminds you, "There's no fuel like an old fuel."

### DrQ

Feb 10, 2013, 11:56:54 AM2/10/13
Natürlich but not yet. We're working toward ensuring that PDQ satisfies the CRAN criteria and 6.0.1 set the foundation.

### Uriel Carrasquilla

Feb 12, 2013, 10:58:15 AM2/12/13
Great, installed without a glitch.
What was the end result of the conversation when throughput for load=1 is not available?
Do I still need to estimate it?
Regards,
Uri

Sent: Tuesday, February 12, 2013 2:35 AM

Subject: Re: R package to analyze USL models

I am happy to announce the availability of the usl 1.0.0 package on CRAN.
You should be able to install the package from your preferred mirror by simply calling:

install.packages("usl")

Regards,
Stefan

--
You received this message because you are subscribed to the Google Groups "Guerrilla Capacity Planning" group.
To unsubscribe from this group and stop receiving emails from it, send an email to guerrilla-capacity-...@googlegroups.com.

### Stefan Parvu

Feb 12, 2013, 10:56:22 AM2/12/13
>
> install.packages("usl")
>

http://cran.r-project.org/web/packages/usl/index.html
sweet. Many thanks indeed.

stefan

### DrQ

Feb 12, 2013, 3:26:08 PM2/12/13
OK, you are now in the USL Hall of Fame and may proceed to quit your day-job. :)

Feb 13, 2013, 1:15:01 AM2/13/13
On Tue, Feb 12, 2013 at 12:26 PM, DrQ <red...@yahoo.com> wrote:
> OK, you are now in the USL Hall of Fame and may proceed to quit your
> day-job. :)

And it saves me the trouble of hunting through my backups for my old code. ;-)

Feb 27, 2013, 7:37:29 PM2/27/13
Hi,
Can I ask if there is help to split the USL function into its constituent parts and understand each part separately ? So beginners know the importance of each statitistical idea that forms the entire function. Is that in the books ?

Thanks,
Mohan

On Wed, Feb 27, 2013 at 1:05 PM, Stefan Moeding wrote:
Release 1.1.0 of my usl package has just appeared on CRAN.
It may take another day until your mirror has all the binaries.

The update has only a small functional change. The efficiency() function now returns a named vector to show which efficiency value corresponds to which load value.

The main reason for the update is the vignette for the package. It includes step-by-step examples and should help with the general analysis approach.

After installation of the updated package you can access the vignette with the following R command:

R> vignette("usl")

The vignette is also accessible from the package page on http://cran.r-project.org/web/packages/usl/index.html