Re: Input Analyzer Arena Download 16

0 views

Skip to first unread message

Message has been deleted

Nelson Suggs

unread,

Jul 17, 2024, 1:45:17 PM7/17/24

to tacomsona

Previously, we examined the modeling of discretedistributions. In this section, we will look at modeling a continuousdistribution using the functionality available in R. This example startswith step 3 of the input modeling process. That is, the data has alreadybeen collected. Additional discussion of this topic can be found inChapter 6 of (Law 2007).

The first steps are to visualize the data and check for independence.This can be readily accomplished using the hist, plot, and acffunctions in R. Assume that the data is in a file called,taskTimes.txt within the R working directory.

input analyzer arena download 16

Download https://ssurll.com/2yMWol

An analysis of the statistical properties of the task times can beeasily accomplished in R using the summary, mean, var, sd, andt.test functions. The summary command summarizes the distributional properties in termsof the minimum, maximum, median, and 1st and 3rd quartiles of the data.

Define $h_j = c_j/n$ as the relative frequency for the $j^th$interval. Note that $\sum\nolimits_j=1^k h_j = 1$. A plot of thecumulative relative frequency, $\sum\nolimits_i=1^j h_i$, for each$j$ is called a cumulative distribution plot. A plot of $h_j$ shouldresemble the true probability distribution in shape because according tothe mean value theorem of calculus.

The number of intervals is a key decision parameter and will affect thevisual quality of the histogram and ultimately the chi-squared teststatistic calculations that are based on the tabulated counts from thehistogram. In general, the visual display of the histogram is highlydependent upon the number of class intervals. If the widths of theintervals are too small, the histogram will tend to have a ragged shape.If the width of the intervals are too large, the resulting histogramwill be very block like. Two common rules for setting the number ofinterval are:

A frequency diagram in R is very simple by using the hist() function.The hist() function provides the frequency version of histogram andhist(x, freq=F) provides the density version of the histogram. Thehist() function will automatically determine breakpoints using theSturges rule as its default. You can also provide your own breakpointsin a vector. The hist() function will automatically compute the countsassociated with the intervals.

Notice how the hist command returns a result object. In the example,the result object is assigned to the variable $h$. By printing theresult object, you can see all the tabulated results. For example thevariable h$counts shows the tabulation of the counts based on thedefault breakpoints.

The variable h$density holds the relative frequencies divided by theinterval length. In terms of notation, this is, $f_j = h_j/\Delta b_j$.This is referred to as the density because it estimates the height ofthe probability density curve.

You can also use the cut() function and the table() command totabulate the counts by providing a vector of breaks and tabulate thecounts using the cut() and the table() commands without using thehist command. The following listing illustrates how to do this.

For this situation, we will hypothesize that the task times come from agamma distribution. Therefore, we need to estimate the shape ($\alpha$)and the scale ($\beta$) parameters. In order to do this we can use anestimation technique such as the method of moments or the maximumlikelihood method. For simplicity and illustrative purposes, we will usethe method of moments to estimate the parameters.

The method of moments is a technique for constructing estimators of theparameters that is based on matching the sample moments (e.g. sampleaverage, sample variance, etc.) with the corresponding distributionmoments. This method equates sample moments to population (theoretical)ones. Recall that the mean and variance of the gamma distribution are:

\[\beginaligned\chi^2_0 & = \sum\limits_j=1^6 \frac\left( c_j - np_j \right)^2np_j\\ & = \frac\left( 6.0 - 7.89\right)^27.89 + \frac\left(34-28.54\right)^228.54 + \frac\left(25-28.79\right)^228.79 + \frac\left(16-18.42\right)^2218.42 \\ & + \frac\left(11-9.41\right)^29.41 + \frac\left(8-6.95\right)^26.95 \\ & = 2.74\endaligned\]

The Kolmogorov-Smirnov (K-S) Test compares the hypothesizeddistribution, $\hatF(x)$, to the empirical distribution and does notdepend on specifying intervals for tabulating the test statistic. TheK-S test compares the theoretical cumulative distribution function (CDF)to the empirical CDF by checking the largest absolute deviation betweenthe two over the range of the random variable. The K-S Test is describedin detail in (Law 2007), which also includes a discussion ofthe advantages/disadvantages of the test. For example,(Law 2007) indicates that the K-S Test is more powerful thanthe Chi-Squared test and has the ability to be used on smaller samplesizes.

To apply the K-S Test, we must be able to compute the empiricaldistribution function. The empirical distribution is the proportion ofthe observations that are less than or equal to $x$. RecallingEquation (B.8), we can define the empirical distribution as inEquation (B.12).

Since the empirical distribution function is characterized by theproportion of the data values that are less than or equal to the$i^th$ order statistic for each $i=1, 2, \cdots, n$,Equation (B.12) can be re-written as:

Intuitively, a large value for the K-S test statistic indicates a poorfit between the empirical and the hypothesized distributions. The nullhypothesis is that the data comes from the hypothesized distribution.While the K-S Test can also be applied to discrete data, special tablesmust be used for getting the critical values. Additionally, the K-S Testin its original form assumes that the parameters of the hypothesizeddistribution are known, i.e. given without estimating from the data.Research on the effect of using the K-S Test with estimated parametershas indicated that it will be conservative in the sense that the actualType I error will be less than specified.

We have now completed the chi-squared goodness of fit test as well asthe K-S test. The Chi-Squared test has more general applicability thanthe K-S Test. Specifically, the Chi-Squared test applies to bothcontinuous and discrete data; however, it suffers from depending on theinterval specification. In addition, it has a number of othershortcomings which are discussed in (Law 2007). While the K-STest can also be applied to discrete data, special tables must be usedfor getting the critical values. Additionally, the K-S Test in itsoriginal form assumes that the parameters of the hypothesizeddistribution are known, i.e. given without estimating from the data.Research on the effect of using the K-S Test with estimated parametershas indicated that it will be conservative in the sense that the actualType I error will be less than specified. Additional advantage anddisadvantage of the K-S Test are given in (Law 2007). Thereare other statistical tests that have been devised for testing thegoodness of fit for distributions. One such test is Anderson-DarlingTest. (Law 2007) describes this test. This test detects taildifferences and has a higher power than the K-S Test for many populardistributions. It can be found as standard output in commercialdistribution fitting software.

Another valuable diagnostic tool is to make probability-probability(P-P) plots and quantile-quantile (Q-Q) plots. A P-P Plot plots the empirical distribution function versus the theoretical distribution evaluated at each order statistic value. Recallthat the empirical distribution is defined as:

The Q-Q Plot is similar in spirit to the P-P Plot. For the Q-Q Plot, thequantiles of the empirical distribution (which are simply the orderstatistics) are plotted versus the quantiles from the hypothesizeddistribution. Let $0 \leq q \leq 1$ so that the $q^th$ quantile of thedistribution is denoted by $x_q$ and is defined by:

For example, the z-values for the standard normal distribution tablesare the quantiles of that distribution. The quantiles of a distributionare readily available if the inverse CDF of the distribution isavailable. Thus, the quantile can be defined as:

where $F^-1$ represents the inverse of the cumulative distributionfunction (not the reciprocal). For example, if the hypothesizeddistribution is N(0,1) then 1.96 = $\Phi^-1(0.975)$ so that$x_0.975$ = 1.96 where $\Phi(z)$ is the CDF of the standard normaldistribution. When you give a probability to the inverse of thecumulative distribution function, you get back the correspondingordinate value that is associated with the area under the curve, e.g.the quantile.

Thus, in order to make a P-P Plot, the CDF of the hypothesizeddistribution must be available and in order to make a Q-Q Plot, theinverse CDF of the hypothesized distribution must be available. When theinverse CDF is not readily available there are other methods to makingQ-Q plots for many distributions. These methods are outlined in(Law 2007). The following example will illustrate how to makeand interpret the P-P plot and Q-Q plot for the hypothesized gammadistribution for the task times.

The Q-Q plot should appear approximately linear with intercept zero andslope 1, i.e. a 45 degree line, if there is a good fit to the data. Inaddition, curvature at the ends implies too long or too short tails,convex or concave curvature implies asymmetry, and stragglers at eitherends may be outliers. The P-P Plot should also appear linear withintercept 0 and slope 1. The abline() function was used to add thereference line to the plots. Figure B.16 illustrates the Q-Q plot. As can be seenin the figures, both plots do not appear to show any significantdeparture from a straight line. Notice that the Q-Q plot is a little offin the right tail.

Then, the gofstat function does all the work to compute the chi-square goodnessof fit, K-S test statistic, as well as other goodness of fit criteria.The results lead to the same conclusion that we had before: the gammadistribution is a good model for this data.