<>= options(SweaveHooks=list(fig=function(){par(cex.main=1.1, mar=c(4.1,3.6,1.6,1.1), pty="s", mgp=c(2.25,0.5,0), tck=-0.02)})) @% \part{Sampling Distributions, \& the Central Limit Theorem -- Lab 3a} For additional background, see SPDM: Part II. \section{Sampling Distributions} The exercises that follow demonstrate the sampling distribution of the mean, for various different population distributions. More generally, sampling distributions of other statistics may be important. Inference with respect to means is commonly based on the sampling distribution of the mean, or of a difference of means, perhaps scaled suitably. The ideas extend to the statistics (coefficients, etc) that arise in regression or discriminant or other such calculations. These ideas are important in themselves, and will be useful background for later laboratories and lectures. Here, it will be assumed that sample values are independent. There are several ways to proceed. \begin{itemize} \item The distribution from which the sample is taken, although not normal, is assumed to follow a common standard form. For example, in the life testing of industrial components, an exponential or Weibull distribution might be assumed. The relevant sampling distribution can be estimated by taking repeated random samples from this distribution, and calculating the statistic for each such sample. \item If the distribution is normal, then the sample distribution of the mean will also be normal. Thus, taking repeated random samples is unnecessary; theory tells us the shape of the distribution. \item Even if the distribution is not normal, the Central Limit Theorem states that, by taking a large enough sample, the sampling distribution can be made arbitrarily close to normal. Often, given a population distribution that is symmetric, a sample of 4 or 5 is enough, to give a sampling distribution that is for all practical purposes normal. \item The final method [the "bootstrap"] that will be described is empirical. The distribution of sample values is treated as if it were the population distribution. The form of the sampling distribution is then determined by taking repeated random with replacement samples (bootstrap samples), of the same size as the one available sample, from that sample. The value of the statistic is calculated for each such bootstrap sample. The repeated bootstrap values of the statistic are used to build a picture of the sampling distribution. With replacement samples are taken because this is equivalent to sampling from a population in which each of the avaialble sample values is repeated an infinite number of times. \end{itemize} The bootstrap method obviously works best if the one available sample is large, thus providing an accurate estimate of the population distribution. Likewise, the assumption that the sampling distribution is normal is in general most reasonable if the one available sample is of modest size, or large. Inference is inevitably hazardous for small samples, unless there is prior information on the likely form of the distribution. As a rough summary, I hazard the following comments: \begin{itemize} \item Simulation (repeated resampling from a theoretical distribution or distributions) is useful \begin{itemize} \item as a check on theory (the theory may be approximate, or of doubtful relevance) \item where there is no adequate theory \item to provide insight, especially in a learning context. \end{itemize} \item The bootstrap (repeated resampling from an empirical distribution or distributions) can be useful \begin{itemize} \item when the sample size is modest and uncertainty about the distributional form may materially affect the assessment of the shape of the sampling distribution; \item when standard theoretical models for the population distribution seem unsatisfactory. \end{itemize} \end{itemize} The idea of a sampling distribution is wider than that of a sampling distribution of a statistic. It can be useful to examine the sampling distribution of a graph, i.e., to examine how the shape of a graph changes under repeated bootstrap sampling. \begin{fmpage}{36pc} \exhead{1} First, take a random sample from the normal distribution, and plot the estimated density function: <>= y <- rnorm(100) plot(density(y), type="l") @ Now take repeated samples of size 4, calculate the mean for each such sample, and plot the density function for the distribution of means: <>= av <- numeric(100) # will take 100 means for (i in 1:100){ av[i] <- mean(rnorm(4)) } lines(density(av), col="red") @% Repeat the above: taking samples of size 9, and of size 25. \end{fmpage} \begin{fmpage}{36pc} \exhead{2} It is also possible to take random samples, usually with replacement, from a vector of values, i.e., from an empirical distribution. This is the bootstrap idea. Again, it may of interest to study the sampling distributions of means of different sizes. Consider the distribution of heights of female Adelaide University students, in the data frame \texttt{survey} (\textit{MASS} package). The following takes 100 bootstrap samples of size 4, calculating the mean for each such sample: <>= library(MASS) y <- na.omit(survey[survey$Sex=="Female", "Height"]) av <- numeric(100) for(i in 1:100)av[i] <- mean(sample(y, 4, replace=TRUE)) @ % Repeat, taking samples of sizes 9 and 16. In each case, use a density plot to display the (empirical) sampling distribution. \end{fmpage} \begin{fmpage}{36pc} \exhead{3} Repeat exercise 1 above: (a) taking values from a uniform distribution (replace \texttt{rnorm(4)} by \texttt{runif(4)}); (b) from an exponential distribution with rate 1 (replace \texttt{rnorm(4)} by \texttt{rexp(4, rate=1)}).\newline [As noted above, density plots are not a good tool for assessing distributional form. They are however quite effective, as here, for showing the reduction in the standard deviation of the sampling distribution of the mean as the sample size increases. The next exercise but one will repeat the comparisons, using normal probability plots in place of density curves.] \end{fmpage} \begin{fmpage}{36pc} \exhead{4} The final exercise of Assignment 2 compared density plots, for several of the variables, between rows that had one or more missing values and those that had no missing values. We can use the bootstrapping idea to ask how apparent differences stand up against repeated simulation. The distribution for \texttt{bmi} looked as though it might have shifted a bit, for data where one or more rows was missing, relative to other rows. We can check whether this apparent shift is consistent under repeated sampling. Here again is code for the graph for \texttt{bmi} <>= library(MASS) library(lattice) complete <- complete.cases(Pima.tr2) completeF <- factor(c("oneORmore", "none")[as.numeric(complete)+1]) Pima.tr2$completeF <- completeF densityplot(~bmi, groups=completeF, data=Pima.tr2, auto.key=list(columns=2)) # Use completeF in place of complete, for better labeling @ % Here is code for taking one bootstrap sample from each of the two categories of row, then repeating the density plot. <>= rownum <- seq(along=complete) # generate row numbers allpresSample <- sample(rownum[complete], replace=TRUE) # By default, sample size is the number of vales of the first argument NApresSample <- sample(rownum[!complete], replace=TRUE) densityplot(~bmi, groups=completeF, data=Pima.tr2, auto.key=list(columns=2), subset=c(allpresSample, NApresSample)) @ % Wrap these lines of code in a function. Repeat the formation of the bootstrap samples and the plots several times. Does the shift in the distribution seem consistent under repeating sampling? \end{fmpage} \begin{fmpage}{36pc} \exhead{5} More commonly, one compares examines the displacement, under repeated sampling, of one mean relative to the other. Here is code for the calculation: <>= twot <- function(n=99){ complete <- complete.cases(Pima.tr2) rownum <- seq(along=complete) # generate row numbers d2 <- numeric(n+1) d2[1] <- with(Pima.tr2, mean(bmi[complete], na.rm=TRUE)- mean(bmi[!complete], na.rm=TRUE)) for(i in 1:n){ allpresSample <- sample(rownum[complete], replace=TRUE) NApresSample <- sample(rownum[!complete], replace=TRUE) d2[i+1] <- with(Pima.tr2, mean(bmi[allpresSample], na.rm=TRUE)- mean(bmi[NApresSample], na.rm=TRUE)) } d2 } # NB: There are n+1 values in all; the mean difference for the # actual sample values, plus differences for n bootstrap samples ## Now estimate and plot the density function d2 <- twot(n=999) dens <- density(d2) plot(dens) ## Check the proportion of values of d2 that are less than or equal to 0 sum(d2<0)/length(d2) @ % Those who are familiar with $t$-tests may recognize the final calculation as a bootstrap equivalent of the $t$-test. \end{fmpage} \vspace*{9pt} \begin{fmpage}{36pc} \exhead{6} The range that contains the central 95\% of values of \texttt{d2} gives a 95\% confidence (or coverage) interval for the mean difference. Given that there are 1000 values in total, the interval is the range from the 26th to the 975th value, when values are sorted in order of magnitude, thus: <<95% CI>>= round(sort(d2)[c(26, 975)], 2) @ % Repeat the calculation of \texttt{d2} and the calculation of the resulting 95\% confidence interval, several times. \end{fmpage} \vspace*{9pt} \section{The Central Limit Theorem} Theoretically based $t$-statistic and related calculations rely on the assumption that the sampling distribution of the mean is normal. The Central Limit Theorem assures that the distribution will for a large enough sample be arbitrarily close to normal, providing only that the population distribution has a finite variance. Simulation of the sampling distribution is especially useful if the population distribution is not normal, providing an indication of the size of sample needed for the sampling distribution to be acceptably close to normal. \begin{fmpage}{36pc} \exhead{7} The function \texttt{simulateSampDist()} (\textit{DAAGxtras}) allows investigation of the sampling distribution of the mean or other stastistic, for an arbitrary population distribution and sample size. Figure \ref{fig:sim-density} shows sampling distributions for samples of sizes 4 and 9, from a normal population. The function call is <>= library(DAAGxtras) sampvalues <- simulateSampDist(numINsamp=c(4,9)) plotSampDist(sampvalues=sampvalues, graph="density", titletext=NULL) @% Experiment with sampling from normal, uniform, exponential and $t_2$-distributions. What is the effect of varying the value of \texttt{numsamp}? \newline [To vary the kernel and/or the bandwidth used by \texttt{density()}, just add the relevant arguments in the call to to \texttt{simulateSampDist()}, e.g. \texttt{sampdist(numINsamp=4, bw=0.5)}. Any such additional arguments (here, \texttt{bw}) are passed via the \texttt{\ldots} part of the parameter list.] \end{fmpage} \setkeys{Gin}{width=\textwidth} \begin{figure}[H] \begin{center} \begin{minipage}[c]{0.375\linewidth} <>= sampvalues <- simulateSampDist(numINsamp=c(4,9)) plotSampDist(sampvalues=sampvalues, graph="density", titletext=NULL) @ \end{minipage} \hspace*{0.05\linewidth} \begin{minipage}[c]{0.4\linewidth} \caption{Empirical density curve, for a normal population and for the sampling distributions of means of samples of sizes 4 and 9 from that population.}\label{fig:sim-density} \end{minipage} \end{center} \end{figure} \begin{fmpage}{36pc} \exhead{8} The function \texttt{simulateSampDist()} has an option (\texttt{graph="qq"}) that allows the plotting of a normal probability plot. Alternatively, by using the argument \texttt{graph=c("density","qq")}, the two types of plot appear side by side, as in Figure \ref{fig:density-qq}. Figure \ref{fig:density-qq} is an example of its use. <>= sampvalues <- simulateSampDist() plotSampDist(sampvalues=sampvalues, graph=c("density", "qq")) @% In the right panel, the slope is proportional to the standard deviation of the distribution. For means of a sample size equal to 4, the slope is reduced by a factor of 2, while for a sample size equal to 9, the slope is reduced by a factor of 3. Comment in each case on how the spread of the density curve changes with increasing sample size. How does the qq-plot change with increasing sample size? Comment both on the slope of a line that might be passed through the points, and on variability about that line. \end{fmpage} \setkeys{Gin}{width=0.8\textwidth} \begin{figure}[H] \begin{center} <>= <> @ \end{center} \caption{Empirical density curves, for a normal population and for the sampling distributions of means of samples of sizes 4 and 9, are in the left panel. The corresponding normal probability plots are shown in the right panel.}\label{fig:density-qq} \end{figure} \begin{fmpage}{36pc} \exhead{9} How large is "large enough", so that the sampling distribution of the mean is close to normal? This will depend on the population distribution. Obtain the equivalent for Figure \ref{fig:density-qq}, for the following populations: \begin{enumerate} \item A $t$-distribution with 2 degrees of freedom\newline [\texttt{rpop = function(n)rt(n, df=2)}] \item A log-normal distribution, i.e., the logarithms of values have a normal distribution\newline [\texttt{rpop = function(n, c=4)exp(rnorm(n)+c)}] \item The empirical distribution of heights of female Adelaide University students, in the data frame \texttt{survey} (\textit{MASS} package). In the call to \texttt{simulateSampDist()}, the parameter \texttt{rpop} can specify a vector of numeric values. Samples are then obtained by sampling with replacement from these numbers. For example: <>= library(MASS) y <- na.omit(survey[survey$Sex=="Female", "Height"]) sampvalues <- simulateSampDist(y) plotSampDist(sampvalues=sampvalues) @ % \end{enumerate} How large a sample seems needed, in each instance, so that the sampling distribution is approximately normal -- around 4, around 9, or greater than 9? \end{fmpage}