<>= options(SweaveHooks=list(fig=function(){par(cex.main=1.1, mar=c(4.1,4.1,2.6,2.1), pty="s", mgp=c(2.25,0.5,0), tck=-0.02)})) @% \part{Populations and Samples -- Theoretical and Empirical Distributions} For background, see SPDM: Sections 5-7. R functions that will be used in this laboratory include: \begin{enumerate} \item \texttt{dnorm()}: Obtain the density values for the theoretical normal distribution; \item \texttt{pnorm()}: Given a normal deviate or deviates, obtain the cumulative probability; \item \texttt{qnorm()}: Given the cumulative probabilty. calculate the normal deviate; \item \texttt{sample()}: take a sample from a vector of values. Values may be taken without replacement (once taken from the vector, the value is not available for subsequent draws), or with replacement (values may be repeated); \item \texttt{density()}: fit an empirical density curve to a set of values; \item \texttt{rnorm()}: Take a random sample from a theoretical normal distribution; \item \texttt{runif()}: similar to \texttt{rnorm()}, but sampling is from a uniform distribution; \item \texttt{rt()}: similar to \texttt{rnorm()}, but sampling is from a $t$-distribution (the degrees of freedom must be given as the second parameter); \item \texttt{rexp()}: similar to \texttt{rnorm()}, but sampling is from an exponential distribution; \item \texttt{qqnorm()}: Compare the empirical distribution of a set of values with the empirical normal distribution. \end{enumerate} \section{Populations and Theoretical Distributions} \begin{fmpage}{36pc} \exhead{1} \begin{enumerate} \item Plot the density and the cumulative probability curve for a normal distribution with a mean of 2.5 and SD = 1.5.\newline Code that will plot the curve is <>= curve(dnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5) curve(pnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5) @ \item From the cumulative probability curve in (a), read off the area under the density curve between \texttt{x=0.5} and \texttt{x=4}. Check your answer by doing the calculation <>= pnorm(4, mean=2.5, sd=1.5) - pnorm(0.5, mean=2.5, sd=1.5) @ % \end{enumerate} \end{fmpage} \begin{fmpage}{36pc} \exhead{1, continued} \begin{enumerate} \item The density for the distribution in items (i) and (ii), given by \texttt{dnorm(x, 2.5, 1.5)}, gives the relative number of observations per unit interval that can be expected at the value \texttt{x}. For example \texttt{dnorm(x=2, 2.5, 1.5)} $\simeq$ 0.2516. Hence \begin{itemize} \item [(i)] In a sample of 100 the expected number of observations per unit interval, in the immediate vicinity of \texttt{x = 2}, is 25.16 \item [(ii)] In a sample of 1000 the expected number of observations per unit interval, in the immediate vicinity of \texttt{x = 2}, is 251.6 \item [(iii)] The expected number of values from a sample of 100, between 1.9 amd 2.1, is approximately 0.2 $\times$ 251.6 = 50.32\newline [The number can be calculated more exactly as <>= (pnorm(2.1, 2.5, 1.5) - pnorm(1.9, 2.5, 1.5)) * 1000 @ % ] \end{itemize} Repeat the calculation to get approximate and more exact values for the expected number \begin{itemize} \item [(i)] between 0.9 and 1.1 \item [(ii)] between 2.9 and 3.1 \item [(iii)] between 3.9 and 4.1 \end{itemize} \end{enumerate} \end{fmpage} By way of example, here is the code for (a): <>= curve(dnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5) curve(pnorm(x, mean=2.5, sd=1.5), from = 2.5-3*1.5, to = 2.5+3*1.5) @ \begin{fmpage}{36pc} \exhead{2} \begin{enumerate} \item Plot the density and the cumulative probability curve for a $t$-distribution with 3 degrees of freedom. Overlay, in each case, a normal distribution with a mean of 0 and SD=1. \newline [Replace \texttt{dnorm} by \texttt{dt}, and specify \texttt{df=10}] \item Plot the density and the cumulative probability curve for an exponential distribution with a rate parameter equal to 1 (the default). Repeat, with a rate parameter equal to 2. (When used as a failure time distribution; the rate parameter is the expected number of failures per unit time.) \end{enumerate} \end{fmpage} \section{Samples and Estimated Density Curves} \begin{fmpage}{36pc} \exhead{3} Use the function \texttt{rnorm()} to draw a random sample of 25 values from a normal distribution with a mean of 0 and a standard deviation equal to 1.0. Use a histogram, with \texttt{probability=TRUE} to display the values. Overlay the histogram with: (a) an estimated density curve; (b) the theoretical density curve for a normal distribution with mean 0 and standard deviation equal to 1.0. Repeat with samples of 100 and 500 values, showing the different displays in different panels on the same graphics page. \end{fmpage} <>= par(mfrow=c(1,3), pty="s") x <- rnorm(50) hist(x, probability=TRUE) lines(density(x)) xval <- pretty(c(-3,3), 50) lines(xval, dnorm(xval), col="red") @ \begin{fmpage}{36pc} \exhead{4} Data whose distribution is close to lognormal are common. Size measurements of biological organisms often have this character. As an example, consider the measurements of body weight (\texttt{body}), in the data frame \texttt{Animals} (\textit{MASS}). Begin by drawing a histogram of the untransformed values, and overlaying a density curve. Then \begin{enumerate} \item Draw an estimated density curve for the logarithms of the values. Code is given immediately below. \item Determine the mean and standard deviation of \verb!log(Animals$body)!. Overlay the estimated density with the theoretical density for a normal distribution with the mean and standard deviation just obtained. \end{enumerate} Does the distribution seem normal, after transformation to a logarithmic scale? \end{fmpage} <>= library(MASS) plot(density(Animals$body)) logbody <- log(Animals$body) plot(density(logbody)) av <- mean(logbody) sdev <- sd(logbody) xval <- pretty(c(av-3*sdev, av+3*sdev), 50) lines(xval, dnorm(xval, mean=av, sd=sdev)) @ \begin{fmpage}{36pc} \exhead{5} The following plots an estimated density curve for a random sample of 50 values from a normal distribution: <>= plot(density(rnorm(50)), type="l") @ % \begin{enumerate} \item Plot estimated density curves, for random samples of 50 values, from (a) the normal distribution; (b) the uniform distribution (\texttt{runif(50)}); (c) the $t$-distribution with 3 degrees of freedom. Overlay the three plots (use \texttt{lines()} in place of \texttt{plot()} for densities after the first). \item Repeat the previous exercise, but taking random samples of 500 values. \end{enumerate} \end{fmpage} \begin{fmpage}{36pc} \exhead{6} There are two ways to make an estimated density smoother: \begin{enumerate} \item One is to increase the number of samples, For example: <>= plot(density(rnorm(500)), type="l") @ \item The other is to increase the bandwidth. For example <>= plot(density(rnorm(50), bw=0.2), type="l") plot(density(rnorm(50), bw=0.6), type="l") @ \end{enumerate} Repeat each of these with bandwidths (\texttt{bw}) of 0.15, with the default choice of bandwidth, and with the bandwidth set to 0.75. \end{fmpage} \begin{fmpage}{36pc} \exhead{7} Here we experiment with the use of \texttt{sample()} to take a sample from an empirical distribution, i.e., from a vector of values that is given as argument. Here, the sample size will be the number of values in the argument. Any size of sample is however permissible. <>= sample(1:5, replace=TRUE) # With replacement sample, size 5 for(i in 1:10)print(sample(1:5, replace=TRUE)) # 10 such samples plot(density(log10(Animals$body))) lines(density(sample(log10(Animals$body), replace=TRUE)), col="red") @ Repeat the final density plot several times, perhaps using different colours for the curve on each occasion. This gives an indication of the stability of the estimated density curve with respect to sample variation. Laboratory exercises 4 will pursue these ideas in more detail. \end{fmpage} \vspace*{9pt} \section{*Normal Probability Plots} \begin{fmpage}{36pc} \exhead{8} Partly because of the issues with bandwidth and choice of kernel, and partly because it is hard to density estimates are not a very effective means for judging normality. A much better tool is the normal probability plot, which works with cumulative probability distributions. Try <>= qqnorm(rnorm(10)) qqnorm(rnorm(50)) qqnorm(rnorm(200)) @ % For samples of modest to large sizes, the points lie close to a line. The function \texttt{qreference()} (\textit{DAAG}) takes one sample as a reference (by default it uses a random sample) and by default provides 5 other random normal samples for comparison. For example: <>= library(DAAG) qreference(m=10) qreference(m=50) qreference(m=200) @ % \end{fmpage} \begin{fmpage}{36pc} \exhead{9} The intended use of \texttt{qreference()} is to draw a normal probability for a set of data, and place alongside it some number of normal probability plots for random normal data. For example <>= qreference(possum$totlngth) @ % Obtain similar plots for each of the variables \texttt{taill}, \texttt{footlgth} and \texttt{earconch} in the possum data. Repeat the exercise for males and females separately \end{fmpage} \begin{fmpage}{36pc} \exhead{10} Use normal probability plots to assess whether the following sets of values, all from data sets in the \texttt{DAAG} package, have distributions that seem consistent with the assumption that they have been sampled from a normal distribution? \begin{enumerate} \item the difference \texttt{heated - ambient}, in the data frame \texttt{pair65} (\textit{DAAG})? \item the values of \texttt{earconch}, in the \texttt{possum} data frame (\textit{DAAG})? \item the values of \texttt{body}, in the data frame \texttt{Animals} (\textit{MASS})? \item the values of \texttt{log(body)}, in the data frame \texttt{Animals} (\textit{MASS})? \end{enumerate} \end{fmpage} \vspace*{9pt} \section{Boxplots -- Simple Summary Information on a Distribution} In the data frame \texttt{cfseal} (\textit{DAAG}), several of the columns have a number of missing values. A relevant question is: ``Do missing and non-missing rows have similar values, for columns that are complete?'' \begin{fmpage}{36pc} \exhead{11} Use the following to find, for each column of the data frame \texttt{cfseal}, the number of missing values: \begin{verbcode} sapply(cfseal, function(x)sum(is.na(x))) \end{verbcode} Observe that for \texttt{lung}, \texttt{leftkid}, \texttt{rightkid}, and \texttt{intestines} values are missing in the same six rows. For each of the remaining columns compare, do boxplots that compare the distribution of values for the 24 rows that had no missing values with the distribution of values for the 6 rows that had missing values. \end{fmpage} Here is code that can be used to get started: \begin{verbcode} present <- complete.cases(cfseal) boxplot(age ~ present, data=cfseal) \end{verbcode} Or you might use the lattice function and do the following: \begin{verbcode} present <- complete.cases(cfseal) library(lattice) present <- complete.cases(cfseal) bwplot(present ~ age, data=cfseal) \end{verbcode} \begin{fmpage}{36pc} \exhead{12} Tabulate, for the same set of columns for which boxplots were obtained in Exercise 2, differences in medians, starting with: \begin{verbcode} median(age[present]) - median(age[!present])) \end{verbcode} Calculate also the ratios of the two interquartile ranges, i.e. \begin{verbcode} IQR(age[present]) - IQR(age[!present])) \end{verbcode} \end{fmpage}