\part{Practice with R} \section{Information about the Columns of Data Frames} \begin{fmpage}{36pc} \exhead{1} Functions that may be used to get information about data frames include \texttt{str()}, \texttt{dim()}, \texttt{row.names()} and \texttt{names()}. Try each of these functions with the data frames \texttt{allbacks}, \texttt{ant111b} and \texttt{tinting} (all in \textit{DAAG}). For getting information about each column of a data frame, use \texttt{sapply()}. For example, the following applies the function \texttt{class()} to each column of the data frame \texttt{ant111b}. <>= library(DAAG) sapply(ant111b, class) @ % For columns in the data frame \texttt{tinting} that are factors, use \texttt{table()} to tabulate the number of values for each level. \end{fmpage} \vspace*{9pt} \section{A Tabulation Exercise} \begin{fmpage}{36pc} \exhead{2} Tabulate the number of observations in each of the different districts in the data frame \texttt{rockArt} (\textit{DAAGxtras}). Create a factor \texttt{groupDis} in which all \texttt{District}s with less than 5 observations are grouped together into the category \texttt{other}. \end{fmpage} <>= library(DAAGxtras) groupDis <- as.character(rockArt$District) tab <- table(rockArt$District) le4 <- rockArt$District %in% names(tab)[tab <= 4] groupDis[le4] <- "other" groupDis <- factor(groupDis) @ % \section{A For Loop} \begin{fmpage}{36pc} \exhead{3} The following code uses a \verb!for! loop to plot graphs that compare the relative population growth (here, by the use of a logarithmic scale) for the Australian states and territories. <>= library(DAAG) oldpar <- par(mfrow=c(2,4)) for (i in 2:9){ plot(austpop[,1], log(austpop[, i]), xlab="Year", ylab=names(austpop)[i], pch=16, ylim=c(0,10))} par(oldpar) @% Which Australian adminstration(s) showed the most rapid increase in the early years? Which showed the most rapid increase in later years? \end{fmpage} \vspace*{9pt} \section{Data Exploration -- Distributions of Data Values} \begin{fmpage}{36pc} \exhead{4} The data frame \texttt{rainforest} (\textit{DAAG} package) has data on four different rainforest species. Use \verb!table(rainforest$species)! to check the names and numbers of the species present. In the sequel, attention will be limited to the species Acmena \textit{smithii}. The following plots a histogram showing the distribution of the diameter at base height: <>= library(DAAG) # The data frame rainforest is from DAAG Acmena <- subset(rainforest, species=="Acmena smithii") hist(Acmena$dbh) @ % Above, frequencies were used to label the the vertical axis (this is the default). An alternative is to use a density scale (\texttt{prob=TRUE}). The histogram is interpreted as a crude density plot. The density, which estimates the number of values per unit interval, changes in discrete jumps at the breakpoints (= class boundaries). The histogram can then be directly overlaid with a density plot, thus: <>= hist(Acmena$dbh, prob=TRUE, xlim=c(0,50)) # Use a density scale lines(density(Acmena$dbh, from=0)) @ % Why use the argument \texttt{from=0}? What is the effect of omitting it? \vspace{3pt} [Density estimates, as given by R's function \texttt{density()}, change smoothly and do not depend on an arbitrary choice of breakpoints, making them generally preferable to histograms. They do sometimes require tuning to give a sensible result. Note especially the parameter \texttt{bw}, which determines how the bandwidth is chosen, and hence affects the smoothness of the density estimate.] \end{fmpage} \vspace*{9pt} \section{Random Samples} \begin{fmpage}{36pc} \exhead{5} By taking repeated random samples from the normal distribution, and plotting the distribution for each such sample, one can get an idea of the effect of sampling variation on the sample distribution. A random sample of 100 values from a normal distribution (with mean 0 and standard deviation 1) can be obtained, and a histogram and overlaid density plot shown, thus: <>= y <- rnorm(100) hist(y, probability=TRUE) # probability=TRUE gives a y density scale lines(density(y)) @ % \begin{itemize} \item[(a)] Take 5 samples of size 25, then showing the plots. \item[(b)] Take 5 samples of size 100, then showing the plots. \item[(c)] Take 5 samples of size 500, then showing the plots. \item[(d)] Take 5 samples of size 2000, then showing the plots. \end{itemize} (Hint: By preceding the plots with \texttt{par(mfrow=c(4,5))}, all 20 plots can be displayed on the one graphics page. To bunch the graphs up more closely, make the further settings \texttt{par(mar=c(3.1,3.1,0.6,0.6), mgp=c(2.25,0.5,0))}) \vspace*{4pt} Comment on the usefulness of a sample histogram and/or density plot for judging whether the population distribution is likely to be close to normal. \vspace*{3pt} \end{fmpage} \vspace*{4pt} Histograms and density plots are, for ``small'' samples, notoriously variable under repeated sampling. This is true even for sample sizes as large as 50 or 100. \vspace*{4pt} \begin{fmpage}{36pc} \exhead{6} This explores the function \texttt{sample()}, used to take a sample of values that are stored or enumerated in a vector. Samples may be with or without replacement; specify \texttt{replace = FALSE} (the default) or \texttt{replace = TRUE}. The parameter \texttt{size} determines the size of the sample. By default the sample has the same size (length) as the vector from which samples are taken. Take several samples of size 5 from the vector \texttt{1:5}, with \texttt{replace=FALSE}. Then repeat the exercise, this time with \texttt{replace=TRUE}. Note how the two sets of samples differ. \end{fmpage} \vspace*{8pt} \begin{fmpage}{36pc} \exhead{7} If in Exercise 4 above a new random sample of trees could be taken, the histogram and density plot would change. How much might we expect them to change? \vspace*{3pt} The boostrap approach treats the one available sample as a microcosm of the population. Repeated with replacement samples are taken from the one available sample. This is equivalent to repeating each sample value and infinite number of times, then taking random samples from the population that is thus created. The expectation is that variation between those samples will be comparable to variation between samples from the original population. \begin{itemize} \item[(a)] Take repeated (5 or more) bootstrap samples from the Acmena dataset of Exercise 4, and show the density plots. [Use \verb!sample(Acmena$dbh, replace=TRUE)!]. \item[(b)] Repeat, now with the \texttt{cerealsugar} data from \textit{DAAG}. \end{itemize} \end{fmpage} \vspace*{9pt} \section{Smooth Curves} \begin{fmpage}{36pc} \exhead{8\textbf{*}} The following compares three different smoothing functions. Comment on the different syntax and, in the case of \texttt{lowess()}, the different default output that is returned. Why, for the smooth obtained using \texttt{lowess()}, is it necessary to sort data in order of values of \texttt{dbh}? (Try omitting the ordering, and observe the result.) <>= Acmena <- subset(rainforest, species=="Acmena smithii") ## Use lowess() plot(wood ~ dbh, data=Acmena) ord <- order(Acmena$dbh) with(Acmena[ord, ], lines(predict(loess(wood ~ dbh)) ~ dbh)) ## Now use panel.smooth() plot(wood ~ dbh, data=Acmena) with(Acmena, panel.smooth(dbh, wood)) # Observe that panel.smooth() does not at this time allow either # a graphics formula as argument, or the data parameter. @ % For each of the functions just noted, what are the parameters that control the smoothness of the curve? What, in each case, is the default? \end{fmpage} \newpage \section{Information on Workspace Objects} \begin{fmpage}{36pc} \exhead{9} An R workspace includes objects \texttt{possum1}, \texttt{possum2}, \ldots \texttt{possum5}. The folowing shows how to get the size of one of these objects one at a time. <>= possum1 <- rnorm(10) object.size(possum1) @ % The names of the objects can be obtained with <<>>= nam <- ls(pattern="^possum") @ To get the sizes from the names that are held in \texttt{nam}, do <>= sapply(nam, function(x)object.size(get(x))) @ % Create objects \texttt{possum2}, \ldots \texttt{possum5}, and enter this command. Explain the successive steps in the computation.\newline [Hint: Compare \texttt{class(possum1)} with \texttt{class("possum1")}, and \texttt{object.size(possum1)} with \texttt{object.size("possum1")}] \end{fmpage} \vspace*{8pt} \begin{fmpage}{36pc} \exhead{10\textbf{*}} The function \texttt{ls()} lists, by default, the names of objects in the current environment. If used from the command line, it lists the objects in the workspace. If used in a function, it lists the names of the function's local variables. To get a listing of the contents of the workspace, do the following <>= workls <- function()ls(name=".GlobalEnv") workls() @ % \begin{itemize} \item[(a)] If \texttt{ls(name=".GlobalEnv")} is replaced by \texttt{ls()}, the function lists the names of its local variables. Modify \texttt{workls()} so that you can use it to demonstrate this.\newline [Hint: Consider adapting \texttt{if(is.null(name))ls())} for the purpose.] \item[(b)] Write a function that calculates the sizes of all objects in the workspace, then listing the names and sizes of the largest ten objects. \end{itemize} \end{fmpage} \vspace{9pt} \section{Different Ways to Do a Calculation -- Timings} \begin{fmpage}{36pc} \exhead{10\textbf{*}} This exercise will investigate the relative times for alternative ways to do a calculation. The function \texttt{system.time()} will provide timings. The numbers that are printed on the command line, if results are not assigned to an output object, are the user cpu time, the system cpu time, and the elapsed time. First, create both matrix and data frame versions of a largish data set. <>= xxMAT <- matrix(runif(480000), ncol=50) xxDF <- as.data.frame(xxMAT) @ % \end{fmpage} \begin{fmpage}{36pc} \exhead{11\textbf{*}, continued} Repeat each of the calculations that follow several times, noting the extent of variation between repeats. If there is noticeable variation, make the setting \texttt{options(gcFirst=TRUE)}, and check whether this leads to more consistent timings. NB: If your computer chokes on these calculations, reduce the dimensions of \texttt{xxMAT} and \texttt{xxDF} \begin{itemize} \item[(a)] The following compares the times taken to increase each element by 1: <>= system.time(invisible(xxMAT+1))[1:3] system.time(invisible(xxDF+1))[1:3] @ % \item[(b)] Now compare the following alternative ways to calculate the means of the 50 columns: <>= ## Use apply() [matrix argument], or sapply() [data frame argument] system.time(av1 <- apply(xxMAT, 2, mean))[1:3] system.time(av1 <- sapply(xxDF, mean))[1:3] ## Use a loop that does the calculation for each column separately system.time({av2 <- numeric(50); for(i in 1:50)av[i] <- mean(xxMAT[,i]) })[1:3] system.time({av2 <- numeric(50); for(i in 1:50)av[i] <- mean(xxDF[,i]) })[1:3] ## Matrix multiplication system.time({colOFones <- rep(1, dim(xxMAT)[2]) av3 <- xxMAT %*% colOFones / dim(xxMAT)[2] })[1:3] @ % \item[(c)] Pick one of the above calculations. Vary the number of rows in the matrix, keeping the number of columns constant, and plot each of user CPU time and system CPU time against number of rows of data. \end{itemize} Why is matrix multiplication is so efficient, relative to equivalent calculations that use \texttt{apply()}, or that use for loops? \end{fmpage} \vspace*{9pt} \section{Functions -- Making Sense of the Code} \begin{fmpage}{36pc} \exhead{12\textbf{*}} Data in the data frame \texttt{fumig} (\textit{DAAGxtras}) are from a series of trials in which produce was exposed to a fumigant over a 2-hour time period. Concentrations of fumigant were measured at times 5, 10, 30, 60, 90 and 120 minutes. Code given following this exercise calculates a concentration-time (c-t) product that measures exposure to the fumigant, leading to the measure \texttt{ctsum}. \vspace*{3pt} Examine the code in the three alternative functions given below, and the data frame \texttt{fumig} (in the \texttt{DAAGxtras} package) that is given as the default argument for the parameter \texttt{df}. Do the following: \begin{itemize} \item[(a)] Run all three functions, and check that they give the same result. \item[(b)] Annotate the code for \texttt{calcCT1()} to explain what each line does. \item[(c)] Are fumigant concentration measurements noticeably more variable at some times than at others? \item[(d)] Which function is fastest? [In order to see much difference, it will be necessary to put the functions in loops that run perhaps 1000 or more times.] \end{itemize} \end{fmpage} \begin{fmpage}{36pc} \paragraph{Code for 3 functions that do equivalent calculations} <>= ## Function "calcCT1" "calcCT1" <- function(df=fumig, times=c(5,10,30,60,90,120), ctcols=3:8){ multiplier <- c(7.5,12.5,25,30,30,15) m <- dim(df)[1] ctsum <- numeric(m) for(i in 1:m){ y <- unlist(df[i, ctcols]) ctsum[i] <- sum(multiplier*y)/60 } df <- cbind(ctsum=ctsum, df[,-ctcols]) df } ## ## Function "calcCT2" "calcCT2" <- function(df=fumig, times=c(5,10,30,60,90,120), ctcols=3:8){ multiplier <- c(7.5,12.5,25,30,30,15) mat <- as.matrix(df[, ctcols]) ctsum <- mat%*%multiplier/60 cbind(ctsum=ctsum, df[,-ctcols]) } ## ## Function "calcCT3" "calcCT3" <- function(df=fumig, times=c(5,10,30,60,90,120), ctcols=3:8){ multiplier <- c(7.5,12.5,25,30,30,15) mat <- as.matrix(df[, ctcols]) ctsum <- apply(mat, 1, function(x)sum(x*multiplier))/60 cbind(ctsum=ctsum, df[,-ctcols]) } @ \end{fmpage} \vspace*{9pt} \section{$^*$Use of \texttt{sapply()} to Give Multiple Graphs} \begin{fmpage}{36pc} \exhead{13\textbf{*}} Here is code for the calculations that compare the relative population growth rates for the Australian states and territories, but avoiding the use of a loop: <>= oldpar <- par(mfrow=c(2,4)) invisible( sapply(2:9, function(i, df) plot(df[,1], log(df[, i]), xlab="Year", ylab=names(df)[i], pch=16, ylim=c(0,10)), df=austpop) ) par(oldpar) @ % Run the code, and check that it does indeed give the same result as an explicit loop.\newline [Use of \texttt{invisible()} as a wrapper suppresses printed output that gives no useful information.] Note that \texttt{lapply()} could be used in place of \texttt{sapply()}. \end{fmpage} There are several subtleties here: \begin{description} \item[(i)] The first argument to \texttt{sapply()} can be either a list (which is, technically, a non-atomic vector) or a vector.\footnote{By ``vector'' we usually mean an atomic vector, with ``atoms'' that are of one of the modes "logical", "integer", "numeric", "complex", "character"' or "raw". (Vectors of mode "raw" can for our purposes be ignored.)} Here, we have supplied the vector 2:9 \item[(ii)] The second argument is a function. Here we have supplied an anonymous function that has two arguments. The argument \texttt{i} takes as its values, in turn, the sucessive elements in the first argument to \texttt{sapply} \item[(iii)] Where as here the anonymous function has further arguments, they are supplied as additional arguments to \texttt{sapply()}. Hence the parameter \texttt{df=austpop}. \end{description} \vspace*{3pt} \enlargethispage{24pt} \section{$^*$The Internals of R -- Functions are Pervasive} \begin{fmpage}{36pc} \exhead{14\textbf{*}} This exercise peeks into the internals of R's handling arithmetic and related computations. Those internals are close enough to the surface that users can experiment with them. \vspace*{3pt} The binary arithmetic operators \texttt{+}, \texttt{-}, \texttt{*}, \texttt{/} and \verb!^! are implemented as functions. (R is a functional language; albeit with features that compromise its purity as a member of this genre!) Try the following: <>= "+"(2,5) "-"(10,3) "/"(2,5) "*"("+"(5,2), "-"(3,7)) @ % \vspace*{3pt} There are two other binary arithmetic operators -- \verb!%%! and \verb!%/%!. Look up the relevant help page, and explain, with examples, what they do. Try <>= (0:25) %/% 5 (0:25) %% 5 @ % Of course, these are also implemented as functions. Write code that demonstrates this. \vspace*{3pt} Note also that \verb![! is implemented as a function. Try <>= z <- c(2, 6, -3, NA, 14, 19) "["(z, 5) heights <- c(Andreas=178, John=185, Jeff=183) "["(heights, c("Jeff", "John")) @ % Rewrite these using the usual syntax. \vspace*{3pt} Use this syntax to extract, from the data frame \texttt{possumsites} (\textit{DAAG}), the altitudes for Byrangery and Conondale. \end{fmpage} \vspace*{4pt} \noindent Note: Expressions in which arithmetic operators appear as explicit functions with binary arguments translate directly into postfix reverse Polish notation, introduced in 1920 by the Polish logician and mathematician Jan Lukasiewicz. Postfix notation is widely used in the interpreters and compilers that translate computer language code into machine or assembly language instructions. See the Wikipedia article ``Reverse Polish Notation''.