%% First set of exercises <>= options(SweaveHooks=list(fig=function(){par(cex.main=1.1, mar=c(4.1,4.1,2.6,2.1), pty="s", mgp=c(2.25,0.5,0), tck=-0.02)})) graphics.off() x11(width=8, height=8) @ \part{R Basics} \section{Data Input} \begin{fmpage}{36pc} \exhead{1} The file \textbf{fuel.txt} is one of several files that the function \texttt{datafile()} (from \textit{DAAG}), when called with a suitable argument, has been designed to place in the working directory. On the R command line, type \texttt{library(DAAG)}, then \texttt{datafile("fuel")}, thus:\footnote{This and other files used in these notes for practice in data input are also available from the web page\\ \url{http://www.maths.anu.edu.au/~johnm/datasets/text/}.} <>= library(DAAG) datafile(file="fuel") # NB datafile, not dataFile # See help(datafile) for information on this function @ % Alternatively, copy \textbf{fuel.txt} from the directory \textbf{data} on the DVD to the working directory. \vspace*{3pt} Use \texttt{file.show()} to examine the file.\footnote{Alternatively, open the file in R's script editor (under Windows, go to \underline{File} | \underline{Open script...}), or in another editor.} Check carefully whether there is a header line. Use the R Commander menu to input the data into R, with the name \texttt{fuel}. Then, as an alternative, use \texttt{read.table()} directly. (If necessary use the code generated by the R Commander as a crib.) In each case, display the data frame and check that data have been input correctly. \vspace*{3pt} {\small Note: If the file is elsewhere than in the working directory a fully specified file name, including the path, is necessary. For example, to input \textbf{travelbooks.txt} from the directory \textbf{data} on drive \textbf{D:}, type <>= travelbooks <- read.table("D:/data/travelbooks.txt") @ % For input to R functions, forward slashes replace backslashes.} \end{fmpage} \vspace*{6pt} \begin{fmpage}{36pc} \exhead{2} The files \textbf{molclock1.txt} and \textbf{molclock1.txt} are in the \textbf{data} directory on the DVD.\footnote{Again, these are among the files that you can use the function \texttt{datafile()} to place in the working directory.} \vspace*{4pt} As in Exercise 1, use the R Commander to input each of these, then using \texttt{read.table()} directly to achieve the same result. Check, in each case, that data have been input correctly. \vspace*{3pt} \end{fmpage} \vspace*{6pt} \section{The \texttt{paste()} Function} \begin{fmpage}{36pc} \exhead{3} Here are examples that illustrate the use of \texttt{paste()}: <>= paste("Leo", "the", "lion") paste("a", "b") paste("a", "b", sep="") paste(1:5) paste(1:5, collapse="") @ % What are the respective effects of the parameters \texttt{sep} and \texttt{collapse}? \end{fmpage} \vspace*{9pt} \section{Missing Values} \begin{fmpage}{36pc} \exhead{4} The following counts, for each species, the number of missing values for the column \texttt{root} of the data frame \texttt{rainforest} (\textit{DAAG}): <>= library(DAAG) with(rainforest, table(complete.cases(root), species)) @ % For each species, how many rows are ``complete'', i.e., have no values that are missing? \end{fmpage} \begin{fmpage}{36pc} \exhead{5} For each column of the data frame \texttt{Pima.tr2} (\textit{MASS}), determine the number of missing values. \end{fmpage} \vspace*{9pt} \section{Subsets of Dataframes} \begin{fmpage}{36pc} \exhead{6} Use \texttt{head()} to check the names of the columns, and the first few rows of data, in the data frame \texttt{rainforest} (\textit{DAAG}). Use \verb!table(rainforest$species)! to check the names and numbers of each species that are present in the data. The following extracts the rows for the species \textit{Acmena smithii} <>= library(DAAG) Acmena <- subset(rainforest, species=="Acmena smithii") @% The following extracts the rows for the species \texttt{Acacia mabellae} and \texttt{Acmena smithii} <>= AcSpecies <- subset(rainforest, species %in% c("Acacia mabellae", "Acmena smithii")) @% Now extract the rows for all species except \texttt{C. fraseri}. \end{fmpage} \begin{fmpage}{36pc} \exhead{7} Extract the following subsets from the data frame \texttt{ais} (\textit{DAAG}): \begin{itemize} \item[(a)] Extract the data for the rowers. \item[(b)] Extract the data for the rowers, the netballers and the tennis players. \item[(c)] Extract the data for the female basketabllers and rowers. \end{itemize} @% \end{fmpage} \vspace*{9pt} \section{Scatterplots} \begin{fmpage}{36pc} \exhead{8} Using the Acmena data from the data frame \texttt{rainforest}, plot \texttt{wood} (wood biomass) vs \texttt{dbh} (diameter at breast height), trying both untransformed scales and logarithmic scales. Here is suitable code: <>= Acmena <- subset(rainforest, species=="Acmena smithii") plot(wood ~ dbh, data=Acmena) plot(wood ~ dbh, data=Acmena, log="xy") @ % \end{fmpage} \begin{fmpage}{36pc} \exhead{8, continued} Use of the argument \verb!log="xy"! gives logarithmic scales on both the $x$ and $y$ axes. For purposes of adding a line, or other additional features that use $x$ and $y$ coordinates, note that logarithms are to base 10. <>= plot(wood~dbh, data=Acmena, log="xy") ## Use lm() to fit a line, and abline() to add it to the plot Acmena10.lm <- lm(log10(wood) ~ log10(dbh), data=Acmena) abline(Acmena10.lm) <>= ## Now print the coefficents, for a log10 scale coef(Acmena10.lm) ## For comparison, print the coefficients for a natural log scale Acmena.lm <- lm(log(wood) ~ log(dbh), data=Acmena) coef(Acmena.lm) @ % Write down the equation that gives the fitted relationship between \texttt{wood} and \texttt{dbh}. \end{fmpage} \begin{fmpage}{36pc} \exhead{9} The \verb!orings! data frame gives data on the damage that had occurred in US space shuttle launches prior to the disastrous Challenger launch of January 28, 1986. Only the observations in rows 1, 2, 4, 11, 13, and 18 were included in the pre-launch charts used in deciding whether to proceed with the launch. Add a new column to the data frame that identifies rows that were included in the pre-launch charts. Now make three plots of \verb!Total! incidents against \verb!Temperature!: \begin{enumerate} \item Plot only the rows that were included in the pre-launch charts. \item Plot all rows. \item Plot all rows, using different symbols or colors to indicate whether or not points were included in the pre-launch charts. \end{enumerate} Comment, for each of the first two graphs, whether and open or closed symbol is preferable. For the third graph, comment on the your reasons for choice of symbols. \end{fmpage} Use the following to identify rows that hold the data that were presented in the pre-launch charts: <>= included <- logical(23) # orings has 23 rows included[c(1,2,4,11,13,18)] <- TRUE @ The construct \texttt{logical(23)} creates a vector of length 23 in which all values are \texttt{FALSE}. The following are two possibilities for the third plot; can you improve on these choices of symbols and/or colors? <>= plot(Total ~ Temperature, data=orings, pch=included+1) plot(Total ~ Temperature, data=orings, col=included+1) @ \begin{fmpage}{36pc} \exhead{10} Using the data frame \texttt{oddbooks}, use graphs to investigate the relationships between: \begin{enumerate} \item weight and volume; \item density and volume; \item density and page area. \end{enumerate} \end{fmpage} \vspace*{9pt} \section{Factors} \begin{fmpage}{36pc} \exhead{11} Investigate the use of the functions \texttt{as.character()} and \texttt{unclass()} with a factor argument. Comment on their use in the following code: <>= par(mfrow=c(1,2), pty="s") plot(weight ~ volume, pch=unclass(cover), data=allbacks) plot(weight ~ volume, data=allbacks, type="n") with(allbacks, text(weight ~ volume, labels=as.character(cover))) par(mfrow=c(1,1)) @ % [\texttt{mfrow=c(1,2)}: plot layout is 1 row $\times$ 2 columns; \texttt{pty="s"}: square plotting region.] \end{fmpage} \begin{fmpage}{36pc} \exhead{12} Run the following code: <>= gender <- factor(c(rep("female", 91), rep("male", 92))) table(gender) gender <- factor(gender, levels=c("male", "female")) table(gender) gender <- factor(gender, levels=c("Male", "female")) # Note the mistake # The level was "male", not "Male" table(gender) rm(gender) # Remove gender @ % The output from the final \verb!table(gender)! is <>= gender <- factor(c(rep("female", 91), rep("male", 92))) gender <- factor(gender, levels=c("Male", "female")) # Note the mistake # The level was "male", not "Male" table(gender) rm(gender) # Remove gender @% Explain the numbers that appear. \end{fmpage} \vspace*{9pt} \section{Stripcharts (base graphics) and Stripplots (\textit{lattice})} \begin{fmpage}{36pc} \exhead{13} Look up the help for the lattice function \verb!dotplot()!. Compare the following, noting the differences in syntax between the lattice graphics function \texttt{stripplot()} and the base graphics function \texttt{stripchart()}. <>= with(ant111b, stripchart(harvwt ~ site)) # Base graphics library(lattice) stripplot(site ~ harvwt, data=ant111b) ## Now switch the x and y axes stripplot(harvwt ~ site, data=ant111b) # Lattice graphics @ % \end{fmpage} \begin{fmpage}{36pc} \exhead{14} Check the class of each of the columns of the data frame \texttt{cabbages} (\textit{MASS}). Do side by side plots of \texttt{HeadWt} against \texttt{Date}, for each of the levels of \texttt{Cult}. <>= stripplot(Date ~ HeadWt | Cult, data=cabbages) @% \end{fmpage} The lattice graphics function \texttt{stripplot()} seems generally preferable to the base graphics function \texttt{stripchart()}. It has functionality that \texttt{stripchart()} lacks, and a consistent syntax that it shares with other lattice functions. \vspace*{9pt} \begin{fmpage}{36pc} \exhead{15} In the data frame \texttt{nsw74psid3}, use \texttt{stripplot()} to compare, between levels of \texttt{trt}, the continuous variables \texttt{age}, \texttt{educ}, \texttt{re74} and \texttt{re75} It is possible to generate all the plots at once, side by side. A simplified version of the plot is: <>= stripplot(trt ~ age + educ, data=nsw74psid1, outer=T, scale="free") @ % What are the effects of \texttt{scale = "free"}, and \texttt{outer = TRUE}? (Try leaving these at their defaults.) \end{fmpage} \vspace*{9pt} \section{Tabulation} \begin{fmpage}{36pc} \exhead{16} In the data set \texttt{nsw74psid3}, compare for each of the two levels of \texttt{trt}, the relative numbers for each of the factors \texttt{black}, \texttt{hisp} (hispanic), and \texttt{marr} (married). \end{fmpage} \vspace*{9pt} \section{Sorting} \begin{fmpage}{36pc} \exhead{17} Sort the rows in the data frame \texttt{Acmena} in order of increasing values of \texttt{dbh}.\newline [Hint: Use the function \texttt{order()}, applied to \texttt{age} to determine the order of row numbers required to sort rows in increasing order of age. Reorder rows of \texttt{Acmena} to appear in this order.] <>= Acmena <- subset(rainforest, species=="Acmena smithii") ord <- order(Acmena$dbh) acm <- Acmena[ord, ] @ % Sort the row names of \texttt{possumsites} (\textit{DAAG}) into alphanumeric order. Reorder the rows of \texttt{possumsites} in order of the row names. \end{fmpage} \vspace*{9pt} \section{For Loops} \begin{fmpage}{36pc} \exhead{18} \vspace*{-18pt} \begin{enumerate} \item Create a \texttt{for} loop that, given a numeric vector, prints out one number per line, with its square and cube alongside. \item Look up \texttt{help(while)}. Show how to use a \texttt{while} loop to achieve the same result. \item Show how to achieve the same result without the use of an explicit loop. \end{enumerate} \end{fmpage} \section{A Function} \begin{fmpage}{36pc} \exhead{19} The following function calculates the mean and standard deviation of a numeric vector. <>= meanANDsd <- function(x){ av <- mean(x) sdev <- sd(x) c(mean=av, sd = sdev) # The function returns this vector } @ % Modify the function so that: (a) the default is to use \texttt{rnorm()} to generate 20 random normal numbers, and return the standard deviation; (b) if there are missing values, the mean and standard deviation are calculated for the remaining values. \end{fmpage} \section{Further Practice with Data Input} \begin{fmpage}{36pc} \exhead{20\textbf{*}} The function \texttt{read.csv()} is a variant of \texttt{read.table()} that is designed to read in comma delimited files such as may be obtained from Excel. Use this function to read in the file \textbf{crx.data} that is available from the web page \url{http://mlearn.ics.uci.edu/databases/credit-screening/}. Check the file \textbf{crx.names} to see which columns should be numeric, which categorical and which logical. Make sure that the numbers of missing values in each column are the number given in the file \textbf{crx.names} \end{fmpage} For a first pass at inputting the data, try: <>= crxpage <- "http://mlearn.ics.uci.edu/databases/credit-screening/crx.data" crx <- read.csv(url(crxpage), header=TRUE) @ % \begin{fmpage}{36pc} \exhead{21\textbf{*}} For a challenging data input task, input the data from \textbf{bostonc.txt}.\footnote{Use \texttt{datafile("bostonc")} to place it in the working directory, or access the copy on the DVD.} Examine the contents of the initial lines of the file carefully before trying to read it in. It will be necessary to change \texttt{sep}, \texttt{comment.char} and \texttt{skip} from their defaults. Note that \verb!\t! denotes a tab character. \end{fmpage} \vspace*{9pt} \section{Data Input from a Web Page} \begin{fmpage}{36pc} \exhead{22\textbf{*}} With a live internet connection, files can be read directly from a web page. Here are examples: <>= webfolder <- "http://www.maths.anu.edu.au/~johnm/datasets/text/" webpage <- paste(webfolder, "molclock.txt", sep="") molclock <- read.table(url(webpage)) @ % At a time when a live internet connection is available, use this approach to input the file \textbf{travelbooks.txt} that is available from this same web page. \end{fmpage}