\part{Informal and Formal Data Exploration -- Laboratory 4} \section{Informal Data Exploration} These exercises explore data that will later be used for exercises in error rate estimation. \begin{fmpage}{36pc} \exhead{1} Look up the help page for the data frame \texttt{Pima.tr2} (\textit{MASS} package), and note the columns in the data frame. The eventual interest is in using use variables in the first seven column to classify diabetes according to \texttt{type}. Here, we explore the individual columns of the data frame. \begin{enumerate} \item Several columns have missing values. Analysis methods inevitably ignore or handle in some special way rows that have one or more missing values. It is therefore desirable to check whether rows with missing values seem to differ systematically from other rows. \vspace*{3pt} Determine the number of missing values in each column, broken down by \texttt{type}, thus: <>= library(MASS) ## Create a function that counts NAs count.na <- function(x)sum(is.na(x)) ## Check function count.na(c(1, NA, 5, 4, NA)) ## For each level of type, count the number of NAs in each column for(lev in levels(Pima.tr2$type)) print(sapply(subset(Pima.tr2, type==lev), count.na)) @ % The function \texttt{by()} can be used to break the calculation down by levels of a factor, avoiding the use of the \texttt{for} loop, thus: <>= by(Pima.tr2, Pima.tr2$type, function(x)sapply(x, count.na)) @ % \item Create a version of the data frame \texttt{Pima.tr2} that has \texttt{anymiss} as an additional column: <>= missIND <- complete.cases(Pima.tr2) Pima.tr2$anymiss <- c("miss","nomiss")[missIND+1] @ % For remaining columns, compare the means for the two levels of \texttt{anymiss}, separately for each level of \texttt{type}. Compare also, for each level of \texttt{type}, the number of missing values. \end{enumerate} \end{fmpage} \begin{fmpage}{36pc} \exhead{2} \begin{itemize} \item[(a)] Use strip plots to compare values of the various measures for the levels of \texttt{anymiss}, for each of the levels of \texttt{type}. Are there any columns where the distribution of differences seems shifted for the rows that have one or more missing values, relative to rows where there are no missing values?\newline Hint: The following indicates how this might be done efficiently: <>= library(lattice) stripplot(anymiss ~ npreg + glu | type, data=Pima.tr2, outer=TRUE, scales=list(relation="free"), xlab="Measure") @ % \end{itemize} \end{fmpage} \begin{fmpage}{36pc} \exhead{2, continued} \begin{itemize} \item[(b)] Density plots are in general better than strip plots for comparing the distributions. Try the following, first with the variable \texttt{npreg} as shown, and then with each of the other columns except \texttt{type}. Note that for \texttt{skin}, the comparison makes sense only for \texttt{type=="No"}. Why? <>= library(lattice) ## npreg & glu side by side (add other variables, as convenient) densityplot( ~ npreg + glu | type, groups=anymiss, data=Pima.tr2, auto.key=list(columns=2), scales=list(relation="free")) @ % \end{itemize} \end{fmpage} \begin{fmpage}{36pc} \exhead{3} Better than either strip plots or density plots may be Q-Q plots. Using \texttt{qq()} from \textit{lattice}, investigate their use. In this exercise, we use random samples from normal distributions to help develop an intuitive understanding of Q-Q plots, as they compare with density plots. \begin{enumerate} \item First consider comparison using (i) a density plot and (ii) a Q-Q plot when samples are from populations in which one of the means is shifted relative to the other. Repeat the following several times, <>= y1 <- rnorm(100, mean=0) y2 <- rnorm(150, mean=0.5) # NB, the samples can be of different sizes df <- data.frame(gp=rep(c("first","second"), c(100,150)), y=c(y1, y2)) densityplot(~y, groups=gp, data=df) qq(gp ~ y, data=df) @ % \item Now make the comparison, from populations that have different standard deviations. For this, try, e.g. <>= y1 <- rnorm(100, sd=1) y2 <- rnorm(150, sd=1.5) @ % Again, make the comparisons using both density plots and Q-Q plots. \end{enumerate} \end{fmpage} \begin{fmpage}{36pc} \exhead{4} Now consider the data set \texttt{Pima.tr2}, with the column \texttt{anymiss} added as above. \begin{enumerate} \item First make the comparison for \texttt{type="No"}. <>= qq(anymiss ~ npreg, data=Pima.tr2, subset=type=="No") @ % Compare this with the equivalent density plot, and explain how one translates into the other. Comment on what these graphs seem to say. \item The following places the comparisons for the two levels of \texttt{type} side by side: <>= qq(anymiss ~ npreg | type, data=Pima.tr2) @ % Comment on what this graph seems to say. \end{enumerate} NB: With \texttt{qq()}, use of ``\texttt{+}'' to get plots for the different columns all at once will not, in the current version of \textit{lattice}, work. \end{fmpage} \vspace*{9pt} \section{Bootstrap sampling} \begin{fmpage}{36pc} \exhead{5} The following takes a with replacement sample of the rows of \texttt{Pima.tr2}. <>= rows <- sample(1:dim(Pima.tr2)[1], replace=TRUE) densityplot(~ bmi, groups=type, data=Pima.tr2[rows, ], scales=list(relation="free"), xlab="Measure") @% Repeat, but using \texttt{anymiss} as the grouping factor, and with different panels for the two levels of \texttt{type}. Repeat for several different bootstrap samples. Are there differences between levels of \texttt{anymiss} that seem consistent over repeated bootstrap samples? \end{fmpage} \vspace*{8pt} \begin{fmpage}{36pc} \exhead{6} Exercise 2 compared density plots, for several of the variables, between rows that had one or more missing values and those that had no missing values. We can use the bootstrapping idea to check the copnsistency of apparent differences across repeated bootstrap samples. \vspace*{3pt} The distribution for \texttt{bmi} gives the impression that it has a different shape, between rows where one or more values was missing and rows where no values were missing, at least for \texttt{type=="Yes"}. One way to check whether this difference is consistent under repeated sampling is to treat the sample as representative of the population, and take repeated with replacement (``bootstrap'') samples from it. \vspace*{3pt} The following takes a bootstrap sample, then showing the Q-Q plot <>= rownum <- 1:dim(Pima.tr2)[1] # generate row numbers chooserows <- sample(rownum, replace=TRUE) qq(anymiss ~ bmi | type, data=Pima.tr2[chooserows, ], auto.key=list(columns=2)) @ % Wrap these lines of code in a function. Repeat the formation of the bootstrap samples and the plots several times. Does the shift in the distribution seem consistent under repeating sampling? \end{fmpage} Judgements based on examination of graphs are inevitably subjective. They do however make it possible to compare differences in the shapes of distributions. Here, the shape difference is of more note than any difference in mean or median. \vspace*{8pt} \begin{fmpage}{36pc} \exhead{7} In the data frame \texttt{nswdemo} (\textit{DAAGxtras} package), compare the distribution of \texttt{re78} for those who received work training (\texttt{trt==1}) with controls (\texttt{trt==0}) who did not. <>= library(DAAGxtras) densityplot(~ re78, groups=trt, data=nswdemo, from=0, auto.key=list(columns=2)) @% The distributions are highly skew. A few very large values may unduly affect the comparison. \vspace*{3pt} A reasonable alternative is to compare values of \texttt{log(re78+23)}. The value 23 is chosen between it is half the minimum non-zero value of \texttt{re78}. Here is the density plot. <>= unique(sort(nswdemo$re78))[1:3] # Examine the 3 smallest values densityplot(~ log(re78+23), groups=trt, data=nswdemo, auto.key=list(columns=2)) @% Do the distribution for control and treated have similar shapes? \end{fmpage} \begin{fmpage}{36pc} \exhead{8} Now examine the displacement, under repeated bootstrap sampling, of one mean relative to the other. Here is code for the calculation: <>= twoBoot <- function(n=999, df=nswdemo, ynam="re78", gp="trt"){ d2 <- numeric(n+1) fac <- df[, gp] if(!is.factor(fac))fac <- factor(fac) if(length(levels(fac)) != 2) stop(paste(gp, "must have 2 levels")) y <- df[, ynam] d2[1] <- diff(tapply(y, fac, mean)) for(i in 1:n){ chooserows <- sample(1:length(y), replace=TRUE) faci <- fac[chooserows] yi <- y[chooserows] d2[i+1] <- diff(tapply(yi, faci, mean)) } d2 } ## d2 <- twoBoot() quantile(d2, c(.025,.975)) # 95% confidence interval @ % \end{fmpage} Note that a confidence interval should not be interpreted as a probability statement. It takes no account of prior probability. Rather, 95\% of intervals that are calculated in this way can be expected to contain the true probability. \vspace*{9pt} \begin{fmpage}{36pc} \exhead{9} Another possibility is to work with a permutation distribution. If the difference between treated and controls is entirely due to sampling variation, then permuting the treatment labels will give another sample from this same distribution. Does the observed difference between treated and controls seem ``extreme'', relative to this permutation distribution? Here is code that may be used. <>= dnsw <- numeric(1000) y <- nswdemo$re78 treat <- nswdemo$trt dnsw[1] <- mean(y[treat==1]) - mean(y[treat==0]) for(i in 2:1000){ trti <- sample(treat) dnsw[i] <- mean(y[trti==1]) - mean(y[trti==0]) } sum(dnsw > dnsw[1])/length(dnsw) 2*min(sum(d2<0)/length(d2), sum(d2>0)/length(d2)) # 2-sided comparison @ % Compare the result with that from the bootstrap approach. \vspace*{3pt} Replace \texttt{re78} with \texttt{log(re78+23)} and repeat the calculations. \vspace*{3pt} Note: In formalizing the result for a test of hypothesis, note that the difference between \texttt{treat==1} and \texttt{treat==1} might go in either direction. \end{fmpage}