As in 2006 and 2007, John
Maindonald and Graham
Williams will share responsibilty for this course.
Lectures and laboratories in 2008
Lectures will be held on Thursdays 14.00 - 16.00h, in Engineering (ENGN) T,
and Fridays 9.00-10.00h, all in G35.
The weekly laboratory is held on Thursdays, from
9.00-11.00am. From Thursday September 4, the laboratory will be held in
the RSPAS Joint Schools Training Lab, in the RSSS wing of the Coombs
Building. The RSSS wing is the hexagon between the main entrance and
the Chancelry, on the ground floor.
To get to the laboratory, go to the breezeway (marked X on the
map;
if outside the University, you will need to log in via a proxy to get
to the map) between the Tea Room quadrangle and East Road (the tea
room balcony is marked Y on the map). If you stand with your back to
the Chancelry building, there will be doors on your left and right. Go
through the door on your left, it is next to a white door that
contains a fire hose. Continue along that corridor, which ends at
the entrance to the training lab.
Data mining gives a new twist on data deployment and analysis
methodologies that have been developed over the past century or
more. Recent developments have included:
Huge increases in computational power and in computer storage
A synergy between theoretical and algorithmic advances, advances
in software and in computational power
Integration of what were formerly stand-alone abilities into single
software systems with a single interface and command language.
(The R system is a prime example; see below)
New types of data, and new opportunites for collecting data, arising
from advances in instrumentation, from the internet and from widespread
deployment of databases. Chapter 5 of Ayres (2007), entitled "Why now?",
has interesting commentary on the impact of such advances.
Classification is a major pre-occupation of data mining, with a more
limited focus on regression with a continuous outcome variable.
The aim is typically prediction rather than the more challenging
task of interpretation of model parameters.
Students intending to take this course will it useful to come with
some initial familiarity with the open-source
R system for scientific and
statistical computing and for graphics. The R system
is available without charge for
downloading from the internet.
It is a marvelous example of what can be achieved when highly
skilled specialists co-operative internationally, using the internet for
communication and co-ordination.
Background Reading
Ian Ayres 2007, Super Crunchers. Why Thinking-By-Numbers is the New
Way to be Smart. Bantam.
[This places data mining in a wider context of data-based
decision-making in business, government and consumer affairs. While
popular in style and short on analysis detail, it offers a useful
overview of ways in which applications of data mining and related
analytical techniques are developing and changing, in part because of
the new opportunities and challenges of the internet.]
Thomas H. Davenport and Jeanne G. Harris 2007, Competing on
Analytics: The New Science of Winning. Harvard Business School
Press. [Analytics is a buzzword for the application of data
mining type approaches in commerce. Davenport and Thomas give a
useful overview of issues for the deployment of analytical techniques
within organizations - benefits and traps, choice of amenable tasks,
the role of management, skill base issues, etc.]
John Maindonald and John Braun 2007, Data Analysis and Graphics Using R - An Example-Based Approach, 2nd edn Cambridge University Press.
[Of greatest relevance to the course are Chapter 2
on Styles of Data Analysis, Chapters 5 & 6 (through to 6.3) on Linear Models,
Chapter 8 (through to 8.3) on logistic regression, Chapter 11 on Tree-based
Methods, and Chapter 12 (through to 12.2) on Multivariate Data Exploration &
Discrimination.]