Data Mining -- Introduction

Basic Ideas and Tools for Data Mining

This page collects together documents and laboratory exercises that are background for a talk and tutorials on data mining.

Slides for Talk

A Statistical Perspective on Data Mining

A further set of slides covers points for which there will not be room in the talk:
Data Mining Statistical Issues -- Some Further Comments

Notes

Data mining -- ideas and tools
These notes offer a perspective on the nature of data mining, and on the tools used. They are in a relatively unpolished draft form.

A case study in an attempt to use propensities to bridge the source/target divide.
This document continues the discussion of Laboratory exercises IV.

Laboratory Notes, and R Scripts

Laboratory Exercises     R Scripts

Background Reading

Ian Ayres 2007, Super Crunchers. Why Thinking-By-Numbers is the New Way to be Smart. Bantam.
[This places data mining in a wider context of data-based decision-making in business, government and consumer affairs. While popular in style and short on analysis detail, it offers a useful overview of ways in which applications of data mining and related analytical techniques are developing and changing, in part because of the new opportunities and challenges of the internet.]

Berk, Richard A 2008. Statistical Learning from a Regression Perspective. Springer.
["... none of the techniques has ever lived up to their most optimistic billing. Widespread misuse has further increased the gap between promised ... and actual performance. ... therefore the tone will be cautious, some might even say dark". More positively, Berk argues that there are new ideas and insights, and insightful new perspectives on more traditional methods.]
Review of Berk's book

Thomas H. Davenport and Jeanne G. Harris 2007, Competing on Analytics: The New Science of Winning. Harvard Business School Press.
[Analytics is a buzzword for the application of data mining type approaches in commerce. Davenport and Thomas give a useful overview of issues for the deployment of analytical techniques within organizations - benefits and traps, choice of amenable tasks, the role of management, skill base issues, etc.]

John Maindonald and John Braun 2010, Data Analysis and Graphics Using R - An Example-Based Approach, 3rd edn Cambridge University Press. [Of greatest relevance to the course are Chapter 2 on Styles of Data Analysis, Chapters 5 & 6 (through to 6.3) on Linear Models, Chapter 8 (through to 8.3) on logistic regression, Chapter 11 on Tree-based Methods, and Chapter 12 (through to 12.2) on Multivariate Data Exploration & Discrimination.]

Links

Data Mining Methodological Weaknesses and Suggested Fixes.    Overheads
Paper presented at Australasian Data Mining Conference (Aus06), Sydney, Nov 29-30, 2006.
Course materials for ANU Data Mining course in 2005 - 2008    Updated and more complete set of lab exercises
Suggestions for getting started on R     New York Times article on R
Graham Williams' data mining web page
(NB in particular rattle, a GUI interface to a data mining toolkit)
John Maindonald's data mining talks and papers