Datasets consist of observations sampled from a population. They can be as
large as terabytes with many variables and many records. The population
may consist of subpopulations with each subpopulation having dierent sets
of dependencies among the variables. Data Mining has tools and techniques
to identify the structure that enable making valid predictions.
Data mining is the name given to a variety of new analytical and statisti-
cal techniques that are already widely used in business, and are starting to
spread into social science research. Other closely-related terms are machine
learning 'pattern recognition' and predictive analytics. Data mining methods
can be applied to visual and to textual data, but the focus of this class is
on the application of data mining to symbolic or numerical data. In this
area, data mining oers interesting alternatives to conventional statistical
modeling methods such as regression and its oshoots.
Each student will undertake a data mining analysis project as a nal
paper, typically analyzing a dataset chosen by the student.
The topic list may include but is not limited to:
Exploratory Data Analysis
Distance and Similarity Measures
-Hierarchical Clustering: Agglomerative and Divisive
-Linear Manifold Clustering
-Graph Theoretic Clustering
Prediction and Classication with K-Nearest Neighbors
Classication and Regression Trees
-Training and Test Sets
- Understand the mathematical and statistics foundations of the methodology and algorithms of data mining techniques
- Become procient with data mining software such as WEKA and R
- Given a dataset, be able to discover patterns and relationships in the data that may be used for descriptive modeling or to make valid predictions
Assessment of understanding the mathematical and statistic foundations will
be done through a midterm (40%) and homeworks (20%). Assessment of
prociency in using data mining software and discovery of patterns and re-
lationships in a data set will be done by a project (40%).