# Data Mining

**Rationale**

Datasets consist of observations sampled from a population. They can be as

large as terabytes with many variables and many records. The population

may consist of subpopulations with each subpopulation having dierent sets

of dependencies among the variables. Data Mining has tools and techniques

to identify the structure that enable making valid predictions.

**Course Description:**

Data mining is the name given to a variety of new analytical and statisti-

cal techniques that are already widely used in business, and are starting to

spread into social science research. Other closely-related terms are machine

learning 'pattern recognition' and predictive analytics. Data mining methods

can be applied to visual and to textual data, but the focus of this class is

on the application of data mining to symbolic or numerical data. In this

area, data mining oers interesting alternatives to conventional statistical

modeling methods such as regression and its oshoots.

Each student will undertake a data mining analysis project as a nal

paper, typically analyzing a dataset chosen by the student.

**Topic List**

The topic list may include but is not limited to:

**Exploratory Data Analysis**

**Association Rules**

**Distance and Similarity Measures**

**Clustering**

-K-means

-Hierarchical Clustering: Agglomerative and Divisive

-Subspace Clustering

-Linear Manifold Clustering

-Graph Theoretic Clustering

-Spectral Clustering

-Mixture Models

-Biclustering

-Density-based Clustering

**Prediction and Classication with K-Nearest Neighbors**

**Discriminant Analysis**

**Classication and Regression Trees**

**Random Forests**

**Logistic Regression**

**Validation Techniques**

-Training and Test Sets

-Permutation Tests

-Bootstrap Resampling

**Learning Goals**

- Understand the mathematical and statistics foundations of the methodology and algorithms of data mining techniques
- Become procient with data mining software such as WEKA and R
- Given a dataset, be able to discover patterns and relationships in the data that may be used for descriptive modeling or to make valid predictions

**Assessment**

Assessment of understanding the mathematical and statistic foundations will

be done through a midterm (40%) and homeworks (20%). Assessment of

prociency in using data mining software and discovery of patterns and re-

lationships in a data set will be done by a project (40%).