Show The Graduate Center Menu

Data Mining Methods

Instructors: Distinguished Professor Paul Attewell & Distinguished Professor Robert M. Haralick

Paul Attewell

Professor Attewell's recent research has been in the sociology of education with a focus on the relationship between educational institutions and social inequality. He has studied middle and high schools and colleges. His co-authored book Passing the Torch: Does Higher Education Pay Off Across the Generations? won the American Education Research Association's Outstanding Book Award, and also the Grawemeyer Award in Education. His current research focuses on the reasons for low degree completion rates in non-selective colleges, and includes randomized controlled field experiments in which lower income undergraduates are encouraged to increase their "academic momentum" in college, using monetary incentives.

Robert Haralick

Professor Haralick has made a series of contributions in the field of computer vision. In the high-level vision area, he has worked on inferring 3D geometry from one or more perspective projection views.] He has also identified a variety of vision problems which are special cases of the consistent labeling problem. His papers on consistent labeling, arrangements, relation homomorphism, matching, and tree search translate some specific computer vision problems to the more general combinatorial consistent labeling problem and then discuss the theory of the look-ahead operators that speed up the tree search. The most basic of these is called Forward Checking.  This gives a framework for the control structure required in high-level vision problems. He has also extended the forward-checking tree search technique to propositional logic.

In the low-and mid-level areas, Professor Haralick has worked in image texture analysis using spatial gray tone co-occurrence texture features. These features have been used with success on biological cell images, x-ray images, satellite images, aerial images and many other kinds of images taken at small and large scales. In the feature detection area, Professor Haralick has developed the facet model for image processing. The facet model states that many low-level image processing operations can be interpreted relative to what the processing does to the estimated underlying gray tone intensity surface of which the given image is a sampled noisy version. The facet papers develop techniques for edge detection,] line detection, noise removal, peak and pit detection, as well as a variety of other topographic gray tone surface features. For shape analysis and extraction he developed the techniques of mathematical morphology, including the mathematical morphology sampling theorem and recursive morphological operations.

His most recent work is in the machine learning area, particularly in the manifold clustering of high dimensional data sets, the application of pattern recognition to mathematical combinatorial problems. He is current work is in the learning of knowledge and structure through relation decomposition.

Course Description

Data mining (DM) is the name given to a variety of new analytical and statistical techniques that are already widely used in business, and are starting to spread into social science research. Other closely-related terms are ‘machine learning’ 'pattern recognition' and ‘predictive analytics.’  Data mining methods can be applied to visual and to textual data, but the focus of this class is on the application of DM to quantitative or numerical data. In this area, DM offers interesting alternatives to conventional statistical modeling methods such as regression and its offshoots.

This class is taught jointly by a professor of computer science and a professor of sociology and typically enrolls a mix of computer science and social science doctoral students. It aims to provide an introduction to data mining methods and their application to data analysis. The course reviews the main DM techniques and explains the logic of each. It emphasizes contrasts between conventional statistical analyses and DM approaches. Students work with each technique using JMP Pro software, in a computer classroom. Each student will undertake a DM analysis project as a final paper, typically analyzing a dataset chosen by the student.