Professor Harry Wechsler
Department of Computer Science
George Mason University
Fairfax, VA 22030
e-mail : wechsler@cs.gmu.edu
web : http://cs.gmu.edu/~wechsler/
(703) 993-1533 (office)
(703) 993-1530 (sec)
(703)993-1710 (fax)
GEORGE MASON UNIVERSITY
FALL '2002
CS 750
Theory and Applications of Data Mining
Class Information
001 00892 W 7:20 p.m. – 10:00 p.m. R A205
Office Hours
W 6:00 p.m. - 7:00 p.m. or by appointment (SITE II - Rm. 461)
Textbook
1. Data Mining : Concepts and Techniques, Han and Kamber, Morgan Kaufmann, 2001
web site for slides : http://www.cs.sfu.ca/~han/bk
References
1. V. Cherkassky and F. Mulier, Learning from Data : Concepts, Theory, and Methods, John Wiley, 1999.
2. D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999.
3. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
4. U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery,
Morgan Kaufmann, 2002.
5. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning : Data Mining, Inference, and Prediction, Springer, 2001.
Course Description
Concepts and techniques in data mining and their multidisciplinary applications. Topics include data warehousing and databases, data cleaning and transformation, pattern transformation and data compression, concept description, association and correlation rules, data classification and predictive modeling, clustering, performance analysis and scalability, data mining in advanced database systems including text, audio and images, and emerging themes and future challenges related to the forthcoming semantic web. Term team project and topical review are required.
Motivation
The explosive growth in generating, collecting and storing data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. Data mining is a multidisciplinary field, drawing from areas including AI, database technology, data visualization, information retrieval, high performance computing, machine learning, mathematical programming, neural networks, pattern recognition, statistical learning theory, and statistics. The course would provide our graduate students a first opportunity to learn about the management and use of large data repositories based upon a multidisciplinary approach.
Goals
The objective of this course is to introduce graduate
students to current research and technological advances and trends in data
mining. Data mining, which supports
knowledge discovery in databases (KDD), is the automated extraction of patterns
representing knowledge implicitly stored in large databases, data warehouses,
and other massive information repositories.
The focuses on issues related to the feasibility, usefulness,
efficiency, and scalability of automated techniques for the discovery of
patterns hidden in large databases.
Students will be exposed to the above topics via lectures and appropriate
reading assignments, including recent journal and conference papers. Students
are expected to complete a term project and to make an in depth presentation on
a topic related to data mining. Follow – Up – Professor Wechsler
: 1.
INFT 844 -- Pattern Recognition –
Spring 2003
and 2. PhD dissertations.
Grading
PROJECT à 75 %.
IN-DEPTH REVIEW à 25 %
Term Project
Students work in teams on term project.
Scope and range for the project to be agreed with the instructor.
Task involves significant amounts of
data.
Project includes the following STEPS :
1. Problem definition,
requirements analysis and conceptual design.
2. Data selection / sampling.
3. Cleaning and integration / Preprocessing.
4. Transformation / Reduction.
5. Data Mining.
6. Modeling, test & evaluation, and performance assessment.
7. Visualization and knowledge discovery.
Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule).
Final Project Presentation (SLIDES) (at
most 30 minutes)
1. Survey / Literature Review of (a) application
and (b) task / functionality , data mining (STEP 5)
and model selection (“training strategy”).
2.
Brief Description of STEPS 1 – 7.
3.
Assessment
of your project.
Final Project Presentation (HARD COPY)
(at most 15 pages)
Submit Technical Report (TR) that covers
your
Final Project Presentation.
Tentative Schedule
|
August 28 |
Chs. 1: Introduction – Data Warehouses, Databases, Data Mining and Knowledge Discovery, and the Semantic Web. |
|
September 4 |
Ch. 2 : Data Warehouse and OLAP Technology. STEP 1 |
|
September 11 |
Data Cleaning. STEP 2 - 3 |
|
September 18 |
Ch. 3 : Data Transformation and Preprocessing. STEP 4 |
|
September 25 |
Chs. 4 : System Architecture. |
|
October 2 |
Chs. 5 : Concept Description and Performance Evaluation. |
|
October 9 |
Machine Learning, Pattern Recognition {Bayes, Linear Discriminants EM}, Statistical Learning Theory (SLT), Structural Risk Minimization (SR), and Support Vector Machines (SVM). |
|
October 16 |
Ch. 6 : Mining Association Rules |
|
October 23 |
No Class |
|
October 30 |
Ch. 7 : Classification and Prediction. Decision Trees (DT). STEP 5 |
|
November 6 |
Ch. 7 : Classification and Prediction. Bayes and Naïve Bayes, Bayesian Networks, Neural Networks {BP – Backpropagation}, Evolutionary Computationa and Genetic Algorithms (GAs), Fuzzy Systems and Regression. |
|
November 13 |
Ch. 8 : Cluster Analysis. Self-Organization and Learning Vector Quantization(LVQ) and Radial Basis Functions(RBF). STEP 6 - 7 |
|
November 20 |
Ch. 9 : Mining Complex Types of Data ; Ch. 10 : Applications and Trends; Biometrics and Face Recognition |
|
December 4 |
FINAL PROJECT PRESENTATION |
|
December 11 |
FINAL PROJECT PRESENTATION |