Professor Harry Wechsler
Department of Computer Science
George Mason University
Fairfax, VA 22030
e-mail : wechsler@cs.gmu.edu
web : http://cs.gmu.edu/~wechsler/
(703) 993-1533 (office)
(703) 993-1530 (sec)
(703)993-1710 (fax)
GEORGE MASON UNIVERSITY
FALL '2003
CS 750
Theory and Applications of Data Mining
Class Information
001 00960 W
7:20 p.m. – 10:00 p.m. R
A125
Prerequisites
CS 450 (“databases”) and CS 580 (“AI”) or instructor’s permission
Office Hours
W 6:00 p.m. - 7:00 p.m. or by
appointment (SITE II - Rm. 461)
Textbook
1. Data Mining : Concepts
and Techniques, Han and Kamber, Morgan Kaufmann, 2001
web site for slides : http://www.cs.sfu.ca/~han/bk
References
1. V. Cherkassky and F. Mulier, Learning from Data : Concepts, Theory, and Methods, John Wiley, 1999.
2. D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999.
3. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
4. U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2002.
5. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning : Data Mining, Inference, and Prediction, Springer, 2001.
Course Description
Concepts
and techniques in data mining and their multidisciplinary applications. Topics
include data warehousing and databases, data cleaning and transformation,
pattern transformation and data compression, concept description, association
and correlation rules, data classification and predictive modeling, clustering,
performance analysis and scalability, data mining in advanced database systems
including text, audio and images, and emerging themes and future challenges
related to the forthcoming semantic web.
Term team project and topical review are required.
Motivation
The
explosive growth in generating, collecting and storing data has generated an
urgent need for new techniques and automated tools that can intelligently
assist us in transforming the vast amounts of data into useful information and
knowledge. Data mining is a multidisciplinary field, drawing from areas
including AI, database technology, data visualization, information retrieval,
high performance computing, machine learning, mathematical programming, neural
networks, pattern recognition, statistical learning theory, and
statistics. The course provides the
graduate students the opportunity to learn about the management and use of
large data repositories based upon a multidisciplinary approach.
Goals
The objective of this
course is to introduce graduate students to current research and technological
advances and trends in data mining.
Data mining, which supports knowledge discovery in databases (KDD), is
the automated extraction of patterns representing knowledge implicitly stored
in large databases, data warehouses, and other massive information
repositories. The focuses on issues
related to the feasibility, usefulness, efficiency, and scalability of
automated techniques for the discovery of patterns hidden in large
databases. Students will be exposed to
the above topics via lectures and appropriate reading assignments, including
recent journal and conference papers. Students are expected to complete a term
project and to make an in depth presentation on a topic related to data mining.
Follow – Up with Professor Wechsler : 1. INFT 844 -- Pattern Recognition – Spring 2005 and 2. PhD dissertations.
Grading
PROJECT à 75 %.
IN-DEPTH REVIEW à 25 %
Term Project
Students work in teams on term project.
Scope and range for the project to be agreed with the instructor.
Task involves significant amounts of
data.
Project includes the following STEPS :
1. Problem definition,
requirements analysis and conceptual design.
2. Data selection / sampling.
3. Cleaning and integration / Preprocessing.
4. Transformation / Reduction.
5. Data Mining.
6. Modeling, test & evaluation, and performance assessment.
7. Visualization and knowledge discovery.
Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is
expected
the week the STEP is listed.in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following week.
Final Project Presentation (SLIDES) (at
most 30 minutes)
1. Survey / Literature Review
of (a) application
and (b) task / functionality , data mining (STEP 5)
and model selection (“training strategy”).
2. Brief
Description of STEPS 1 – 7.
3. Performance Evaluation and Assessment of your project.
Final Project Presentation (HARD COPY)
(at most 15 pages)
Submit Technical Report (TR) that
covers your Final Project Presentation.
Tentative Schedule
|
August 27 |
Chs. 1: Introduction – Data Warehouses, Databases, Data Mining and Knowledge Discovery, and the Semantic Web. |
|
September 3 |
Ch. 2 : Data Warehouse and OLAP Technology. STEP 1 |
|
September 10 |
Ch. 3 : Data Transformation and Preprocessing. STEP 2 - 3 |
|
September 17 |
Ch. 4 : System Architecture. Machine Learning : DTs (Decision Trees). STEP 4 |
|
September 24 |
Ch. 5 : Concept Description |
|
October 1 |
Neural Networks: MLP (MultiLayer Networks and BackPropagation), Clustering and K-Means, and RBFs (Radial Basis Functions) |
|
October 8 |
Performance Assessment : Training (and Validation), Testing and Evaluation. Ch. 6 : Mining Association Rules : A Priori Algorithm |
|
October 15 |
No Class |
|
October 22 |
Ch. 7 : Classification and Prediction. |
|
October 29 |
Ch. 8 : Cluster Analysis. Self-Organization and Learning Vector Quantization(LVQ). STEP 5 |
|
November 5 |
Spatial and Temporal Data Mining |
|
November 12 |
Ch. 9 : Mining Complex Types of Data ; Ch. 10 : Applications and Trends; Biometrics and Face Recognition STEP 6 - 7 |
|
November 19 |
Pattern Recognition {Bayes, Linear Discriminant, EM}, Statistical Learning Theory (SLT), Generalization and Prediction Risk, Structural Risk Minimization (SR), and Support Vector Machines (SVM). |
|
December 3 |
FINAL PROJECT PRESENTATION |
|
December 10 |
FINAL PROJECT PRESENTATION |