Professor Harry Wechsler

Department of Computer Science

George Mason University

Fairfax, VA 22030

e-mail : wechsler@cs.gmu.edu

web : http://cs.gmu.edu/~wechsler/

           (703) 993-1533 (office)

(703) 993-1530 (sec)

(703)993-1710 (fax)

 

GEORGE MASON UNIVERSITY

FALL   '2003

 

CS 750 Theory and Applications of Data Mining

Class Information

001  00960    W   7:20 p.m. –  10:00 p.m.  R    A125

Prerequisites

CS 450 (“databases”) and CS 580 (“AI”) or instructor’s permission

Office Hours

W   6:00 p.m. - 7:00 p.m. or by appointment (SITE II - Rm. 461)

            Textbook

1. Data Mining : Concepts and Techniques, Han and Kamber, Morgan Kaufmann, 2001

web site for slides  : http://www.cs.sfu.ca/~han/bk

References

1.      V. Cherkassky and F. Mulier, Learning from Data : Concepts, Theory, and Methods,  John Wiley, 1999.

 

            2.   D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999.

 

3.      R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval,  Addison-Wesley, 1999.

 

4.      U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery,  Morgan Kaufmann, 2002.

 

5.      T. Hastie, R. Tibshirani, and J. Friedman,  The Elements of Statistical Learning : Data Mining, Inference, and Prediction, Springer, 2001.

Course Description

Concepts and techniques in data mining and their multidisciplinary applications. Topics include data warehousing and databases, data cleaning and transformation, pattern transformation and data compression, concept description, association and correlation rules, data classification and predictive modeling, clustering, performance analysis and scalability, data mining in advanced database systems including text, audio and images, and emerging themes and future challenges related to the forthcoming semantic web.  Term team project and topical review are required.

Motivation

The explosive growth in generating, collecting and storing data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. Data mining is a multidisciplinary field, drawing from areas including AI, database technology, data visualization, information retrieval, high performance computing, machine learning, mathematical programming, neural networks, pattern recognition, statistical learning theory, and statistics.  The course provides the graduate students the opportunity to learn about the management and use of large data repositories based upon a multidisciplinary approach.

Goals

The objective of this course is to introduce graduate students to current research and technological advances and trends in data mining.   Data mining, which supports knowledge discovery in databases (KDD), is the automated extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories.  The focuses on issues related to the feasibility, usefulness, efficiency, and scalability of automated techniques for the discovery of patterns hidden in large databases.  Students will be exposed to the above topics via lectures and appropriate reading assignments, including recent journal and conference papers. Students are expected to complete a term project and to make an in depth presentation on a topic related to data mining.  

 

Follow – Up with Professor Wechsler :  1. INFT 844  -- Pattern Recognition – Spring 2005  and  2. PhD dissertations.

Grading

PROJECT à 75  %.

IN-DEPTH   REVIEW  à 25 %

Term Project

Students work in teams on term project.
Scope and range for the project to be agreed with the instructor.
Task  involves significant amounts of data.
Project  includes the following  STEPS :


1. Problem definition, requirements analysis and conceptual design.
2. Data selection / sampling.
3. Cleaning and integration / Preprocessing.
4. Transformation / Reduction.
5. Data Mining.
6. Modeling, test & evaluation, and performance assessment.
7. Visualization and knowledge discovery.

Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is expected
the week the STEP is listed.in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following week.

Final Project Presentation (SLIDES) (at most 30 minutes)

1.  Survey / Literature Review of  (a) application
and (b) task / functionality , data mining (STEP 5)
and model selection (“training strategy”).

2.    Brief  Description of STEPS 1 – 7.

3.    Performance Evaluation and Assessment of your project.

Final Project Presentation (HARD COPY) (at most 15 pages)

         Submit Technical Report (TR) that covers your Final Project  Presentation.

 

Tentative Schedule

August 27

Chs. 1: Introduction – Data Warehouses, Databases, Data Mining and Knowledge Discovery, and the Semantic Web.

September 3

Ch. 2 : Data Warehouse and OLAP Technology.     STEP 1

September 10

Ch. 3 : Data Transformation and Preprocessing.     STEP 2 - 3

September 17

Ch. 4 :  System Architecture. Machine Learning : DTs (Decision Trees).  STEP 4

September 24

            Ch. 5 :  Concept Description

October 1

Neural Networks: MLP (MultiLayer Networks and BackPropagation), Clustering and K-Means, and RBFs (Radial Basis Functions)

October  8

Performance Assessment : Training (and Validation), Testing and Evaluation. Ch. 6 : Mining Association Rules : A Priori Algorithm

October 15

                     No Class

October 22

Ch. 7 : Classification and Prediction.  

October 29

Ch. 8 : Cluster Analysis. Self-Organization and Learning Vector Quantization(LVQ).  STEP 5

November 5

Spatial and Temporal Data Mining

November 12

Ch. 9 : Mining Complex Types of Data ; Ch. 10 : Applications and Trends; Biometrics and Face Recognition STEP  6 - 7

November  19

Pattern Recognition {Bayes, Linear Discriminant,  EM}, Statistical Learning Theory (SLT), Generalization and Prediction Risk, Structural Risk Minimization (SR),  and Support Vector Machines (SVM).

December 3

FINAL  PROJECT   PRESENTATION

December 10

FINAL  PROJECT   PRESENTATION