Spring 2014: Data Mining [CS484]

Professor: Carlotta Domeniconi, Rm 4424 ENG, carlotta\AT\cs.gmu.edu
Teaching Assistant: Tanwistha Saha, tsaha\AT\masonlive.gmu.edu
Office Hours: (Professor) TR 4:30pm - 5:30pm, or by appointment; (TA) M 4:30pm - 6:30pm, Rm. 4456
Prerequisites: CS310 and STAT344 (C or better in both).
Some programming experience is expected. Students should be familiar with basic probability and statistics concepts, and linear algebra.
Location and Time: We meet in the Art and Design Building 2026, TR 12:00pm - 1:15pm
Textbook: P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2006. Book's companion website
Course Web Page

General Description and Preliminary List of Topics

Data mining is the process of automatically discovering useful information in large data repositories. The course covers key concepts and algorithms at the core of data mining.

Topics include: classification, clustering, association analysis, anomaly detection.

Outcomes

The ability to apply computing principles, probability and statistics relevant to the data mining discipline to analyze data.
A thorough understanding of model programming with data mining tools, algorithms for estimation, prediction, and pattern discovery.
The ability to analyze a problem, identifying and defining the computing requirements appropriate to its solution: data collection and preparation, functional requirements, selection of models and prediction algorithms, software, and performance evaluation.
The ability to understand performance metrics used in the data mining field to interpret the results of applying an algorithm or model, to compare methods and to reach conclusions about data.
The ability to communicate effectively to an audience the steps and results followed in solving a data mining problem (through a term project).

Grading

Assignments: 15%
Midterm: 25%
Final: 25%
Project: 30%
Participation: 5%

Exams are closed book. Assignments must be performed individually. Group work is NOT allowed, unless otherwise stated by the instructor. Any deviation from this policy will be considered a violation of the GMU Honor Code In addition, the CS department has its own Honor Code policies. Any deviation from this is also considered an Honor Code violation.

Software and Data:

UCI Machine Learning Repository is a repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

UCI Knowledge Discovery in Databases Archive is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas

More datasets

Resources: software and data

Weka is an open source Java package implementing many learning algorithms

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

SVM light and LibSVM are two popular implementations of various SVM algorithms

TMG is a Matlab Toolbox that can be used for various tasks in text mining