Spring 2010: Advanced Pattern Recognition [CS775]

Instructor: Carlotta Domeniconi, Rm 4424 Engineering Building, carlotta@cs.gmu.edu

Office hours: Monday: 3:00pm-4:00pm, or by appointment

Prerequisites: CS 688 or permission of instructor. Some programming experience is expected. Students should be familiar with basic probability and statistics concepts, linear algebra, optimization, and multivariate calculus.

Time: We meet in ST1 122, M 4:30pm - 7:10pm

Useful books: (No required textbook. Reading material will be provided.)

C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006.
Book's companion website

Useful material

Overview on Linear Algebra

Andrew Moore's Tutorials: Collection of tutorials on topics of interest for this class
Schedule of Classes

General Description and Preliminary List of Topics:
The course covers advanced topics in pattern recognition and machine learning. Recent conference and journal papers will be discussed in depth. Tentative topics: Mixture models and EM; Ensemble methods; Co-clustering; Transfer learning; Semi-supervised learning; Learning with external knowledge; Generative approaches to topic modeling; Kernel methods. Actual topics covered will depend on time available and students' interests.
Course Format:
Lectures by the instructor, students' presentations, and discussions. Research papers and handouts will be made available. The course requires a project, homeworks and exams (exact format to be decided). Homework assignments and project will require some programming.
Course Project:
The project gives you an opportunity to explore in depth a particular topic/area of the course that interests you. The topic of the project, of course, should be related to the material covered in class, but otherwise you are free to select the specific topic. Possible types of projects include:

An application research project: The project demonstrates the application of some techniques discussed in class in an application domain (e.g., text mining, bioinformatics, computer vision, image processing, artificial intelligence etc.). Properties, drawbacks, advantages of the used techniques are analyzed within the context of the explored application domain.

A theoretical or methodological research project: A study of different classes of models and approaches; proving either theoretically or experimentally properties of known algorithms; designing a new approach.

Papers on Kernel methods and SVMs: (Under construction)

A tutorial on support vector machines for pattern recognition

On Kernel-target alignment

Sequence and Tree Kernels with Statistical Feature Mining

Making Tree Kernels practical for Natural Language Processing

Semi-supervised graph clustering: a kernel approach

Papers on Information Retrieval and Text Mining: (Under construction)

Computing semantic relatedness using Wikipedia-based explicit semantic analysis

Probabilistic Latent Semantic Analysis

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews

Constructing Informative Priors using Transfer Learning

Papers on Subspace Clustering: (Under construction)

Information-Theoretic Co-clustering

Ensemble Methods and Clustering Ensembles: (Under construction)

Proactive learning: cost-sensitive active learning with multiple imperfect oracles

Get another label? Improving data quality and data mining using multiple noisy labelers

Projective Clustering Ensembles

An ensemble approach to identifying informative constraints for semi-supervised clustering

Solving cluster ensemble problems by bipartite graph partitioning

Cluster ensembles - a knowledge reuse framework for combining multiple partitions

Weighted Cluster Ensembles: Methods and Analysis

Semi-supervised learning: (Under construction)

A probabilistic framework for semi-supervised clustering

An adaptive kernel method for semi-supervised clustering

Metric Learning: (Under construction)

Locally adaptive metrics for clustering high dimensional data

Transfer Learning: (Under construction)

Improving SVM accuracy by training on auxiliary data sources

Mapping and revising Markov logic networks for transfer learning

Self-taught Clustering

Knowledge transfer via multiple model local structure mapping

Software and Data:

The Distance Metric Learning Toolkit Matlab toolkit for distance metric learning. It includes the code for the NCA (Neighbourhood Components Analysis) algorithm.

The Transfer Learning Toolkit Matlab toolkit for transfer learning. It also contains benchmark datasets consisting of multiple tasks.

MALLET MAchine Learning for LanguagE Toolkit

Stanford Topic Modeling Toolbox

Matlab Topic Modeling Toolbox

UCI Machine Learning Repository is a repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. A beta version of a new and improved site is also available

UCI Knowledge Discovery in Databases Archive is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas

Weka is an open source Java package implementing many learning algorithms

YALE (Yet Another Learning Environment) is another open source Java package. It includes a GUI which allows automation of the whole data process from feature normalization to feature selection, learning and cross-validation

SVM light and LibSVM are two popular implementations of various SVM algorithms

Pointers to Support Vector Machine and Gaussian Processes Software.

Software, datasets, publications related to data mining.

Microarray Data:

Labeled microarray data.
Performance comparison of classifiers on the above microarray data.

Gene Expression Profiles in Hereditary Breast Cancer. This data set does not have labels.
Here is the text file containing the matrix of expression levels: rows are genes (3226); columns are conditions (22).
Lymphoma (9 classes) data set. Can be used for classification and clustering.
A large collection of microarray data, including Leukemia and Lung cancer data is available here.
NCI60 data.

Software and Data for Text:

Online NIPS proceedings NIPS online: The text repository.
TREC (Text REtrieval data) . NIST Collection of documents returned by a retrieval system with relevance judgements.
TMG is a Matlab Toolbox that can be used for various tasks in text mining
Snowball. Software to perform stemming.
The 'MC' Toolkit. A Toolkit for Creating Vector Models from Text Documents.
Rainbow. Software for statistical text classification.
SPAM Archive.
Classic3 data set. Collections of Abstracts from three categories: MEDLINE (abstracts from medical journals); CISI (abstracts from IR papers); CRANFIELD (abstracts from aerodynamics papers).
Classic3 data set processed with word information.
20Newsgroups data set.
Reuters21578 data set.
LSI. A list of papers and Software related to Latent Semantic Indexing.
LPU. A system for learning from Positive and Unlabeled Examples.