Spring 2010: Advanced Pattern Recognition [CS775]
-
Instructor:
Carlotta Domeniconi, Rm 4424 Engineering Building, carlotta@cs.gmu.edu
-
Office hours:
Monday: 3:00pm-4:00pm, or by appointment
-
Prerequisites:
CS 688 or permission of instructor.
Some programming experience is expected.
Students should be familiar with
basic probability and statistics concepts, linear algebra, optimization, and multivariate
calculus.
-
Time:
We meet in ST1 122, M 4:30pm - 7:10pm
-
Useful books: (No required textbook. Reading material will be provided.)
- C. M. Bishop Pattern Recognition and Machine Learning,
Springer, 2006.
Book's companion website
General Description and Preliminary List of Topics:
The course covers advanced topics in pattern recognition and machine learning.
Recent conference and journal papers will be discussed in depth.
Tentative topics: Mixture models and EM; Ensemble methods; Co-clustering;
Transfer learning; Semi-supervised learning; Learning with external knowledge;
Generative approaches to topic modeling; Kernel methods.
Actual topics covered will depend on time available and students' interests.
Course Format:
Lectures by the instructor, students' presentations, and discussions.
Research papers and handouts will be made available.
The course requires a project, homeworks and exams (exact format to be decided).
Homework assignments and project will require
some programming.
Course Project:
The project gives you an opportunity to explore in depth a particular topic/area of the course that interests you. The topic of the project, of course, should be related to the material covered in class, but otherwise you are free to select the specific topic. Possible types of projects include:
An application research project: The project demonstrates the application of some techniques discussed in class in an application domain (e.g., text mining, bioinformatics, computer vision, image processing, artificial intelligence etc.). Properties, drawbacks, advantages of the used techniques are analyzed within the context of the explored application domain.
A theoretical or methodological research project: A study of different classes of models and approaches; proving either theoretically or experimentally properties of known algorithms; designing a new approach.
Papers on Kernel methods and SVMs: (Under construction)
A tutorial on support vector machines for pattern recognition
On Kernel-target alignment
Sequence and Tree Kernels with Statistical Feature Mining
Making Tree Kernels practical for Natural Language Processing
Semi-supervised graph clustering: a kernel approach
Papers on Information Retrieval and Text Mining: (Under construction)
Computing semantic relatedness using Wikipedia-based explicit semantic analysis
Probabilistic Latent Semantic Analysis
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews
Constructing Informative Priors using Transfer Learning
Papers on Subspace Clustering: (Under construction)
Information-Theoretic Co-clustering
Ensemble Methods and Clustering Ensembles: (Under construction)
Proactive learning: cost-sensitive active learning with multiple imperfect oracles
Get another label? Improving data quality and data mining using multiple noisy labelers
Projective Clustering Ensembles
An ensemble approach to identifying informative constraints for semi-supervised clustering
Solving cluster ensemble problems by bipartite graph partitioning
Cluster ensembles - a knowledge reuse framework for combining multiple partitions
Weighted Cluster Ensembles: Methods and Analysis
Semi-supervised learning: (Under construction)
A probabilistic framework for semi-supervised clustering
An adaptive kernel method for semi-supervised clustering
Metric Learning: (Under construction)
Locally adaptive metrics for clustering high dimensional data
Transfer Learning: (Under construction)
Improving SVM accuracy by training on auxiliary data sources
Mapping and revising Markov logic networks for transfer learning
Self-taught Clustering
Knowledge transfer via multiple model local structure mapping
Software and Data:
The Distance Metric Learning Toolkit Matlab toolkit for distance metric learning. It includes the code for the NCA (Neighbourhood Components Analysis) algorithm.
The Transfer Learning Toolkit Matlab toolkit for transfer learning. It also contains benchmark datasets consisting of multiple tasks.
MALLET MAchine Learning for LanguagE Toolkit
Stanford Topic Modeling Toolbox
Matlab Topic Modeling Toolbox
UCI Machine Learning Repository is a repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
A beta version of a new and improved site is also available
UCI Knowledge Discovery in Databases Archive is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas
Weka is an open source Java package implementing many learning algorithms
YALE (Yet Another Learning Environment) is another open source Java package. It includes a GUI which allows automation of the whole data process from feature normalization to feature selection, learning and cross-validation
SVM light and
LibSVM
are two popular implementations of various SVM algorithms
Pointers to Support Vector Machine and Gaussian Processes Software.
Software, datasets, publications related to data mining.
Microarray Data:
Labeled microarray data.
Performance comparison of classifiers on the above microarray data.
Gene Expression Profiles in Hereditary Breast Cancer.
This data set does not have labels.
Here
is the text file containing the matrix of expression levels:
rows are genes (3226); columns are conditions (22).
Lymphoma (9 classes) data set.
Can be used for classification and clustering.
A large collection of microarray data, including Leukemia and Lung cancer
data is available
here.
NCI60 data.
Software and Data for Text:
Online NIPS proceedings NIPS online: The text repository.
TREC (Text REtrieval data) . NIST Collection of documents
returned by a retrieval system with relevance judgements.
TMG is a Matlab Toolbox that can be used for various tasks in text mining
Snowball. Software to perform stemming.
The 'MC' Toolkit. A Toolkit for Creating Vector Models from Text Documents.
Rainbow. Software for statistical text classification.
SPAM Archive.
Classic3 data set. Collections of Abstracts from three categories:
MEDLINE (abstracts from medical journals); CISI (abstracts from IR papers);
CRANFIELD (abstracts from aerodynamics papers).
Classic3 data set processed with word information.
20Newsgroups data set.
Reuters21578 data set.
LSI. A list of papers and Software related to Latent Semantic Indexing.
LPU. A system for learning from Positive and Unlabeled Examples.