Spring 2010: Advanced Pattern Recognition [CS775]

General Description and Preliminary List of Topics:
The course covers advanced topics in pattern recognition and machine learning. Recent conference and journal papers will be discussed in depth. Tentative topics: Mixture models and EM; Ensemble methods; Co-clustering; Transfer learning; Semi-supervised learning; Learning with external knowledge; Generative approaches to topic modeling; Kernel methods. Actual topics covered will depend on time available and students' interests.
Course Format:
Lectures by the instructor, students' presentations, and discussions. Research papers and handouts will be made available. The course requires a project, homeworks and exams (exact format to be decided). Homework assignments and project will require some programming.
Course Project:
The project gives you an opportunity to explore in depth a particular topic/area of the course that interests you. The topic of the project, of course, should be related to the material covered in class, but otherwise you are free to select the specific topic. Possible types of projects include:
  • An application research project: The project demonstrates the application of some techniques discussed in class in an application domain (e.g., text mining, bioinformatics, computer vision, image processing, artificial intelligence etc.). Properties, drawbacks, advantages of the used techniques are analyzed within the context of the explored application domain.
  • A theoretical or methodological research project: A study of different classes of models and approaches; proving either theoretically or experimentally properties of known algorithms; designing a new approach.
  • Papers on Kernel methods and SVMs: (Under construction)
  • A tutorial on support vector machines for pattern recognition
  • On Kernel-target alignment
  • Sequence and Tree Kernels with Statistical Feature Mining
  • Making Tree Kernels practical for Natural Language Processing
  • Semi-supervised graph clustering: a kernel approach
  • Papers on Information Retrieval and Text Mining: (Under construction)
  • Computing semantic relatedness using Wikipedia-based explicit semantic analysis
  • Probabilistic Latent Semantic Analysis
  • Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews
  • Constructing Informative Priors using Transfer Learning
  • Papers on Subspace Clustering: (Under construction)
  • Information-Theoretic Co-clustering
  • Ensemble Methods and Clustering Ensembles: (Under construction)
  • Proactive learning: cost-sensitive active learning with multiple imperfect oracles
  • Get another label? Improving data quality and data mining using multiple noisy labelers
  • Projective Clustering Ensembles
  • An ensemble approach to identifying informative constraints for semi-supervised clustering
  • Solving cluster ensemble problems by bipartite graph partitioning
  • Cluster ensembles - a knowledge reuse framework for combining multiple partitions
  • Weighted Cluster Ensembles: Methods and Analysis
  • Semi-supervised learning: (Under construction)
  • A probabilistic framework for semi-supervised clustering
  • An adaptive kernel method for semi-supervised clustering
  • Metric Learning: (Under construction)
  • Locally adaptive metrics for clustering high dimensional data
  • Transfer Learning: (Under construction)
  • Improving SVM accuracy by training on auxiliary data sources
  • Mapping and revising Markov logic networks for transfer learning
  • Self-taught Clustering
  • Knowledge transfer via multiple model local structure mapping
  • Software and Data:
  • The Distance Metric Learning Toolkit Matlab toolkit for distance metric learning. It includes the code for the NCA (Neighbourhood Components Analysis) algorithm.
  • The Transfer Learning Toolkit Matlab toolkit for transfer learning. It also contains benchmark datasets consisting of multiple tasks.
  • MALLET MAchine Learning for LanguagE Toolkit
  • Stanford Topic Modeling Toolbox
  • Matlab Topic Modeling Toolbox
  • UCI Machine Learning Repository is a repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. A beta version of a new and improved site is also available
  • UCI Knowledge Discovery in Databases Archive is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas
  • Weka is an open source Java package implementing many learning algorithms
  • YALE (Yet Another Learning Environment) is another open source Java package. It includes a GUI which allows automation of the whole data process from feature normalization to feature selection, learning and cross-validation
  • SVM light and LibSVM are two popular implementations of various SVM algorithms
  • Pointers to Support Vector Machine and Gaussian Processes Software.
  • Software, datasets, publications related to data mining.
  • Microarray Data:
  • Labeled microarray data.
    Performance comparison of classifiers on the above microarray data.
  • Gene Expression Profiles in Hereditary Breast Cancer. This data set does not have labels.
    Here is the text file containing the matrix of expression levels: rows are genes (3226); columns are conditions (22).
  • Lymphoma (9 classes) data set. Can be used for classification and clustering.
  • A large collection of microarray data, including Leukemia and Lung cancer data is available here.
  • NCI60 data.
  • Software and Data for Text:
  • Online NIPS proceedings NIPS online: The text repository.
  • TREC (Text REtrieval data) . NIST Collection of documents returned by a retrieval system with relevance judgements.
  • TMG is a Matlab Toolbox that can be used for various tasks in text mining
  • Snowball. Software to perform stemming.
  • The 'MC' Toolkit. A Toolkit for Creating Vector Models from Text Documents.
  • Rainbow. Software for statistical text classification.
  • SPAM Archive.
  • Classic3 data set. Collections of Abstracts from three categories: MEDLINE (abstracts from medical journals); CISI (abstracts from IR papers); CRANFIELD (abstracts from aerodynamics papers).
    Classic3 data set processed with word information.
  • 20Newsgroups data set.
  • Reuters21578 data set.
  • LSI. A list of papers and Software related to Latent Semantic Indexing.
  • LPU. A system for learning from Positive and Unlabeled Examples.