Sam Blasiak

E-mail	spam@masonlive.gmu.edu
Advisor	Dr. Huzefa Rangwala

I recently defended my dissertation, Latent Variable Models of Sequence Data for Classification and Discovery, at the department of Computer Science in George Mason University. It covers a number of novel machine learning methods to extract useful information from (mostly) sequence data and incorporates techniques from Bayesian inference and probabilistic models, topic modeling, neural network architectures, and sparse dictionary learning.
I am interested in pretty much everything related to machine learning from the practical details involved with coding inference algorithms to concepts in learning theory and am currently looking for a job where I can pursue these interests further.

My resume

Publications

Relevant Subsequence Discovery with Sparse Dictionary Learning [paper] [code]
Sam Blasiak, Huzefa Rangwala, and Kathryn B. Laskey.
Proceedings of the European Conference on Machine Learning (ECML-PKDD 2013)
Prague, CZ, September 23th to 27th, 2013

Relevant Subsequence Dictionary Learning (RS-DL) is an alternative method for applying Sparse Dictionary Learning to sequence datasets. In RS-DL, separate dictionaries are constructed for each sequence in a dataset from a set of "relevant subsequence patterns," allowing interesting subsequences to be discovered.

Joint Segmentation and Clustering in Text Corpuses [paper] [supplementary material] [wikipeople dataset]
Sam Blasiak, Sithu Sudarsan, and Huzefa Rangwala.
Proceedings of the 2013 SIAM International Conference on Datamining. (SDM 2013)
Austin, Texas, May 2-4, 2013.

The Joint Segmentation and Clustering (JSC) model combines ideas from topic modeling and segmental semi-Markov models. Compared to techniques where segmentation and clustering are performed individually, the JSC model improves performance both in recovering semantic information from documents and in producing concise representations.

A Family of Feed Forward Models for Sequence Classification [paper]
Sam Blasiak, Huzefa Rangwala, and Kathryn B. Laskey.
Proceedings of the European Conference on Machine Learning (ECML-PKDD 2012)
Bristol, UK, September 25-27, 2012.

We created a family of feed-forward models that has many similarities with convolutional neural networks. However, rather than using the standard convolutional layer, these networks extract informative features from protein sequences using a structure inspired by Profile Hidden Markov Models. These networks are competitive with top-performing kernel methods on standard biological sequence datasets but use a significantly different mode of operation.

Beam Methods for the Profile Hidden Markov Model [paper] [supplementary material]
Sam Blasiak, Huzefa Rangwala, and Kathryn B. Laskey.
Proceedings of the 2012 SIAM International Conference on Datamining. (SDM 2012)
Anaheim, California, April 25-28, 2012.

In this paper, we extended the Profile Hidden Markov Model (pHMM), a type of HMM commonly used to model biological sequences, to allow for an infinite number of hidden states. A key component of this work is a beam method that we created, which makes inference possible for the Infinite pHMM. This beam method is also applicable to the finite pHMM and allows faster inference compared to the standard forward-backward algorithm. In experiments, we showed that our approximate inference method significantly speeds inference while retaining accuracy (in terms of perplexity) on standard protein datasets.

A Hidden Markov Model Variant for Sequence Classification [paper] [video]
Sam Blasiak, Huzefa Rangwala.
Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011)
Barcelona, Spain, July 16-22, 2011.

The Hidden Markov Model Variant is a probabilistic model inspired by both topic models and Hidden Markov Models to learn fixed length vector representations from sequence data. After training, vector representations of sequences extracted from the model can be fed to a classifier or other machine learning tool that requires vector, rather than sequence, input.

Other Work

At the 2011 Mid-Atlantic Student Colloquium on Speech, Language and Learning, I did a presentation on Maximimum Entropy Discrimination (MED). MED is a method originally developed by Jaakola et. al., with recent work by Zhu et. al. and others, to add constraints to generative models to make them better suited to supervised learning.
In my presentation and worksheet, I tried to explain the basic concept behind MED and how it relates to the more-standard Support Vector Machine. I also briefly discussed how MED can be applied in the context of topic modeling (Zhu et. al.).

For my master's project at Brandeis University, I wrote an application to simulate genetic regulatory networks, which we called the Cell Regulation Simulator (CRS). I also conducted a number of comparisons between CRS and other regulatory network simulation programs.

At Brandeis, I wrote an implementation of the Brill Grammar Induction algorithm. Brill Grammar Induction greedily creates a set of rules that transform a naive parse of training set sentences into a parse that is near the ground truth parse tree. This set of rules is then saved and constitutes the trained model.

Besphered is a clone of a popular game I wrote between missions in Afghanistan . A lot of people in my unit were playing this game in their spare time, and I became interested in developing an artificial inteligence to play it. I ended up writing a simple search algorithm that rates boards using both the number of available moves and the number of adjacent blocks of the same color, and it seemed to work pretty well. I also ended doing some work on the user inferface. The spheres are created at startup using a raytracer that I wrote with some assistance from Ken Perlin's website. I was particularly pleased at how well the double bounce animation turned out.

Anagrams was an idea for a real-time multiplayer game that I had that never quite got to the multiplayer stage. The goal was to try to find more anagrams than your opponent as letters are added to a pool at set time intervals. Once an anagram is found, its letters are collected by player who found it. Currently, a single player just competes against the clock. I was pretty proud of the letter animations I created for this game.