Bioinformatics & Data Mining
We describe algorithms that use cloud shadows as a form of stochastically structured light to support 3D scene geometry estimation. Taking video captured from a static outdoor camera as input, we use the relationship of the time series of intensity values between pairs of pixels as the primary input to our algorithms. We describe two cues that relate the 3D distance between a pair of points to the pair of intensity time series. The first cue results from the fact that two pixels that are nearby in the world are more likely to be under a cloud at the same time than two distant clouds. We describe methods for using this cue to estimate focal length and scene structure. The second cue is based on the motion of shadow clouds across the scene; this cue results in a set of linear constraints on scene structure. These constraints have an inherent ambiguity, which we show how to overcome by combining the cloud motion cue with the spatial cue. We evaluate our method on several time lapses of real outdoor scenes.
We present a novel patch-based probabilistic graphical model for semi-supervised video segmentation. At the heart of our model is a temporal tree structure which links patches in adjacent frames through the video sequence. This permits exact inference of pixel labels without resorting to traditional short time-window based video processing or instantaneous decision making. The input to our algorithm are labelled key frame(s) of a video sequence and the output is pixel-wise labels along with their confidences. We propose an efficient inference scheme that performs exact inference over the temporal tree, and optionally a per frame label smoothing step using loopy BP, to estimate pixel-wise labels and their posteriors. These posteriors are used to learn pixel unaries by training a Random Decision Forest in a semi-supervised manner. These unaries are used in a second iteration of label inference to improve the segmentation quality. We demonstrate the efficacy of our proposed algorithm using several qualitative and quantitative tests on both foreground/background and multi-class video segmentation problems using publicly available and our own datasets.
One of the main difficulties in facial age estimation is that the learning algorithms cannot expect sufficient and complete training data. Fortunately, the faces at close ages look quite similar since aging is a slow and smooth process. Inspired by this observation, instead of considering each face image as an instance with one label (age), this paper regards each face image as an instance associated with a label distribution. The label distribution covers a certain number of class labels, representing the degree that each label describes the instance. Through this way, one face image can contribute to not only the learning of its chronological age, but also the learning of its adjacent ages. Two algorithms named IIS-LLD and CPNN are proposed to learn from such label distributions. Experimental results on two aging face databases show remarkable advantages of the proposed label distribution learning algorithms over the compared single-label learning algorithms, either specially designed for age estimation or for general purpose.
Capturing the skeleton motion and detailed time-varying surface geometry of multiple, closely interacting persons is a very challenging task, even in a multi-camera setup, due to frequent occlusions and ambiguities in feature-to-person assignments. In order to address this task, we propose a framework that exploits multi-view image segmentation. To this end, a probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Given the articulated template models of each person and the labeled pixels, a combined optimization scheme, which splits the skeleton pose optimization problem into a local one and a lower dimensional global one, is applied one-by-one to each individual, followed with surface estimation to capture detailed non-rigid deformations. We show on various sequences that our approach can capture the 3D motion of humans accurately even if they move rapidly, if they wear wide apparel, and if they are engaged in challenging multi-person motions, including dancing, wrestling, and hugging.
We propose a novel geometric framework for analyzing 3D faces, with the specific goals of comparing, matching, and averaging their shapes. Here we represent facial surfaces by radial curves emanating from the nose tips and use elastic shape analysis of these curves to develop a Riemannian framework for analyzing shapes of full facial surfaces. This representation, along with the elastic Riemannian metric, seems natural for measuring facial deformations and is robust to challenges such as large facial expressions (especially those with open mouths), large pose variations, missing parts, and partial occlusions due to glasses, hair, etc. This framework is shown to be promising from both - empirical and theoretical - perspectives. In terms of the empirical evaluation, our results match or improve the state-of-the-art methods on three prominent databases: FRGCv2, GavabDB, and Bosphorus, each posing a different type of challenge. From a theoretical perspective, this framework allows for formal statistical inferences, such as the estimation of missing facial parts using PCA on tangent spaces and computing average shapes.
With the availability of live-scan palmprint technology, high resolution palmprint recognition has started to receive significant attention in forensics and law enforcement. In forensic applications, latent palmprints provide critical evidence as it is estimated that about 30 percent of the latents recovered at crime scenes are those of palms. Considering the large number of minutiae and large area of foreground region in full palmprints, novel strategies need to be developed for efficient and robust latent palmprint matching. In this paper, a coarse to fine matching strategy based on minutiae clustering and minutiae match propagation is designed specifically for palmprint matching. The proposed palmprint matching algorithm has been evaluated on a latent-to-full palmprint database consisting of 446 latents and 12,489 background full prints. The matching results show a rank-1 identification accuracy of 79.4%, which is significantly higher than the 60.8% identification accuracy of a state of the art latent palmprint matching algorithm on the same latent database. The average computation time of our algorithm for a single latent-to-full match is about 141ms for genuine match and 50ms for impostor match, on a Windows XP desktop system with 2.2GHz CPU and 1.00GB RAM.
This paper presents a new intrinsic calibration method that allows us to calibrate a generic single-view point camera. From the video sequence obtained while the camera undergoes random motion, we compute the pairwise time correlation of the luminance signal for the pixels. We show that the pairwise correlation of any pixels pair is a function of the distance between the pixel directions on the visual sphere. This leads to formalizing calibration as a problem of metric embedding from non-metric measurements: we want to find the disposition of pixels on the visual sphere, from similarities that are an unknown function of the distances. This problem is a generalization of multidimensional scaling (MDS) that has so far resisted a comprehensive observability analysis and a generic solution. We show that the observability depends both on the local geometric properties as well as on the global topological properties of the target manifold. It follows that, in contrast to the Euclidean case, on the sphere we can recover the scale of the points distribution. We describe an algorithm that is robust across manifolds and can recover a metrically accurate solution when the metric information is observable. We demonstrate the performance of the algorithm for several cameras (pin-hole, fish-eye, omnidirectional).
We propose a novel approach for the estimation of the pose and focal length of a camera from a set of 3D-to-2D point correspondences. Our method compares favorably to competing approaches in that it is both more accurate than existing closed form solutions, as well as faster and also more accurate than iterative ones. Our approach is inspired on the EPnP algorithm, a recent O(n) solution for the calibrated case. Yet, we show that considering the focal length as an additional unknown renders the linearization and relinearization techniques of the original approach no longer valid, especially with large amounts of noise. We present new methodologies to circumvent this limitation termed exhaustive linearization and exhaustive relinearization which perform a systematic exploration of the solution space in closed form. The method is evaluated on both real and synthetic data, and our results show that besides producing precise focal length estimation, the retrieved camera pose is almost as accurate as the one computed using the EPnP, which assumes a calibrated camera.
PrePrint: Modeling Temporal Interactions with Interval Temporal Bayesian Networks for Complex Activity Recognition
Complex activities typically consist of multiple primitive events happening in parallel or sequentially over a period of time. Understanding such activities requires recognizing not only each individual event but, more importantly, capturing their spatiotemporal dependencies over different time intervals. Most of current graphical model-based approaches have several limitations. First, time-sliced graphical models such as Hidden Markov Models (HMMs) and Dynamic Bayesian Networks are typically based on points of time and they hence can only capture three temporal relations: precedes, follows, and equals. Second, HMMs are probabilistic finite-state machine that grow exponentially as the number of parallel events increases. Third, other approaches such as syntactic and description-based methods, while rich in modeling temporal relationships, do not have the expressive power to capture uncertainties. To address these issues, we introduce the Interval Temporal Bayesian Network (ITBN), a novel graphical model that combines the Bayesian Network with the Interval Algebra (IA) to explicitly model the temporal dependencies over time intervals. Advanced machine learning methods are introduced to learn the ITBN model structure and parameters. Experimental results show that by reasoning with spatiotemporal dependencies the proposed model leads to a significantly improved performance when modeling and recognizing complex activities involving both parallel and sequential events.
The Sturm-Triggs type iteration is a classic approach for solving the projective structure-from-motion (SfM) factorization problem, which iteratively solves the projective depths, scene structure and camera motions in an alternated fashion. Like many other iterative algorithms, the Sturm-Triggs iteration suffers from common drawbacks such as requiring a good initialization, the iteration may not converge or only converge to a local minimum, etc. In this paper, we formulate the projective SfM problem as a novel and original element-wise factorization (i.e., Hadamard factorization) problem, as opposed to the conventional matrix factorization. Thanks to this formulation, we are able to solve the projective depths, structure and camera motions simultaneously by convex optimization. To address the scalability issue, we adopt a continuation based algorithm. Our method is a global method, in the sense that it is guaranteed to obtain a globally-optimal solution up to relaxation gap. Another advantage is that, our method can handle challenging real-world situations such as missing data and outliers quite easily, and all in a natural and unified manner. Extensive experiments on both synthetic and real images show comparable results compared with the state-of-the-art methods.
We present a novel approach to localizing parts in images of human faces. The approach combines the output of local detectors with a non-parametric set of global models for the part locations based on over one thousand hand-labeled exemplar images. By assuming that the global models generate the part locations as hidden variables, we derive a Bayesian objective function. This function is optimized using a consensus of models for these hidden variables. The resulting localizer handles a much wider range of expression, pose, lighting and occlusion than prior ones. We show excellent performance on real-world face datasets such as Labeled Faces in the Wild (LFW) and a new Labeled Face Parts in the Wild (LFPW), and show that our localizer achieves state-of-the-art performance on the less challenging BioID dataset.
In this paper, we propose an efficient algorithm to detect dense subgraphs of a weighted graph. The proposed algorithm, called Shrinking and Expansion Algorithm (SEA), iterates between two phases, namely, expansion phase and shrink phase, until convergence. For a current subgraph, the expansion phase adds the most related vertices based on the average affinity between each vertex and the subgraph. The shrink phase considers all pairwise relations in the current subgraph and filters out vertices whose average affinities to other vertices are smaller than the average affinity of the result subgraph. In both phases, SEA operates on small subgraphs, thus it is very efficient. Significant dense subgraphs are robustly enumerated by running SEA from each vertex of the graph. We evaluate SEA on two different applications: solving correspondence problems and cluster analysis. Both theoretic analysis and experimental results show that SEA is very efficient and robust, especially when there exist large amount of noises in edge weights.
PrePrint: Stacked Autoencoders for Unsupervised Feature Learning and Multiple Organ Detection in a Pilot Study Using 4D Patient Data
Medical image analysis remains a challenging application area for artificial intelligence. When applying machine learning, obtaining ground-truth labels for supervised learning is more difficult than in many more common applications of machine learning. This is especially so for datasets with abnormalities, as tissue types and the shapes of the organs in these datasets differ widely. However, organ detection in such an abnormal dataset may have many promising potential real world applications such as automatic diagnosis, automated radiotherapy planning, and medical image retrieval, where new multi-modal medical images provide more information about the imaged tissues for diagnosis. Here we test the application of deep learning methods to organ identification in magnetic resonance medical images, with visual and temporal hierarchical features learnt to categorise object classes from an unlabelled multi-modal DCE-MRI dataset, so that only a weakly supervised training is required for a classifier. A probabilistic patch-based method was employed for multiple organ detection, with the features learnt from the deep learning model. This shows the potential of the deep learning model for application to medical images, despite the difficulty of obtaining libraries of correctly labelled training datasets, and despite the intrinsic abnormalities present in patient datasets.
We describe the use of two spike-and-slab models for modeling real-valued data, with an emphasis on their applications to object recognition. The ﬁrst model, which we call spike-and-slab sparse coding (S3C), is a pre-existing model for which we introduce a faster approximate inference algorithm. We introduce a deep variant of S3C which we call the partially directed deep Boltzmann machine (PD-DBM) and extend our S3C inference algorithm for use on this model. We describe learning procedures for each. We demonstrate that our inference procedure for S3C enables scaling the model to unprecedently large problem sizes, and demonstrate that using S3C as a feature extractor results in very good object recognition performance, particularly when the number of labeled examples is low. We show that the PD-DBM generates better samples than its shallow counterpart, and that unlike DBMs or DBNs, the PD-DBM may be trained successfully without greedy layerwise training.
In surveillance applications, head and body orientation of people is of primary importance for assessing many behavioural traits. Unfortunately, in this context people is often encoded by few, noisy pixels, so that their characterization is difficult. We face this issue, proposing a computational framework which is based on an expressive descriptor, the covariance of features. Covariances have been employed for pedestrian detection purposes, actually, a binary classification problem on Riemannian manifolds. In this paper, we show how to extend to the multi-classification case, presenting a novel descriptor, named Weighted ARray of COvariances, WARCO, especially suited for dealing with tiny image representations. The extension requires a novel differential geometry approach, in which covariances are projected on a unique tangent space, where standard machine learning techniques can be applied. In particular, we adopt the Campbell-Baker-Hausdorff expansion as a means to approximate on the tangent space the genuine (geodesic) distances on the manifold, in a very efficient way. We test our methodology on multiple benchmark datasets, and also propose new testing sets, getting convincing results in all the cases.
We describe a method for articulated human detection and human pose estimation in static images based on a new representation of deformable part models. Rather than modeling articulation using a family of warped (rotated and foreshortened) templates, we use a mixture of small, non-oriented parts. We describe a general, flexible mixture model that jointly captures spatial relations between part locations and co-occurence relations between part mixtures, augmenting standard pictorial structure models that encode just spatial relations. Our models have several notable properties: (1) they efficiently model articulation by sharing computation across similar warps (2) they efficiently model an exponentially-large set of global mixtures through composition of local mixtures and (3) they capture the dependency of global geometry on local appearance (parts look different at different locations). When relations are tree-structured, our models can be efficiently optimized with dynamic programming. We introduce novel criteria for evaluating pose estimation and human detection, both separately and jointly. We show that currently-used evaluation criteria may conflate these two issues. We present experimental results on standard benchmarks that suggest our approach is the state-of-the-art system for pose estimation, improving past work on the challenging Parse and Buffy datasets, while being orders of magnitude faster.
The albedo of a Lambertian object is a surface property that contributes to an object's appearance under changing illumination. As a signature independent of illumination, the albedo is useful for object recognition. Single image-based albedo estimation algorithms suffer due to shadows and non-Lambertian effects of the image. In this paper, we propose a sequential algorithm to estimate the albedo from a sequence of images of a known 3D object in varying poses and illumination conditions. We first show that by knowing/estimating the pose of the object at each frame of a sequence, the object's albedo can be efficiently estimated using a Kalman filter. We then extend this for the case of unknown pose by simultaneously tracking the pose as well as updating the albedo through a Rao-Blackwellized particle filter. More specifically, the albedo is marginalized from the posterior distribution and estimated analytically using the Kalman filter, while the pose parameters are estimated using importance sampling and by minimizing the projection error of the face onto its spherical harmonic subspace, which results in an illumination-insensitive pose tracking algorithm. Illustrations and experiments are provided to validate the effectiveness of the approach using various synthetic and real sequences followed by applications to unconstrained, video-based face recognition.
A computational problem that arises frequently in computer vision is that of estimating the parameters of a model from data that has been contaminated by noise and outliers. More generally, any practical system that seeks to estimate quantities from noisy data measurements must some means of dealing with data contamination. The Random Sample Consensus (RANSAC) algorithm is one of the most popular tools for robust estimation. Recent years have seen an explosion of activity in this area, leading to the development of a number of techniques that improve upon the efficiency and robustness of the basic algorithm. In this paper, we present a comprehensive overview of recent research in RANSAC-based robust estimation, by analyzing various approaches that have been explored over the years. We provide a common context for this analysis by introducing a new framework, which we call Universal RANSAC (USAC). USAC extends the hypothesize-and-verify structure of RANSAC to incorporate a number of important practical and computational considerations. In addition, we provide a general-purpose C++ software library that implements the USAC framework by leveraging state of the art algorithms for the various modules. The implementation we provide can be used by researchers either as a stand-alone tool for robust estimation, or as a benchmark for evaluating new techniques.
To date, almost all experimental evaluations of machine learning-based recognition algorithms in computer vision have taken the form of "closed set" recognition, whereby all testing classes are known at training time. A more realistic scenario for vision applications is "open set" recognition, where incomplete knowledge of the world is present at training time, and unknown classes can be submitted to an algorithm during testing. This article explores the nature of open set recognition, and formalizes its definition as a constrained minimization problem. The open set recognition problem is not well addressed by existing algorithms because it requires strong generalization. As a step towards a solution, we introduce a novel "1-vs-Set Machine," which sculpts a decision space from the marginal distances of a 1-class or binary SVM with a linear kernel. This methodology applies to several different applications in computer vision where open set recognition is a challenging problem, including object recognition and face verification. We consider both in this work, with large scale cross-dataset experiments performed over the Caltech 256 and ImageNet sets, as well as face matching experiments performed over the Labeled Faces in the Wild set.
In this work we use Shape Grammars for Facade Parsing, which amounts to segmenting 2D building facades into balconies, walls, windows and doors in an architecturally meaningful manner. The main thrust of our work is the introduction of Reinforcement Learning (RL) techniques to deal with the computational complexity of the problem. RL provides us with efficient tools such as Q-learning and state aggregation which we exploit to accelerate facade parsing. We initially phrase the 1D parsing problem in terms of Markov Decision Processes, paving the way for the application of RL-based tools. We then develop novel techniques for the 2D shape parsing problem that take into account the specificities of the facade parsing problem. Specifically, we use state aggregation to enforce the symmetry of facade floors and also demonstrate that RL can seamlessly exploit bottom-up, image-based guidance during optimization. We provide systematic results on the Paris building dataset and obtain state-of-the-art results in a fraction of the time required by previous methods. We validate our method under diverse imaging conditions and make our software and results available online.