PLoS Computational Biology
The Genealogical Population Dynamics of HIV-1 in a Large Transmission Chain: Bridging within and among Host Evolutionary Rates
by Bram Vrancken, Andrew Rambaut, Marc A. Suchard, Alexei Drummond, Guy Baele, Inge Derdelinckx, Eric Van Wijngaerden, Anne-Mieke Vandamme, Kristel Van Laethem, Philippe LemeyTransmission lies at the interface of human immunodeficiency virus type 1 (HIV-1) evolution within and among hosts and separates distinct selective pressures that impose differences in both the mode of diversification and the tempo of evolution. In the absence of comprehensive direct comparative analyses of the evolutionary processes at different biological scales, our understanding of how fast within-host HIV-1 evolutionary rates translate to lower rates at the between host level remains incomplete. Here, we address this by analyzing pol and env data from a large HIV-1 subtype C transmission chain for which both the timing and the direction is known for most transmission events. To this purpose, we develop a new transmission model in a Bayesian genealogical inference framework and demonstrate how to constrain the viral evolutionary history to be compatible with the transmission history while simultaneously inferring the within-host evolutionary and population dynamics. We show that accommodating a transmission bottleneck affords the best fit our data, but the sparse within-host HIV-1 sampling prevents accurate quantification of the concomitant loss in genetic diversity. We draw inference under the transmission model to estimate HIV-1 evolutionary rates among epidemiologically-related patients and demonstrate that they lie in between fast intra-host rates and lower rates among epidemiologically unrelated individuals infected with HIV subtype C. Using a new molecular clock approach, we quantify and find support for a lower evolutionary rate along branches that accommodate a transmission event or branches that represent the entire backbone of transmitted lineages in our transmission history. Finally, we recover the rate differences at the different biological scales for both synonymous and non-synonymous substitution rates, which is only compatible with the ‘store and retrieve’ hypothesis positing that viruses stored early in latently infected cells preferentially transmit or establish new infections upon reactivation.
Leadership in Moving Human Groups
by Margarete Boos, Johannes Pritz, Simon Lange, Michael BelzHow is movement of individuals coordinated as a group? This is a fundamental question of social behaviour, encompassing phenomena such as bird flocking, fish schooling, and the innumerable activities in human groups that require people to synchronise their actions. We have developed an experimental paradigm, the HoneyComb computer-based multi-client game, to empirically investigate human movement coordination and leadership. Using economic games as a model, we set monetary incentives to motivate players on a virtual playfield to reach goals via players' movements. We asked whether (I) humans coordinate their movements when information is limited to an individual group member's observation of adjacent group member motion, (II) whether an informed group minority can lead an uninformed group majority to the minority's goal, and if so, (III) how this minority exerts its influence. We showed that in a human group – on the basis of movement alone – a minority can successfully lead a majority. Minorities lead successfully when (a) their members choose similar initial steps towards their goal field and (b) they are among the first in the whole group to make a move. Using our approach, we empirically demonstrate that the rules of swarming behaviour apply to humans. Even complex human behaviour, such as leadership and directed group movement, follow simple rules that are based on visual perception of local movement.
by Justin S. Hogg, Leonard A. Harris, Lori J. Stover, Niketh S. Nair, James R. FaederDetailed modeling and simulation of biochemical systems is complicated by the problem of combinatorial complexity, an explosion in the number of species and reactions due to myriad protein-protein interactions and post-translational modifications. Rule-based modeling overcomes this problem by representing molecules as structured objects and encoding their interactions as pattern-based rules. This greatly simplifies the process of model specification, avoiding the tedious and error prone task of manually enumerating all species and reactions that can potentially exist in a system. From a simulation perspective, rule-based models can be expanded algorithmically into fully-enumerated reaction networks and simulated using a variety of network-based simulation methods, such as ordinary differential equations or Gillespie's algorithm, provided that the network is not exceedingly large. Alternatively, rule-based models can be simulated directly using particle-based kinetic Monte Carlo methods. This “network-free” approach produces exact stochastic trajectories with a computational cost that is independent of network size. However, memory and run time costs increase with the number of particles, limiting the size of system that can be feasibly simulated. Here, we present a hybrid particle/population simulation method that combines the best attributes of both the network-based and network-free approaches. The method takes as input a rule-based model and a user-specified subset of species to treat as population variables rather than as particles. The model is then transformed by a process of “partial network expansion” into a dynamically equivalent form that can be simulated using a population-adapted network-free simulator. The transformation method has been implemented within the open-source rule-based modeling platform BioNetGen, and resulting hybrid models can be simulated using the particle-based simulator NFsim. Performance tests show that significant memory savings can be achieved using the new approach and a monetary cost analysis provides a practical measure of its utility.
by Paul J. McMurdie, Susan HolmesCurrent practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.
by David Samu, Anil K. Seth, Thomas NowotnyIn the past two decades some fundamental properties of cortical connectivity have been discovered: small-world structure, pronounced hierarchical and modular organisation, and strong core and rich-club structures. A common assumption when interpreting results of this kind is that the observed structural properties are present to enable the brain's function. However, the brain is also embedded into the limited space of the skull and its wiring has associated developmental and metabolic costs. These basic physical and economic aspects place separate, often conflicting, constraints on the brain's connectivity, which must be characterized in order to understand the true relationship between brain structure and function. To address this challenge, here we ask which, and to what extent, aspects of the structural organisation of the brain are conserved if we preserve specific spatial and topological properties of the brain but otherwise randomise its connectivity. We perform a comparative analysis of a connectivity map of the cortical connectome both on high- and low-resolutions utilising three different types of surrogate networks: spatially unconstrained (‘random’), connection length preserving (‘spatial’), and connection length optimised (‘reduced’) surrogates. We find that unconstrained randomisation markedly diminishes all investigated architectural properties of cortical connectivity. By contrast, spatial and reduced surrogates largely preserve most properties and, interestingly, often more so in the reduced surrogates. Specifically, our results suggest that the cortical network is less tightly integrated than its spatial constraints would allow, but more strongly segregated than its spatial constraints would necessitate. We additionally find that hierarchical organisation and rich-club structure of the cortical connectivity are largely preserved in spatial and reduced surrogates and hence may be partially attributable to cortical wiring constraints. In contrast, the high modularity and strong s-core of the high-resolution cortical network are significantly stronger than in the surrogates, underlining their potential functional relevance in the brain.
by Marjet Elemans, Arnaud Florins, Luc Willems, Becca AsquithThe CD8+ cytotoxic T lymphocyte (CTL) response is an important defence against viral invasion. Although CTL-mediated cytotoxicity has been widely studied for many years, the rate at which virus-infected cells are killed in vivo by the CTL response is poorly understood. To date the rate of CTL killing in vivo has been estimated for three virus infections but the estimates differ considerably, and killing of HIV-1-infected cells was unexpectedly low. This raises questions about the typical anti-viral capability of CTL and whether CTL killing is abnormally low in HIV-1. We estimated the rate of killing of infected cells by CD8+ T cells in two distinct persistent virus infections: sheep infected with Bovine Leukemia Virus (BLV) and humans infected with Human T Lymphotropic Virus type 1 (HTLV-1) which together with existing data allows us to study a total of five viruses in parallel. Although both BLV and HTLV-1 infection are characterised by large expansions of chronically activated CTL with immediate effector function ex vivo and no evidence of overt immune suppression, our estimates are at the lower end of the reported range. This enables us to put current estimates into perspective and shows that CTL killing of HIV-infected cells may not be atypically low. The estimates at the higher end of the range are obtained in more manipulated systems and may thus represent the potential rather than the realised CTL efficiency.
by Gerard J. P. van Westen, Anna Gaulton, John P. OveringtonAllosteric modulators are ligands for proteins that exert their effects via a different binding site than the natural (orthosteric) ligand site and hence form a conceptually distinct class of ligands for a target of interest. Here, the physicochemical and structural features of a large set of allosteric and non-allosteric ligands from the ChEMBL database of bioactive molecules are analyzed. In general allosteric modulators are relatively smaller, more lipophilic and more rigid compounds, though large differences exist between different targets and target classes. Furthermore, there are differences in the distribution of targets that bind these allosteric modulators. Allosteric modulators are over-represented in membrane receptors, ligand-gated ion channels and nuclear receptor targets, but are underrepresented in enzymes (primarily proteases and kinases). Moreover, allosteric modulators tend to bind to their targets with a slightly lower potency (5.96 log units versus 6.66 log units, p<0.01). However, this lower absolute affinity is compensated by their lower molecular weight and more lipophilic nature, leading to similar binding efficiency and surface efficiency indices. Subsequently a series of classifier models are trained, initially target class independent models followed by finer-grained target (architecture/functional class) based models using the target hierarchy of the ChEMBL database. Applications of these insights include the selection of likely allosteric modulators from existing compound collections, the design of novel chemical libraries biased towards allosteric regulators and the selection of targets potentially likely to yield allosteric modulators on screening. All data sets used in the paper are available for download.
Exploring the Conformational Transitions of Biomolecular Systems Using a Simple Two-State Anisotropic Network Model
by Avisek Das, Mert Gur, Mary Hongying Cheng, Sunhwan Jo, Ivet Bahar, Benoît RouxBiomolecular conformational transitions are essential to biological functions. Most experimental methods report on the long-lived functional states of biomolecules, but information about the transition pathways between these stable states is generally scarce. Such transitions involve short-lived conformational states that are difficult to detect experimentally. For this reason, computational methods are needed to produce plausible hypothetical transition pathways that can then be probed experimentally. Here we propose a simple and computationally efficient method, called ANMPathway, for constructing a physically reasonable pathway between two endpoints of a conformational transition. We adopt a coarse-grained representation of the protein and construct a two-state potential by combining two elastic network models (ENMs) representative of the experimental structures resolved for the endpoints. The two-state potential has a cusp hypersurface in the configuration space where the energies from both the ENMs are equal. We first search for the minimum energy structure on the cusp hypersurface and then treat it as the transition state. The continuous pathway is subsequently constructed by following the steepest descent energy minimization trajectories starting from the transition state on each side of the cusp hypersurface. Application to several systems of broad biological interest such as adenylate kinase, ATP-driven calcium pump SERCA, leucine transporter and glutamate transporter shows that ANMPathway yields results in good agreement with those from other similar methods and with data obtained from all-atom molecular dynamics simulations, in support of the utility of this simple and efficient approach. Notably the method provides experimentally testable predictions, including the formation of non-native contacts during the transition which we were able to detect in two of the systems we studied. An open-access web server has been created to deliver ANMPathway results.
Investigation of Inflammation and Tissue Patterning in the Gut Using a Spatially Explicit General-Purpose Model of Enteric Tissue (SEGMEnT)
by Chase Cockrell, Scott Christley, Gary AnThe mucosa of the intestinal tract represents a finely tuned system where tissue structure strongly influences, and is turn influenced by, its function as both an absorptive surface and a defensive barrier. Mucosal architecture and histology plays a key role in the diagnosis, characterization and pathophysiology of a host of gastrointestinal diseases. Inflammation is a significant factor in the pathogenesis in many gastrointestinal diseases, and is perhaps the most clinically significant control factor governing the maintenance of the mucosal architecture by morphogenic pathways. We propose that appropriate characterization of the role of inflammation as a controller of enteric mucosal tissue patterning requires understanding the underlying cellular and molecular dynamics that determine the epithelial crypt-villus architecture across a range of conditions from health to disease. Towards this end we have developed the Spatially Explicit General-purpose Model of Enteric Tissue (SEGMEnT) to dynamically represent existing knowledge of the behavior of enteric epithelial tissue as influenced by inflammation with the ability to generate a variety of pathophysiological processes within a common platform and from a common knowledge base. In addition to reproducing healthy ileal mucosal dynamics as well as a series of morphogen knock-out/inhibition experiments, SEGMEnT provides insight into a range of clinically relevant cellular-molecular mechanisms, such as a putative role for Phosphotase and tensin homolog/phosphoinositide 3-kinase (PTEN/PI3K) as a key point of crosstalk between inflammation and morphogenesis, the protective role of enterocyte sloughing in enteric ischemia-reperfusion and chronic low level inflammation as a driver for colonic metaplasia. These results suggest that SEGMEnT can serve as an integrating platform for the study of inflammation in gastrointestinal disease.
by Emily Berger, Deniz Yorukoglu, Jian Peng, Bonnie BergerAs the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.
by Ewa Szczurek, Niko BeerenwinkelIn large collections of tumor samples, it has been observed that sets of genes that are commonly involved in the same cancer pathways tend not to occur mutated together in the same patient. Such gene sets form mutually exclusive patterns of gene alterations in cancer genomic data. Computational approaches that detect mutually exclusive gene sets, rank and test candidate alteration patterns by rewarding the number of samples the pattern covers and by punishing its impurity, i.e., additional alterations that violate strict mutual exclusivity. However, the extant approaches do not account for possible observation errors. In practice, false negatives and especially false positives can severely bias evaluation and ranking of alteration patterns. To address these limitations, we develop a fully probabilistic, generative model of mutual exclusivity, explicitly taking coverage, impurity, as well as error rates into account, and devise efficient algorithms for parameter estimation and pattern ranking. Based on this model, we derive a statistical test of mutual exclusivity by comparing its likelihood to the null model that assumes independent gene alterations. Using extensive simulations, the new test is shown to be more powerful than a permutation test applied previously. When applied to detect mutual exclusivity patterns in glioblastoma and in pan-cancer data from twelve tumor types, we identify several significant patterns that are biologically relevant, most of which would not be detected by previous approaches. Our statistical modeling framework of mutual exclusivity provides increased flexibility and power to detect cancer pathways from genomic alteration data in the presence of noise. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.
Identification of New IκBα Complexes by an Iterative Experimental and Mathematical Modeling Approach
by Fabian Konrath, Johannes Witt, Thomas Sauter, Dagmar KulmsThe transcription factor nuclear factor kappa-B (NFκB) is a key regulator of pro-inflammatory and pro-proliferative processes. Accordingly, uncontrolled NFκB activity may contribute to the development of severe diseases when the regulatory system is impaired. Since NFκB can be triggered by a huge variety of inflammatory, pro-and anti-apoptotic stimuli, its activation underlies a complex and tightly regulated signaling network that also includes multi-layered negative feedback mechanisms. Detailed understanding of this complex signaling network is mandatory to identify sensitive parameters that may serve as targets for therapeutic interventions. While many details about canonical and non-canonical NFκB activation have been investigated, less is known about cellular IκBα pools that may tune the cellular NFκB levels. IκBα has so far exclusively been described to exist in two different forms within the cell: stably bound to NFκB or, very transiently, as unbound protein. We created a detailed mathematical model to quantitatively capture and analyze the time-resolved network behavior. By iterative refinement with numerous biological experiments, we yielded a highly identifiable model with superior predictive power which led to the hypothesis of an NFκB-lacking IκBα complex that contains stabilizing IKK subunits. We provide evidence that other but canonical pathways exist that may affect the cellular IκBα status. This additional IκBα:IKKγ complex revealed may serve as storage for the inhibitor to antagonize undesired NFκB activation under physiological and pathophysiological conditions.
Comparative Analysis of the Macroscale Structural Connectivity in the Macaque and Human Brain
by Alexandros Goulas, Matteo Bastiani, Gleb Bezgin, Harry B. M. Uylings, Alard Roebroeck, Peter StiersThe macaque brain serves as a model for the human brain, but its suitability is challenged by unique human features, including connectivity reconfigurations, which emerged during primate evolution. We perform a quantitative comparative analysis of the whole brain macroscale structural connectivity of the two species. Our findings suggest that the human and macaque brain as a whole are similarly wired. A region-wise analysis reveals many interspecies similarities of connectivity patterns, but also lack thereof, primarily involving cingulate regions. We unravel a common structural backbone in both species involving a highly overlapping set of regions. This structural backbone, important for mediating information across the brain, seems to constitute a feature of the primate brain persevering evolution. Our findings illustrate novel evolutionary aspects at the macroscale connectivity level and offer a quantitative translational bridge between macaque and human research.
by Samanthe M. Lyons, Wenlong Xu, June Medford, Ashok PrasadBiological protein interactions networks such as signal transduction or gene transcription networks are often treated as modular, allowing motifs to be analyzed in isolation from the rest of the network. Modularity is also a key assumption in synthetic biology, where it is similarly expected that when network motifs are combined together, they do not lose their essential characteristics. However, the interactions that a network module has with downstream elements change the dynamical equations describing the upstream module and thus may change the dynamic and static properties of the upstream circuit even without explicit feedback. In this work we analyze the behavior of a ubiquitous motif in gene transcription and signal transduction circuits: the switch. We show that adding an additional downstream component to the simple genetic toggle switch changes its dynamical properties by changing the underlying potential energy landscape, and skewing it in favor of the unloaded side, and in some situations adding loads to the genetic switch can also abrogate bistable behavior. We find that an additional positive feedback motif found in naturally occurring toggle switches could tune the potential energy landscape in a desirable manner. We also analyze autocatalytic signal transduction switches and show that a ubiquitous positive feedback switch can lose its switch-like properties when connected to a downstream load. Our analysis underscores the necessity of incorporating the effects of downstream components when understanding the physics of biochemical network motifs, and raises the question as to how these effects are managed in real biological systems. This analysis is particularly important when scaling synthetic networks to more complex organisms.
A Discrete Model of Drosophila Eggshell Patterning Reveals Cell-Autonomous and Juxtacrine Effects
by Adrien Fauré, Barbara M. I. Vreede, Élio Sucena, Claudine ChaouiyaThe Drosophila eggshell constitutes a remarkable system for the study of epithelial patterning, both experimentally and through computational modeling. Dorsal eggshell appendages arise from specific regions in the anterior follicular epithelium that covers the oocyte: two groups of cells expressing broad (roof cells) bordered by rhomboid expressing cells (floor cells). Despite the large number of genes known to participate in defining these domains and the important modeling efforts put into this developmental system, key patterning events still lack a proper mechanistic understanding and/or genetic basis, and the literature appears to conflict on some crucial points. We tackle these issues with an original, discrete framework that considers single-cell models that are integrated to construct epithelial models. We first build a phenomenological model that reproduces wild type follicular epithelial patterns, confirming EGF and BMP signaling input as sufficient to establish the major features of this patterning system within the anterior domain. Importantly, this simple model predicts an instructive juxtacrine signal linking the roof and floor domains. To explore this prediction, we define a mechanistic model that integrates the combined effects of cellular genetic networks, cell communication and network adjustment through developmental events. Moreover, we focus on the anterior competence region, and postulate that early BMP signaling participates with early EGF signaling in its specification. This model accurately simulates wild type pattern formation and is able to reproduce, with unprecedented level of precision and completeness, various published gain-of-function and loss-of-function experiments, including perturbations of the BMP pathway previously seen as conflicting results. The result is a coherent model built upon rules that may be generalized to other epithelia and developmental systems.
by James M. Osborne, Miguel O. Bernabeu, Maria Bruna, Ben Calderhead, Jonathan Cooper, Neil Dalchau, Sara-Jane Dunn, Alexander G. Fletcher, Robin Freeman, Derek Groen, Bernhard Knapp, Greg J. McInerny, Gary R. Mirams, Joe Pitt-Francis, Biswa Sengupta, David W. Wright, Christian A. Yates, David J. Gavaghan, Stephen Emmott, Charlotte Deane
Crossing Borders for Science
by Sebastian J. Schultheiss, Joshua SungWoo Yang, Wataru Iwasaki, Shu-Hsi Lin, Angela Jean, Magali MichautExchanging ideas with like-minded, enthusiastic people interested in the same topic is crucial for the advancement of a scientist's career. Several Regional Student Groups (RSGs) of the International Society for Computational Biology (ISCB) Student Council have cooperated in the last six years to organize scientific workshops and conferences. With motivated students, it is possible to create a memorable event for fellow scientists; in doing so, the organizers gain valuable experiences. While collaborating across borders and time zones can be difficult, feedback from event organizers was always positive. When limited resources are juxtaposed with great ideas and a network of contacts, the outcome is always an amazing experience, despite organizers being separated geographically across different countries.
Neuronal Spike Timing Adaptation Described with a Fractional Leaky Integrate-and-Fire Model
by Wondimu Teka, Toma M. Marinov, Fidel SantamariaThe voltage trace of neuronal activities can follow multiple timescale dynamics that arise from correlated membrane conductances. Such processes can result in power-law behavior in which the membrane voltage cannot be characterized with a single time constant. The emergent effect of these membrane correlations is a non-Markovian process that can be modeled with a fractional derivative. A fractional derivative is a non-local process in which the value of the variable is determined by integrating a temporal weighted voltage trace, also called the memory trace. Here we developed and analyzed a fractional leaky integrate-and-fire model in which the exponent of the fractional derivative can vary from 0 to 1, with 1 representing the normal derivative. As the exponent of the fractional derivative decreases, the weights of the voltage trace increase. Thus, the value of the voltage is increasingly correlated with the trajectory of the voltage in the past. By varying only the fractional exponent, our model can reproduce upward and downward spike adaptations found experimentally in neocortical pyramidal cells and tectal neurons in vitro. The model also produces spikes with longer first-spike latency and high inter-spike variability with power-law distribution. We further analyze spike adaptation and the responses to noisy and oscillatory input. The fractional model generates reliable spike patterns in response to noisy input. Overall, the spiking activity of the fractional leaky integrate-and-fire model deviates from the spiking activity of the Markovian model and reflects the temporal accumulated intrinsic membrane dynamics that affect the response of the neuron to external stimulation.
by Jianzhu Ma, Sheng Wang, Zhiyong Wang, Jinbo XuSequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.
An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 Binding
by Shaun Mahony, Matthew D. Edwards, Esteban O. Mazzoni, Richard I. Sherwood, Akshay Kakumanu, Carolyn A. Morrison, Hynek Wichterle, David K. GiffordRegulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS's multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre-existing genomic context, suggesting that such sites are pre-determined by cell-specific regulatory architecture. However, MultiGPS-defined condition-independent sites are not predicted by pre-existing regulatory signals, suggesting that Cdx2 can bind to a subset of locations regardless of genomic environment. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.